Large language models are improving exponentially?

原始链接: https://spectrum.ieee.org/large-language-model-performance

Model Evaluation & Threat Research (METR) is working on benchmarking large language models (LLMs) to quantitatively measure their improvement over time. Their research focuses on assessing the ability of LLMs to complete complex tasks without human input. METR's key finding is that LLM capabilities are doubling approximately every seven months, based on their developed metric, "task-completion time horizon." This metric gauges the human time equivalent an LLM needs to complete a task with a specified reliability. Extrapolating this trend, METR predicts that by 2030, advanced LLMs could reliably complete month-long software-based projects, such as starting a company or writing a novel, potentially in days or hours. The organization emphasizes that "messy" real-world tasks pose a greater challenge to LLMs. While acknowledging the potential risks associated with rapidly improving AI, METR suggests that hardware and robotics limitations could potentially bottleneck the pace of progress.

A recent IEEE article claims that Large Language Models (LLMs) are improving exponentially, based on a metric called "task-completion time horizon" – how long it would take a human programmer to complete a task an LLM can do with a specified reliability (e.g., 50%). This metric shows a doubling period of about seven months. However, commenters on Hacker News are skeptical. Critics argue the study relies on an "invented metric" and extrapolate the findings to infinity, downplaying the "messiness" of real-world tasks. One user points out that a 50% success rate is unacceptable in a professional setting. Others question the choice of 50% success as a benchmark and challenge the claim that LLMs will write a decent novel by 2030, arguing that LLMs lack the emotional understanding and artistic judgment necessary for such creative endeavors. Some believe improvements will plateau due to the Pareto principle, focusing on solving the remaining 20% of challenging problems. Others point out the hardware and robotics limitation as a bottleneck.

原文

Benchmarking large language models presents some unusual challenges. For one, the main purpose of many LLMs is to provide compelling text that’s indistinguishable from human writing. And success in that task may not correlate with metrics traditionally used to judge processor performance, such as instruction execution rate.

But there are solid reasons to persevere in attempting to gauge the performance of LLMs. Otherwise, it’s impossible to know quantitatively how much better LLMs are becoming over time—and to estimate when they might be capable of completing substantial and useful projects by themselves.

Scatter plot showing negative correlation between success rate and task-messiness score. Large Language Models are more challenged by tasks that have a high “messiness” score.Model Evaluation & Threat Research

That was a key motivation behind work at Model Evaluation & Threat Research (METR). The organization, based in Berkeley, Calif., “researches, develops, and runs evaluations of frontier AI systems’ ability to complete complex tasks without human input.” In March, the group released a paper called Measuring AI Ability to Complete Long Tasks, which reached a startling conclusion: According to a metric it devised, the capabilities of key LLMs are doubling every seven months. This realization leads to a second conclusion, equally stunning: By 2030, the most advanced LLMs should be able to complete, with 50 percent reliability, a software-based task that takes humans a full month of 40-hour workweeks. And the LLMs would likely be able to do many of these tasks much more quickly than humans, taking only days, or even just hours.

Such tasks might include starting up a company, writing a novel, or greatly improving an existing LLM. The availability of LLMs with that kind of capability “would come with enormous stakes, both in terms of potential benefits and potential risks,” AI researcher Zach Stein-Perlman wrote in a blog post.

At the heart of the METR work is a metric the researchers devised called “task-completion time horizon.” It’s the amount of time human programmers would take, on average, to do a task that an LLM can complete with some specified degree of reliability, such as 50 percent. A plot of this metric for some general-purpose LLMs going back several years [main illustration at top] shows clear exponential growth, with a doubling period of about seven months. The researchers also considered the “messiness” factor of the tasks, with “messy” tasks being those that more resembled ones in the “real world,” according to METR researcher Megan Kinniment. Messier tasks were more challenging for LLMs [smaller chart, above].

If the idea of LLMs improving themselves strikes you as having a certain singularity-robocalypse quality to it, Kinniment wouldn’t disagree with you. But she does add a caveat: “You could get acceleration that is quite intense and does make things meaningfully more difficult to control without it necessarily resulting in this massively explosive growth,” she says. It’s quite possible, she adds, that various factors could slow things down in practice. “Even if it were the case that we had very, very clever AIs, this pace of progress could still end up bottlenecked on things like hardware and robotics.”

From Your Site Articles