Superslow AI Newsletter
Posts
Predicting AI Timelines in 2025

Predicting AI Timelines in 2025

Interpreting AI 2027's timelines in light of METRs long task research

Alexandra Barr
May 29, 2025

Hi everyone 👋

This month marks my 1+ year at OpenAI! I’ve been meaning to get back into writing for the past few months but so many exciting things have happened this year which has made it harder to dive back in. I’m really happy to be back and am hoping to do bi-weekly posts — reading papers one weekend and writing the other. More to come!

In the spirit of this newsletter changing form and focusing more on paper summaries over fast news, I’ve decided to change the handle from Superfast (which was ironic anyway!) to Superslow (much more appropriate). After thinking about my goals with this newsletter, I’d love to use it as a way to motivate reading some of the best papers out there right now.

Since moving to SF, I’ve also been lucky to be surrounded by incredibly high talent AI researchers, many of whom are close friends. I’d love to use this newsletter as a way of exploring my friends’ favorite paper recommendations.

This week, I decided to start with a paper from METR that my friend David Rein suggested: Measuring AI Ability to Complete Long Tasks.

I’m hoping to tie in the RE Bench paper and interpret the results in the popular AI 2027 post in light of this newly(-ish) published research. If that sounds exciting to you too, let’s dive in!

Thanks for reading! Hit subscribe to stay updated
on the most interesting papers in AI.

📚 Predicting AI Timelines in 2025

“According to a recent METR’s report, the length of coding tasks AIs can handle, their “time horizon”, doubled every 7 months from 2019 - 2024 and every 4 months from 2024-onward. If the trend continues to speed up, by March 2027 AIs could succeed with 80% reliability on software tasks that would take a skilled human years to complete.“

tldr: AI progress is exponential — and now we can prove it!

Source

Long horizon task research is really interesting. It informs both both capabilities research and, of course, safety research. Notably, it also affects how we think about AI timelines — how long do we expect it will take us to reach AGI or superintelligent AI?

This can spawn conversations like:

What are the economic impacts of superintelligent AI? (economists and policy makers care about this)
What are the ethical implications? (philosophers and social scientists care about this)
Given beliefs about those timelines, how should I think about my life right now? (everyone existentialists cares about this?)

On question 3, I have had real conversations with friends in AI who have stopped investing money in their 401ks. Their AI timelines are too short. They don’t think they’ll ever need retirement savings. The optimistic view is: they won’t need savings because AIs will do everything and give everyone UBI. The pessimistic view is: they won’t need savings because AIs will destroy the world. 😬 What do you think?

METR’s research on long tasks

Many conversations about AI progress rotate around the idea of building an AI researcher. The building blocks of that are:

Build a great software engineer
Build a great AI researcher
Have a team of AI R&Ds solve all of the other problems in the world*

*You can think of this like an intelligence explosion that’s (relatively^** ) inexpensive, really fast, and always working should predictably. Imagine you have OAI’s Deep Research model working all the time to write summaries for you, research your next vacation, help you shop for deals online, and pair with you on some professional tasks. You watch it search the web, reason through tradeoffs, and come up with a final plan to suggest to you. A lot of the same skillset can be put towards traditional scientific research, which could unlock solutions to unsolved research questions in medicine, robotics, economics, and more.

**You can think of AIs as relatively inexpensive compared to human labor

And why start with SWE tasks? Those are the building blocks for 2 and 3. When we solve great SWE tasks, we’re a few short steps away from building (2) AI researchers and (3) a team of AI researchers. And then AI development becomes recursive — AIs further develop future AIs.

Given that, METR started its AI progress research on long tasks in the software engineering domain. An added benefit, SWE tasks should have objectively correct and verifiable solutions. This helps with standardized grading across human and model responses.

The short summary of METR’s study:

The team collected 100s of SWE tasks from other datasets including RE-Bench (very hard), HCAST (medium hard), and Software Atomic Actions (easy)
They had human SWE experts complete tasks in those datasets and record how much time each task took. Each task now has and average-human-handling-time (AHHT? I’m making this up)
Models attempt to solve those same tasks. They are graded on their success and either pass or fail. In this particular paper, a 50% score is considered a passing rate
From the subset of tasks that models passed, you can plot the AHHT associated with that task. METR then plotted which models successfully completed which tasks, and uses model release date as a measurement to get to our final question: AI timelines. You can see that here:
Source: page 2

So how do we get to the 2027 timeline for significant AI progress?

In the AI 2027 post, the authors, Kokotajlo et. al, predict a few things:

AI capabilities will continue to exponentially progress
The rate will continue from doubling every 7 months, to every 3-4 months, to an even faster rate (pulled from METR’s report)
The success on SWE tasks will move up from 50% (as the stated passing rate in METR’s paper) to 80%+. This becomes important because we want to build reliable AI SWEs
1. If interesting, METR also included a plot on AI progress timeline estimates if the passing rate is 80%. The blue line is the 80% passing rate, the grey line is the 50% passing rate. The upshot? Timelines are much longer if the passing rate is higher:
  Source: page 12

Still, that doesn’t answer the 2027 timeline — by 2027 we’re only looking at a ~4+ hour length task completed at a 50% pass rate according to these plots. We only get ~1 hour length task completed if we take a more conservative 80% pass rate.

Ways these researchers predict the 2027 timeline:

Extending the timeline research done by METR discussed above
Estimating how long it would take to saturate RE-Bench tasks (8+ hours long tasks), while taking into account how this would be applied to real world AI R&D tasks (how does that application work?)
Approach 1 and 2 plus geopolitics and macroeconomics

Summary of AI 2027 estimates here (method 1, 2, and 3):

Source

Taking this all together, it seems like timeline extrapolation / forecasting based on METR’s report makes sense. We should consider:

How perfect do we need our AI agents to be? 50-80% is likely too low, so timelines may be longer than we expect
How much memory do these models have? This will affect how work is parallelized, planned, prioritized, and executed, which mimics more of the dynamics between SWE teams than isolated ICs

I’ll leave you with this graphic from AI 2027. It maps what needs to be solved between RE-Bench saturation (very difficulty, hours long SWE tasks) and the existance of a superhuman AI coder:

Source

Time horizon: how long will it be until AI models can solve discreet SWE tasks that currently take human experts a long time to solve (many hours)?
1. One way to estimate this is by extending METR’s long task research
Engineering complexity: how long will it take to solve real-world AI R&D tasks?
1. What’s a good proxy measurement? Maybe if models can complete 1+ month’s work of an expert SWE? What does that look like in the details? Maybe modifying 20k+ lines of code in a codebase that’s 500k+ lines (estimates here)
2. To test this, you can measure how many lines of code were modified and evaluate the quality of the work with pre-made unit tests that expert human SWEs have created
Feedback loops: how do you do the above without pre-made unit tests to verify outcomes or without clear, comprehensive, bounded instructions? How do we teach our models to have SWE “taste”
1. Give models an vaguely specified task and see if they successful infer what an ideal outcome looks like
Parallel projects: how long will it take to create an AI product manager?
1. To test this, you can see how well models work across multiple codebases, large scale training pipelines, or large scale data analysis pipelines
2. How do they assess tradeoffs and select an ideal path forward?
Specialization: how long will it take to create an AI research lead?
1. Same as above but with AI research “taste”
Cost and speed: how do we optimize efficiency?
1. How do we go from “things that don’t scale” to “things that scale”?

If you’re interested in diving into more of the details and research method, I recommend reading the timeline forecast extension here published by the AI 2027 team. It’s long but detailed.

So how should we think about AI timelines given these three research papers and extensive discussion of confidence intervals?

It’s hard to say. In general, I’m skeptical of having any deep confidence in any particular timeline estimate. Some things to consider:

The confidence intervals cover quite a long range of time:

Source

It’s almost June 2025 and we have not seen significant progress from leading AI labs on saturating RE Bench, which should push the starting date out further for these estimates
There may be considerations that slow down progress more than we expect: safety protocols, product implementation issues, quality control, a misunderstanding of relevant transfer knowledge, 3p tooling integration issues, macroeconomic/political frictions, and more

Given the above open question, I find the AI 2027 timeline forecasting super interesting research that I consider more evidence towards shorter timelines, but all with a healthy dose of skepticism or openness that relevant questions are still unanswered.

Some other nuggets I found interesting:

The two experts consulted for the AI 2027 forecast put RE-Bench tasks at saturation in August or September 2025 (this year!) as an early bound for their 80% confidence interval
- I mentioned this earlier, but I’m writing this post ~June 2025 and given we haven’t seen close saturation on RE-Bench tasks from current model testing, this seems like an overzealous estimation
In METR’s report, they find that switching from 50% to 80% reduces the time horizon by 5x
- If our goal is to have near perfect agents complete professional or personal tasks for us with high fidelity, timelines will likely be pushed out much further
- How much does the time horizon extend if you move from 80% to ~95%? What about 99%?
- Does this mean my job is safe? I hope so 🤞

🗞️ On My Reading List

Super short reading list, that I’ll add to over the coming weeks. Let me know what you’re interesting in exploring and I can prioritize new papers :)

Training Large Language Models to Reason in a Continuous Latent Space (link)
Inference Scaling Reshapes AI Governance (link)
Alignment faking in large language models (link)

That’s it! Have a great day and see you in two weeks! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfastslow AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!