- Superslow AI Newsletter
- Posts
- The Unreasonable Effectiveness of Easy Training Data for Hard Tasks
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Hi everyone 👋
Week 2 of Superslow and so far it’s living up to its namesake 🙂 This week, we’re exploring a paper that examines how effective “easy” training data can be in developing strong model capabilities compared to training with more challenging or “hard” data.
This paper is authored by one of my best friends, Peter Hase, who recently wrapped up a residency on the safety team at Anthropic. Hope you enjoy!
Let’s dive in!
Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.
📚 Concepts & Learning
At a high level, this paper aims to answer the question: if I train my models on “easy” data, can I get the same performance as training on “hard” data?
Hase et. al measure four things in this paper:
How do we know if data is “hard” vs “easy”?
Do “easy”-trained models score highly on “hard” tests?
What are the tradeoffs between collecting easy vs hard datasets?
If the final upshot is that “easy”-trained models perform similarly well to “hard-trained models on hard tests, is this behavior consistent across different model sizes and different levels of test difficulty?
The main findings are:
One way to measure this is through model performance on test data — if models perform worse on category A test sets than category B test sets, this would imply that the category A training sets are more difficult than the category B training sets. Hase et al. find this to be true (this is expected)
Yes! Easy models perform ~70-100% as well as the “hard”-trained models:
If the “hard”-trained max score is 10 points
100% = “easy”-trained score is 10 points too
70% = “easy”-trained score is 7 points
My take: Nice, this is an optimistic finding! Hase et. al imply this easy-to-hard generalization is not foolproof, so it would interesting to see more research come out about how well this works for harder tests sets (or for novel research)
Easy data is: cheaper, faster, and easier to collect. You can collect in higher volumes and you can have higher confidence in the quality because QC measurements are easier to implement (this makes sense)
The same final result on easy-to-hard generalization persists across a variety of models and model sizes. But one area of research that needs further investigation is how model performance declines when the gap in difficulty between the test set and the training set gets larger and larger
Okay, so how is this research set up? Some assumptions to start with:
It’s easier, faster, and cheaper to collect “easy” training data (e.g. high school level math problems) vs. “hard” data (e.g. PhD level math problems)
You can also collect “easy” data in larger volumes, so you can have more data for training and testing
You can also have higher confidence that the “easy” data is accurate because more people can verify the true answer — easy data is less noisy than hard data
More people can verify 2+2=4, so you can get an accurate ground truth label through consensus
Fewer people can verify any one deep, PhD level math proof, so it’s harder to get an accurate ground truth label through consensus
Question to answer: is it possible to improve a model’s ability to solve “hard” test problems when you only train it on “easy” problems?
Let’s also compare models trained on easy data vs hard data: how much uplift do you get when you do the harder, more expensive thing, which is train on “hard” problems?
Does the model trained on “hard” datasets perform better on “hard” tests?
If yes, how much better does the “hard”-trained models perform compared to the “easy”-trained model?
Implications:
Superhuman intelligence: if easy-to-hard generalization works, we should be able to build models that have superhuman capabilities (model that are more intelligent than any person) from human-generated datasets
Emergent behavior: relatedly, we’d expect these models to produce novel research. Humans do this — make new discoveries, innovate, and uncover novel research — so can we expect models to do this one day?
Adjacent implications: we can leverage this research when thinking about scalable oversight and model safety?
What happens when models are more intelligent than any human on earth?
How can we monitor and mitigate dangerous behaviors?
In the paper, Hase et. al discuss this idea of a Supervision Gap, which is defined as the following:

Each variable above refers to the test score of models trained on different datasets. The models are:
Easy: model was trained on the “easy” dataset
Hard: model was trained on the “hard” dataset
Unsupervised: model wasn’t trained on anything
Scores from unsupervised models on a given test set should act as the floor — if you did no fine-tuning at all, what is the baseline performance we can expect?
Scores on the same test set from “easy”-trained models should hopefully be higher than scores from unsupervised models.
Scores on the same test set from “hard”-trained models should hopefully be even higher.
On a given test set, we would predict the following rank of model scores:
unsupervised model score > “easy”-trained model score > “hard”-trained model score
For example:
When tested on a held out set of college-level MMLU questions (i.e. “hard” test problems), the models score in the following ways:
Across a variety of “hard” tests, the main finding is easy-trained models perform almost as well as hard-trained models. Hase et. al tested on “easy” and “hard” tests (see the two sets of graphs below) across a variety of existing benchmarks (ARC, MMLU, etc.).
When you look at the Easy vs Hard bars below (which are the test scores from each respective fine-tuned model), the majority of results show a less than 1 point difference (for some, ~0.1 point difference!) in scores between the two fine-tuned models:

Easy test sets

Hard test sets
This is surprising, because you would imagine “hard”-trained models would have higher capabilities, right? They would have learned more knowledge about the world from their “hard” dataset. Which should lead to better performance across both easy and/or hard tests compared to the “easy”-trained models. Why does this not happen? Hase et. al write:
“We hypothesize that this occurs because easy data elicits latent knowledge and skills from pretrained models in a hardness-invariant way.”
Interesting! I’m interested in reading more on how that easy data could elicit latent knowledge (if that’s what’s happening here!). Hase et. al explore whether the models are learning the format of the test set questions and if that helps with performance. They conclude that this is not the reason for improved scores. If you have suggestions on where to continue reading about this, lmk!
Some other stuff from the paper:
Performance on “hard” tests given different training sets

This set of graphs measures how well fine-tuned models perform on the “hard” test set when trained on easy, medium, or hard datasets. For the most part, performance stays flat! This is consistent with other results from this paper — hard training datasets don’t help that much with final model performance, it seems. This research paper was published in June 2024, so it would be interested to see similar research in 2025 (I’m writing this as of June 2025, so one year later) and see how the OpenAI o-series reasoning models do. You can see the biggest gap in performance across easy, medium, and hard trained models is on the reasoning steps of GDM8k. How do reasoning models perform on these benchmarks?
Unsupervised vs fine-tuned model performance

One thing I want to note across all of the results from this paper is that the difference in performance between unsupervised and either one of the fine-tuned models (easy vs hard) is not that large. Looks like ~5 points or less across all three model sizes above^. While the focus of this paper is on easy-trained vs hard-trained performance, it might be worth stepping back entirely and asking:
Did the extra training data do that much?
If it did, what is the mechanism of improvement?
If we rule out learning about the question format leading to improved performance (Hase et. al address this in the paper), what are other reasons?
Is it that the training datasets have more relevant subject matter coverage that ultimately gets tested in the test set (compared to what can be learned just from unsupervised training)?
If you know of an adjacent paper that covers similar topics, let me know! I’d love to take a look.
🗞️ On My Reading List
🎁 Bonus
I read Toby Ord’s post on how improvements in inference-time compute could affect AI governance and found a few great nuggets:
Inference-time vs train-time compute:
“Because some tasks benefit more from additional inference than others, it is possible to tailor the amount of inference compute to the task, spending 1,000x the normal amount for a hard, deep maths problem, while just spending 1x on problems that are more intuitive. This kind of tailoring isn’t possible with pre-training scaling, where scaling up by 10x increases the costs for everything.”
I’d be interested in exploring a post on how important training-time compute vs inference-time compute is to final model performance, and what implications that would have in the competitive landscape. Drop me a line if you have a post that covers this! Ord’s paper is a great starting point, I think.
Distillation
I thought this was a useful image on distillation of models:
Conclusions
And I thought his conclusions nicely wrapped up his predictions about the future of AI Governance:

That’s it! Have a great day and see you in two weeks! 👋
What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!
Thanks for reading Superslow AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!