Why More Context Isn't Always Better: Context Window Fails Explained

Bigger isn't always better in LLMs. Larger context windows fail across a variety of LLM tests.

Hi everyone 👋

Today we’ll dive into why LLMs with huge context windows don’t actually perform well (or better), my reading list, some fun Midjourney and AI-generated videos.

As a quick note, I’ll be traveling next week so we’ll either skip next week’s post, or it’ll be published later in the week. Thanks for your understanding and thanks so much for reading along with me :)

Let’s dive in!

DALL-E/Shutterstock on a scientist analyzing at a giant context window

Thanks for reading! Hit subscribe to stay updated
on the most interesting news in AI.

📚 Concepts & Learning

Today we’re going to dive into the performance of different large language models (LLMs) across different context windows. The context window is the input box in an LLM chatbot. When you ask ChatGPT to help you draft a cover letter and you give it background on your work experience, skill and more, that’s all context that you’ve shared in the context window.

Some recent developments have further made this an interesting research area. This past May, Anthropic announced an expanded context window from 8k tokens to 100k tokens. That’s an increase from ~6k words to 75k words. That’s around the length of the average book.

OpenAI also recently announced in March that they’re expanding their GPT-3.5-Turbo context window from 4k to 16k tokens. While they haven’t officially made a comment on this, OpenAI probably can’t make their context window too large because it’s a B2C product. There are millions of daily users and expanding the context window would drastically increase the cost per query (in compute costs). Anthropic is B2B and mainly enterprise with additional limited access to researchers. Their customer pool, as a result, is much smaller, enabling them to come out with fantastic announcements like this recent one.

Other companies are also turning their attention to context windows. Startups like Magic.dev have developed LLMs like LTM-1 with a 5M token context window:

And it’s not just startups. Big players like Microsoft Research announced just last week a transformer-based LLM with a context window of 1B tokens. They even said in their paper (below) that they expect users to desire unlimited length context windows. Looks like we’re just getting started…

Why does context window matter?

There’s been a debate around whether context windows will get longer or whether AI memory will be improved. Both are important if you want the AI product/chatbot to recall previous preferences or conversations you’ve had with it, which helps greatly with personalized content. I do a fuller breakdown on this in this post (including some examples of ChatGPT failures). Check it out here.

But how do these LLMs with larger context windows perform? Is more context always better?

That’s the subject of today’s deep dive on recent comparative research on the performance of larger context windows across a variety of state-of-the-art LLMs. Let’s dive in.

How well do language models use information in their context window?

First, let’s talk about the models used in this research:

Closed models:

  • OpenAI’s GPT-3.5-Turbo (4k tokens)

  • OpenAI’s GPT-3.5-Turbo-16k (16k tokens)

  • Anthropic’s Claude-1.3 (8k tokens)

  • Anthropic’s Claude-1.3-100k (100k tokens)

Open models:

  • MosaicML’s* MPT-30B (8k tokens)

  • MosaicML’s MPT-30B-Instruct (8k tokens)

  • LongChat-13B (16k tokens; builds on LLaMA-13B)

*MosaicML was just acquired by Databricks. Things move fast in AI!

**13B vs 30B refers to the model size: 13B parameters vs 30B parameters. Typically larger models have greater capabilities and thus perform better.

If you’re not familiar with all of these models, don’t worry. We’ll break them down throughout this post.

All of these models are language models: they take text in and return text out, just like ChatGPT/Bing Chat.

What tests did they run?

The researchers ran two tests:

  1. Read a lot of text, synthesize, and answer: From a set of documents, 10-pages, 20-pages, or 30-pages long, find the page with the relevant information, synthesize the relevant information and return an answer. For example: Read this passage on the history of WWII, return me information about the capital of England.

  2. Pair matching: Find a unique input in a list, return me the unique output from the pair. (This is easy in tree-logic code or deterministic mapping, but hard for LLMs)

This research primarily tests for output accuracy. The winning answer is the one that matches the answer in a pre-labeled dataset.

A little on the datasets

  • Experiment 1: Text synthesis task:

    • The dataset is pulled from the NaturalQuestions benchmark, and each model is tested on a variety of text-lengths, some 10-pages, some 20, some 30.

    • There are 2.7k examples of each text-length.

    • The NaturalQuestions benchmark contains historical queries from Google search that human-labelers have answered given information from Wikipedia.

  • Experiment 2: Pair matching task:

    • The dataset is a set of JSON-formatted input-output pairs, which essentially look like long strings of text.

    • Each model is tested on a list of 75, 140 and 300 input-output pairs with 1 input-output pair containing the right information.

    • They tested over each density (75, 140, 300) 500 times.

Here’s the first experiment

  1. The chatbot is given multi-document context (aka input). At the end of the context, the chatbot is asked a question.

  2. In one of those documents is the correct answer. Let’s say it’s on page 1 in a stack of 20 pages.

  3. The chatbot returns an answer. It’s either correct or incorrect.

  4. This process is repeated, except the page with the correct answer is shuffled in the stack. Sometimes it’s in page 4, sometimes page 17, sometimes page 19, etc.

Here’s a toy example of 3 different tests:

✅ Correct

✅ Correct

❌ Incorrect

This experiment is pretty similar to what we’d expect ChatGPT or Bing Chat to do. First the user queries the chatbot, like ChatGPT. ChatGPT searches the internet and reviews several web pages with some relevant and lots of irrelevant information. ChatGPT returns only relevant information to the user. ChatGPT needs to also be aware that sometimes none of the webpages will have relevant information and should let the user know if this error occurs rather than hallucinate or make up facts that are untrue.

Here’s a quick snapshot on how each of the models do:

Overall, the models exhibit a u-shaped curve, indicating that they perform the best when the relevant information is at the beginning or the end of the context window. Their performances suffer when the relevant information is somewhere in the middle of the context window.

Here’s a snapshot of how the models perform on closed-book (aka baseline) vs oracle (aka with additional context).

Closed-book vs Oracle

Closed-book is the baseline LLM response. It must simply answer based on internal knowledge it’s already learned in pre-training/fine-tuning. Oracle is the response when given additional information in the context window.

GPT-3.5-Turbo (4k and 16k) have the highest closed book performance (55%) and oracle performance (88%). Interestingly, the difference between a 4k and 16k context window in GPT-3.4-Turbo doesn’t seem to affect performance: they both perform relatively similarly. This would lead us to believe that larger context windows don’t actually lead to better performance on the same task.

The lowest performance of GPT-3.5-Turbo is when the correct answer is in page 10 over 20 total pages. It performs 52.9% on those tests. This backs up the overall research claim that performance dips (on all models) when the correct information is in the middle of the context window.

Here’s the second experiment

  1. The chatbot is given a list of input-output pairs (ranging from 75 to 140 to 300 pairs). At the end of that context, the chatbot is asked a question.

  2. One of those pairs has the correct answer. Let’s say it’s in pair 10 in a stack of 300 pairs.

  3. The chatbot returns an answer. It’s either correct or incorrect.

  4. This process is repeated, except the pair location with the correct answer is shuffled in the list. Sometimes it’s in pair 40, sometimes pair 170, sometimes pair 290, etc.

For example, let’s say I have pairs of first and last names. If I give you a unique first name, will the LLM return the correct unique last name?

✅ Correct

✅ Correct

❌ Incorrect

Interestingly, this kind of test is pretty easy in standard computer programs. You can use deterministic mapping to do this in code or you can very easily do something like this in Excel or Airtable. So it’s interesting that LLMs struggle so much with something straightforward like this and shows that there are still accuracy issues with LLMs.

In this experiment, Anthropic’s Claude does nearly perfectly across all tasks (a huge win!). In comparison, other models tended to struggle:

As you can see, the maximum length possible for a comparative test is 16k tokens, even though Anthropic’s Claude goes up to 100k tokens. The next largest model only has a 16k context window.

Overall, there is some u-shaped behavior, but some models perform near perfectly. It is particularly interesting to see the performance of GPT-3.5 vs Claude-100k in the third graph, because in other tests OpenAI’s GPT models have outperformed Anthropic’s Claude.

Some other hypotheses the researchers tested

  • Decoder-only vs encoder-decoder architectures: forward-looking prediction vs bidirectional-looking prediction

  • Query-aware contextualization: placing the question at the beginning and end of the context window

  • Effects of instruction fine-tuning: regular pre-trained models vs instruction fine-tuned models

Decoder-only vs encoder-decoder architectures

First, what is an encoder-decoder vs decoder-only architecture? This will take a much deeper dive, but here’s the high level version:

Imagine you have a fill-in-the-blank quiz.

Decoder-only models only use context before the “blank” to inform a guess of what should be there in place of the blank.

For example:

On the 4th of July, she had a burger and _____

The model guess might be: fries
Or it could be: milkshake
Or it could be: ketchup

All of these seem like reasonable answers.

In contrast, encoder-decoder models are able to use context both before and after the “blank” to help inform a guess of what should be in the place of the blank.

For example:

On the 4th of July, she had a burger and _____. She skipped the condiments and saved dessert for later.

The model might respond: fries (and be correct).

Other options like milkshake and ketchup seem less reasonable given the additional context the model sees in the following sentence.

The researchers tested open-source that are decoder-only, meaning they will only look at prior context to inform their future predictions. Since the researchers can’t be sure of closed-model architectures, they did not test them in this section.

Decoder-only models:

  • MPT-30B-Instruct

  • Longchat-13B-16k

The encoder-decoder models tested are:

  • FLAN-T5-XXL

  • FLAN-UL2

Here are the results:

Performance is on the y-axis (higher is better). Position of the correct answer (aka page location) is on the x-axis (1st position is early, last position is late in the context window). U-shape means the models perform well when the relevant text is in the beginning or end of the context window.

Although the researchers note that the Flan models are relatively robust, from looking at the results here, it seems like decoder-only Longchat-13B-16k mostly outperforms the other models, sometimes by a large chunk. In 30-document tests (the right-most graph) the encoder-decoder models actually seem to perform the worst in the middle sections.

Overall, all these models demonstrate a u-shaped curve across all 3 tests, which indicates that they perform the best when the relevant text is at the beginning or end of the context window. It remembers best the information it is shown first or shown last.

Overall, my take is that encoder-decoder architecture doesn’t necessarily improve the performance, although intuitively it seems like it should given the context the model sees both before and after the relevant area of prediction. I’m sure I have much more reading to do in this encoder-decoder vs decoder-only space, so if you have any suggestions, I’d love recommendations! Feel free to reply here with any blogs/arxiv papers you recommend 🙂 

Takeaway: Encoder-decoder models perform well when tested on token lengths that are similar in size/length to those in their training dataset. However, they show u-shaped performance when evaluated on sequences longer than those seen in training.

Query-aware contextualization

Another test they ran is query-aware contextualization:

The adjustment here is to include the question both before and after the context. That way, the model knows what to look for (or pay attention to) before being presented with the full context. The model is then reminded of the question again at the end. Taking our celebrity name case here, here’s a quick example:

Two results: do you want the good news or the bad news?

The good news

On pair-matching tests (Experiment 2), the performance of decoder-only models (MPT-30B-Instruct and Longchat-13B-16k) is significantly improved.

Why?

Given the query up front, the decoder-only models are able to contextualize the content they read and are then able to extract the right kinds of insights.

For example:

The 16k context window version of GPT-3.5-Turbo gets nearly a perfect score on matching-pair tests, when previously on the same test it achieved 45.6% accuracy. (Is this a hint to us that GPT-3.5 is a decoder-only model?)

That intuitively totally makes sense.

Imagine if a teacher gave you a list of celebrity first and last names, told you to read it over and when you’re ready, she’d ask you a question about the list. It would be easier to retrieve the right information if you were given the question up front before reading the full list. It can contextualize your search and attention.

The bad news

When asked to find and decipher answers in multi-page text (Experiment 1), performance remains the same if the relevant context is up front. Surprisingly, the models actually perform slightly worse if the right context is middle or late in the text.

That’s an interesting outcome. Think of this test as a passage comprehension test. You’d think having context on the question before reading the passage would help focus your attention on the right pages or documents. But according to this portion of the paper, it doesn’t.

Overall, query-aware contextualization improves the performance on pair matching tests (Experiment 2) for decoder-only models, but it doesn’t change much in the performance of multi-page text analysis (Experiment 1).

Instruct vs non-instruct models

Another hypothesis tested was instruct vs non-instruct performance.

First, what are instruct vs non-instruct models?

Non-instruct models are similar to pre-trained models that focus on sentence completion. Instruct models are ones that are trained to answer questions. Instruct models are often created from non-instruct (aka pre-trained) models that are then fine-tuned on a labeled dataset of instruction-following inputs and outputs (like Question and Answering).

For example:

Imagine you query a non-instruct and instruct model this:

Input: What is the capital of England?

Non-instruct model might respond in auto-completion fashion like this:

Input: What is the capital of England?

Output: What is the capital of France? What is the capital of the U.S.A.? These are all questions covered in our first course here. Be sure to sign up on the link below.

(Okay, fair… this is a trivial example, but you get the point. It’s auto-complete rather than actually answering the question.)

In contrast, an instruct model might respond in an instruction-following manner like this:

Input: What is the capital of England?

Output: London.

How might this affect the model performance on understanding context?

When these instruct models are fine-tuned (via Supervised Fine-Tuning) on instruction datasets, it’s common for the instruction (“make this a bulleted list”, “what is X…”, “summarize this text”, etc.) to be placed at the beginning of the context window. This might lead instruct-models to place greater weight on the beginning of the input context, since that kind of attention in their training has led to successful outcomes.

Here’s a super short explainer on pre-training vs SFT vs RLHF in case you’re interested in diving deeper (link). Instruct fine-tuning typically occurs at the SFT stage.

Okay so now let’s look at the comparative performance of instruct vs non-instruct models:

The u-shaped trend is similar in both instruct vs non-instruct models, but the instruct fine-tuned model performs uniformly better than the non-instruct model.

Is this because of the instruction fine-tuning? Probably. But it also might be because the instruct model has been trained more, therefore has more context. So there might be multiple factors as to why the instruct model performs better. It would be interesting to explore this in future research!

The big takeaways

  1. Models respond best when the relevant information is at the beginning or end of the context window.

  2. Models with larger context windows don’t necessarily know how to (or don’t) use that information better than models with smaller context windows:

    • See performance of GPT-3.5-Turbo 4k-tokens vs 16k-tokens above

    • On Experiment 1, when models use more than 20 documents, their performance only marginally improves: ~1.5% for GPT-3.5-Turbo and only ~1% for Claude-1.3). More context isn’t always better.

Why are these language models so bad at understanding the full context?

Transformers, which scale poorly to long sequences (e.g., since self-attention complexity is quadratic with the input sequence length). As a result, language models are typically trained with relatively small context windows.

A lot of this is likely due to the fact that language models are trained on the transformer architecture (which we’ll dive into more in a later post). Given the self-attention architecture, for each input token the complexity of time and memory is quadratic. That means it takes a lot of time and money to train LLMs on longer text sequences. So as a result, these models are typically trained on shorter sequences. The downside is that these models then won’t (in theory) perform well on longer sequences in larger context windows. And at least we’ve seen that corroborated with the research we just read about.

Bonus: LLMs act like humans

In psychology, we observe primacy and recency biases in recall tasks. Imagine you’re at a speech coaching session. What is some canonical advice they’d give you? Start with the most important information up front and end strong with the big takeaways. Why? People remember the beginning and the end of your talk the most. Kinda like these LLMs in retrieval tasks?

Today’s paper touches upon the serial-position effect which posits that in free-association recall of elements from a list, people remember the first and last elements the best. As the paper notes, this is a pretty surprising (and interesting!) observation in LLMs, since in reality LLMs should be able to retrieve any token from their context windows.

What do you think? Drop me a line if you have any thoughts here. 🧠

Want to keep reading?

  • The paper (Link)

  • Microsoft Research 1B LLM (Link)

  • a16z take on this research (Link)

🗞️ News

What’s on my reading list?

Mostly research these days! Lmk if you have any recs :)

  • FedTP: Federated Learning by Transformer Personalization (Link) (Thanks Alex Raymond for the rec! And congrats again on the new role as Head of AI at Doppl 🎉)

  • AI Agents that “Self-Reflect” Perform Better in Changing Environments (Stanford HAI: Link)

  • What should the UK’s ÂŁ100 million Foundation Model Taskforce do? (Jack Clark: Link)

  • GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models (Google DeepMind: Link)

  • LongNet: Scaling Transformers to 1,000,000,000 Tokens (Microsoft Research: Link)

  • Towards Measuring the Representation of Subjective Global Opinions in Language Models (Anthropic: Link)

Product:

  • Anthropic announced Claude 2! (Link)

🎁 Miscellaneous

Jen-erative AI

Another Midjourney gem from Nick St. Pierre

ChatGPT invented banned 1980s video games and Midjourney rendered them 

That’s it! Have a great day and see you next week! 👋

What did you think about today’s newsletter? Send me a DM on Twitter @barralexandra or reply to this email!

Thanks for reading Superfast AI.
If you enjoyed this post, feel free to
share it with any AI-curious friends. Cheers!