Are Emergent Abilities of Large Language Models a Mirage? (Quick Summary)

Anshu Kumar
3 min readDec 22, 2023

NeurIPS 2023 best paper award paper-1

https://www.craiyon.com/image/wgLwWScYR6CyjuvDn-IIqQ

This year NeurIPS 2023 has awarded 6 papers in 3 different categories. Take a look here : https://blog.neurips.cc/2023/12/11/announcing-the-neurips-2023-paper-awards/

Lets dive into this paper this paper, which dives deeper into the claim when model size increases new abilities emerges.

What are emergent abilities in large language models?

Emergent abilities in large language models (LLMs) refer to capabilities that are not present in smaller-scale models but become apparent in larger-scale models.

Examples of emergent abilities in large language models are:

  • Solving Maths word problems
  • Performing arithmetic
  • Storytelling
  • Multi-step reasoning
  • Taking college-level exams.

How these claims has been described?

https://arxiv.org/pdf/2206.07682.pdf

You can observe a sharp gain in metric after certain model size.

Reasons why emergent abilities could be mirage?

The primary reason investigated in this paper is the chosen metric to claim emergent abilities.

Choice of metric: LLM’s per token error rate changes smoothly and predictably, but chosen eval metrics are non-linearly scales the token error rate.

Example metrics which are discontinuous and scale the token-error rate are as follows

  • Bleu
  • Multiple choice grade
  • RougeL-Sum
  • Exact String Match

Emergent abilities appear for specific metrics shown above, but not for task-model families. Possible emergent abilities appear with at most 5 out of 39 BIG-Bench metrics.

> 92% of emergent abilities appear under one of two metrics: Multiple Choice Grade and Exact String Match.

https://arxiv.org/pdf/2304.15004.pdf

We can see above on Big Bench Tasks, spike occurs for only few discontinuous metrics.

When evaluated using the discontinuous Multiple Choice Grade metric, the LaMDA model family exhibits emergent abilities. However, when evaluated using the continuous BIG-Bench metric, Brier Score, the LaMDA model family’s emergent abilities disappear.

Paper also showed how an emergent abilities can be created for Vision Tasks on various architectures: fully connected, convolutional, self-attention.

The secondary reason

Due to the limited amount of test data available for smaller models, there is a risk of inaccurately estimating the performance of these models, leading to the misconception that smaller models are entirely incapable of performing the task.

There are 4 implication of this study.

1. when choosing metric(s), one should consider the met- ric’s effect on the per-token error rate and adapt their measuring process accordingly.

2. when making claims about capabilities of large models, including proper controls is critical.

3. scientific progress can be hampered when models and their outputs are not made public for independent scientific investigation.

4. when constructing a benchmark, it’s essential to carefully consider both the task and the metric to ensure that the evaluation accurately reflects the model’s capabilities in the specific context of the task at hand.

Authors are not denying that LLMs can not have emergent abilities, but currently its only showing for certain metrics.

Is it a Mirage?

I hope you have like this short summary of this paper. Stay tuned for rest of the awarded paper’s summaries. 😊 (Thanks!!)

References:

  1. https://arxiv.org/pdf/2304.15004.pdf
  2. https://arxiv.org/pdf/2206.07682.pdf

--

--

Anshu Kumar

Data Scientist, Author. Building Semantic Search and Recommendation Engine.