mirror of
https://github.com/dair-ai/Prompt-Engineering-Guide
synced 2024-11-02 15:40:13 +00:00
83 lines
4.7 KiB
Plaintext
83 lines
4.7 KiB
Plaintext
# Scaling Instruction-Finetuned Language Models
|
|
|
|
import {Screenshot} from 'components/screenshot'
|
|
import FLAN1 from '../../img/flan-1.png'
|
|
import FLAN2 from '../../img/flan-2.png'
|
|
import FLAN3 from '../../img/flan-3.png'
|
|
import FLAN4 from '../../img/flan-4.png'
|
|
import FLAN5 from '../../img/flan-5.png'
|
|
import FLAN6 from '../../img/flan-6.png'
|
|
import FLAN7 from '../../img/flan-7.png'
|
|
import FLAN8 from '../../img/flan-8.png'
|
|
import FLAN9 from '../../img/flan-9.png'
|
|
import FLAN10 from '../../img/flan-10.png'
|
|
import FLAN11 from '../../img/flan-11.png'
|
|
|
|
## What's new?
|
|
|
|
<Screenshot src={FLAN1} alt="FLAN1" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
This paper explores the benefits scaling [instruction finetuning](https://arxiv.org/pdf/2109.01652.pdf) and how it improves performance on a variety of models (PaLM, T5), prompting setups (zero-shot, few-shot, CoT), and benchmarks (MMLU, TyDiQA). This is explored with the following aspects: scaling the number of tasks (1.8K tasks), scaling model size, and finetuning on chain-of-thought data (9 datasets used).
|
|
|
|
**Finetuning procedure:**
|
|
- 1.8K tasks were phrased as instructions and used to finetune the model
|
|
- Uses both with and without exemplars, and with and without CoT
|
|
|
|
Finetuning tasks and held out tasks shown below:
|
|
|
|
<Screenshot src={FLAN11} alt="FLAN11" />
|
|
|
|
## Capabilities & Key Results
|
|
|
|
- Instruction finetuning scales well with the number of tasks and the size of the model; this suggests the need for scaling number of tasks and size of model further
|
|
- Adding CoT datasets into the finetuning enables good performance on reasoning tasks
|
|
- Flan-PaLM has improved multilingual abilities; 14.9% improvement on one-shot TyDiQA; 8.1% improvement on arithmetic reasoning in under-represented languages
|
|
- Plan-PaLM also performs well on open-ended generation questions, which is a good indicator for improved usability
|
|
- Improves performance across responsible AI (RAI) benchmarks
|
|
- Flan-T5 instruction tuned models demonstrate strong few-shot capabilities and outperforms public checkpoint such as T5
|
|
|
|
|
|
**The results when scaling number of finetuning tasks and model size:** scaling both the size of the model and the number of finetuning tasks is expected to continue improving performance, although scaling the number of tasks has diminished returns.
|
|
|
|
<Screenshot src={FLAN2} alt="FLAN2" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
**The results when finetuning with non-CoT and CoT data:** Jointly finetuning on non-CoT and CoT data improves performance on both evaluations, compared to finetuning on just one or the other.
|
|
|
|
<Screenshot src={FLAN3} alt="FLAN3" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
In addition, self-consistency combined with CoT achieves SoTA results on several benchmarks. CoT + self-consistency also significantly improves results on benchmarks involving math problems (e.g., MGSM, GSM8K).
|
|
|
|
<Screenshot src={FLAN4} alt="FLAN4" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
CoT finetuning unlocks zero-shot reasoning, activated by the phrase "let's think step-by-step", on BIG-Bench tasks. In general, zero-shot CoT Flan-PaLM outperforms zero-shot CoT PaLM without finetuning.
|
|
|
|
<Screenshot src={FLAN6} alt="FLAN6" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
Below are some demonstrations of zero-shot CoT for PaLM and Flan-PaLM in unseen tasks.
|
|
|
|
<Screenshot src={FLAN5} alt="FLAN5" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
Below are more examples for zero-shot prompting. It shows how the PaLM model struggles with repetitions and not replying to instructions in the zero-shot setting where the Flan-PaLM is able to perform well. Few-shot exemplars can mitigate these errors.
|
|
|
|
<Screenshot src={FLAN7} alt="FLAN7" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
Below are some examples demonstrating more zero-shot capabilities of the Flan-PALM model on several different types of challenging open-ended questions:
|
|
|
|
<Screenshot src={FLAN8} alt="FLAN8" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
|
|
<Screenshot src={FLAN9} alt="FLAN9" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
<Screenshot src={FLAN10} alt="FLAN10" />
|
|
Image Source: [Scaling Instruction-Finetuned Language Models](https://arxiv.org/abs/2210.11416)
|
|
|
|
You can try [Flan-T5 models on the Hugging Face Hub](https://huggingface.co/google/flan-t5-xxl). |