mirror of
https://github.com/dair-ai/Prompt-Engineering-Guide
synced 2024-11-18 03:25:39 +00:00
15 lines
951 B
Plaintext
15 lines
951 B
Plaintext
# Multimodal CoT Prompting
|
|
|
|
import { Callout, FileTree } from 'nextra-theme-docs'
|
|
import {Screenshot} from 'components/screenshot'
|
|
import MCOT from '../../img/multimodal-cot.png'
|
|
|
|
[Zhang et al. (2023)](https://arxiv.org/abs/2302.00923) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.
|
|
|
|
The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.
|
|
|
|
<Screenshot src={MCOT} alt="MCOT" />
|
|
Image Source: [Zhang et al. (2023)](https://arxiv.org/abs/2302.00923)
|
|
|
|
Further reading:
|
|
- [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045) (Feb 2023) |