Prompt-Engineering-Guide/pages/techniques/multimodalcot.mdx

# Multimodal CoT Prompting

import { Callout, FileTree } from 'nextra-theme-docs'
import {Screenshot} from 'components/screenshot'
import MCOT from '../../img/multimodal-cot.png'

[Zhang et al. (2023)](https://arxiv.org/abs/2302.00923) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.

The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.

<Screenshot src={MCOT} alt="MCOT" />

Further reading:
- [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045) (Feb 2023)
add pages 2023-03-11 02:21:43 +00:00			`# Multimodal CoT Prompting`

			`import { Callout, FileTree } from 'nextra-theme-docs'`
			`import {Screenshot} from 'components/screenshot'`
			`import MCOT from '../../img/multimodal-cot.png'`

			`[Zhang et al. (2023)](https://arxiv.org/abs/2302.00923) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.`

			`The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.`

			`<Screenshot src={MCOT} alt="MCOT" />`

			`Further reading:`
			`- [Language Is Not All You Need: Aligning Perception with Language Models](https://arxiv.org/abs/2302.14045) (Feb 2023)`