multimodal CoT

1 year ago · 91495493ed
parent db5414fbd6
commit 91495493ed
3 changed files with 8 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -89,6 +89,7 @@ The following are the latest papers (sorted by release date) on prompt engineeri
  - [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/abs/2205.11916) (May 2022)
  - [MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning](https://arxiv.org/abs/2205.00445) (May 2022)
  - [Toxicity Detection with Generative Prompt-based Inference](https://arxiv.org/abs/2205.12390) (May 2022)
+  - [Learning to Transfer Prompts for Text Generation](https://arxiv.org/abs/2205.01543) (May 2022)
  - [The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning](https://arxiv.org/abs/2205.03401) (May 2022)
  - [A Taxonomy of Prompt Modifiers for Text-To-Image Generation](https://arxiv.org/abs/2204.13988) (Apr 2022)
  - [PromptChainer: Chaining Large Language Model Prompts through Visual Programming](https://arxiv.org/abs/2203.06566) (Mar 2022)
--- a/guides/prompt-miscellaneous.md
+++ b/guides/prompt-miscellaneous.md
@ -7,7 +7,7 @@ In this section, we discuss other miscellaneous but important topics in prompt e
 Topic:
 - [Program-Aided Language Models](#program-aided-language-models)
 - [ReAct](#react)
- [Multimodal Prompting](#multimodal-prompting)
+- [Multimodal CoT Prompting](#multimodal-prompting)
 - [GraphPrompts](#graphprompts)

 ---
@ -30,10 +30,13 @@ The ReAct framework can allow LLMs to interact with external tools to retrieve a
 Full example coming soon!

 ---
-## Multimodal Prompting
-In this section, we will cover some examples of multimodal prompting techniques and applications that leverage multiple modalities as opposed to just text alone.
+## Multimodal CoT Prompting

-Examples coming soon!
+[Zhang et al. (2023)](https://arxiv.org/abs/2302.00923) recently proposed a multimodal chain-of-thought prompting approach. Traditional CoT focuses on the language modality. In contrast, Multimodal CoT incorporates text and vision into a two-stage framework. The first step involves rationale generation based on multimodal information. This is followed by the second phase, answer inference, which leverages the informative generated rationales.
+
+The multimodal CoT model (1B) outperforms GPT-3.5 on the ScienceQA benchmark.
+
+![](../img/multimodal-cot.png)

 ---
 ## GraphPrompts
--- a/img/multimodal-cot.png
+++ b/img/multimodal-cot.png