import { Callout, FileTree } from 'nextra-theme-docs'
import {Screenshot} from 'components/screenshot'
import GEMINI1 from '../../img/gemini/gemini-1.png'
import GEMINI2 from '../../img/gemini/gemini-architecture.png'
import GEMINI3 from '../../img/gemini/gemini-result.png'
import GEMINI4 from '../../img/gemini/gemini-2.png'
import GEMINI5 from '../../img/gemini/gemini-3.png'
import GEMINI6 from '../../img/gemini/gemini-6.png'
import GEMINI7 from '../../img/gemini/gemini-7.png'
In this guide, we provide an overview of the Gemini models and how to effectively prompt and use them. The guide also includes capabilities, tips, applications, limitations, papers, and additional reading materials related to the Gemini models.
## Introduction to Gemini
Gemini is the newest most capable AI model from Google Deepmind. It's built with multimodal capabilities from the ground up and can showcases impressive crossmodal reasoning across texts, images, video, audio, and code.
Gemini comes in three sizes:
- **Ultra** - the most capable of the model series and good for highly complex tasks
- **Pro** - considered the best model for scaling across a wide range of tasks
- **Nano** - an efficient model for on-device memory-constrained tasks and use-cases; they include 1.8B (Nano-1) and 3.25B (Nano-2) parameters models and distilled from large Gemini models and quantized to 4-bit.
According to the accompanying [technical report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf), Gemini advances state of the art in 30 of 32 benchmarks covering tasks such as language, coding, reasoning, and multimodal reasoning.
It is the first model to achieve human-expert performance on [MMLU](https://paperswithcode.com/dataset/mmlu) (a popular exam benchmark), and claim state of the art in 20 multimodal benchmarks. Gemini Ultra achieves 90.0% on MMLU and 62.4% on the [MMMU benchmark](https://mmmu-benchmark.github.io/) which requires college-level subject knowledge and reasoning.
The Gemini models are trained to support 32k context length and built of top of Transformer decoders with efficient attention mechanisms (e.g., [multi-query attention](https://arxiv.org/abs/1911.02150)). They support textual input interleaved with audio and visual inputs and can produce text and image outputs.
<Screenshot src={GEMINI2} alt="GEMINI2" />
The models are trained on both multimodal and multilingual data such as web documents, books, and code data, including images, audio, and video data. The models are trained jointly across all modalities and show strong crossmodal reasoning capabilities and even strong capabilities in each domain.
## Gemini Experimental Results
Gemini Ultra achieves highest accuracy when combined with approaches like [chain-of-thought (CoT) prompting](https://www.promptingguide.ai/techniques/cot) and [self-consistency](https://www.promptingguide.ai/techniques/consistency) which helps dealing with model uncertainty.
As reported in the technical report, Gemini Ultra improves its performance on MMLU from 84.0% with greedy sampling to 90.0% with uncertainty-routed chain-of-thought approach (involve CoT and majority voting) with 32 samples while it marginally improves to 85.0% with the use of 32 chain-of-thought samples only. Similarly, CoT and self-consistency achieves 94.4% accuracy on the GSM8K grade-school math benchmark. In addition, Gemini Ultra correctly implements 74.4% of the [HumanEval](https://paperswithcode.com/dataset/humaneval) code completion problems. Below is a table summarizing the results of Gemini and how the models compare to other notable models.
<Screenshot src={GEMINI3} alt="GEMINI3" />
The Gemini Nano Models also show strong performance on factuality (i.e. retrieval-related tasks), reasoning, STEM, coding, multimodal and multilingual tasks.
Besides standard multilingual capabilities, Gemini shows great performance on multilingual math and summarization benchmarks like [MGSM](https://paperswithcode.com/dataset/mgsm) and [XLSum](https://paperswithcode.com/dataset/xl-sum), respectively.
The Gemini models are trained on a sequence length of 32K and are found to retrieve correct values with 98% accuracy when queried across the context length. This is an important capability to support new use cases such as retrieval over documents and video understanding.
The instruction-tuned Gemini models are consistently preferred by human evaluators on important capabilities such as instruction following, creative writing, and safety.
## Gemini Multimodal Reasoning Capabilities
Gemini is trained natively multimodal and exhibits the ability to combine capabilities across modalities with the reasoning capabilities of the language model. Capabilities include but not limited to information extraction from tables, charts, and figures. Other interesting capabilities include discerning fine-grained details from inputs, aggregating context across space and time, and combining information across different modalities.
Gemini consistently outperforms existing approaches across image understanding tasks such as high-level object recognition, fine-grained transcription, chart understanding, and multimodal reasoning. Some of the image understanding and generation capabilities also transfer across a diverse set of global language (e.g., generating image descriptions using languages like Hindi and Romanian).
### Verifying and Correcting
Gemini models display impressive crossmodal reasoning capabilities. For instance, the figure below demonstrates a solution to a physics problem drawn by a teacher (left). Gemini is then prompted to reason about the question and explain where the student went wrong in the solution if they did so. The model is also instructed to solve the problem and use LaTeX for the math parts. The response (right) is the solution provided by the model which explains the problem and solution with details.
<Screenshot src={GEMINI1} alt="GEMINI1" />
### Rearranging Figures
Below is another interesting example from the technical report showing Gemini's multimodal reasoning capabilities to generate matplotlib code for rearranging subplots. The multimodal prompt is shown on the top left, the generated code on the right, and the rendered code on the bottom left. The model is leveraging several capabilities to solve the task such as recognition, code generation, abstract reasoning on subplot location, and instruction following to rearrange the subplots in their desired positions.
<Screenshot src={GEMINI4} alt="GEMINI4" />
### Video Understanding
Gemini Ultra achieves state-of-the-art results on various few-shot video captioning tasks and zero-shot video question answering. The example below shows that the model is provided a video and text instruction as input. It can analyze the video and reason about the situation to provide an appropriate answer or in this case recommendations on how the person could improve their technique.
<Screenshot src={GEMINI5} alt="GEMINI5" />
### Image Understanding
Gemini Ultra can also take few-shot prompts and generate images. For example, as shown in the example below, it can be prompted with one example of interleaved image and text where the user provides information about two colors and image suggestions. The model then take the final instruction in the prompt and then respond with the colors it sees together with some ideas.
<Screenshot src={GEMINI6} alt="GEMINI6" />
### Modality Combination
The Gemini models also show the ability to process a sequence of audio and images natively. From the example, you can observe that the model can be prompted with a sequence of audio and images. The model is able to then send back a text response that's taking the context of each interaction.
<Screenshot src={GEMINI7} alt="GEMINI7" />
## Gemini Generalist Coding Agent
Gemini is also used to build a generalist agent called [AlphaCode 2](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf) that combines it's reasoning capabilities with search and tool-use to solve competitive programming problems. AlphaCode 2 ranks within the top 15% of entrants on the Codeforces competitive programming platform.
## References
- [Introducing Gemini: our largest and most capable AI model](https://blog.google/technology/ai/google-gemini-ai/#sundar-note)
- [How it’s Made: Interacting with Gemini through multimodal prompting](https://developers.googleblog.com/2023/12/how-its-made-gemini-multimodal-prompting.html)
- [Welcome to the Gemini era](https://deepmind.google/technologies/gemini/#introduction)
- [Gemini: A Family of Highly Capable Multimodal Models - Technical Report](https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf)
- [Fast Transformer Decoding: One Write-Head is All You Need](https://arxiv.org/abs/1911.02150)