context caching

main
Elvis Saravia 2 weeks ago
parent acf2b6ff57
commit fe056a6212

@ -0,0 +1,276 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"pip install -q -U google-generativeai"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from google.generativeai import caching\n",
"import google.generativeai as genai\n",
"import os\n",
"import time\n",
"import datetime\n",
"\n",
"from dotenv import load_dotenv\n",
"\n",
"load_dotenv()\n",
"\n",
"genai.configure(api_key=os.environ[\"GEMINI_API_KEY\"])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"file_name = \"weekly-ai-papers.txt\"\n",
"file = genai.upload_file(path=file_name)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Wait for the file to finish processing\n",
"while file.state.name == \"PROCESSING\":\n",
" print('Waiting for video to be processed.')\n",
" time.sleep(2)\n",
" video_file = genai.get_file(file.name)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"File processing complete: https://generativelanguage.googleapis.com/v1beta/files/n146hu3zpxvv\n"
]
}
],
"source": [
"print(f'File processing complete: ' + file.uri)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# Create a cache with a 5 minute TTL\n",
"cache = caching.CachedContent.create(\n",
" model=\"models/gemini-1.5-flash-001\",\n",
" display_name=\"ml papers of the week\", # used to identify the cache\n",
" system_instruction=\"You are an expert AI researcher, and your job is to answer user's query based on the file you have access to.\",\n",
" contents=[file],\n",
" ttl=datetime.timedelta(minutes=15),\n",
")\n",
"\n",
"# create the model\n",
"model = genai.GenerativeModel.from_cached_content(cached_content=cache)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The latest AI papers of the week, according to the file provided, are from **June 3 - June 9, 2024**. \n",
"\n",
"Here is a summary:\n",
"\n",
"1. **NLLB**: This paper proposes a massive multilingual model that leverages transfer learning across 200 languages. It achieves a significant improvement in translation quality. \n",
"2. **Extracting Concepts from GPT-4**: This paper presents a new method to extract interpretable patterns from GPT-4, making the model more understandable and predictable.\n",
"3. **Mamba-2**: This paper introduces an enhanced architecture combining state space models (SSMs) and structured attention, leading to improved performance on tasks requiring large state capacity.\n",
"4. **MatMul-free LLMs**: This paper proposes an implementation that eliminates matrix multiplication operations from LLMs, achieving significant memory reduction while maintaining performance.\n",
"5. **Buffer of Thoughts**: This paper presents a new prompting technique to enhance LLM-based reasoning, improving accuracy and efficiency compared to other methods.\n",
"6. **SaySelf**: This paper introduces a framework to teach LLMs to express accurate confidence estimates and rationales, boosting model transparency and reliability. \n",
"7. **The Geometry of Concepts in LLMs**: This paper studies how hierarchical relationships between concepts are encoded in LLMs, revealing insights into the model's internal representation. \n",
"8. **Aligning LLMs with Demonstrated Feedback**: This paper proposes a method to align LLMs to specific settings using a limited number of demonstrations, leading to improved task alignment across domains.\n",
"9. **Towards Scalable Automated Alignment of LLMs**: This paper explores different strategies for aligning LLMs, including aligning through inductive bias, imitation, and environmental feedback.\n",
"10. **AgentGym**: This paper presents a new framework for LLM-based agents, enabling them to explore various environments and tasks, going beyond previously seen data.\n",
"\n",
"You can find links to the papers, as well as related tweets, in the file. \n",
"\n"
]
}
],
"source": [
"# query the model\n",
"response = model.generate_content([\"Can you please tell me the latest AI papers of the week?\"])\n",
"\n",
"print(response.text)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Here are the papers mentioned in the document that discuss Mamba:\n",
"\n",
"* **Mamba-2** - a new architecture that combines state space models (SSMs) and structured attention; it uses 8x larger states and trains 50% faster; the new state space duality layer is more efficient and scalable compared to the approach used in Mamba; it also improves results on tasks that require large state capacity. \n",
"\n",
"* **MoE-Mamba** - an approach to efficiently scale LLMs by combining state space models (SSMs) with Mixture of Experts (MoE); MoE-Mamba, outperforms both Mamba and Transformer-MoE; it reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer. \n",
"\n",
"* **MambaByte** - adapts Mamba SSM to learn directly from raw bytes; bytes lead to longer sequences which autoregressive Transformers will scale poorly on; this work reports huge benefits related to faster inference and even outperforms subword Transformers. \n",
"\n"
]
}
],
"source": [
"response = model.generate_content([\"Can you list the papers that mention Mamba? List the title of the paper and summary.\"])\n",
"print(response.text)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Here are some of the innovations around long-context LLMs from the papers listed:\n",
"\n",
"**1. Leave No Context Behind** \n",
"* **Paper:** Leave No Context Behind\n",
"* **Summary:** This paper proposes Infini-attention, an attention mechanism that incorporates a compressive memory module into a vanilla attention mechanism, enabling Transformers to effectively process infinitely long inputs with bounded memory footprint and computation.\n",
"\n",
"**2. DeepSeek-V2**\n",
"* **Paper:** DeepSeek-V2\n",
"* **Summary:** A 236B parameter Mixture-of-Experts (MoE) model that supports a context length of 128K tokens. It uses Multi-head Latent Attention (MLA) for efficient inference by compressing the Key-Value (KV) cache into a latent vector.\n",
"\n",
"**3. Make Your LLM Fully Utilize the Context**\n",
"* **Paper:** Make Your LLM Fully Utilize the Context\n",
"* **Summary:** This paper presents an approach to overcome the \"lost-in-the-middle\" challenge in LLMs. It applies an \"information-intensive\" training procedure to enable the LLM to fully utilize the context. \n",
"\n",
"**4. Gemini 1.5 Flash**\n",
"* **Paper:** Gemini 1.5 Flash \n",
"* **Summary:** A lightweight transformer decoder model with a 2M context window and multimodal capabilities. It's designed for efficiency and yields the fastest output generation of all models on several evaluated languages.\n",
"\n",
"**5. Grok-1.5**\n",
"* **Paper:** Grok-1.5 \n",
"* **Summary:** A long-context LLM that can process contexts of up to 128K tokens. It demonstrates strong retrieval capabilities.\n",
"\n",
"**6. Large World Model**\n",
"* **Paper:** Large World Model\n",
"* **Summary:** A general-purpose 1M context multimodal model trained on long videos and books using RingAttention. It sets new benchmarks in difficult retrieval tasks and long video understanding.\n",
"\n",
"**7. MambaByte**\n",
"* **Paper:** MambaByte\n",
"* **Summary:** Adapts the Mamba state space model to learn directly from raw bytes, enabling faster inference and outperforming subword Transformers.\n",
"\n",
"**8. Efficient Inference of LLMs**\n",
"* **Paper:** Efficient Inference of LLMs\n",
"* **Summary:** Proposes a layer-condensed KV cache for efficient inference in LLMs. It only computes and caches the key-values (KVs) of a small number of layers, leading to memory savings and improved inference throughput. \n",
"\n",
"**9. Retrieval Augmented Thoughts (RAT)**\n",
"* **Paper:** Retrieval Augmented Thoughts\n",
"* **Summary:** Shows that iteratively revising a chain of thoughts with information retrieval can significantly improve LLM reasoning and generation in long-horizon generation tasks.\n",
"\n",
"**10. Are Long-LLMs A Necessity For Long-Context Tasks?**\n",
"* **Paper:** Are Long-LLMs A Necessity For Long-Context Tasks?\n",
"* **Summary:** This paper claims that long-LLMs are not a necessity for solving long-context tasks. It proposes a reasoning framework to enable short-LLMs to address long-context tasks by adaptively accessing and utilizing the context based on the presented tasks. \n",
"\n",
"**11. Leave No Context Behind**\n",
"* **Paper:** Leave No Context Behind \n",
"* **Summary:** Integrates compressive memory into a vanilla dot-product attention layer to enable Transformer LLMs to effectively process infinitely long inputs with bounded memory footprint and computation.\n",
"\n",
"**12. The Illusion of State in State-Space Models**\n",
"* **Paper:** The Illusion of State in State-Space Models\n",
"* **Summary:** Investigates the expressive power of state space models (SSMs) and reveals that they are limited similar to transformers in that SSMs cannot express computation outside the complexity class 𝖳𝖢^0. \n",
"\n",
"**13. StreamingLLM**\n",
"* **Paper:** StreamingLLM\n",
"* **Summary:** Enables efficient streaming LLMs with attention sinks, a phenomenon where the KV states of initial tokens will largely recover the performance of window attention. \n",
"\n",
"**14. UniIR**\n",
"* **Paper:** UniIR\n",
"* **Summary:** A unified instruction-guided multimodal retriever that handles eight retrieval tasks across modalities. It can generalize to unseen retrieval tasks and achieves robust performance across existing datasets and zero-shot generalization to new tasks. \n",
"\n",
"**15. LongLoRA**\n",
"* **Paper:** LongLoRA\n",
"* **Summary:** An efficient fine-tuning approach to significantly extend the context windows of pre-trained LLMs. It implements shift short attention, a substitute that approximates the standard self-attention pattern during training. \n",
"\n",
"**16. Recurrent Memory Finds What LLMs Miss**\n",
"* **Paper:** Recurrent Memory Finds What LLMs Miss\n",
"* **Summary:** Explores the capability of transformer-based models in extremely long context processing. It finds that both GPT-4 and RAG performance heavily rely on the first 25% of the input. It reports that recurrent memory augmentation of transformer models achieves superior performance on documents of up to 10 million tokens. \n",
"\n",
"**17. System 2 Attention**\n",
"* **Paper:** System 2 Attention\n",
"* **Summary:** Leverages the reasoning and instruction following capabilities of LLMs to decide what to attend to. It regenerates input context to only include relevant portions before attending to the regenerated context to elicit the final response from the model. \n",
"\n",
"**18. Extending Context Window of LLMs**\n",
"* **Paper:** Extending Context Window of LLMs\n",
"* **Summary:** Extends the context window of LLMs like LLaMA to up to 32K with minimal fine-tuning (within 1000 steps). \n",
"\n",
"**19. Efficient Context Window Extension of LLMs**\n",
"* **Paper:** Efficient Context Window Extension of LLMs\n",
"* **Summary:** Proposes a compute-efficient method for efficiently extending the context window of LLMs beyond what it was pretrained on. \n",
"\n"
]
}
],
"source": [
"response = model.generate_content([\"What are some of the innovations around long context LLMs? List the title of the paper and summary.\"])\n",
"print(response.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "peguide",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.-1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,4 +1,4 @@
# LLM Applications
# LLM Applications & Guides
import { Callout } from 'nextra-theme-docs'
import {Cards, Card} from 'nextra-theme-docs'

@ -1,5 +1,6 @@
{
"function_calling": "Function Calling",
"context-caching": "Context Caching with LLMs",
"generating": "Generating Data",
"synthetic_rag": "Generating Synthetic Dataset for RAG",
"generating_textbooks": "Tackling Generated Datasets Diversity",

@ -0,0 +1,55 @@
# Context Caching with Gemini 1.5 Flash
import {Cards, Card} from 'nextra-theme-docs'
import {CodeIcon} from 'components/icons'
Google recently released a new feature called [context-caching](https://ai.google.dev/gemini-api/docs/caching?lang=python) which is available via the Gemini APIs through the Gemini 1.5 Pro and Gemini 1.5 Flash models. This guide provides a basic example of how to use context-caching with Gemini 1.5 Flash.
<iframe width="100%"
height="415px"
src="https://www.youtube.com/embed/987Pd89EDPs?si=j43isgNb0uwH5AeI" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowFullScreen
/>
https://youtu.be/987Pd89EDPs?si=j43isgNb0uwH5AeI
### The Use Case: Analyzing a Year's Worth of ML Papers
The guide demonstrates how you can use context caching to analyze the summaries of all the [ML papers we've documented over the past year](https://github.com/dair-ai/ML-Papers-of-the-Week). We store these summaries in a text file, which can now be fed to the Gemini 1.5 Flash model and query efficiently.
### The Process: Uploading, Caching, and Querying
1. **Data Preparation:** First convert the readme file (containing the summaries) into a plain text file.
2. **Utilizing the Gemini API:** You can upload the text file using the Google `generativeai` library.
3. **Implementing Context Caching:** A cache is created using the `caching.CachedContent.create()` function. This involves:
* Specifying the Gemini Flash 1.5 model.
* Providing a name for the cache.
* Defining an instruction for the model (e.g., "You are an expert AI researcher...").
* Setting a time-to-live (TTL) for the cache (e.g., 15 minutes).
4. **Creating the Model:** We then create a generative model instance using the cached content.
5. **Querying:** We can start querying the model with natural language questions like:
* "Can you please tell me the latest AI papers of the week?"
* "Can you list the papers that mention Mamba? List the title of the paper and summary."
* "What are some of the innovations around long-context LLMs? List the title of the paper and summary."
The results were promising. The model accurately retrieved and summarized information from the text file. Context caching proved highly efficient, eliminating the need to repeatedly send the entire text file with each query.
This workflow has the potential to be a valuable tool for researchers, allowing them to:
* Quickly analyze and query large amounts of research data.
* Retrieve specific findings without manually searching through documents.
* Conduct interactive research sessions without wasting prompt tokens.
We are excited to explore further applications of context caching, especially within more complex scenarios like agentic workflows.
The notebook can be found below:
<Cards>
<Card
icon={<CodeIcon />}
title="Context Caching with Gemini APIs"
href="https://github.com/dair-ai/Prompt-Engineering-Guide/blob/main/notebooks/gemini-context-caching.ipynb"
/>
</Cards>
Loading…
Cancel
Save