## Fireworks.AI + LangChain + RAG
 
[Fireworks AI](https://python.langchain.com/docs/integrations/llms/fireworks) wants to provide the best experience when working with LangChain, and here is an example of Fireworks + LangChain doing RAG

See [our models page](https://fireworks.ai/models) for the full list of models. We use `accounts/fireworks/models/mixtral-8x7b-instruct` for RAG In this tutorial.

For the RAG target, we will use the Gemma technical report https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf 

In [1]:
%pip install --quiet pypdf chromadb tiktoken openai 
%pip uninstall -y langchain-fireworks
%pip install --editable /mnt/disks/data/langchain/libs/partners/fireworks


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Found existing installation: langchain-fireworks 0.0.1
Uninstalling langchain-fireworks-0.0.1:
  Successfully uninstalled langchain-fireworks-0.0.1
Note: you may need to restart the kernel to use updated packages.
Obtaining file:///mnt/disks/data/langchain/libs/partners/fireworks
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: langchain-fireworks
  Building editable for langchain-fireworks (pyproject.toml) ... [?25ldone
[?25h  Created 

In [3]:
import fireworks

print(fireworks)
import fireworks.client

<module 'fireworks' from '/mnt/disks/data/langchain/.venv/lib/python3.9/site-packages/fireworks/__init__.py'>


In [None]:
# Load
import requests
from langchain_community.document_loaders import PyPDFLoader

# Download the PDF from a URL and save it to a temporary location
url = "https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf"
response = requests.get(url, stream=True)
file_name = "temp_file.pdf"
with open(file_name, "wb") as pdf:
    pdf.write(response.content)

loader = PyPDFLoader(file_name)
data = loader.load()

# Split
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)

# Add to vectorDB
from langchain_community.vectorstores import Chroma
from langchain_fireworks.embeddings import FireworksEmbeddings

vectorstore = Chroma.from_documents(
    documents=all_splits,
    collection_name="rag-chroma",
    embedding=FireworksEmbeddings(),
)

retriever = vectorstore.as_retriever()

In [3]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

# RAG prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# LLM
from langchain_together import Together

llm = Together(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    temperature=0.0,
    max_tokens=2000,
    top_k=1,
)

# RAG chain
chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | prompt
    | llm
    | StrOutputParser()
)

In [4]:
chain.invoke("What are the Architectural details of Mixtral?")

'\nAnswer: The architectural details of Mixtral are as follows:\n- Dimension (dim): 4096\n- Number of layers (n\\_layers): 32\n- Dimension of each head (head\\_dim): 128\n- Hidden dimension (hidden\\_dim): 14336\n- Number of heads (n\\_heads): 32\n- Number of kv heads (n\\_kv\\_heads): 8\n- Context length (context\\_len): 32768\n- Vocabulary size (vocab\\_size): 32000\n- Number of experts (num\\_experts): 8\n- Number of top k experts (top\\_k\\_experts): 2\n\nMixtral is based on a transformer architecture and uses the same modifications as described in [18], with the notable exceptions that Mixtral supports a fully dense context length of 32k tokens, and the feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency, as the model 

Trace: 

https://smith.langchain.com/public/935fd642-06a6-4b42-98e3-6074f93115cd/r