templates: add RAG template for Intel Xeon Scalable Processors (#18424)

**Description:** This template utilizes Chroma and TGI (Text Generation Inference) to execute RAG on the Intel Xeon Scalable Processors. It serves as a demonstration for users, illustrating the deployment of the RAG service on the Intel Xeon Scalable Processors and showcasing the resulting performance enhancements. **Issue:** None **Dependencies:** The template contains the poetry project requirements to run this template. CPU TGI batching is WIP. **Twitter handle:** None --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
2024-11-18 09:25:54 +00:00 · 2024-03-30 05:37:32 +08:00 · 2024-03-30 05:37:32 +08:00 · 0175906437
commit 0175906437
parent d4673a3507
9 changed files with 6027 additions and 0 deletions
--- a/templates/intel-rag-xeon/README.md
+++ b/templates/intel-rag-xeon/README.md
@ -0,0 +1,97 @@
 # RAG example on Intel Xeon
 This template performs RAG using Chroma and Text Generation Inference on Intel® Xeon® Scalable Processors.
 Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and unmatched AI performance, with advanced security technologies for the most in-demand workload requirements—all while offering the greatest cloud choice and application portability, please check [Intel® Xeon® Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html).
 ## Environment Setup
 To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Intel® Xeon® Scalable Processors, please follow these steps:
 ### Launch a local server instance on Intel Xeon Server:
 ```bash
 model=Intel/neural-chat-7b-v3-3
 volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
 docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
 ```
 For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
 Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
 ```bash
 export HUGGINGFACEHUB_API_TOKEN=<token> 
 ```
 Send a request to check if the endpoint is wokring:
 ```bash
 curl localhost:8080/generate -X POST -d '{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"max_new_tokens":128, "do_sample": true}}'   -H 'Content-Type: application/json'
 ```
 More details please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference).
 ## Populating with data
 If you want to populate the DB with some example data, you can run the below commands:
 ```shell
 poetry install
 poetry run python ingest.py
 ```
 The script process and stores sections from Edgar 10k filings data for Nike `nke-10k-2023.pdf` into a Chroma database.
 ## Usage
 To use this package, you should first have the LangChain CLI installed:
 ```shell
 pip install -U langchain-cli
 ```
 To create a new LangChain project and install this as the only package, you can do:
 ```shell
 langchain app new my-app --package intel-rag-xeon
 ```
 If you want to add this to an existing project, you can just run:
 ```shell
 langchain app add intel-rag-xeon
 ```
 And add the following code to your `server.py` file:
 ```python
 from intel_rag_xeon import chain as xeon_rag_chain
 add_routes(app, xeon_rag_chain, path="/intel-rag-xeon")
 ```
 (Optional) Let's now configure LangSmith. LangSmith will help us trace, monitor and debug LangChain applications. LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). If you don't have access, you can skip this section
 ```shell
 export LANGCHAIN_TRACING_V2=true
 export LANGCHAIN_API_KEY=<your-api-key>
 export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
 ```
 If you are inside this directory, then you can spin up a LangServe instance directly by:
 ```shell
 langchain serve
 ```
 This will start the FastAPI app with a server is running locally at 
 [http://localhost:8000](http://localhost:8000)
 We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
 We can access the playground at [http://127.0.0.1:8000/intel-rag-xeon/playground](http://127.0.0.1:8000/intel-rag-xeon/playground)
 We can access the template from code with:
 ```python
 from langserve.client import RemoteRunnable
 runnable = RemoteRunnable("http://localhost:8000/intel-rag-xeon")
 ```
--- a/templates/intel-rag-xeon/data/nke-10k-2023.pdf
+++ b/templates/intel-rag-xeon/data/nke-10k-2023.pdf
--- a/templates/intel-rag-xeon/ingest.py
+++ b/templates/intel-rag-xeon/ingest.py
@ -0,0 +1,49 @@
 import os
 from langchain.text_splitter import RecursiveCharacterTextSplitter
 from langchain_community.document_loaders import UnstructuredFileLoader
 from langchain_community.embeddings import HuggingFaceEmbeddings
 from langchain_community.vectorstores import Chroma
 from langchain_core.documents import Document
 def ingest_documents():
    """
    Ingest PDF to Redis from the data/ directory that
    contains Edgar 10k filings data for Nike.
    """
    # Load list of pdfs
    data_path = "data/"
    doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
    print("Parsing 10k filing doc for NIKE", doc)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500, chunk_overlap=100, add_start_index=True
    )
    loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
    chunks = loader.load_and_split(text_splitter)
    print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")
    # Create vectorstore
    embedder = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    documents = []
    for chunk in chunks:
        doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
        documents.append(doc)
    # Add to vectorDB
    _ = Chroma.from_documents(
        documents=documents,
        collection_name="xeon-rag",
        embedding=embedder,
        persist_directory="/tmp/xeon_rag_db",
    )
 if __name__ == "__main__":
    ingest_documents()
--- a/templates/intel-rag-xeon/intel_rag_xeon.ipynb
+++ b/templates/intel-rag-xeon/intel_rag_xeon.ipynb
@ -0,0 +1,62 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "681a5d1e",
   "metadata": {},
   "source": [
    "## Connect to RAG App\n",
    "\n",
    "Assuming you are already running this server:\n",
    "```bash\n",
    "langserve start\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d774be2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langserve.client import RemoteRunnable\n",
    "\n",
    "gaudi_rag = RemoteRunnable(\"http://localhost:8000/intel-rag-xeon\")\n",
    "\n",
    "print(gaudi_rag.invoke(\"What was Nike's revenue in 2023?\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "07ae0005",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(gaudi_rag.invoke(\"How many employees work at Nike?\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/templates/intel-rag-xeon/intel_rag_xeon/init.py
+++ b/templates/intel-rag-xeon/intel_rag_xeon/init.py
@ -0,0 +1,3 @@
 from intel_rag_xeon.chain import chain
 __all__ = ["chain"]
--- a/templates/intel-rag-xeon/intel_rag_xeon/chain.py
+++ b/templates/intel-rag-xeon/intel_rag_xeon/chain.py
@ -0,0 +1,72 @@
 from langchain.callbacks import streaming_stdout
 from langchain_community.embeddings import HuggingFaceEmbeddings
 from langchain_community.llms import HuggingFaceEndpoint
 from langchain_community.vectorstores import Chroma
 from langchain_core.output_parsers import StrOutputParser
 from langchain_core.prompts import ChatPromptTemplate
 from langchain_core.pydantic_v1 import BaseModel
 from langchain_core.runnables import RunnableParallel, RunnablePassthrough
 from langchain_core.vectorstores import VectorStoreRetriever
 # Make this look better in the docs.
 class Question(BaseModel):
    __root__: str
 # Init Embeddings
 embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
 knowledge_base = Chroma(
    persist_directory="/tmp/xeon_rag_db",
    embedding_function=embedder,
    collection_name="xeon-rag",
 )
 query = "What was Nike's revenue in 2023?"
 docs = knowledge_base.similarity_search(query)
 print(docs[0].page_content)
 retriever = VectorStoreRetriever(
    vectorstore=knowledge_base, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5}
 )
 # Define our prompt
 template = """
 Use the following pieces of context from retrieved
 dataset to answer the question. Do not make up an answer if there is no
 context provided to help answer it.
 Context:
 ---------
 {context}
 ---------
 Question: {question}
 ---------
 Answer:
 """
 prompt = ChatPromptTemplate.from_template(template)
 ENDPOINT_URL = "http://localhost:8080"
 callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
 model = HuggingFaceEndpoint(
    endpoint_url=ENDPOINT_URL,
    max_new_tokens=512,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.01,
    repetition_penalty=1.03,
    streaming=True,
 )
 # RAG Chain
 chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | prompt
    | model
    | StrOutputParser()
 ).with_types(input_type=Question)
--- a/templates/intel-rag-xeon/poetry.lock
+++ b/templates/intel-rag-xeon/poetry.lock
--- a/templates/intel-rag-xeon/pyproject.toml
+++ b/templates/intel-rag-xeon/pyproject.toml
@ -0,0 +1,51 @@
 [tool.poetry]
 name = "intel-rag-xeon"
 version = "0.0.1"
 description = "Run a RAG app on Intel Xeon Scalable Processors"
 authors = [
    "Liang Lv <liang1.lv@intel.com>",
 ]
 readme = "README.md"
 [tool.poetry.dependencies]
 python = ">=3.9,<3.13"
 langchain = "^0.1"
 fastapi = "^0.104.0"
 sse-starlette = "^1.6.5"
 sentence-transformers = "2.2.2"
 tiktoken = ">=0.5.1"
 chromadb = ">=0.4.14"
 beautifulsoup4 = ">=4.12.2"
 [tool.poetry.dependencies.unstructured]
 version = "^0.10.27"
 extras = [
    "pdf",
 ]
 [tool.poetry.group.dev.dependencies]
 poethepoet = "^0.24.1"
 langchain-cli = ">=0.0.21"
 [tool.langserve]
 export_module = "intel_rag_xeon.chain"
 export_attr = "chain"
 [tool.templates-hub]
 use-case = "rag"
 author = "Intel"
 integrations = ["Intel", "HuggingFace"]
 tags = ["vectordbs"]
 [tool.poe.tasks.start]
 cmd = "uvicorn langchain_cli.dev_scripts:create_demo_server --reload --port $port --host $host"
 args = [
    { name = "port", help = "port to run on", default = "8000" },
    { name = "host", help = "host to run on", default = "127.0.0.1" },
 ]
 [build-system]
 requires = [
    "poetry-core",
 ]
 build-backend = "poetry.core.masonry.api"
--- a/templates/intel-rag-xeon/tests/init.py
+++ b/templates/intel-rag-xeon/tests/init.py
		`@ -0,0 +1,3 @@`
							`from intel_rag_xeon.chain import chain`

							`__all__ = ["chain"]`