templates: add RAG template for Intel Xeon Scalable Processors (#18424)

**Description:** This template utilizes Chroma and TGI (Text Generation Inference) to execute RAG on the Intel Xeon Scalable Processors. It serves as a demonstration for users, illustrating the deployment of the RAG service on the Intel Xeon Scalable Processors and showcasing the resulting performance enhancements. **Issue:** None **Dependencies:** The template contains the poetry project requirements to run this template. CPU TGI batching is WIP. **Twitter handle:** None --------- Signed-off-by: lvliang-intel <liang1.lv@intel.com> Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
6 months ago · 0175906437
parent d4673a3507
commit 0175906437
9 changed files with 6027 additions and 0 deletions
--- a/templates/intel-rag-xeon/README.md
+++ b/templates/intel-rag-xeon/README.md
@ -0,0 +1,97 @@
+# RAG example on Intel Xeon
+This template performs RAG using Chroma and Text Generation Inference on Intel® Xeon® Scalable Processors.
+Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and unmatched AI performance, with advanced security technologies for the most in-demand workload requirements—all while offering the greatest cloud choice and application portability, please check [Intel® Xeon® Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html).
+
+## Environment Setup
+To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Intel® Xeon® Scalable Processors, please follow these steps:
+
+
+### Launch a local server instance on Intel Xeon Server:
+```bash
+model=Intel/neural-chat-7b-v3-3
+volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
+
+docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
+```
+
+For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
+
+Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
+
+```bash
+export HUGGINGFACEHUB_API_TOKEN=<token> 
+```
+
+Send a request to check if the endpoint is wokring:
+
+```bash
+curl localhost:8080/generate -X POST -d '{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"max_new_tokens":128, "do_sample": true}}'   -H 'Content-Type: application/json'
+```
+
+More details please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference).
+
+
+## Populating with data
+
+If you want to populate the DB with some example data, you can run the below commands:
+```shell
+poetry install
+poetry run python ingest.py
+```
+
+The script process and stores sections from Edgar 10k filings data for Nike `nke-10k-2023.pdf` into a Chroma database.
+
+## Usage
+
+To use this package, you should first have the LangChain CLI installed:
+
+```shell
+pip install -U langchain-cli
+```
+
+To create a new LangChain project and install this as the only package, you can do:
+
+```shell
+langchain app new my-app --package intel-rag-xeon
+```
+
+If you want to add this to an existing project, you can just run:
+
+```shell
+langchain app add intel-rag-xeon
+```
+
+And add the following code to your `server.py` file:
+```python
+from intel_rag_xeon import chain as xeon_rag_chain
+
+add_routes(app, xeon_rag_chain, path="/intel-rag-xeon")
+```
+
+(Optional) Let's now configure LangSmith. LangSmith will help us trace, monitor and debug LangChain applications. LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). If you don't have access, you can skip this section
+
+```shell
+export LANGCHAIN_TRACING_V2=true
+export LANGCHAIN_API_KEY=<your-api-key>
+export LANGCHAIN_PROJECT=<your-project>  # if not specified, defaults to "default"
+```
+
+If you are inside this directory, then you can spin up a LangServe instance directly by:
+
+```shell
+langchain serve
+```
+
+This will start the FastAPI app with a server is running locally at 
+[http://localhost:8000](http://localhost:8000)
+
+We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
+We can access the playground at [http://127.0.0.1:8000/intel-rag-xeon/playground](http://127.0.0.1:8000/intel-rag-xeon/playground)
+
+We can access the template from code with:
+
+```python
+from langserve.client import RemoteRunnable
+
+runnable = RemoteRunnable("http://localhost:8000/intel-rag-xeon")
+```
--- a/templates/intel-rag-xeon/data/nke-10k-2023.pdf
+++ b/templates/intel-rag-xeon/data/nke-10k-2023.pdf
--- a/templates/intel-rag-xeon/ingest.py
+++ b/templates/intel-rag-xeon/ingest.py
@ -0,0 +1,49 @@
+import os
+
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_community.document_loaders import UnstructuredFileLoader
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.vectorstores import Chroma
+from langchain_core.documents import Document
+
+
+def ingest_documents():
+    """
+    Ingest PDF to Redis from the data/ directory that
+    contains Edgar 10k filings data for Nike.
+    """
+    # Load list of pdfs
+    data_path = "data/"
+    doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
+
+    print("Parsing 10k filing doc for NIKE", doc)
+
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=1500, chunk_overlap=100, add_start_index=True
+    )
+    loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
+    chunks = loader.load_and_split(text_splitter)
+
+    print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")
+
+    # Create vectorstore
+    embedder = HuggingFaceEmbeddings(
+        model_name="sentence-transformers/all-MiniLM-L6-v2"
+    )
+
+    documents = []
+    for chunk in chunks:
+        doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
+        documents.append(doc)
+
+    # Add to vectorDB
+    _ = Chroma.from_documents(
+        documents=documents,
+        collection_name="xeon-rag",
+        embedding=embedder,
+        persist_directory="/tmp/xeon_rag_db",
+    )
+
+
+if __name__ == "__main__":
+    ingest_documents()
--- a/templates/intel-rag-xeon/intel_rag_xeon.ipynb
+++ b/templates/intel-rag-xeon/intel_rag_xeon.ipynb
@ -0,0 +1,62 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "681a5d1e",
+   "metadata": {},
+   "source": [
+    "## Connect to RAG App\n",
+    "\n",
+    "Assuming you are already running this server:\n",
+    "```bash\n",
+    "langserve start\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d774be2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langserve.client import RemoteRunnable\n",
+    "\n",
+    "gaudi_rag = RemoteRunnable(\"http://localhost:8000/intel-rag-xeon\")\n",
+    "\n",
+    "print(gaudi_rag.invoke(\"What was Nike's revenue in 2023?\"))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07ae0005",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(gaudi_rag.invoke(\"How many employees work at Nike?\"))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/templates/intel-rag-xeon/intel_rag_xeon/init.py
+++ b/templates/intel-rag-xeon/intel_rag_xeon/init.py
@ -0,0 +1,3 @@
+from intel_rag_xeon.chain import chain
+
+__all__ = ["chain"]
--- a/templates/intel-rag-xeon/intel_rag_xeon/chain.py
+++ b/templates/intel-rag-xeon/intel_rag_xeon/chain.py
@ -0,0 +1,72 @@
+from langchain.callbacks import streaming_stdout
+from langchain_community.embeddings import HuggingFaceEmbeddings
+from langchain_community.llms import HuggingFaceEndpoint
+from langchain_community.vectorstores import Chroma
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.pydantic_v1 import BaseModel
+from langchain_core.runnables import RunnableParallel, RunnablePassthrough
+from langchain_core.vectorstores import VectorStoreRetriever
+
+
+# Make this look better in the docs.
+class Question(BaseModel):
+    __root__: str
+
+
+# Init Embeddings
+embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
+
+knowledge_base = Chroma(
+    persist_directory="/tmp/xeon_rag_db",
+    embedding_function=embedder,
+    collection_name="xeon-rag",
+)
+query = "What was Nike's revenue in 2023?"
+docs = knowledge_base.similarity_search(query)
+print(docs[0].page_content)
+retriever = VectorStoreRetriever(
+    vectorstore=knowledge_base, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5}
+)
+
+# Define our prompt
+template = """
+Use the following pieces of context from retrieved
+dataset to answer the question. Do not make up an answer if there is no
+context provided to help answer it.
+
+Context:
+---------
+{context}
+
+---------
+Question: {question}
+---------
+
+Answer:
+"""
+
+
+prompt = ChatPromptTemplate.from_template(template)
+
+
+ENDPOINT_URL = "http://localhost:8080"
+callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
+model = HuggingFaceEndpoint(
+    endpoint_url=ENDPOINT_URL,
+    max_new_tokens=512,
+    top_k=10,
+    top_p=0.95,
+    typical_p=0.95,
+    temperature=0.01,
+    repetition_penalty=1.03,
+    streaming=True,
+)
+
+# RAG Chain
+chain = (
+    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
+    | prompt
+    | model
+    | StrOutputParser()
+).with_types(input_type=Question)
--- a/templates/intel-rag-xeon/poetry.lock
+++ b/templates/intel-rag-xeon/poetry.lock
--- a/templates/intel-rag-xeon/pyproject.toml
+++ b/templates/intel-rag-xeon/pyproject.toml
@ -0,0 +1,51 @@
+[tool.poetry]
+name = "intel-rag-xeon"
+version = "0.0.1"
+description = "Run a RAG app on Intel Xeon Scalable Processors"
+authors = [
+    "Liang Lv <liang1.lv@intel.com>",
+]
+readme = "README.md"
+
+[tool.poetry.dependencies]
+python = ">=3.9,<3.13"
+langchain = "^0.1"
+fastapi = "^0.104.0"
+sse-starlette = "^1.6.5"
+sentence-transformers = "2.2.2"
+tiktoken = ">=0.5.1"
+chromadb = ">=0.4.14"
+beautifulsoup4 = ">=4.12.2"
+
+[tool.poetry.dependencies.unstructured]
+version = "^0.10.27"
+extras = [
+    "pdf",
+]
+
+[tool.poetry.group.dev.dependencies]
+poethepoet = "^0.24.1"
+langchain-cli = ">=0.0.21"
+
+[tool.langserve]
+export_module = "intel_rag_xeon.chain"
+export_attr = "chain"
+
+[tool.templates-hub]
+use-case = "rag"
+author = "Intel"
+integrations = ["Intel", "HuggingFace"]
+tags = ["vectordbs"]
+
+[tool.poe.tasks.start]
+cmd = "uvicorn langchain_cli.dev_scripts:create_demo_server --reload --port $port --host $host"
+args = [
+    { name = "port", help = "port to run on", default = "8000" },
+    { name = "host", help = "host to run on", default = "127.0.0.1" },
+]
+
+[build-system]
+requires = [
+    "poetry-core",
+]
+build-backend = "poetry.core.masonry.api"
--- a/templates/intel-rag-xeon/tests/init.py
+++ b/templates/intel-rag-xeon/tests/init.py