templates: add RAG template for Intel Xeon Scalable Processors (#18424)

**Description:**
This template utilizes Chroma and TGI (Text Generation Inference) to
execute RAG on the Intel Xeon Scalable Processors. It serves as a
demonstration for users, illustrating the deployment of the RAG service
on the Intel Xeon Scalable Processors and showcasing the resulting
performance enhancements.

**Issue:**
None

**Dependencies:**
The template contains the poetry project requirements to run this
template.
CPU TGI batching is WIP.

**Twitter handle:**
None

---------

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Co-authored-by: Bagatur <22008038+baskaryan@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
This commit is contained in:
lvliang-intel 2024-03-30 05:37:32 +08:00 committed by GitHub
parent d4673a3507
commit 0175906437
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
9 changed files with 6027 additions and 0 deletions

View File

@ -0,0 +1,97 @@
# RAG example on Intel Xeon
This template performs RAG using Chroma and Text Generation Inference on Intel® Xeon® Scalable Processors.
Intel® Xeon® Scalable processors feature built-in accelerators for more performance-per-core and unmatched AI performance, with advanced security technologies for the most in-demand workload requirements—all while offering the greatest cloud choice and application portability, please check [Intel® Xeon® Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable.html).
## Environment Setup
To use [🤗 text-generation-inference](https://github.com/huggingface/text-generation-inference) on Intel® Xeon® Scalable Processors, please follow these steps:
### Launch a local server instance on Intel Xeon Server:
```bash
model=Intel/neural-chat-7b-v3-3
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
```
For gated models such as `LLAMA-2`, you will have to pass -e HUGGING_FACE_HUB_TOKEN=\<token\> to the docker run command above with a valid Hugging Face Hub read token.
Please follow this link [huggingface token](https://huggingface.co/docs/hub/security-tokens) to get the access token ans export `HUGGINGFACEHUB_API_TOKEN` environment with the token.
```bash
export HUGGINGFACEHUB_API_TOKEN=<token>
```
Send a request to check if the endpoint is wokring:
```bash
curl localhost:8080/generate -X POST -d '{"inputs":"Which NFL team won the Super Bowl in the 2010 season?","parameters":{"max_new_tokens":128, "do_sample": true}}' -H 'Content-Type: application/json'
```
More details please refer to [text-generation-inference](https://github.com/huggingface/text-generation-inference).
## Populating with data
If you want to populate the DB with some example data, you can run the below commands:
```shell
poetry install
poetry run python ingest.py
```
The script process and stores sections from Edgar 10k filings data for Nike `nke-10k-2023.pdf` into a Chroma database.
## Usage
To use this package, you should first have the LangChain CLI installed:
```shell
pip install -U langchain-cli
```
To create a new LangChain project and install this as the only package, you can do:
```shell
langchain app new my-app --package intel-rag-xeon
```
If you want to add this to an existing project, you can just run:
```shell
langchain app add intel-rag-xeon
```
And add the following code to your `server.py` file:
```python
from intel_rag_xeon import chain as xeon_rag_chain
add_routes(app, xeon_rag_chain, path="/intel-rag-xeon")
```
(Optional) Let's now configure LangSmith. LangSmith will help us trace, monitor and debug LangChain applications. LangSmith is currently in private beta, you can sign up [here](https://smith.langchain.com/). If you don't have access, you can skip this section
```shell
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<your-api-key>
export LANGCHAIN_PROJECT=<your-project> # if not specified, defaults to "default"
```
If you are inside this directory, then you can spin up a LangServe instance directly by:
```shell
langchain serve
```
This will start the FastAPI app with a server is running locally at
[http://localhost:8000](http://localhost:8000)
We can see all templates at [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs)
We can access the playground at [http://127.0.0.1:8000/intel-rag-xeon/playground](http://127.0.0.1:8000/intel-rag-xeon/playground)
We can access the template from code with:
```python
from langserve.client import RemoteRunnable
runnable = RemoteRunnable("http://localhost:8000/intel-rag-xeon")
```

Binary file not shown.

View File

@ -0,0 +1,49 @@
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
def ingest_documents():
"""
Ingest PDF to Redis from the data/ directory that
contains Edgar 10k filings data for Nike.
"""
# Load list of pdfs
data_path = "data/"
doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]
print("Parsing 10k filing doc for NIKE", doc)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500, chunk_overlap=100, add_start_index=True
)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)
print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")
# Create vectorstore
embedder = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
documents = []
for chunk in chunks:
doc = Document(page_content=chunk.page_content, metadata=chunk.metadata)
documents.append(doc)
# Add to vectorDB
_ = Chroma.from_documents(
documents=documents,
collection_name="xeon-rag",
embedding=embedder,
persist_directory="/tmp/xeon_rag_db",
)
if __name__ == "__main__":
ingest_documents()

View File

@ -0,0 +1,62 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "681a5d1e",
"metadata": {},
"source": [
"## Connect to RAG App\n",
"\n",
"Assuming you are already running this server:\n",
"```bash\n",
"langserve start\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d774be2a",
"metadata": {},
"outputs": [],
"source": [
"from langserve.client import RemoteRunnable\n",
"\n",
"gaudi_rag = RemoteRunnable(\"http://localhost:8000/intel-rag-xeon\")\n",
"\n",
"print(gaudi_rag.invoke(\"What was Nike's revenue in 2023?\"))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "07ae0005",
"metadata": {},
"outputs": [],
"source": [
"print(gaudi_rag.invoke(\"How many employees work at Nike?\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -0,0 +1,3 @@
from intel_rag_xeon.chain import chain
__all__ = ["chain"]

View File

@ -0,0 +1,72 @@
from langchain.callbacks import streaming_stdout
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.vectorstores import VectorStoreRetriever
# Make this look better in the docs.
class Question(BaseModel):
__root__: str
# Init Embeddings
embedder = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
knowledge_base = Chroma(
persist_directory="/tmp/xeon_rag_db",
embedding_function=embedder,
collection_name="xeon-rag",
)
query = "What was Nike's revenue in 2023?"
docs = knowledge_base.similarity_search(query)
print(docs[0].page_content)
retriever = VectorStoreRetriever(
vectorstore=knowledge_base, search_type="mmr", search_kwargs={"k": 1, "fetch_k": 5}
)
# Define our prompt
template = """
Use the following pieces of context from retrieved
dataset to answer the question. Do not make up an answer if there is no
context provided to help answer it.
Context:
---------
{context}
---------
Question: {question}
---------
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
ENDPOINT_URL = "http://localhost:8080"
callbacks = [streaming_stdout.StreamingStdOutCallbackHandler()]
model = HuggingFaceEndpoint(
endpoint_url=ENDPOINT_URL,
max_new_tokens=512,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.01,
repetition_penalty=1.03,
streaming=True,
)
# RAG Chain
chain = (
RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
| prompt
| model
| StrOutputParser()
).with_types(input_type=Question)

5693
templates/intel-rag-xeon/poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,51 @@
[tool.poetry]
name = "intel-rag-xeon"
version = "0.0.1"
description = "Run a RAG app on Intel Xeon Scalable Processors"
authors = [
"Liang Lv <liang1.lv@intel.com>",
]
readme = "README.md"
[tool.poetry.dependencies]
python = ">=3.9,<3.13"
langchain = "^0.1"
fastapi = "^0.104.0"
sse-starlette = "^1.6.5"
sentence-transformers = "2.2.2"
tiktoken = ">=0.5.1"
chromadb = ">=0.4.14"
beautifulsoup4 = ">=4.12.2"
[tool.poetry.dependencies.unstructured]
version = "^0.10.27"
extras = [
"pdf",
]
[tool.poetry.group.dev.dependencies]
poethepoet = "^0.24.1"
langchain-cli = ">=0.0.21"
[tool.langserve]
export_module = "intel_rag_xeon.chain"
export_attr = "chain"
[tool.templates-hub]
use-case = "rag"
author = "Intel"
integrations = ["Intel", "HuggingFace"]
tags = ["vectordbs"]
[tool.poe.tasks.start]
cmd = "uvicorn langchain_cli.dev_scripts:create_demo_server --reload --port $port --host $host"
args = [
{ name = "port", help = "port to run on", default = "8000" },
{ name = "host", help = "host to run on", default = "127.0.0.1" },
]
[build-system]
requires = [
"poetry-core",
]
build-backend = "poetry.core.masonry.api"