langchain/templates/rag-semi-structured/README.md

47 lines
1.2 KiB
Markdown
Raw Normal View History

# Semi structured RAG
This template performs RAG on semi-structured data (e.g., a PDF with text and tables).
See this [blog post](https://langchain-blog.ghost.io/ghost/#/editor/post/652dc74e0633850001e977d4) for useful background context.
## Data loading
We use [partition_pdf](https://unstructured-io.github.io/unstructured/bricks/partition.html#partition-pdf) from Unstructured to extract both table and text elements.
This will require some system-level package installations, e.g., on Mac:
```
brew install tesseract poppler
```
## Chroma
[Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma) is an open-source vector database.
This template will create and add documents to the vector database in `chain.py`.
These documents can be loaded from [many sources](https://python.langchain.com/docs/integrations/document_loaders).
## LLM
Be sure that `OPENAI_API_KEY` is set in order to the OpenAI models.
## Adding the template
Create your LangServe app:
```
langchain app new my-app
cd my-app
```
Add template:
```
langchain app add rag-semi-structured
```
Start server:
```
langchain serve
```
See Jupyter notebook `rag_semi_structured` for various way to connect to the template.