langchain/templates/rag-redis/ingest.py

import os

from langchain_community.document_loaders import UnstructuredFileLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Redis
from langchain_text_splitters import RecursiveCharacterTextSplitter
from rag_redis.config import EMBED_MODEL, INDEX_NAME, INDEX_SCHEMA, REDIS_URL


def ingest_documents():
    """
    Ingest PDF to Redis from the data/ directory that
    contains Edgar 10k filings data for Nike.
    """
    # Load list of pdfs
    company_name = "Nike"
    data_path = "data/"
    doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]

    print("Parsing 10k filing doc for NIKE", doc)  # noqa: T201

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500, chunk_overlap=100, add_start_index=True
    )
    loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
    chunks = loader.load_and_split(text_splitter)

    print("Done preprocessing. Created", len(chunks), "chunks of the original pdf")  # noqa: T201
    # Create vectorstore
    embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

    _ = Redis.from_texts(
        # appending this little bit can sometimes help with semantic retrieval
        # especially with multiple companies
        texts=[f"Company: {company_name}. " + chunk.page_content for chunk in chunks],
        metadatas=[chunk.metadata for chunk in chunks],
        embedding=embedder,
        index_name=INDEX_NAME,
        index_schema=INDEX_SCHEMA,
        redis_url=REDIS_URL,
    )


if __name__ == "__main__":
    ingest_documents()
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00			`import os`

docs, experimental[patch], langchain[patch], community[patch]: update storage imports (#15429) ran ```bash g grep -l "langchain.vectorstores" \| xargs -L 1 sed -i '' "s/langchain\.vectorstores/langchain_community.vectorstores/g" g grep -l "langchain.document_loaders" \| xargs -L 1 sed -i '' "s/langchain\.document_loaders/langchain_community.document_loaders/g" g grep -l "langchain.chat_loaders" \| xargs -L 1 sed -i '' "s/langchain\.chat_loaders/langchain_community.chat_loaders/g" g grep -l "langchain.document_transformers" \| xargs -L 1 sed -i '' "s/langchain\.document_transformers/langchain_community.document_transformers/g" g grep -l "langchain\.graphs" \| xargs -L 1 sed -i '' "s/langchain\.graphs/langchain_community.graphs/g" g grep -l "langchain\.memory\.chat_message_histories" \| xargs -L 1 sed -i '' "s/langchain\.memory\.chat_message_histories/langchain_community.chat_message_histories/g" gco master libs/langchain/tests/unit_tests//test_imports.py gco master libs/langchain/tests/unit_tests/*/test_public_api.py ``` 2024-01-02 21:47:11 +00:00			`from langchain_community.document_loaders import UnstructuredFileLoader`
docs, community[patch], experimental[patch], langchain[patch], cli[pa… (#15412) …tch]: import models from community ran ```bash git grep -l 'from langchain\.chat_models' \| xargs -L 1 sed -i '' "s/from\ langchain\.chat_models/from\ langchain_community.chat_models/g" git grep -l 'from langchain\.llms' \| xargs -L 1 sed -i '' "s/from\ langchain\.llms/from\ langchain_community.llms/g" git grep -l 'from langchain\.embeddings' \| xargs -L 1 sed -i '' "s/from\ langchain\.embeddings/from\ langchain_community.embeddings/g" git checkout master libs/langchain/tests/unit_tests/llms git checkout master libs/langchain/tests/unit_tests/chat_models git checkout master libs/langchain/tests/unit_tests/embeddings/test_imports.py make format cd libs/langchain; make format cd ../experimental; make format cd ../core; make format ``` 2024-01-02 20:32:16 +00:00			`from langchain_community.embeddings import HuggingFaceEmbeddings`
docs, experimental[patch], langchain[patch], community[patch]: update storage imports (#15429) ran ```bash g grep -l "langchain.vectorstores" \| xargs -L 1 sed -i '' "s/langchain\.vectorstores/langchain_community.vectorstores/g" g grep -l "langchain.document_loaders" \| xargs -L 1 sed -i '' "s/langchain\.document_loaders/langchain_community.document_loaders/g" g grep -l "langchain.chat_loaders" \| xargs -L 1 sed -i '' "s/langchain\.chat_loaders/langchain_community.chat_loaders/g" g grep -l "langchain.document_transformers" \| xargs -L 1 sed -i '' "s/langchain\.document_transformers/langchain_community.document_transformers/g" g grep -l "langchain\.graphs" \| xargs -L 1 sed -i '' "s/langchain\.graphs/langchain_community.graphs/g" g grep -l "langchain\.memory\.chat_message_histories" \| xargs -L 1 sed -i '' "s/langchain\.memory\.chat_message_histories/langchain_community.chat_message_histories/g" gco master libs/langchain/tests/unit_tests//test_imports.py gco master libs/langchain/tests/unit_tests/*/test_public_api.py ``` 2024-01-02 21:47:11 +00:00			`from langchain_community.vectorstores import Redis`
text-splitters[minor], langchain[minor], community[patch], templates, docs: langchain-text-splitters 0.0.1 (#18346) 2024-03-01 02:33:21 +00:00			`from langchain_text_splitters import RecursiveCharacterTextSplitter`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00			`from rag_redis.config import EMBED_MODEL, INDEX_NAME, INDEX_SCHEMA, REDIS_URL`


			`def ingest_documents():`
			`"""`
			`Ingest PDF to Redis from the data/ directory that`
			`contains Edgar 10k filings data for Nike.`
			`"""`
			`# Load list of pdfs`
			`company_name = "Nike"`
			`data_path = "data/"`
notebook fmt (#12498) 2023-10-29 22:50:09 +00:00			`doc = [os.path.join(data_path, file) for file in os.listdir(data_path)][0]`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00
infra: add print rule to ruff (#16221) Added noqa for existing prints. Can slowly remove / will prevent more being intro'd 2024-02-10 00:13:30 +00:00			`print("Parsing 10k filing doc for NIKE", doc) # noqa: T201`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00
			`text_splitter = RecursiveCharacterTextSplitter(`
			`chunk_size=1500, chunk_overlap=100, add_start_index=True`
			`)`
			`loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")`
			`chunks = loader.load_and_split(text_splitter)`

infra: add print rule to ruff (#16221) Added noqa for existing prints. Can slowly remove / will prevent more being intro'd 2024-02-10 00:13:30 +00:00			`print("Done preprocessing. Created", len(chunks), "chunks of the original pdf") # noqa: T201`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00			`# Create vectorstore`
notebook fmt (#12498) 2023-10-29 22:50:09 +00:00			`embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00
			`_ = Redis.from_texts(`
			`# appending this little bit can sometimes help with semantic retrieval`
			`# especially with multiple companies`
			`texts=[f"Company: {company_name}. " + chunk.page_content for chunk in chunks],`
			`metadatas=[chunk.metadata for chunk in chunks],`
			`embedding=embedder,`
			`index_name=INDEX_NAME,`
			`index_schema=INDEX_SCHEMA,`
notebook fmt (#12498) 2023-10-29 22:50:09 +00:00			`redis_url=REDIS_URL,`
Redis langserve template (#12443) Add Redis langserve template! Eventually will add semantic caching to this too. But I was struggling to get that to work for some reason with the LCEL implementation here. - Description: Introduces the Redis LangServe template. A simple RAG based app built on top of Redis that allows you to chat with company's public financial data (Edgar 10k filings) - Issue: None - Dependencies: The template contains the poetry project requirements to run this template - Tag maintainer: @baskaryan @Spartee - Twitter handle: @tchutch94 Note: this requires the commit here that deletes the `_aget_relevant_documents()` method from the Redis retriever class that wasn't implemented. That was breaking the langserve app. --------- Co-authored-by: Sam Partee <sam.partee@redis.com> 2023-10-28 15:31:12 +00:00			`)`


			`if __name__ == "__main__":`
			`ingest_documents()`