mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
2667ddc686
**Description: a description of the change** Fixed `make docs_build` and related scripts which caused errors. There are several changes. First, I made the build of the documentation and the API Reference into two separate commands. This is because it takes less time to build. The commands for documents are `make docs_build`, `make docs_clean`, and `make docs_linkcheck`. The commands for API Reference are `make api_docs_build`, `api_docs_clean`, and `api_docs_linkcheck`. It looked like `docs/.local_build.sh` could be used to build the documentation, so I used that. Since `.local_build.sh` was also building API Rerefence internally, I removed that process. `.local_build.sh` also added some Bash options to stop in error or so. Futher more added `cd "${SCRIPT_DIR}"` at the beginning so that the script will work no matter which directory it is executed in. `docs/api_reference/api_reference.rst` is removed, because which is generated by `docs/api_reference/create_api_rst.py`, and added it to .gitignore. Finally, the description of CONTRIBUTING.md was modified. **Issue: the issue # it fixes (if applicable)** https://github.com/hwchase17/langchain/issues/6413 **Dependencies: any dependencies required for this change** `nbdoc` was missing in group docs so it was added. I installed it with the `poetry add --group docs nbdoc` command. I am concerned if any modifications are needed to poetry.lock. I would greatly appreciate it if you could pay close attention to this file during the review. **Tag maintainer** - General / Misc / if you don't know who to tag: @baskaryan If this PR needs any additional changes, I'll be happy to make them! --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
222 lines
6.7 KiB
Plaintext
222 lines
6.7 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Question answering over a group chat messages using Activeloop's DeepLake\n",
|
|
"In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT4 to semantically search and ask questions over a group chat.\n",
|
|
"\n",
|
|
"View a working demo [here](https://twitter.com/thisissukh_/status/1647223328363679745)"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 1. Install required packages"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!python3 -m pip install --upgrade langchain 'deeplake[enterprise]' openai tiktoken"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Add API keys"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"import getpass\n",
|
|
"from langchain.document_loaders import PyPDFLoader, TextLoader\n",
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"from langchain.text_splitter import (\n",
|
|
" RecursiveCharacterTextSplitter,\n",
|
|
" CharacterTextSplitter,\n",
|
|
")\n",
|
|
"from langchain.vectorstores import DeepLake\n",
|
|
"from langchain.chains import ConversationalRetrievalChain, RetrievalQA\n",
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"from langchain.llms import OpenAI\n",
|
|
"\n",
|
|
"os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")\n",
|
|
"activeloop_token = getpass.getpass(\"Activeloop Token:\")\n",
|
|
"os.environ[\"ACTIVELOOP_TOKEN\"] = activeloop_token\n",
|
|
"os.environ[\"ACTIVELOOP_ORG\"] = getpass.getpass(\"Activeloop Org:\")\n",
|
|
"\n",
|
|
"org_id = os.environ[\"ACTIVELOOP_ORG\"]\n",
|
|
"embeddings = OpenAIEmbeddings()\n",
|
|
"\n",
|
|
"dataset_path = \"hub://\" + org_id + \"/data\""
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"\n",
|
|
"## 2. Create sample data"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can generate a sample group chat conversation using ChatGPT with this prompt:\n",
|
|
"\n",
|
|
"```\n",
|
|
"Generate a group chat conversation with three friends talking about their day, referencing real places and fictional names. Make it funny and as detailed as possible.\n",
|
|
"```\n",
|
|
"\n",
|
|
"I've already generated such a chat in `messages.txt`. We can keep it simple and use this for our example.\n",
|
|
"\n",
|
|
"## 3. Ingest chat embeddings\n",
|
|
"\n",
|
|
"We load the messages in the text file, chunk and upload to ActiveLoop Vector store."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"with open(\"messages.txt\") as f:\n",
|
|
" state_of_the_union = f.read()\n",
|
|
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
|
"pages = text_splitter.split_text(state_of_the_union)\n",
|
|
"\n",
|
|
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
|
|
"texts = text_splitter.create_documents(pages)\n",
|
|
"\n",
|
|
"print(texts)\n",
|
|
"\n",
|
|
"dataset_path = \"hub://\" + org + \"/data\"\n",
|
|
"embeddings = OpenAIEmbeddings()\n",
|
|
"db = DeepLake.from_documents(\n",
|
|
" texts, embeddings, dataset_path=dataset_path, overwrite=True\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"`Optional`: You can also use Deep Lake's Managed Tensor Database as a hosting service and run queries there. In order to do so, it is necessary to specify the runtime parameter as {'tensor_db': True} during the creation of the vector store. This configuration enables the execution of queries on the Managed Tensor Database, rather than on the client side. It should be noted that this functionality is not applicable to datasets stored locally or in-memory. In the event that a vector store has already been created outside of the Managed Tensor Database, it is possible to transfer it to the Managed Tensor Database by following the prescribed steps."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# with open(\"messages.txt\") as f:\n",
|
|
"# state_of_the_union = f.read()\n",
|
|
"# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
|
|
"# pages = text_splitter.split_text(state_of_the_union)\n",
|
|
"\n",
|
|
"# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
|
|
"# texts = text_splitter.create_documents(pages)\n",
|
|
"\n",
|
|
"# print(texts)\n",
|
|
"\n",
|
|
"# dataset_path = \"hub://\" + org + \"/data\"\n",
|
|
"# embeddings = OpenAIEmbeddings()\n",
|
|
"# db = DeepLake.from_documents(\n",
|
|
"# texts, embeddings, dataset_path=dataset_path, overwrite=True, runtime=\"tensor_db\"\n",
|
|
"# )"
|
|
]
|
|
},
|
|
{
|
|
"attachments": {},
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Ask questions\n",
|
|
"\n",
|
|
"Now we can ask a question and get an answer back with a semantic search:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"db = DeepLake(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)\n",
|
|
"\n",
|
|
"retriever = db.as_retriever()\n",
|
|
"retriever.search_kwargs[\"distance_metric\"] = \"cos\"\n",
|
|
"retriever.search_kwargs[\"k\"] = 4\n",
|
|
"\n",
|
|
"qa = RetrievalQA.from_chain_type(\n",
|
|
" llm=OpenAI(), chain_type=\"stuff\", retriever=retriever, return_source_documents=False\n",
|
|
")\n",
|
|
"\n",
|
|
"# What was the restaurant the group was talking about called?\n",
|
|
"query = input(\"Enter query:\")\n",
|
|
"\n",
|
|
"# The Hungry Lobster\n",
|
|
"ans = qa({\"query\": query})\n",
|
|
"\n",
|
|
"print(ans)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|