mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
326 lines
11 KiB
Plaintext
326 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "59723cea",
|
|
"metadata": {},
|
|
"source": [
|
|
"# StarRocks\n",
|
|
"\n",
|
|
">[StarRocks](https://www.starrocks.io/) is a High-Performance Analytical Database.\n",
|
|
"`StarRocks` is a next-gen sub-second MPP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics and ad-hoc query.\n",
|
|
"\n",
|
|
">Usually `StarRocks` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n",
|
|
"\n",
|
|
"Here we'll show how to use the StarRocks Vector Store."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "1685854f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Setup"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#!pip install pymysql"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2c891bba",
|
|
"metadata": {},
|
|
"source": [
|
|
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "3c85fb93",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/Users/dirlt/utils/py3env/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.7) or chardet (5.1.0)/charset_normalizer (2.0.9) doesn't match a supported version!\n",
|
|
" warnings.warn(\"urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported \"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
|
|
"from langchain.vectorstores import StarRocks\n",
|
|
"from langchain.vectorstores.starrocks import StarRocksSettings\n",
|
|
"from langchain.vectorstores import Chroma\n",
|
|
"from langchain.text_splitter import CharacterTextSplitter, TokenTextSplitter\n",
|
|
"from langchain.llms import OpenAI\nfrom langchain.chains import VectorDBQA\n",
|
|
"from langchain.document_loaders import DirectoryLoader\n",
|
|
"from langchain.chains import RetrievalQA\n",
|
|
"from langchain.document_loaders import TextLoader, UnstructuredMarkdownLoader\n",
|
|
"\n",
|
|
"update_vectordb = False"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ee821c00",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load docs and split them into tokens"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "34ba0cfd",
|
|
"metadata": {},
|
|
"source": [
|
|
"Load all markdown files under the `docs` directory\n",
|
|
"\n",
|
|
"for starrocks documents, you can clone repo from https://github.com/StarRocks/starrocks, and there is `docs` directory in it."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "85912696",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = DirectoryLoader(\n",
|
|
" \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n",
|
|
")\n",
|
|
"documents = loader.load()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b415fe2a",
|
|
"metadata": {},
|
|
"source": [
|
|
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "07e8acff",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# load text splitter and split docs into snippets of text\n",
|
|
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n",
|
|
"split_docs = text_splitter.split_documents(documents)\n",
|
|
"\n",
|
|
"# tell vectordb to update text embeddings\n",
|
|
"update_vectordb = True"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "1f365370",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Document(page_content='Compile StarRocks with Docker\\n\\nThis topic describes how to compile StarRocks using Docker.\\n\\nOverview\\n\\nStarRocks provides development environment images for both Ubuntu 22.04 and CentOS 7.9. With the image, you can launch a Docker container and compile StarRocks in the container.\\n\\nStarRocks version and DEV ENV image\\n\\nDifferent branches of StarRocks correspond to different development environment images provided on StarRocks Docker Hub.\\n\\nFor Ubuntu 22.04:\\n\\n| Branch name | Image name |\\n | --------------- | ----------------------------------- |\\n | main | starrocks/dev-env-ubuntu:latest |\\n | branch-3.0 | starrocks/dev-env-ubuntu:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-ubuntu:2.5-latest |\\n\\nFor CentOS 7.9:\\n\\n| Branch name | Image name |\\n | --------------- | ------------------------------------ |\\n | main | starrocks/dev-env-centos7:latest |\\n | branch-3.0 | starrocks/dev-env-centos7:3.0-latest |\\n | branch-2.5 | starrocks/dev-env-centos7:2.5-latest |\\n\\nPrerequisites\\n\\nBefore compiling StarRocks, make sure the following requirements are satisfied:\\n\\nHardware\\n\\n', metadata={'source': 'docs/developers/build-starrocks/Build_in_docker.md'})"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"split_docs[-20]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "50012b29",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"# docs = 657, # splits = 2802\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5371f152",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Create vectordb instance"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15702d9c",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Use StarRocks as vectordb"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "ced7dbe1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def gen_starrocks(update_vectordb, embeddings, settings):\n",
|
|
" if update_vectordb:\n",
|
|
" docsearch = StarRocks.from_documents(split_docs, embeddings, config=settings)\n",
|
|
" else:\n",
|
|
" docsearch = StarRocks(embeddings, settings)\n",
|
|
" return docsearch"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "15d86fda",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Convert tokens into embeddings and put them into vectordb"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ff1322ea",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we use StarRocks as vectordb, you can configure StarRocks instance via `StarRocksSettings`.\n",
|
|
"\n",
|
|
"Configuring StarRocks instance is pretty much like configuring mysql instance. You need to specify:\n",
|
|
"1. host/port\n",
|
|
"2. username(default: 'root')\n",
|
|
"3. password(default: '')\n",
|
|
"4. database(default: 'default')\n",
|
|
"5. table(default: 'langchain')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "26410d9b",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Inserting data...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2802/2802 [02:26<00:00, 19.11it/s]\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"\u001b[92m\u001b[1mzya.langchain @ 127.0.0.1:41003\u001b[0m\n",
|
|
"\n",
|
|
"\u001b[1musername: root\u001b[0m\n",
|
|
"\n",
|
|
"Table Schema:\n",
|
|
"----------------------------------------------------------------------------\n",
|
|
"|\u001b[94mname \u001b[0m|\u001b[96mtype \u001b[0m|\u001b[96mkey \u001b[0m|\n",
|
|
"----------------------------------------------------------------------------\n",
|
|
"|\u001b[94mid \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mtrue \u001b[0m|\n",
|
|
"|\u001b[94mdocument \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
|
|
"|\u001b[94membedding \u001b[0m|\u001b[96marray<float> \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
|
|
"|\u001b[94mmetadata \u001b[0m|\u001b[96mvarchar(65533) \u001b[0m|\u001b[96mfalse \u001b[0m|\n",
|
|
"----------------------------------------------------------------------------\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"embeddings = OpenAIEmbeddings()\n",
|
|
"\n",
|
|
"# configure starrocks settings(host/port/user/pw/db)\n",
|
|
"settings = StarRocksSettings()\n",
|
|
"settings.port = 41003\n",
|
|
"settings.host = \"127.0.0.1\"\n",
|
|
"settings.username = \"root\"\n",
|
|
"settings.password = \"\"\n",
|
|
"settings.database = \"zya\"\n",
|
|
"docsearch = gen_starrocks(update_vectordb, embeddings, settings)\n",
|
|
"\n",
|
|
"print(docsearch)\n",
|
|
"\n",
|
|
"update_vectordb = False"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "bde66626",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Build QA and ask question to it"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "84921814",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" No, profile is not enabled by default. To enable profile, set the variable `enable_profile` to `true` using the command `set enable_profile = true;`\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"llm = OpenAI()\n",
|
|
"qa = RetrievalQA.from_chain_type(\n",
|
|
" llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n",
|
|
")\n",
|
|
"query = \"is profile enabled by default? if not, how to enable profile?\"\n",
|
|
"resp = qa.run(query)\n",
|
|
"print(resp)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|