mirror of
https://github.com/hwchase17/langchain
synced 2024-11-04 06:00:26 +00:00
a591cdb67d
**Description:** Add cookbook for RAG with baidu QIANFAN and elasticsearch. Co-authored-by: wemysschen <root@icoding-cwx.bcc-szzj.baidu.com>
169 lines
5.3 KiB
Plaintext
169 lines
5.3 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# RAG based on Qianfan and BES"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"This notebook is an implementation of Retrieval augmented generation (RAG) using Baidu Qianfan Platform combined with Baidu ElasricSearch, where the original data is located on BOS.\n",
|
||
"## Baidu Qianfan\n",
|
||
"Baidu AI Cloud Qianfan Platform is a one-stop large model development and service operation platform for enterprise developers. Qianfan not only provides including the model of Wenxin Yiyan (ERNIE-Bot) and the third-party open-source models, but also provides various AI development tools and the whole set of development environment, which facilitates customers to use and develop large model applications easily.\n",
|
||
"\n",
|
||
"## Baidu ElasticSearch\n",
|
||
"[Baidu Cloud VectorSearch](https://cloud.baidu.com/doc/BES/index.html?from=productToDoc) is a fully managed, enterprise-level distributed search and analysis service which is 100% compatible to open source. Baidu Cloud VectorSearch provides low-cost, high-performance, and reliable retrieval and analysis platform level product services for structured/unstructured data. As a vector database , it supports multiple index types and similarity distance methods. "
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Installation and Setup\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"#!pip install qianfan\n",
|
||
"#!pip install bce-python-sdk\n",
|
||
"#!pip install elasticsearch == 7.11.0"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Imports"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from baidubce.bce_client_configuration import BceClientConfiguration\n",
|
||
"from baidubce.auth.bce_credentials import BceCredentials\n",
|
||
"from langchain.document_loaders.baiducloud_bos_directory import BaiduBOSDirectoryLoader\n",
|
||
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
|
||
"from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
|
||
"from langchain.vectorstores import BESVectorStore\n",
|
||
"from langchain.llms.baidu_qianfan_endpoint import QianfanLLMEndpoint"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Document loading"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"bos_host = \"your bos eddpoint\"\n",
|
||
"access_key_id = \"your bos access ak\"\n",
|
||
"secret_access_key = \"your bos access sk\"\n",
|
||
"\n",
|
||
"# create BceClientConfiguration\n",
|
||
"config = BceClientConfiguration(credentials=BceCredentials(access_key_id, secret_access_key), endpoint = bos_host)\n",
|
||
"\n",
|
||
"loader = BaiduBOSDirectoryLoader(conf=config, bucket=\"llm-test\", prefix=\"llm/\")\n",
|
||
"documents = loader.load()\n",
|
||
"\n",
|
||
"text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)\n",
|
||
"split_docs = text_splitter.split_documents(documents)"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Embedding and VectorStore"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"embeddings = HuggingFaceEmbeddings(model_name=\"shibing624/text2vec-base-chinese\")\n",
|
||
"embeddings.client = sentence_transformers.SentenceTransformer(embeddings.model_name)\n",
|
||
"\n",
|
||
"db = BESVectorStore.from_documents(\n",
|
||
" documents=split_docs, embedding=embeddings, bes_url=\"your bes url\", index_name='test-index', vector_query_field='vector'\n",
|
||
" )\n",
|
||
"\n",
|
||
"db.client.indices.refresh(index='test-index')\n",
|
||
"retriever = db.as_retriever()"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## QA Retriever"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"llm = QianfanLLMEndpoint(model=\"ERNIE-Bot\", qianfan_ak='your qianfan ak', qianfan_sk='your qianfan sk', streaming=True)\n",
|
||
"qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"refine\", retriever=retriever, return_source_documents=True)\n",
|
||
"\n",
|
||
"query = \"什么是张量?\"\n",
|
||
"print(qa.run(query))"
|
||
]
|
||
},
|
||
{
|
||
"attachments": {},
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"> 张量(Tensor)是一个数学概念,用于表示多维数据。它是一个可以表示多个数值的数组,可以是标量、向量、矩阵等。在深度学习和人工智能领域中,张量常用于表示神经网络的输入、输出和权重等。"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"name": "python",
|
||
"version": "3.9.17"
|
||
},
|
||
"orig_nbformat": 4,
|
||
"vscode": {
|
||
"interpreter": {
|
||
"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
|
||
}
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 2
|
||
}
|