@ -44,7 +44,7 @@
"metadata": {},
"metadata": {},
"source": [
"source": [
"_Note: depending on your LangChain setup, you may need to install/upgrade other dependencies needed for this demo_\n",
"_Note: depending on your LangChain setup, you may need to install/upgrade other dependencies needed for this demo_\n",
"_(specifically, recent versions of `datasets` `openai` `pypdf` and `tiktoken` are required)._"
"_(specifically, recent versions of `datasets`, `openai`, `pypdf` and `tiktoken` are required)._"
]
]
},
},
{
{
@ -65,7 +65,6 @@
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.embeddings import OpenAIEmbeddings\n",
"from langchain.prompts import ChatPromptTemplate\n",
"from langchain.prompts import ChatPromptTemplate\n",
"\n",
"\n",
"# if not present yet, run: pip install \"datasets==2.14.6\"\n",
"from langchain.schema import Document\n",
"from langchain.schema import Document\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.output_parser import StrOutputParser\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
"from langchain.schema.runnable import RunnablePassthrough\n",
@ -145,7 +144,7 @@
"outputs": [],
"outputs": [],
"source": [
"source": [
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_API_ENDPOINT = input(\"ASTRA_DB_API_ENDPOINT = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")"
"ASTRA_DB_APPLICATION_ TOKEN = getpass(\"ASTRA_DB_APPLICATION _TOKEN = \")"
]
]
},
},
{
{
@ -159,7 +158,7 @@
" embedding=embe,\n",
" embedding=embe,\n",
" collection_name=\"astra_vector_demo\",\n",
" collection_name=\"astra_vector_demo\",\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" api_endpoint=ASTRA_DB_API_ENDPOINT,\n",
" token=ASTRA_DB_TOKEN,\n",
" token=ASTRA_DB_APPLICATION_ TOKEN,\n",
")"
")"
]
]
},
},
@ -171,6 +170,14 @@
"### Load a dataset"
"### Load a dataset"
]
]
},
},
{
"cell_type": "markdown",
"id": "552e56b0-301a-4b06-99c7-57ba6faa966f",
"metadata": {},
"source": [
"Convert each entry in the source dataset into a `Document`, then write them into the vector store:"
]
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
@ -190,6 +197,16 @@
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
"print(f\"\\nInserted {len(inserted_ids)} documents.\")"
]
]
},
},
{
"cell_type": "markdown",
"id": "79d4f436-ef04-4288-8f79-97c9abb983ed",
"metadata": {},
"source": [
"In the above, `metadata` dictionaries are created from the source data and are part of the `Document`.\n",
"\n",
"_Note: check the [Astra DB API Docs](https://docs.datastax.com/en/astra-serverless/docs/develop/dev-with-json.html#_json_api_limits) for the valid metadata field names: some characters are reserved and cannot be used._"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
"id": "084d8802-ab39-4262-9a87-42eafb746f92",
@ -213,6 +230,16 @@
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
"print(f\"\\nInserted {len(inserted_ids_2)} documents.\")"
]
]
},
},
{
"cell_type": "markdown",
"id": "63840eb3-8b29-4017-bc2f-301bf5001f28",
"metadata": {},
"source": [
"_Note: you may want to speed up the execution of `add_texts` and `add_documents` by increasing the concurrency level for_\n",
"_these bulk operations - check out the `*_concurrency` parameters in the class constructor and the `add_texts` docstrings_\n",
"_for more details. Depending on the network and the client machine specifications, your best-performing choice of parameters may vary._"
]
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
"id": "c031760a-1fc5-4855-adf2-02ed52fe2181",
@ -625,7 +652,7 @@
"outputs": [],
"outputs": [],
"source": [
"source": [
"ASTRA_DB_ID = input(\"ASTRA_DB_ID = \")\n",
"ASTRA_DB_ID = input(\"ASTRA_DB_ID = \")\n",
"ASTRA_DB_TOKEN = getpass(\"ASTRA_DB_TOKEN = \")\n",
"ASTRA_DB_APPLICATION_ TOKEN = getpass(\"ASTRA_DB_APPLICATION _TOKEN = \")\n",
"\n",
"\n",
"desired_keyspace = input(\"ASTRA_DB_KEYSPACE (optional, can be left empty) = \")\n",
"desired_keyspace = input(\"ASTRA_DB_KEYSPACE (optional, can be left empty) = \")\n",
"if desired_keyspace:\n",
"if desired_keyspace:\n",
@ -645,7 +672,7 @@
"\n",
"\n",
"cassio.init(\n",
"cassio.init(\n",
" database_id=ASTRA_DB_ID,\n",
" database_id=ASTRA_DB_ID,\n",
" token=ASTRA_DB_TOKEN,\n",
" token=ASTRA_DB_APPLICATION_ TOKEN,\n",
" keyspace=ASTRA_DB_KEYSPACE,\n",
" keyspace=ASTRA_DB_KEYSPACE,\n",
")"
")"
]
]