langchain/docs/extras/integrations/providers/vectara/vectara_text_generation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Vectara Text Generation\n",
    "\n",
    "This notebook is based on [text generation](https://github.com/langchain-ai/langchain/blob/master/docs/modules/chains/index_examples/vector_db_text_generation.ipynb) notebook and adapted to Vectara."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare Data\n",
    "\n",
    "First, we prepare the data. For this example, we fetch a documentation site that consists of markdown files hosted on Github and split them into small enough Documents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from langchain.llms import OpenAI\n",
    "from langchain.docstore.document import Document\n",
    "import requests\n",
    "from langchain.vectorstores import Vectara\n",
    "from langchain.text_splitter import CharacterTextSplitter\n",
    "from langchain.prompts import PromptTemplate\n",
    "import pathlib\n",
    "import subprocess\n",
    "import tempfile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Cloning into '.'...\n"
     ]
    }
   ],
   "source": [
    "def get_github_docs(repo_owner, repo_name):\n",
    "    with tempfile.TemporaryDirectory() as d:\n",
    "        subprocess.check_call(\n",
    "            f\"git clone --depth 1 https://github.com/{repo_owner}/{repo_name}.git .\",\n",
    "            cwd=d,\n",
    "            shell=True,\n",
    "        )\n",
    "        git_sha = (\n",
    "            subprocess.check_output(\"git rev-parse HEAD\", shell=True, cwd=d)\n",
    "            .decode(\"utf-8\")\n",
    "            .strip()\n",
    "        )\n",
    "        repo_path = pathlib.Path(d)\n",
    "        markdown_files = list(repo_path.glob(\"*/*.md\")) + list(\n",
    "            repo_path.glob(\"*/*.mdx\")\n",
    "        )\n",
    "        for markdown_file in markdown_files:\n",
    "            with open(markdown_file, \"r\") as f:\n",
    "                relative_path = markdown_file.relative_to(repo_path)\n",
    "                github_url = f\"https://github.com/{repo_owner}/{repo_name}/blob/{git_sha}/{relative_path}\"\n",
    "                yield Document(page_content=f.read(), metadata={\"source\": github_url})\n",
    "\n",
    "\n",
    "sources = get_github_docs(\"yirenlu92\", \"deno-manual-forked\")\n",
    "\n",
    "source_chunks = []\n",
    "splitter = CharacterTextSplitter(separator=\" \", chunk_size=1024, chunk_overlap=0)\n",
    "for source in sources:\n",
    "    for chunk in splitter.split_text(source.page_content):\n",
    "        source_chunks.append(chunk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Up Vector DB\n",
    "\n",
    "Now that we have the documentation content in chunks, let's put all this information in a vector index for easy retrieval."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "search_index = Vectara.from_texts(source_chunks, embedding=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Up LLM Chain with Custom Prompt\n",
    "\n",
    "Next, let's set up a simple LLM chain but give it a custom prompt for blog post generation. Note that the custom prompt is parameterized and takes two inputs: `context`, which will be the documents fetched from the vector search, and `topic`, which is given by the user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chains import LLMChain\n",
    "\n",
    "prompt_template = \"\"\"Use the context below to write a 400 word blog post about the topic below:\n",
    "    Context: {context}\n",
    "    Topic: {topic}\n",
    "    Blog post:\"\"\"\n",
    "\n",
    "PROMPT = PromptTemplate(template=prompt_template, input_variables=[\"context\", \"topic\"])\n",
    "\n",
    "llm = OpenAI(openai_api_key=os.environ[\"OPENAI_API_KEY\"], temperature=0)\n",
    "\n",
    "chain = LLMChain(llm=llm, prompt=PROMPT)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Text\n",
    "\n",
    "Finally, we write a function to apply our inputs to the chain. The function takes an input parameter `topic`. We find the documents in the vector index that correspond to that `topic`, and use them as additional context in our simple LLM chain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_blog_post(topic):\n",
    "    docs = search_index.similarity_search(topic, k=4)\n",
    "    inputs = [{\"context\": doc.page_content, \"topic\": topic} for doc in docs]\n",
    "    print(chain.apply(inputs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'text': '\\n\\nWhen it comes to running Deno CLI tasks, environment variables can be a powerful tool for customizing the behavior of your tasks. With the Deno Task Definition interface, you can easily configure environment variables to be set when executing your tasks.\\n\\nThe Deno Task Definition interface is configured in a `tasks.json` within your workspace. It includes a `env` field, which allows you to specify any environment variables that should be set when executing the task. For example, if you wanted to set the `NODE_ENV` environment variable to `production` when running a Deno task, you could add the following to your `tasks.json`:\\n\\n```json\\n{\\n \"version\": \"2.0.0\",\\n \"tasks\": [\\n {\\n \"type\": \"deno\",\\n \"command\": \"run\",\\n \"args\": [\\n \"mod.ts\"\\n ],\\n \"env\": {\\n \"NODE_ENV\": \"production\"\\n },\\n \"problemMatcher\": [\\n \"$deno\"\\n ],\\n \"label\": \"deno: run\"\\n }\\n ]\\n}\\n```\\n\\nThe Deno language server and this extension also'}, {'text': '\\n\\nEnvironment variables are a great way to store and access data in your applications. They are especially useful when you need to store sensitive information such as API keys, passwords, and other credentials.\\n\\nDeno.env is a library that provides getter and setter methods for environment variables. This makes it easy to store and retrieve data from environment variables. For example, you can use the setter method to set a variable like this:\\n\\n```ts\\nDeno.env.set(\"FIREBASE_API_KEY\", \"examplekey123\");\\nDeno.env.set(\"FIREBASE_AUTH_DOMAIN\", \"firebasedomain.com\");\\n```\\n\\nAnd then you can use the getter method to retrieve the data like this:\\n\\n```ts\\nconsole.log(Deno.env.get(\"FIREBASE_API_KEY\")); // examplekey123\\nconsole.log(Deno.env.get(\"FIREBASE_AUTH_DOMAIN\")); // firebasedomain.com\\n```\\n\\nYou can also store environment variables in a `.env` file and retrieve them using `dotenv` in the standard'}, {'text': '\\n\\nEnvironment variables are a powerful tool for developers, allowing them to store and access data without hard-coding it into their applications. Deno, the secure JavaScript and TypeScript runtime, offers built-in support for environment variables with the `Deno.env` API.\\n\\nUsing `Deno.env` is simple. It has getter and setter methods that allow you to easily set and retrieve environment variables. For example, you can set the `FIREBASE_API_KEY` and `FIREBASE_AUTH_DOMAIN` environment variables like this:\\n\\n```ts\\nDeno.env.set(\"FIREBASE_API_KEY\", \"examplekey123\");\\nDeno.env.set(\"FIREBASE_AUTH_DOMAIN\", \"firebasedomain.com\");\\n```\\n\\nAnd then you can retrieve them like this:\\n\\n```ts\\nconsole.log(Deno.env.get(\"FIREBASE_API_KEY\")); // examplekey123\\nconsole.log(Deno.env.get(\"FIREBASE_AUTH_DOMAIN\")); // firebasedomain.com\\n```'}, {'text': '\\n\\nEnvironment variables are an important part of any programming language, and Deno is no exception. Environment variables are used to store information about the environment in which a program is running, such as the operating system, user preferences, and other settings. In Deno, environment variables are used to set up proxies, control the output of colors, and more.\\n\\nThe `NO_PROXY` environment variable is a de facto standard in Deno that indicates which hosts should bypass the proxy set in other environment variables. This is useful for developers who want to access certain resources without having to go through a proxy. For more information on this standard, you can check out the website no-color.org.\\n\\nThe `Deno.noColor` environment variable is another important environment variable in Deno. This variable is used to control the output of colors in the Deno terminal. By setting this variable to true, you can disable the output of colors in the terminal. This can be useful for developers who want to focus on the output of their code without being distracted by the colors.\\n\\nFinally, the `Deno.env` environment variable is used to access the environment variables set in the Deno runtime. This variable is useful for developers who want'}]\n"
     ]
    }
   ],
   "source": [
    "generate_blog_post(\"environment variables\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}