Web research retriever (#8102)

Given a user question, this will - * Use LLM to generate a set of queries. * Query for each. * The URLs from search results are stored in self.urls. * A check is performed for any new URLs that haven't been processed yet (not in self.url_database). * Only these new URLs are loaded, transformed, and added to the vectorstore. * The vectorstore is queried for relevant documents based on the questions generated by the LLM. * Only unique documents are returned as the final result. This code will avoid reprocessing of URLs across multiple runs of similar queries, which should improve the performance of the retriever. It also keeps track of all URLs that have been processed, which could be useful for debugging or understanding the retriever's behavior. --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
1 year ago · 7a00f17033
parent d1d691caa4
commit 7a00f17033
4 changed files with 803 additions and 2 deletions
--- a/docs/extras/modules/data_connection/retrievers/web_research.ipynb
+++ b/docs/extras/modules/data_connection/retrievers/web_research.ipynb
@ -0,0 +1,580 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9c0ffe42",
+   "metadata": {},
+   "source": [
+    "# WebResearchRetriever\n",
+    "\n",
+    "Given a query, this retriever will: \n",
+    "\n",
+    "* Formulate a set of relate Google searches\n",
+    "* Search for each \n",
+    "* Load all the resulting URLs\n",
+    "* Then embed and perform similarity search with the query on the consolidate page content"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "13548212",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.retrievers.web_research import WebResearchRetriever"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90b1dcbd",
+   "metadata": {},
+   "source": [
+    "### Simple usage\n",
+    "\n",
+    "Specify the LLM to use for Google search query generation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e63d1c8b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from langchain.vectorstores import Chroma\n",
+    "from langchain.embeddings import OpenAIEmbeddings\n",
+    "from langchain.chat_models.openai import ChatOpenAI\n",
+    "from langchain.utilities import GoogleSearchAPIWrapper\n",
+    "\n",
+    "# Vectorstore\n",
+    "vectorstore = Chroma(embedding_function=OpenAIEmbeddings(),persist_directory=\"./chroma_db_oai\")\n",
+    "\n",
+    "# LLM\n",
+    "llm = ChatOpenAI(temperature=0)\n",
+    "\n",
+    "# Search \n",
+    "os.environ[\"GOOGLE_CSE_ID\"] = \"xxx\"\n",
+    "os.environ[\"GOOGLE_API_KEY\"] = \"xxx\"\n",
+    "search = GoogleSearchAPIWrapper()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "118b50aa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize\n",
+    "web_research_retriever = WebResearchRetriever.from_llm(\n",
+    "    vectorstore=vectorstore,\n",
+    "    llm=llm, \n",
+    "    search=search, \n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39114da4",
+   "metadata": {},
+   "source": [
+    "`Run with citations`\n",
+    "\n",
+    "We can use `RetrievalQAWithSourcesChain` to retrieve docs and provide citations"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "0b330acd",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Fetching pages: 100%|###################################################################################################################################| 1/1 [00:00<00:00,  3.33it/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'How do LLM Powered Autonomous Agents work?',\n",
+       " 'answer': \"LLM Powered Autonomous Agents work by using LLM (large language model) as the core controller of the agent's brain. It is complemented by several key components, including planning, memory, and tool use. The agent system is designed to be a powerful general problem solver. \\n\",\n",
+       " 'sources': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains import RetrievalQAWithSourcesChain\n",
+    "user_input = \"How do LLM Powered Autonomous Agents work?\"\n",
+    "qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever)\n",
+    "result = qa_chain({\"question\": user_input})\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "357559fd",
+   "metadata": {},
+   "source": [
+    "`Run with logging`\n",
+    "\n",
+    "Here, we use `get_relevant_documents` method to return docs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "2c4e8ab3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:langchain.retrievers.web_research:Generating questions for Google Search ...\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How do LLM powered autonomous agents utilize task decomposition?\\n', '2. Can you explain the concept of task decomposition in LLM powered autonomous agents?\\n', '3. What role does task decomposition play in the functioning of LLM powered autonomous agents?\\n', '4. Why is task decomposition important for LLM powered autonomous agents?\\n'])}\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How do LLM powered autonomous agents utilize task decomposition?\\n', '2. Can you explain the concept of task decomposition in LLM powered autonomous agents?\\n', '3. What role does task decomposition play in the functioning of LLM powered autonomous agents?\\n', '4. Why is task decomposition important for LLM powered autonomous agents?\\n']\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2)\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... In a LLM-powered autonomous agent system, LLM functions as the ... Task decomposition can be done (1) by LLM with simple prompting like\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Agent System Overview In a LLM-powered autonomous agent system, ... Task decomposition can be done (1) by LLM with simple prompting like\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:New URLs to load: []\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Run\n",
+    "import logging\n",
+    "logging.basicConfig()\n",
+    "logging.getLogger(\"langchain.retrievers.web_research\").setLevel(logging.INFO)\n",
+    "user_input = \"What is Task Decomposition in LLM Powered Autonomous Agents?\"\n",
+    "docs = web_research_retriever.get_relevant_documents(user_input)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b681a846",
+   "metadata": {},
+   "source": [
+    "`Generate answer using retrieved docs`\n",
+    "\n",
+    "We can use `load_qa_chain` for QA using the retrieved docs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "ceca5681",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'Task decomposition in LLM-powered autonomous agents refers to the process of breaking down a complex task into smaller, more manageable subgoals. This allows the agent to efficiently handle and execute the individual steps required to complete the overall task. By decomposing the task, the agent can prioritize and organize its actions, making it easier to plan and execute the necessary steps towards achieving the desired outcome.'"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains.question_answering import load_qa_chain\n",
+    "chain = load_qa_chain(llm, chain_type=\"stuff\")\n",
+    "output = chain({\"input_documents\": docs, \"question\": user_input},return_only_outputs=True)\n",
+    "output['output_text']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0c0e57bb",
+   "metadata": {},
+   "source": [
+    "### More flexibility\n",
+    "\n",
+    "Pass an LLM chain with custom prompt and output parsing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3d84ea47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import re\n",
+    "from typing import List\n",
+    "from langchain.chains import LLMChain\n",
+    "from pydantic import BaseModel, Field\n",
+    "from langchain.prompts import PromptTemplate\n",
+    "from langchain.output_parsers.pydantic import PydanticOutputParser\n",
+    "\n",
+    "# LLMChain\n",
+    "search_prompt = PromptTemplate(\n",
+    "    input_variables=[\"question\"],\n",
+    "    template=\"\"\"You are an assistant tasked with improving Google search \n",
+    "    results. Generate FIVE Google search queries that are similar to\n",
+    "    this question. The output should be a numbered list of questions and each\n",
+    "    should have a question mark at the end: {question}\"\"\",\n",
+    ")\n",
+    "\n",
+    "class LineList(BaseModel):\n",
+    "    \"\"\"List of questions.\"\"\"\n",
+    "\n",
+    "    lines: List[str] = Field(description=\"Questions\")\n",
+    "\n",
+    "class QuestionListOutputParser(PydanticOutputParser):\n",
+    "    \"\"\"Output parser for a list of numbered questions.\"\"\"\n",
+    "\n",
+    "    def __init__(self) -> None:\n",
+    "        super().__init__(pydantic_object=LineList)\n",
+    "\n",
+    "    def parse(self, text: str) -> LineList:\n",
+    "        lines = re.findall(r\"\\d+\\..*?\\n\", text)\n",
+    "        return LineList(lines=lines)\n",
+    "    \n",
+    "llm_chain = LLMChain(\n",
+    "            llm=llm,\n",
+    "            prompt=search_prompt,\n",
+    "            output_parser=QuestionListOutputParser(),\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "851b0471",
+   "metadata": {
+    "scrolled": false
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:langchain.retrievers.web_research:Generating questions for Google Search ...\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How do LLM powered autonomous agents use task decomposition?\\n', '2. Why is task decomposition important for LLM powered autonomous agents?\\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?\\n', '4. What are the benefits of task decomposition in LLM powered autonomous agents?\\n'])}\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How do LLM powered autonomous agents use task decomposition?\\n', '2. Why is task decomposition important for LLM powered autonomous agents?\\n', '3. Can you explain the concept of task decomposition in LLM powered autonomous agents?\\n', '4. What are the benefits of task decomposition in LLM powered autonomous agents?\\n']\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2)\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?'}]\n",
+      "INFO:langchain.retrievers.web_research:New URLs to load: ['https://lilianweng.github.io/posts/2023-06-23-agent/']\n",
+      "INFO:langchain.retrievers.web_research:Grabbing most relevant splits from urls ...\n",
+      "Fetching pages: 100%|###################################################################################################################################| 1/1 [00:00<00:00,  6.32it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Initialize\n",
+    "web_research_retriever_llm_chain = WebResearchRetriever(\n",
+    "    vectorstore=vectorstore,\n",
+    "    llm_chain=llm_chain, \n",
+    "    search=search, \n",
+    ")\n",
+    "\n",
+    "# Run\n",
+    "docs = web_research_retriever_llm_chain.get_relevant_documents(user_input)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "1ee52163",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "1"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(docs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f9530c0",
+   "metadata": {},
+   "source": [
+    "### Run locally\n",
+    "\n",
+    "Specify LLM and embeddings that will run locally (e.g., on your laptop)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "8cf0d155",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "llama.cpp: loading model from /Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\n",
+      "llama_model_load_internal: format     = ggjt v3 (latest)\n",
+      "llama_model_load_internal: n_vocab    = 32000\n",
+      "llama_model_load_internal: n_ctx      = 4096\n",
+      "llama_model_load_internal: n_embd     = 5120\n",
+      "llama_model_load_internal: n_mult     = 256\n",
+      "llama_model_load_internal: n_head     = 40\n",
+      "llama_model_load_internal: n_layer    = 40\n",
+      "llama_model_load_internal: n_rot      = 128\n",
+      "llama_model_load_internal: freq_base  = 10000.0\n",
+      "llama_model_load_internal: freq_scale = 1\n",
+      "llama_model_load_internal: ftype      = 2 (mostly Q4_0)\n",
+      "llama_model_load_internal: n_ff       = 13824\n",
+      "llama_model_load_internal: model size = 13B\n",
+      "llama_model_load_internal: ggml ctx size =    0.09 MB\n",
+      "llama_model_load_internal: mem required  = 9132.71 MB (+ 1608.00 MB per state)\n",
+      "llama_new_context_with_model: kv self size  = 3200.00 MB\n",
+      "ggml_metal_init: allocating\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Found model file at  /Users/rlm/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin\n",
+      "llama_new_context_with_model: max tensor size =    87.89 MB\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "ggml_metal_init: using MPS\n",
+      "ggml_metal_init: loading '/Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/ggml-metal.metal'\n",
+      "ggml_metal_init: loaded kernel_add                            0x110fbd600\n",
+      "ggml_metal_init: loaded kernel_mul                            0x110fbeb30\n",
+      "ggml_metal_init: loaded kernel_mul_row                        0x110fbf350\n",
+      "ggml_metal_init: loaded kernel_scale                          0x110fbf9e0\n",
+      "ggml_metal_init: loaded kernel_silu                           0x110fc0150\n",
+      "ggml_metal_init: loaded kernel_relu                           0x110fbd950\n",
+      "ggml_metal_init: loaded kernel_gelu                           0x110fbdbb0\n",
+      "ggml_metal_init: loaded kernel_soft_max                       0x110fc14d0\n",
+      "ggml_metal_init: loaded kernel_diag_mask_inf                  0x110fc1980\n",
+      "ggml_metal_init: loaded kernel_get_rows_f16                   0x110fc22a0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_0                  0x110fc2ad0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_1                  0x110fc3260\n",
+      "ggml_metal_init: loaded kernel_get_rows_q2_K                  0x110fc3ad0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q3_K                  0x110fc41c0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q4_K                  0x110fc48c0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q5_K                  0x110fc4fa0\n",
+      "ggml_metal_init: loaded kernel_get_rows_q6_K                  0x110fc56a0\n",
+      "ggml_metal_init: loaded kernel_rms_norm                       0x110fc5da0\n",
+      "ggml_metal_init: loaded kernel_norm                           0x110fc64d0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x2a5c19990\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x2a5c1d4a0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x2a5c19fc0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x2a5c1dcc0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x2a5c1e420\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x2a5c1edc0\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x2a5c1fd90\n",
+      "ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x2a5c20540\n",
+      "ggml_metal_init: loaded kernel_rope                           0x2a5c20d40\n",
+      "ggml_metal_init: loaded kernel_alibi_f32                      0x2a5c21730\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f16                    0x2a5c21ab0\n",
+      "ggml_metal_init: loaded kernel_cpy_f32_f32                    0x2a5c22080\n",
+      "ggml_metal_init: loaded kernel_cpy_f16_f16                    0x2a5c231d0\n",
+      "ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB\n",
+      "ggml_metal_init: hasUnifiedMemory             = true\n",
+      "ggml_metal_init: maxTransferRate              = built-in GPU\n",
+      "ggml_metal_add_buffer: allocated 'data            ' buffer, size =  6984.06 MB, ( 6984.52 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1040.00 MB, ( 8024.52 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  3202.00 MB, (11226.52 / 21845.34)\n",
+      "ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   597.00 MB, (11823.52 / 21845.34)\n",
+      "AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | \n",
+      "ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, (12335.52 / 21845.34)\n",
+      "objc[33471]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/libllama.dylib (0x2c7368208) and /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libreplit-mainline-metal.dylib (0x5ebf48208). One of the two will be used. Which one is undefined.\n",
+      "objc[33471]: Class GGMLMetalClass is implemented in both /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/llama_cpp/libllama.dylib (0x2c7368208) and /Users/rlm/miniforge3/envs/llama/lib/python3.9/site-packages/gpt4all/llmodel_DO_NOT_MODIFY/build/libllamamodel-mainline-metal.dylib (0x5ec374208). One of the two will be used. Which one is undefined.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain.llms import LlamaCpp\n",
+    "from langchain.embeddings import GPT4AllEmbeddings\n",
+    "from langchain.callbacks.manager import CallbackManager\n",
+    "from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler\n",
+    "\n",
+    "n_gpu_layers = 1  # Metal set to 1 is enough.\n",
+    "n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.\n",
+    "callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])\n",
+    "llama = LlamaCpp(\n",
+    "    model_path=\"/Users/rlm/Desktop/Code/llama.cpp/llama-2-13b-chat.ggmlv3.q4_0.bin\",\n",
+    "    n_gpu_layers=n_gpu_layers,\n",
+    "    n_batch=n_batch,\n",
+    "    n_ctx=4096,  # Context window\n",
+    "    max_tokens=1000,  # Max tokens to generate\n",
+    "    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls\n",
+    "    callback_manager=callback_manager,\n",
+    "    verbose=True,\n",
+    ")\n",
+    "\n",
+    "vectorstore_llama = Chroma(embedding_function=GPT4AllEmbeddings(),persist_directory=\"./chroma_db_llama\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00f93dd4",
+   "metadata": {},
+   "source": [
+    "We supplied `StreamingStdOutCallbackHandler()`, so model outputs (e.g., generated questions) are streamed. \n",
+    "\n",
+    "We also have logging on, so we seem them there too."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "3e0561ca",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:langchain.retrievers.web_research:Generating questions for Google Search ...\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "  Sure, here are five Google search queries that are similar to \"What is Task Decomposition in LLM Powered Autonomous Agents?\":\n",
+      "\n",
+      "1. How does Task Decomposition work in LLM Powered Autonomous Agents? \n",
+      "2. What are the benefits of using Task Decomposition in LLM Powered Autonomous Agents? \n",
+      "3. Can you provide examples of Task Decomposition in LLM Powered Autonomous Agents? \n",
+      "4. How does Task Decomposition improve the performance of LLM Powered Autonomous Agents? \n",
+      "5. What are some common challenges or limitations of using Task Decomposition in LLM Powered Autonomous Agents, and how can they be addressed?"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  8585.01 ms\n",
+      "llama_print_timings:      sample time =   124.24 ms /   164 runs   (    0.76 ms per token,  1320.04 tokens per second)\n",
+      "llama_print_timings: prompt eval time =  8584.83 ms /   101 tokens (   85.00 ms per token,    11.76 tokens per second)\n",
+      "llama_print_timings:        eval time =  7268.55 ms /   163 runs   (   44.59 ms per token,    22.43 tokens per second)\n",
+      "llama_print_timings:       total time = 16236.13 ms\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search (raw): {'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?', 'text': LineList(lines=['1. How does Task Decomposition work in LLM Powered Autonomous Agents? \\n', '2. What are the benefits of using Task Decomposition in LLM Powered Autonomous Agents? \\n', '3. Can you provide examples of Task Decomposition in LLM Powered Autonomous Agents? \\n', '4. How does Task Decomposition improve the performance of LLM Powered Autonomous Agents? \\n'])}\n",
+      "INFO:langchain.retrievers.web_research:Questions for Google Search: ['1. How does Task Decomposition work in LLM Powered Autonomous Agents? \\n', '2. What are the benefits of using Task Decomposition in LLM Powered Autonomous Agents? \\n', '3. Can you provide examples of Task Decomposition in LLM Powered Autonomous Agents? \\n', '4. How does Task Decomposition improve the performance of LLM Powered Autonomous Agents? \\n']\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Task decomposition can be done (1) by LLM with simple prompting like \"Steps for XYZ.\\\\n1.\" , \"What are the subgoals for achieving XYZ?\" , (2)\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition#. Chain of thought (CoT;\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:Searching for relevat urls ...\n",
+      "INFO:langchain.retrievers.web_research:Search results: [{'title': \"LLM Powered Autonomous Agents | Lil'Log\", 'link': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'snippet': 'Jun 23, 2023 ... Agent System Overview In a LLM-powered autonomous agent system, ... Task decomposition can be done (1) by LLM with simple prompting like\\xa0...'}]\n",
+      "INFO:langchain.retrievers.web_research:New URLs to load: ['https://lilianweng.github.io/posts/2023-06-23-agent/']\n",
+      "INFO:langchain.retrievers.web_research:Grabbing most relevant splits from urls ...\n",
+      "Fetching pages: 100%|###################################################################################################################################| 1/1 [00:00<00:00, 10.49it/s]\n",
+      "Llama.generate: prefix-match hit\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      " The content discusses Task Decomposition in LLM Powered Autonomous Agents, which involves breaking down large tasks into smaller, manageable subgoals for efficient handling of complex tasks.\n",
+      "SOURCES:\n",
+      "https://lilianweng.github.io/posts/2023-06-23-agent/"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "llama_print_timings:        load time =  8585.01 ms\n",
+      "llama_print_timings:      sample time =    52.88 ms /    72 runs   (    0.73 ms per token,  1361.55 tokens per second)\n",
+      "llama_print_timings: prompt eval time = 125925.13 ms /  2358 tokens (   53.40 ms per token,    18.73 tokens per second)\n",
+      "llama_print_timings:        eval time =  3504.16 ms /    71 runs   (   49.35 ms per token,    20.26 tokens per second)\n",
+      "llama_print_timings:       total time = 129584.60 ms\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "{'question': 'What is Task Decomposition in LLM Powered Autonomous Agents?',\n",
+       " 'answer': ' The content discusses Task Decomposition in LLM Powered Autonomous Agents, which involves breaking down large tasks into smaller, manageable subgoals for efficient handling of complex tasks.\\n',\n",
+       " 'sources': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from langchain.chains import RetrievalQAWithSourcesChain\n",
+    "# Initialize\n",
+    "web_research_retriever = WebResearchRetriever.from_llm(\n",
+    "    vectorstore=vectorstore_llama,\n",
+    "    llm=llama, \n",
+    "    search=search, \n",
+    ")\n",
+    "\n",
+    "# Run\n",
+    "user_input = \"What is Task Decomposition in LLM Powered Autonomous Agents?\"\n",
+    "qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llama,retriever=web_research_retriever)\n",
+    "result = qa_chain({\"question\": user_input})\n",
+    "result"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/libs/langchain/langchain/document_loaders/async_html.py
+++ b/libs/langchain/langchain/document_loaders/async_html.py
@ -1,4 +1,3 @@
-"""Web base loader class."""
 import asyncio
 import logging
 import warnings
@ -86,7 +85,12 @@ class AsyncHtmlLoader(BaseLoader):
                        headers=self.session.headers,
                        ssl=None if self.session.verify else False,
                    ) as response:
-                        return await response.text()
+                        try:
+                            text = await response.text()
+                        except UnicodeDecodeError:
+                            logger.error(f"Failed to decode content from {url}")
+                            text = ""
+                        return text
                except aiohttp.ClientConnectionError as e:
                    if i == retries - 1:
                        raise
--- a/libs/langchain/langchain/retrievers/init.py
+++ b/libs/langchain/langchain/retrievers/init.py
@ -31,6 +31,7 @@ from langchain.retrievers.time_weighted_retriever import (
 )
 from langchain.retrievers.vespa_retriever import VespaRetriever
 from langchain.retrievers.weaviate_hybrid_search import WeaviateHybridSearchRetriever
+from langchain.retrievers.web_research import WebResearchRetriever
 from langchain.retrievers.wikipedia import WikipediaRetriever
 from langchain.retrievers.zep import ZepRetriever
 from langchain.retrievers.zilliz import ZillizRetriever
@ -65,5 +66,6 @@ __all__ = [
    "ZepRetriever",
    "ZillizRetriever",
    "DocArrayRetriever",
+    "WebResearchRetriever",
    "EnsembleRetriever",
 ]
--- a/libs/langchain/langchain/retrievers/web_research.py
+++ b/libs/langchain/langchain/retrievers/web_research.py
@ -0,0 +1,215 @@
+import logging
+import re
+from typing import List, Optional
+
+from pydantic import BaseModel, Field
+
+from langchain.callbacks.manager import (
+    AsyncCallbackManagerForRetrieverRun,
+    CallbackManagerForRetrieverRun,
+)
+from langchain.chains import LLMChain
+from langchain.chains.prompt_selector import ConditionalPromptSelector
+from langchain.document_loaders import AsyncHtmlLoader
+from langchain.document_transformers import Html2TextTransformer
+from langchain.llms import LlamaCpp
+from langchain.llms.base import BaseLLM
+from langchain.output_parsers.pydantic import PydanticOutputParser
+from langchain.prompts import BasePromptTemplate, PromptTemplate
+from langchain.schema import BaseRetriever, Document
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain.utilities import GoogleSearchAPIWrapper
+from langchain.vectorstores.base import VectorStore
+
+logger = logging.getLogger(__name__)
+
+
+class SearchQueries(BaseModel):
+    """Search queries to run to research for the user's goal."""
+
+    queries: List[str] = Field(
+        ..., description="List of search queries to look up on Google"
+    )
+
+
+DEFAULT_LLAMA_SEARCH_PROMPT = PromptTemplate(
+    input_variables=["question"],
+    template="""<<SYS>> \n You are an assistant tasked with improving Google search 
+    results. \n <</SYS>> \n\n [INST] Generate FIVE Google search queries that 
+    are similar to this question. The output should be a numbered list of questions 
+    and each should have a question mark at the end: \n\n {question} [/INST]""",
+)
+
+DEFAULT_SEARCH_PROMPT = PromptTemplate(
+    input_variables=["question"],
+    template="""You are an assistant tasked with improving Google search 
+    results. Generate FIVE Google search queries that are similar to
+    this question. The output should be a numbered list of questions and each
+    should have a question mark at the end: {question}""",
+)
+
+
+class LineList(BaseModel):
+    """List of questions."""
+
+    lines: List[str] = Field(description="Questions")
+
+
+class QuestionListOutputParser(PydanticOutputParser):
+    """Output parser for a list of numbered questions."""
+
+    def __init__(self) -> None:
+        super().__init__(pydantic_object=LineList)
+
+    def parse(self, text: str) -> LineList:
+        lines = re.findall(r"\d+\..*?\n", text)
+        return LineList(lines=lines)
+
+
+class WebResearchRetriever(BaseRetriever):
+    # Inputs
+    vectorstore: VectorStore = Field(
+        ..., description="Vector store for storing web pages"
+    )
+    llm_chain: LLMChain
+    search: GoogleSearchAPIWrapper = Field(..., description="Google Search API Wrapper")
+    max_splits_per_doc: int = Field(100, description="Maximum splits per document")
+    num_search_results: int = Field(1, description="Number of pages per Google search")
+    text_splitter: RecursiveCharacterTextSplitter = Field(
+        RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=50),
+        description="Text splitter for splitting web pages into chunks",
+    )
+    url_database: List[str] = Field(
+        default_factory=list, description="List of processed URLs"
+    )
+
+    @classmethod
+    def from_llm(
+        cls,
+        vectorstore: VectorStore,
+        llm: BaseLLM,
+        search: GoogleSearchAPIWrapper,
+        prompt: Optional[BasePromptTemplate] = None,
+        max_splits_per_doc: int = 100,
+        num_search_results: int = 1,
+        text_splitter: RecursiveCharacterTextSplitter = RecursiveCharacterTextSplitter(
+            chunk_size=1500, chunk_overlap=50
+        ),
+    ) -> "WebResearchRetriever":
+        """Initialize from llm using default template.
+
+        Args:
+            vectorstore: Vector store for storing web pages
+            llm: llm for search question generation
+            search: GoogleSearchAPIWrapper
+            prompt: prompt to generating search questions
+            max_splits_per_doc: Maximum splits per document to keep
+            num_search_results: Number of pages per Google search
+            text_splitter: Text splitter for splitting web pages into chunks
+
+        Returns:
+            WebResearchRetriever
+        """
+
+        if not prompt:
+            QUESTION_PROMPT_SELECTOR = ConditionalPromptSelector(
+                default_prompt=DEFAULT_SEARCH_PROMPT,
+                conditionals=[
+                    (lambda llm: isinstance(llm, LlamaCpp), DEFAULT_LLAMA_SEARCH_PROMPT)
+                ],
+            )
+            prompt = QUESTION_PROMPT_SELECTOR.get_prompt(llm)
+
+        # Use chat model prompt
+        llm_chain = LLMChain(
+            llm=llm,
+            prompt=prompt,
+            output_parser=QuestionListOutputParser(),
+        )
+
+        return cls(
+            vectorstore=vectorstore,
+            llm_chain=llm_chain,
+            search=search,
+            max_splits_per_doc=max_splits_per_doc,
+            num_search_results=num_search_results,
+            text_splitter=text_splitter,
+        )
+
+    def search_tool(self, query: str, num_search_results: int = 1) -> List[dict]:
+        """Returns num_serch_results pages per Google search."""
+        result = self.search.results(query, num_search_results)
+        return result
+
+    def _get_relevant_documents(
+        self,
+        query: str,
+        *,
+        run_manager: CallbackManagerForRetrieverRun,
+    ) -> List[Document]:
+        """Search Google for documents related to the query input.
+
+        Args:
+            query: user query
+
+        Returns:
+            Relevant documents from all various urls.
+        """
+
+        # Get search questions
+        logger.info("Generating questions for Google Search ...")
+        result = self.llm_chain({"question": query})
+        logger.info(f"Questions for Google Search (raw): {result}")
+        questions = getattr(result["text"], "lines", [])
+        logger.info(f"Questions for Google Search: {questions}")
+
+        # Get urls
+        logger.info("Searching for relevat urls ...")
+        urls_to_look = []
+        for query in questions:
+            # Google search
+            search_results = self.search_tool(query, self.num_search_results)
+            logger.info("Searching for relevat urls ...")
+            logger.info(f"Search results: {search_results}")
+            for res in search_results:
+                urls_to_look.append(res["link"])
+
+        # Relevant urls
+        urls = set(urls_to_look)
+
+        # Check for any new urls that we have not processed
+        new_urls = list(urls.difference(self.url_database))
+
+        logger.info(f"New URLs to load: {new_urls}")
+        # Load, split, and add new urls to vectorstore
+        if new_urls:
+            loader = AsyncHtmlLoader(new_urls)
+            html2text = Html2TextTransformer()
+            logger.info("Indexing new urls...")
+            docs = loader.load()
+            docs = list(html2text.transform_documents(docs))
+            docs = self.text_splitter.split_documents(docs)
+            self.vectorstore.add_documents(docs)
+            self.url_database.extend(new_urls)
+
+        # Search for relevant splits
+        # TODO: make this async
+        logger.info("Grabbing most relevant splits from urls...")
+        docs = []
+        for query in questions:
+            docs.extend(self.vectorstore.similarity_search(query))
+
+        # Get unique docs
+        unique_documents_dict = {
+            (doc.page_content, tuple(sorted(doc.metadata.items()))): doc for doc in docs
+        }
+        unique_documents = list(unique_documents_dict.values())
+        return unique_documents
+
+    async def _aget_relevant_documents(
+        self,
+        query: str,
+        *,
+        run_manager: AsyncCallbackManagerForRetrieverRun,
+    ) -> List[Document]:
+        raise NotImplementedError