langchain/docs/extras/integrations/chat_loaders/facebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e4bd269b",
   "metadata": {},
   "source": [
    "# Facebook Messenger\n",
    "\n",
    "This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n",
    "\n",
    "1. Download your messenger data to disk.\n",
    "2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n",
    "3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n",
    "\n",
    "\n",
    "Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n",
    "\n",
    "4. Upload your messages to OpenAI and run a fine-tuning job.\n",
    "6. Use the resulting model in your LangChain app!\n",
    "\n",
    "\n",
    "Let's begin.\n",
    "\n",
    "\n",
    "## 1. Download Data\n",
    "\n",
    "To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n",
    "\n",
    "We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "647f2158-a42e-4634-b283-b8492caf542a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File file.zip downloaded.\n",
      "File file.zip has been unzipped.\n"
     ]
    }
   ],
   "source": [
    "# This uses some example data\n",
    "import requests\n",
    "import zipfile\n",
    "\n",
    "def download_and_unzip(url: str, output_path: str = 'file.zip') -> None:\n",
    "    file_id = url.split('/')[-2]\n",
    "    download_url = f'https://drive.google.com/uc?export=download&id={file_id}'\n",
    "\n",
    "    response = requests.get(download_url)\n",
    "    if response.status_code != 200:\n",
    "        print('Failed to download the file.')\n",
    "        return\n",
    "\n",
    "    with open(output_path, 'wb') as file:\n",
    "        file.write(response.content)\n",
    "        print(f'File {output_path} downloaded.')\n",
    "\n",
    "    with zipfile.ZipFile(output_path, 'r') as zip_ref:\n",
    "        zip_ref.extractall()\n",
    "        print(f'File {output_path} has been unzipped.')\n",
    "\n",
    "# URL of the file to download\n",
    "url = 'https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing'\n",
    "\n",
    "# Download and unzip\n",
    "download_and_unzip(url)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48ef8bb1-fc28-453c-835a-94a552f05a91",
   "metadata": {},
   "source": [
    "## 2. Create Chat Loader\n",
    "\n",
    "We have 2 different `FacebookMessengerChatLoader` classes, one for an entire directory of chats, and one to load individual files. We"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a0869bc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "directory_path = \"./hogwarts\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0460bf25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_loaders.facebook_messenger import (\n",
    "    SingleFileFacebookMessengerChatLoader,\n",
    "    FolderFacebookMessengerChatLoader,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f61ee277",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = SingleFileFacebookMessengerChatLoader(\n",
    "    path=\"./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ec466ad7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
       " HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n",
       " HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chat_session = loader.load()[0]\n",
    "chat_session[\"messages\"][:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8a3ee473",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = FolderFacebookMessengerChatLoader(\n",
    "    path=\"./hogwarts\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9f41e122",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chat_sessions = loader.load()\n",
    "len(chat_sessions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce",
   "metadata": {},
   "source": [
    "## 3. Prepare for fine-tuning\n",
    "\n",
    "Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n",
    "\n",
    "You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5a78030d-b757-4bbe-8a6c-841056f46df7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_loaders.utils import (\n",
    "    merge_chat_runs,\n",
    "    map_ai_messages,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ff35b028-78bf-4c5b-9ec6-939fe67de7f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_sessions = merge_chat_runs(chat_sessions)\n",
    "alternating_sessions = list(map_ai_messages(merged_sessions, \"Harry Potter\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4b11906e-a496-4d01-9f0d-1938c14147bf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
       " HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n",
       " AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now all of Harry Potter's messages will take the AI message class\n",
    "# which maps to the 'assistant' role in OpenAI's training format\n",
    "alternating_sessions[0]['messages'][:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d985478d-062e-47b9-ae9a-102f59be07c0",
   "metadata": {},
   "source": [
    "#### Now we can convert to OpenAI format dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "21372331",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.adapters.openai import convert_messages_for_finetuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "92c5ae7a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prepared 9 dialogues for training\n"
     ]
    }
   ],
   "source": [
    "training_data = convert_messages_for_finetuning(alternating_sessions)\n",
    "print(f\"Prepared {len(training_data)} dialogues for training\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "dfcbd181",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'role': 'assistant',\n",
       "  'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n",
       " {'role': 'user',\n",
       "  'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n",
       " {'role': 'assistant',\n",
       "  'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data[0][:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7",
   "metadata": {},
   "source": [
    "OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n",
    "\n",
    "Facebook chat sessions (1 per person) often span multiple days and conversations,\n",
    "so the long-range dependencies may not be that important to model anyhow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Our chat is alternating, we will make each datapoint a group of 8 messages,\n",
    "# with 2 messages overlapping\n",
    "chunk_size = 8\n",
    "overlap = 2\n",
    "\n",
    "training_examples = [\n",
    "    conversation_messages[i: i + chunk_size] \n",
    "    for conversation_messages in training_data\n",
    "    for i in range(\n",
    "        0, len(conversation_messages) - chunk_size + 1, \n",
    "        chunk_size - overlap)\n",
    "]\n",
    "\n",
    "len(training_examples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc8baf41-ff07-4492-96bd-b2472ee7cef9",
   "metadata": {},
   "source": [
    "## 4. Fine-tune the model\n",
    "\n",
    "It's time to fine-tune the model. Make sure you have `openai` installed\n",
    "and have set your `OPENAI_API_KEY` appropriately"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "95ce3f63-3c80-44b2-9060-534ad74e16fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install -U openai --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "ab9e28eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File file-zCyNBeg4snpbBL7VkvsuhCz8 ready afer 30.55 seconds.\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "from io import BytesIO\n",
    "import time\n",
    "\n",
    "import openai\n",
    "\n",
    "# We will write the jsonl file in memory\n",
    "my_file = BytesIO()\n",
    "for m in training_examples:\n",
    "    my_file.write((json.dumps({\"messages\": m}) + \"\\n\").encode('utf-8'))\n",
    "\n",
    "my_file.seek(0)\n",
    "training_file = openai.File.create(\n",
    "  file=my_file,\n",
    "  purpose='fine-tune'\n",
    ")\n",
    "\n",
    "# OpenAI audits each training file for compliance reasons.\n",
    "# This make take a few minutes\n",
    "status = openai.File.retrieve(training_file.id).status\n",
    "start_time = time.time()\n",
    "while status != \"processed\":\n",
    "    print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
    "    time.sleep(5)\n",
    "    status = openai.File.retrieve(training_file.id).status\n",
    "print(f\"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "759a7f51-fde9-4b75-aaa9-e600e6537bd1",
   "metadata": {},
   "source": [
    "With the file ready, it's time to kick off a training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "3f451425",
   "metadata": {},
   "outputs": [],
   "source": [
    "job = openai.FineTuningJob.create(\n",
    "    training_file=training_file.id,\n",
    "    model=\"gpt-3.5-turbo\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "489b23ef-5e14-42a9-bafb-44220ec6960b",
   "metadata": {},
   "source": [
    "Grab a cup of tea while your model is being prepared. This may take some time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "bac1637a-c087-4523-ade1-c47f9bf4c6f4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status=[running]... 908.87s\r"
     ]
    }
   ],
   "source": [
    "status = openai.FineTuningJob.retrieve(job.id).status\n",
    "start_time = time.time()\n",
    "while status != \"succeeded\":\n",
    "    print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
    "    time.sleep(5)\n",
    "    job = openai.FineTuningJob.retrieve(job.id)\n",
    "    status = job.status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "535895e1-bc69-40e5-82ed-e24ed2baeeee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ft:gpt-3.5-turbo-0613:personal::7rDwkaOq\n"
     ]
    }
   ],
   "source": [
    "print(job.fine_tuned_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "502ff73b-f9e9-49ce-ba45-401811e57946",
   "metadata": {},
   "source": [
    "## 5. Use in LangChain\n",
    "\n",
    "You can use the resulting model ID directly the `ChatOpenAI` model class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "3925d60d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
    "model = ChatOpenAI(\n",
    "    model=job.fine_tuned_model,\n",
    "    temperature=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "7190cf2e-ab34-4ceb-bdad-45f24f069c29",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain.schema.output_parser import StrOutputParser\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"human\", \"{input}\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "chain = prompt | model | StrOutputParser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "f02057e9-f914-40b1-9c9d-9432ff594b98",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?"
     ]
    }
   ],
   "source": [
    "for tok in chain.stream({\"input\": \"What classes are you taking?\"}):\n",
    "    print(tok, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35331503-3cc6-4d64-955e-64afe6b5fef3",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
Chat Loaders (#9708) Still working out interface/notebooks + need discord data dump to test out things other than copy+paste Update: - Going to remove the 'user_id' arg in the loaders themselves and just standardize on putting the "sender" arg in the extra kwargs. Then can provide a utility function to map these to ai and human messages - Going to move the discord one into just a notebook since I don't have a good dump to test on and copy+paste maybe isn't the greatest thing to support in v0 - Need to do more testing on slack since it seems the dump only includes channels and NOT 1 on 1 convos - --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-08-25 00:23:27 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "e4bd269b",`
			`"metadata": {},`
			`"source": [`
			`"# Facebook Messenger\n",`
			`"\n",`
			`"This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n",`
			`"\n",`
			`"1. Download your messenger data to disk.\n",`
			"2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n",
			"3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n",
			`"\n",`
			`"\n",`
			`"Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n",`
			`"\n",`
			`"4. Upload your messages to OpenAI and run a fine-tuning job.\n",`
			`"6. Use the resulting model in your LangChain app!\n",`
			`"\n",`
			`"\n",`
			`"Let's begin.\n",`
			`"\n",`
			`"\n",`
			`"## 1. Download Data\n",`
			`"\n",`
			`"To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n",`
			`"\n",`
			`"We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "647f2158-a42e-4634-b283-b8492caf542a",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"File file.zip downloaded.\n",`
			`"File file.zip has been unzipped.\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# This uses some example data\n",`
			`"import requests\n",`
			`"import zipfile\n",`
			`"\n",`
			`"def download_and_unzip(url: str, output_path: str = 'file.zip') -> None:\n",`
			`" file_id = url.split('/')[-2]\n",`
			`" download_url = f'https://drive.google.com/uc?export=download&id={file_id}'\n",`
			`"\n",`
			`" response = requests.get(download_url)\n",`
			`" if response.status_code != 200:\n",`
			`" print('Failed to download the file.')\n",`
			`" return\n",`
			`"\n",`
			`" with open(output_path, 'wb') as file:\n",`
			`" file.write(response.content)\n",`
			`" print(f'File {output_path} downloaded.')\n",`
			`"\n",`
			`" with zipfile.ZipFile(output_path, 'r') as zip_ref:\n",`
			`" zip_ref.extractall()\n",`
			`" print(f'File {output_path} has been unzipped.')\n",`
			`"\n",`
			`"# URL of the file to download\n",`
			`"url = 'https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing'\n",`
			`"\n",`
			`"# Download and unzip\n",`
			`"download_and_unzip(url)\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "48ef8bb1-fc28-453c-835a-94a552f05a91",`
			`"metadata": {},`
			`"source": [`
			`"## 2. Create Chat Loader\n",`
			`"\n",`
			"We have 2 different `FacebookMessengerChatLoader` classes, one for an entire directory of chats, and one to load individual files. We"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "a0869bc6",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"directory_path = \"./hogwarts\""`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "0460bf25",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.chat_loaders.facebook_messenger import (\n",`
			`" SingleFileFacebookMessengerChatLoader,\n",`
			`" FolderFacebookMessengerChatLoader,\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "f61ee277",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"loader = SingleFileFacebookMessengerChatLoader(\n",`
			`" path=\"./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json\",\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"id": "ec466ad7",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",`
			`" HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n",`
			`" HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"`
			`]`
			`},`
			`"execution_count": 9,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"chat_session = loader.load()[0]\n",`
			`"chat_session[\"messages\"][:3]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"id": "8a3ee473",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"loader = FolderFacebookMessengerChatLoader(\n",`
			`" path=\"./hogwarts\",\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"id": "9f41e122",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"9"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"chat_sessions = loader.load()\n",`
			`"len(chat_sessions)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce",`
			`"metadata": {},`
			`"source": [`
			`"## 3. Prepare for fine-tuning\n",`
			`"\n",`
			"Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n",
			`"\n",`
			`"You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"id": "5a78030d-b757-4bbe-8a6c-841056f46df7",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.chat_loaders.utils import (\n",`
			`" merge_chat_runs,\n",`
			`" map_ai_messages,\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"id": "ff35b028-78bf-4c5b-9ec6-939fe67de7f7",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"merged_sessions = merge_chat_runs(chat_sessions)\n",`
			`"alternating_sessions = list(map_ai_messages(merged_sessions, \"Harry Potter\"))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"id": "4b11906e-a496-4d01-9f0d-1938c14147bf",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",`
			`" HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n",`
			`" AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"`
			`]`
			`},`
			`"execution_count": 19,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Now all of Harry Potter's messages will take the AI message class\n",`
			`"# which maps to the 'assistant' role in OpenAI's training format\n",`
			`"alternating_sessions[0]['messages'][:3]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "d985478d-062e-47b9-ae9a-102f59be07c0",`
			`"metadata": {},`
			`"source": [`
			`"#### Now we can convert to OpenAI format dictionaries"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 20,`
			`"id": "21372331",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.adapters.openai import convert_messages_for_finetuning"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 38,`
			`"id": "92c5ae7a",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Prepared 9 dialogues for training\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"training_data = convert_messages_for_finetuning(alternating_sessions)\n",`
			`"print(f\"Prepared {len(training_data)} dialogues for training\")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 33,`
			`"id": "dfcbd181",`
			`"metadata": {`
			`"scrolled": true`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[{'role': 'assistant',\n",`
			`" 'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n",`
			`" {'role': 'user',\n",`
			`" 'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n",`
			`" {'role': 'assistant',\n",`
			`" 'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]"`
			`]`
			`},`
			`"execution_count": 33,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"training_data[0][:3]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7",`
			`"metadata": {},`
			`"source": [`
			`"OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n",`
			`"\n",`
			`"Facebook chat sessions (1 per person) often span multiple days and conversations,\n",`
			`"so the long-range dependencies may not be that important to model anyhow."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 42,`
			`"id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"100"`
			`]`
			`},`
			`"execution_count": 42,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Our chat is alternating, we will make each datapoint a group of 8 messages,\n",`
			`"# with 2 messages overlapping\n",`
			`"chunk_size = 8\n",`
			`"overlap = 2\n",`
			`"\n",`
			`"training_examples = [\n",`
			`" conversation_messages[i: i + chunk_size] \n",`
			`" for conversation_messages in training_data\n",`
			`" for i in range(\n",`
			`" 0, len(conversation_messages) - chunk_size + 1, \n",`
			`" chunk_size - overlap)\n",`
			`"]\n",`
			`"\n",`
			`"len(training_examples)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "cc8baf41-ff07-4492-96bd-b2472ee7cef9",`
			`"metadata": {},`
			`"source": [`
			`"## 4. Fine-tune the model\n",`
			`"\n",`
			"It's time to fine-tune the model. Make sure you have `openai` installed\n",
			"and have set your `OPENAI_API_KEY` appropriately"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 43,`
			`"id": "95ce3f63-3c80-44b2-9060-534ad74e16fa",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"# %pip install -U openai --quiet"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 58,`
			`"id": "ab9e28eb",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"File file-zCyNBeg4snpbBL7VkvsuhCz8 ready afer 30.55 seconds.\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"import json\n",`
			`"from io import BytesIO\n",`
			`"import time\n",`
			`"\n",`
			`"import openai\n",`
			`"\n",`
			`"# We will write the jsonl file in memory\n",`
			`"my_file = BytesIO()\n",`
			`"for m in training_examples:\n",`
			`" my_file.write((json.dumps({\"messages\": m}) + \"\\n\").encode('utf-8'))\n",`
			`"\n",`
			`"my_file.seek(0)\n",`
			`"training_file = openai.File.create(\n",`
			`" file=my_file,\n",`
			`" purpose='fine-tune'\n",`
			`")\n",`
			`"\n",`
			`"# OpenAI audits each training file for compliance reasons.\n",`
			`"# This make take a few minutes\n",`
			`"status = openai.File.retrieve(training_file.id).status\n",`
			`"start_time = time.time()\n",`
			`"while status != \"processed\":\n",`
			`" print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",`
			`" time.sleep(5)\n",`
			`" status = openai.File.retrieve(training_file.id).status\n",`
			`"print(f\"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "759a7f51-fde9-4b75-aaa9-e600e6537bd1",`
			`"metadata": {},`
			`"source": [`
			`"With the file ready, it's time to kick off a training job."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 59,`
			`"id": "3f451425",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"job = openai.FineTuningJob.create(\n",`
			`" training_file=training_file.id,\n",`
			`" model=\"gpt-3.5-turbo\",\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "489b23ef-5e14-42a9-bafb-44220ec6960b",`
			`"metadata": {},`
			`"source": [`
			`"Grab a cup of tea while your model is being prepared. This may take some time!"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 60,`
			`"id": "bac1637a-c087-4523-ade1-c47f9bf4c6f4",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Status=[running]... 908.87s\r"`
			`]`
			`}`
			`],`
			`"source": [`
			`"status = openai.FineTuningJob.retrieve(job.id).status\n",`
			`"start_time = time.time()\n",`
			`"while status != \"succeeded\":\n",`
			`" print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",`
			`" time.sleep(5)\n",`
			`" job = openai.FineTuningJob.retrieve(job.id)\n",`
			`" status = job.status"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 66,`
			`"id": "535895e1-bc69-40e5-82ed-e24ed2baeeee",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"ft:gpt-3.5-turbo-0613:personal::7rDwkaOq\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"print(job.fine_tuned_model)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "502ff73b-f9e9-49ce-ba45-401811e57946",`
			`"metadata": {},`
			`"source": [`
			`"## 5. Use in LangChain\n",`
			`"\n",`
			"You can use the resulting model ID directly the `ChatOpenAI` model class."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 67,`
			`"id": "3925d60d",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.chat_models import ChatOpenAI\n",`
			`"\n",`
			`"model = ChatOpenAI(\n",`
			`" model=job.fine_tuned_model,\n",`
			`" temperature=1,\n",`
			`")"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 69,`
			`"id": "7190cf2e-ab34-4ceb-bdad-45f24f069c29",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from langchain.prompts import ChatPromptTemplate\n",`
			`"from langchain.schema.output_parser import StrOutputParser\n",`
			`"\n",`
			`"prompt = ChatPromptTemplate.from_messages(\n",`
			`" [\n",`
			`" (\"human\", \"{input}\"),\n",`
			`" ]\n",`
			`")\n",`
			`"\n",`
			`"chain = prompt \| model \| StrOutputParser()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 72,`
			`"id": "f02057e9-f914-40b1-9c9d-9432ff594b98",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?"`
			`]`
			`}`
			`],`
			`"source": [`
			`"for tok in chain.stream({\"input\": \"What classes are you taking?\"}):\n",`
			`" print(tok, end=\"\", flush=True)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "35331503-3cc6-4d64-955e-64afe6b5fef3",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
add gmail loader (#9810) 2023-08-28 00:18:09 +00:00			`"version": "3.10.1"`
Chat Loaders (#9708) Still working out interface/notebooks + need discord data dump to test out things other than copy+paste Update: - Going to remove the 'user_id' arg in the loaders themselves and just standardize on putting the "sender" arg in the extra kwargs. Then can provide a utility function to map these to ai and human messages - Going to move the discord one into just a notebook since I don't have a good dump to test on and copy+paste maybe isn't the greatest thing to support in v0 - Need to do more testing on slack since it seems the dump only includes channels and NOT 1 on 1 convos - --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com> 2023-08-25 00:23:27 +00:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`