langchain/docs/extras/integrations/chat_loaders/facebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e4bd269b",
   "metadata": {},
   "source": [
    "# Facebook Messenger\n",
    "\n",
    "This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n",
    "\n",
    "1. Download your messenger data to disk.\n",
    "2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n",
    "3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n",
    "\n",
    "\n",
    "Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n",
    "\n",
    "4. Upload your messages to OpenAI and run a fine-tuning job.\n",
    "6. Use the resulting model in your LangChain app!\n",
    "\n",
    "\n",
    "Let's begin.\n",
    "\n",
    "\n",
    "## 1. Download Data\n",
    "\n",
    "To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n",
    "\n",
    "We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "647f2158-a42e-4634-b283-b8492caf542a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File file.zip downloaded.\n",
      "File file.zip has been unzipped.\n"
     ]
    }
   ],
   "source": [
    "# This uses some example data\n",
    "import requests\n",
    "import zipfile\n",
    "\n",
    "def download_and_unzip(url: str, output_path: str = 'file.zip') -> None:\n",
    "    file_id = url.split('/')[-2]\n",
    "    download_url = f'https://drive.google.com/uc?export=download&id={file_id}'\n",
    "\n",
    "    response = requests.get(download_url)\n",
    "    if response.status_code != 200:\n",
    "        print('Failed to download the file.')\n",
    "        return\n",
    "\n",
    "    with open(output_path, 'wb') as file:\n",
    "        file.write(response.content)\n",
    "        print(f'File {output_path} downloaded.')\n",
    "\n",
    "    with zipfile.ZipFile(output_path, 'r') as zip_ref:\n",
    "        zip_ref.extractall()\n",
    "        print(f'File {output_path} has been unzipped.')\n",
    "\n",
    "# URL of the file to download\n",
    "url = 'https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing'\n",
    "\n",
    "# Download and unzip\n",
    "download_and_unzip(url)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48ef8bb1-fc28-453c-835a-94a552f05a91",
   "metadata": {},
   "source": [
    "## 2. Create Chat Loader\n",
    "\n",
    "We have 2 different `FacebookMessengerChatLoader` classes, one for an entire directory of chats, and one to load individual files. We"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a0869bc6",
   "metadata": {},
   "outputs": [],
   "source": [
    "directory_path = \"./hogwarts\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "0460bf25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_loaders.facebook_messenger import (\n",
    "    SingleFileFacebookMessengerChatLoader,\n",
    "    FolderFacebookMessengerChatLoader,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f61ee277",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = SingleFileFacebookMessengerChatLoader(\n",
    "    path=\"./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "ec466ad7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
       " HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n",
       " HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chat_session = loader.load()[0]\n",
    "chat_session[\"messages\"][:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8a3ee473",
   "metadata": {},
   "outputs": [],
   "source": [
    "loader = FolderFacebookMessengerChatLoader(\n",
    "    path=\"./hogwarts\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "9f41e122",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chat_sessions = loader.load()\n",
    "len(chat_sessions)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce",
   "metadata": {},
   "source": [
    "## 3. Prepare for fine-tuning\n",
    "\n",
    "Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n",
    "\n",
    "You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "5a78030d-b757-4bbe-8a6c-841056f46df7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_loaders.utils import (\n",
    "    merge_chat_runs,\n",
    "    map_ai_messages,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "ff35b028-78bf-4c5b-9ec6-939fe67de7f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_sessions = merge_chat_runs(chat_sessions)\n",
    "alternating_sessions = list(map_ai_messages(merged_sessions, \"Harry Potter\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "4b11906e-a496-4d01-9f0d-1938c14147bf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
       " HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n",
       " AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Now all of Harry Potter's messages will take the AI message class\n",
    "# which maps to the 'assistant' role in OpenAI's training format\n",
    "alternating_sessions[0]['messages'][:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d985478d-062e-47b9-ae9a-102f59be07c0",
   "metadata": {},
   "source": [
    "#### Now we can convert to OpenAI format dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "21372331",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.adapters.openai import convert_messages_for_finetuning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "92c5ae7a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prepared 9 dialogues for training\n"
     ]
    }
   ],
   "source": [
    "training_data = convert_messages_for_finetuning(alternating_sessions)\n",
    "print(f\"Prepared {len(training_data)} dialogues for training\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "dfcbd181",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'role': 'assistant',\n",
       "  'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n",
       " {'role': 'user',\n",
       "  'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n",
       " {'role': 'assistant',\n",
       "  'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data[0][:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7",
   "metadata": {},
   "source": [
    "OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n",
    "\n",
    "Facebook chat sessions (1 per person) often span multiple days and conversations,\n",
    "so the long-range dependencies may not be that important to model anyhow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Our chat is alternating, we will make each datapoint a group of 8 messages,\n",
    "# with 2 messages overlapping\n",
    "chunk_size = 8\n",
    "overlap = 2\n",
    "\n",
    "training_examples = [\n",
    "    conversation_messages[i: i + chunk_size] \n",
    "    for conversation_messages in training_data\n",
    "    for i in range(\n",
    "        0, len(conversation_messages) - chunk_size + 1, \n",
    "        chunk_size - overlap)\n",
    "]\n",
    "\n",
    "len(training_examples)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc8baf41-ff07-4492-96bd-b2472ee7cef9",
   "metadata": {},
   "source": [
    "## 4. Fine-tune the model\n",
    "\n",
    "It's time to fine-tune the model. Make sure you have `openai` installed\n",
    "and have set your `OPENAI_API_KEY` appropriately"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "95ce3f63-3c80-44b2-9060-534ad74e16fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install -U openai --quiet"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "ab9e28eb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File file-zCyNBeg4snpbBL7VkvsuhCz8 ready afer 30.55 seconds.\n"
     ]
    }
   ],
   "source": [
    "import json\n",
    "from io import BytesIO\n",
    "import time\n",
    "\n",
    "import openai\n",
    "\n",
    "# We will write the jsonl file in memory\n",
    "my_file = BytesIO()\n",
    "for m in training_examples:\n",
    "    my_file.write((json.dumps({\"messages\": m}) + \"\\n\").encode('utf-8'))\n",
    "\n",
    "my_file.seek(0)\n",
    "training_file = openai.File.create(\n",
    "  file=my_file,\n",
    "  purpose='fine-tune'\n",
    ")\n",
    "\n",
    "# OpenAI audits each training file for compliance reasons.\n",
    "# This make take a few minutes\n",
    "status = openai.File.retrieve(training_file.id).status\n",
    "start_time = time.time()\n",
    "while status != \"processed\":\n",
    "    print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
    "    time.sleep(5)\n",
    "    status = openai.File.retrieve(training_file.id).status\n",
    "print(f\"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "759a7f51-fde9-4b75-aaa9-e600e6537bd1",
   "metadata": {},
   "source": [
    "With the file ready, it's time to kick off a training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "3f451425",
   "metadata": {},
   "outputs": [],
   "source": [
    "job = openai.FineTuningJob.create(\n",
    "    training_file=training_file.id,\n",
    "    model=\"gpt-3.5-turbo\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "489b23ef-5e14-42a9-bafb-44220ec6960b",
   "metadata": {},
   "source": [
    "Grab a cup of tea while your model is being prepared. This may take some time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "bac1637a-c087-4523-ade1-c47f9bf4c6f4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status=[running]... 908.87s\r"
     ]
    }
   ],
   "source": [
    "status = openai.FineTuningJob.retrieve(job.id).status\n",
    "start_time = time.time()\n",
    "while status != \"succeeded\":\n",
    "    print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
    "    time.sleep(5)\n",
    "    job = openai.FineTuningJob.retrieve(job.id)\n",
    "    status = job.status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "535895e1-bc69-40e5-82ed-e24ed2baeeee",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ft:gpt-3.5-turbo-0613:personal::7rDwkaOq\n"
     ]
    }
   ],
   "source": [
    "print(job.fine_tuned_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "502ff73b-f9e9-49ce-ba45-401811e57946",
   "metadata": {},
   "source": [
    "## 5. Use in LangChain\n",
    "\n",
    "You can use the resulting model ID directly the `ChatOpenAI` model class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "id": "3925d60d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.chat_models import ChatOpenAI\n",
    "\n",
    "model = ChatOpenAI(\n",
    "    model=job.fine_tuned_model,\n",
    "    temperature=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "7190cf2e-ab34-4ceb-bdad-45f24f069c29",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.prompts import ChatPromptTemplate\n",
    "from langchain.schema.output_parser import StrOutputParser\n",
    "\n",
    "prompt = ChatPromptTemplate.from_messages(\n",
    "    [\n",
    "        (\"human\", \"{input}\"),\n",
    "    ]\n",
    ")\n",
    "\n",
    "chain = prompt | model | StrOutputParser()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "f02057e9-f914-40b1-9c9d-9432ff594b98",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?"
     ]
    }
   ],
   "source": [
    "for tok in chain.stream({\"input\": \"What classes are you taking?\"}):\n",
    "    print(tok, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35331503-3cc6-4d64-955e-64afe6b5fef3",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}