{ "cells": [ { "cell_type": "markdown", "id": "e4bd269b", "metadata": {}, "source": [ "# Facebook Messenger\n", "\n", "This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n", "\n", "1. Download your messenger data to disk.\n", "2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n", "3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n", "\n", "\n", "Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n", "\n", "4. Upload your messages to OpenAI and run a fine-tuning job.\n", "6. Use the resulting model in your LangChain app!\n", "\n", "\n", "Let's begin.\n", "\n", "\n", "## 1. Download Data\n", "\n", "To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n", "\n", "We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough." ] }, { "cell_type": "code", "execution_count": 2, "id": "647f2158-a42e-4634-b283-b8492caf542a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File file.zip downloaded.\n", "File file.zip has been unzipped.\n" ] } ], "source": [ "# This uses some example data\n", "import requests\n", "import zipfile\n", "\n", "def download_and_unzip(url: str, output_path: str = 'file.zip') -> None:\n", " file_id = url.split('/')[-2]\n", " download_url = f'https://drive.google.com/uc?export=download&id={file_id}'\n", "\n", " response = requests.get(download_url)\n", " if response.status_code != 200:\n", " print('Failed to download the file.')\n", " return\n", "\n", " with open(output_path, 'wb') as file:\n", " file.write(response.content)\n", " print(f'File {output_path} downloaded.')\n", "\n", " with zipfile.ZipFile(output_path, 'r') as zip_ref:\n", " zip_ref.extractall()\n", " print(f'File {output_path} has been unzipped.')\n", "\n", "# URL of the file to download\n", "url = 'https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing'\n", "\n", "# Download and unzip\n", "download_and_unzip(url)\n" ] }, { "cell_type": "markdown", "id": "48ef8bb1-fc28-453c-835a-94a552f05a91", "metadata": {}, "source": [ "## 2. Create Chat Loader\n", "\n", "We have 2 different `FacebookMessengerChatLoader` classes, one for an entire directory of chats, and one to load individual files. We" ] }, { "cell_type": "code", "execution_count": 1, "id": "a0869bc6", "metadata": {}, "outputs": [], "source": [ "directory_path = \"./hogwarts\"" ] }, { "cell_type": "code", "execution_count": 3, "id": "0460bf25", "metadata": {}, "outputs": [], "source": [ "from langchain.chat_loaders.facebook_messenger import (\n", " SingleFileFacebookMessengerChatLoader,\n", " FolderFacebookMessengerChatLoader,\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "id": "f61ee277", "metadata": {}, "outputs": [], "source": [ "loader = SingleFileFacebookMessengerChatLoader(\n", " path=\"./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json\",\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "id": "ec466ad7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n", " HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n", " HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chat_session = loader.load()[0]\n", "chat_session[\"messages\"][:3]" ] }, { "cell_type": "code", "execution_count": 10, "id": "8a3ee473", "metadata": {}, "outputs": [], "source": [ "loader = FolderFacebookMessengerChatLoader(\n", " path=\"./hogwarts\",\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "id": "9f41e122", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chat_sessions = loader.load()\n", "len(chat_sessions)" ] }, { "cell_type": "markdown", "id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce", "metadata": {}, "source": [ "## 3. Prepare for fine-tuning\n", "\n", "Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n", "\n", "You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages." ] }, { "cell_type": "code", "execution_count": 14, "id": "5a78030d-b757-4bbe-8a6c-841056f46df7", "metadata": {}, "outputs": [], "source": [ "from langchain.chat_loaders.utils import (\n", " merge_chat_runs,\n", " map_ai_messages,\n", ")" ] }, { "cell_type": "code", "execution_count": 17, "id": "ff35b028-78bf-4c5b-9ec6-939fe67de7f7", "metadata": {}, "outputs": [], "source": [ "merged_sessions = merge_chat_runs(chat_sessions)\n", "alternating_sessions = list(map_ai_messages(merged_sessions, \"Harry Potter\"))" ] }, { "cell_type": "code", "execution_count": 19, "id": "4b11906e-a496-4d01-9f0d-1938c14147bf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n", " HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n", " AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now all of Harry Potter's messages will take the AI message class\n", "# which maps to the 'assistant' role in OpenAI's training format\n", "alternating_sessions[0]['messages'][:3]" ] }, { "cell_type": "markdown", "id": "d985478d-062e-47b9-ae9a-102f59be07c0", "metadata": {}, "source": [ "#### Now we can convert to OpenAI format dictionaries" ] }, { "cell_type": "code", "execution_count": 20, "id": "21372331", "metadata": {}, "outputs": [], "source": [ "from langchain.adapters.openai import convert_messages_for_finetuning" ] }, { "cell_type": "code", "execution_count": 38, "id": "92c5ae7a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prepared 9 dialogues for training\n" ] } ], "source": [ "training_data = convert_messages_for_finetuning(alternating_sessions)\n", "print(f\"Prepared {len(training_data)} dialogues for training\")" ] }, { "cell_type": "code", "execution_count": 33, "id": "dfcbd181", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[{'role': 'assistant',\n", " 'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n", " {'role': 'user',\n", " 'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n", " {'role': 'assistant',\n", " 'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_data[0][:3]" ] }, { "cell_type": "markdown", "id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7", "metadata": {}, "source": [ "OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n", "\n", "Facebook chat sessions (1 per person) often span multiple days and conversations,\n", "so the long-range dependencies may not be that important to model anyhow." ] }, { "cell_type": "code", "execution_count": 42, "id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our chat is alternating, we will make each datapoint a group of 8 messages,\n", "# with 2 messages overlapping\n", "chunk_size = 8\n", "overlap = 2\n", "\n", "training_examples = [\n", " conversation_messages[i: i + chunk_size] \n", " for conversation_messages in training_data\n", " for i in range(\n", " 0, len(conversation_messages) - chunk_size + 1, \n", " chunk_size - overlap)\n", "]\n", "\n", "len(training_examples)" ] }, { "cell_type": "markdown", "id": "cc8baf41-ff07-4492-96bd-b2472ee7cef9", "metadata": {}, "source": [ "## 4. Fine-tune the model\n", "\n", "It's time to fine-tune the model. Make sure you have `openai` installed\n", "and have set your `OPENAI_API_KEY` appropriately" ] }, { "cell_type": "code", "execution_count": 43, "id": "95ce3f63-3c80-44b2-9060-534ad74e16fa", "metadata": {}, "outputs": [], "source": [ "# %pip install -U openai --quiet" ] }, { "cell_type": "code", "execution_count": 58, "id": "ab9e28eb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File file-zCyNBeg4snpbBL7VkvsuhCz8 ready afer 30.55 seconds.\n" ] } ], "source": [ "import json\n", "from io import BytesIO\n", "import time\n", "\n", "import openai\n", "\n", "# We will write the jsonl file in memory\n", "my_file = BytesIO()\n", "for m in training_examples:\n", " my_file.write((json.dumps({\"messages\": m}) + \"\\n\").encode('utf-8'))\n", "\n", "my_file.seek(0)\n", "training_file = openai.File.create(\n", " file=my_file,\n", " purpose='fine-tune'\n", ")\n", "\n", "# OpenAI audits each training file for compliance reasons.\n", "# This make take a few minutes\n", "status = openai.File.retrieve(training_file.id).status\n", "start_time = time.time()\n", "while status != \"processed\":\n", " print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n", " time.sleep(5)\n", " status = openai.File.retrieve(training_file.id).status\n", "print(f\"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.\")" ] }, { "cell_type": "markdown", "id": "759a7f51-fde9-4b75-aaa9-e600e6537bd1", "metadata": {}, "source": [ "With the file ready, it's time to kick off a training job." ] }, { "cell_type": "code", "execution_count": 59, "id": "3f451425", "metadata": {}, "outputs": [], "source": [ "job = openai.FineTuningJob.create(\n", " training_file=training_file.id,\n", " model=\"gpt-3.5-turbo\",\n", ")" ] }, { "cell_type": "markdown", "id": "489b23ef-5e14-42a9-bafb-44220ec6960b", "metadata": {}, "source": [ "Grab a cup of tea while your model is being prepared. This may take some time!" ] }, { "cell_type": "code", "execution_count": 60, "id": "bac1637a-c087-4523-ade1-c47f9bf4c6f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Status=[running]... 908.87s\r" ] } ], "source": [ "status = openai.FineTuningJob.retrieve(job.id).status\n", "start_time = time.time()\n", "while status != \"succeeded\":\n", " print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n", " time.sleep(5)\n", " job = openai.FineTuningJob.retrieve(job.id)\n", " status = job.status" ] }, { "cell_type": "code", "execution_count": 66, "id": "535895e1-bc69-40e5-82ed-e24ed2baeeee", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ft:gpt-3.5-turbo-0613:personal::7rDwkaOq\n" ] } ], "source": [ "print(job.fine_tuned_model)" ] }, { "cell_type": "markdown", "id": "502ff73b-f9e9-49ce-ba45-401811e57946", "metadata": {}, "source": [ "## 5. Use in LangChain\n", "\n", "You can use the resulting model ID directly the `ChatOpenAI` model class." ] }, { "cell_type": "code", "execution_count": 67, "id": "3925d60d", "metadata": {}, "outputs": [], "source": [ "from langchain.chat_models import ChatOpenAI\n", "\n", "model = ChatOpenAI(\n", " model=job.fine_tuned_model,\n", " temperature=1,\n", ")" ] }, { "cell_type": "code", "execution_count": 69, "id": "7190cf2e-ab34-4ceb-bdad-45f24f069c29", "metadata": {}, "outputs": [], "source": [ "from langchain.prompts import ChatPromptTemplate\n", "from langchain.schema.output_parser import StrOutputParser\n", "\n", "prompt = ChatPromptTemplate.from_messages(\n", " [\n", " (\"human\", \"{input}\"),\n", " ]\n", ")\n", "\n", "chain = prompt | model | StrOutputParser()" ] }, { "cell_type": "code", "execution_count": 72, "id": "f02057e9-f914-40b1-9c9d-9432ff594b98", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?" ] } ], "source": [ "for tok in chain.stream({\"input\": \"What classes are you taking?\"}):\n", " print(tok, end=\"\", flush=True)" ] }, { "cell_type": "code", "execution_count": null, "id": "35331503-3cc6-4d64-955e-64afe6b5fef3", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }