mirror of
https://github.com/hwchase17/langchain
synced 2024-10-31 15:20:26 +00:00
580 lines
17 KiB
Plaintext
580 lines
17 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e4bd269b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Facebook Messenger\n",
|
|
"\n",
|
|
"This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n",
|
|
"\n",
|
|
"1. Download your messenger data to disk.\n",
|
|
"2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n",
|
|
"3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n",
|
|
"\n",
|
|
"\n",
|
|
"Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n",
|
|
"\n",
|
|
"4. Upload your messages to OpenAI and run a fine-tuning job.\n",
|
|
"6. Use the resulting model in your LangChain app!\n",
|
|
"\n",
|
|
"\n",
|
|
"Let's begin.\n",
|
|
"\n",
|
|
"\n",
|
|
"## 1. Download Data\n",
|
|
"\n",
|
|
"To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n",
|
|
"\n",
|
|
"We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "647f2158-a42e-4634-b283-b8492caf542a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"File file.zip downloaded.\n",
|
|
"File file.zip has been unzipped.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# This uses some example data\n",
|
|
"import requests\n",
|
|
"import zipfile\n",
|
|
"\n",
|
|
"def download_and_unzip(url: str, output_path: str = 'file.zip') -> None:\n",
|
|
" file_id = url.split('/')[-2]\n",
|
|
" download_url = f'https://drive.google.com/uc?export=download&id={file_id}'\n",
|
|
"\n",
|
|
" response = requests.get(download_url)\n",
|
|
" if response.status_code != 200:\n",
|
|
" print('Failed to download the file.')\n",
|
|
" return\n",
|
|
"\n",
|
|
" with open(output_path, 'wb') as file:\n",
|
|
" file.write(response.content)\n",
|
|
" print(f'File {output_path} downloaded.')\n",
|
|
"\n",
|
|
" with zipfile.ZipFile(output_path, 'r') as zip_ref:\n",
|
|
" zip_ref.extractall()\n",
|
|
" print(f'File {output_path} has been unzipped.')\n",
|
|
"\n",
|
|
"# URL of the file to download\n",
|
|
"url = 'https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing'\n",
|
|
"\n",
|
|
"# Download and unzip\n",
|
|
"download_and_unzip(url)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "48ef8bb1-fc28-453c-835a-94a552f05a91",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2. Create Chat Loader\n",
|
|
"\n",
|
|
"We have 2 different `FacebookMessengerChatLoader` classes, one for an entire directory of chats, and one to load individual files. We"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "a0869bc6",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"directory_path = \"./hogwarts\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "0460bf25",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chat_loaders.facebook_messenger import (\n",
|
|
" SingleFileFacebookMessengerChatLoader,\n",
|
|
" FolderFacebookMessengerChatLoader,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "f61ee277",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = SingleFileFacebookMessengerChatLoader(\n",
|
|
" path=\"./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "ec466ad7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
|
|
" HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n",
|
|
" HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chat_session = loader.load()[0]\n",
|
|
"chat_session[\"messages\"][:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "8a3ee473",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"loader = FolderFacebookMessengerChatLoader(\n",
|
|
" path=\"./hogwarts\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "9f41e122",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"9"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"chat_sessions = loader.load()\n",
|
|
"len(chat_sessions)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3. Prepare for fine-tuning\n",
|
|
"\n",
|
|
"Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n",
|
|
"\n",
|
|
"You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"id": "5a78030d-b757-4bbe-8a6c-841056f46df7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chat_loaders.utils import (\n",
|
|
" merge_chat_runs,\n",
|
|
" map_ai_messages,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"id": "ff35b028-78bf-4c5b-9ec6-939fe67de7f7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"merged_sessions = merge_chat_runs(chat_sessions)\n",
|
|
"alternating_sessions = list(map_ai_messages(merged_sessions, \"Harry Potter\"))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"id": "4b11906e-a496-4d01-9f0d-1938c14147bf",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
|
|
" HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n",
|
|
" AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
|
|
]
|
|
},
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Now all of Harry Potter's messages will take the AI message class\n",
|
|
"# which maps to the 'assistant' role in OpenAI's training format\n",
|
|
"alternating_sessions[0]['messages'][:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "d985478d-062e-47b9-ae9a-102f59be07c0",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### Now we can convert to OpenAI format dictionaries"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"id": "21372331",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.adapters.openai import convert_messages_for_finetuning"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"id": "92c5ae7a",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Prepared 9 dialogues for training\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"training_data = convert_messages_for_finetuning(alternating_sessions)\n",
|
|
"print(f\"Prepared {len(training_data)} dialogues for training\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"id": "dfcbd181",
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[{'role': 'assistant',\n",
|
|
" 'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n",
|
|
" {'role': 'user',\n",
|
|
" 'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n",
|
|
" {'role': 'assistant',\n",
|
|
" 'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]"
|
|
]
|
|
},
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"training_data[0][:3]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7",
|
|
"metadata": {},
|
|
"source": [
|
|
"OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n",
|
|
"\n",
|
|
"Facebook chat sessions (1 per person) often span multiple days and conversations,\n",
|
|
"so the long-range dependencies may not be that important to model anyhow."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"100"
|
|
]
|
|
},
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Our chat is alternating, we will make each datapoint a group of 8 messages,\n",
|
|
"# with 2 messages overlapping\n",
|
|
"chunk_size = 8\n",
|
|
"overlap = 2\n",
|
|
"\n",
|
|
"training_examples = [\n",
|
|
" conversation_messages[i: i + chunk_size] \n",
|
|
" for conversation_messages in training_data\n",
|
|
" for i in range(\n",
|
|
" 0, len(conversation_messages) - chunk_size + 1, \n",
|
|
" chunk_size - overlap)\n",
|
|
"]\n",
|
|
"\n",
|
|
"len(training_examples)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "cc8baf41-ff07-4492-96bd-b2472ee7cef9",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4. Fine-tune the model\n",
|
|
"\n",
|
|
"It's time to fine-tune the model. Make sure you have `openai` installed\n",
|
|
"and have set your `OPENAI_API_KEY` appropriately"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"id": "95ce3f63-3c80-44b2-9060-534ad74e16fa",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# %pip install -U openai --quiet"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 58,
|
|
"id": "ab9e28eb",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"File file-zCyNBeg4snpbBL7VkvsuhCz8 ready afer 30.55 seconds.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import json\n",
|
|
"from io import BytesIO\n",
|
|
"import time\n",
|
|
"\n",
|
|
"import openai\n",
|
|
"\n",
|
|
"# We will write the jsonl file in memory\n",
|
|
"my_file = BytesIO()\n",
|
|
"for m in training_examples:\n",
|
|
" my_file.write((json.dumps({\"messages\": m}) + \"\\n\").encode('utf-8'))\n",
|
|
"\n",
|
|
"my_file.seek(0)\n",
|
|
"training_file = openai.File.create(\n",
|
|
" file=my_file,\n",
|
|
" purpose='fine-tune'\n",
|
|
")\n",
|
|
"\n",
|
|
"# OpenAI audits each training file for compliance reasons.\n",
|
|
"# This make take a few minutes\n",
|
|
"status = openai.File.retrieve(training_file.id).status\n",
|
|
"start_time = time.time()\n",
|
|
"while status != \"processed\":\n",
|
|
" print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
|
|
" time.sleep(5)\n",
|
|
" status = openai.File.retrieve(training_file.id).status\n",
|
|
"print(f\"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "759a7f51-fde9-4b75-aaa9-e600e6537bd1",
|
|
"metadata": {},
|
|
"source": [
|
|
"With the file ready, it's time to kick off a training job."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 59,
|
|
"id": "3f451425",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"job = openai.FineTuningJob.create(\n",
|
|
" training_file=training_file.id,\n",
|
|
" model=\"gpt-3.5-turbo\",\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "489b23ef-5e14-42a9-bafb-44220ec6960b",
|
|
"metadata": {},
|
|
"source": [
|
|
"Grab a cup of tea while your model is being prepared. This may take some time!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 60,
|
|
"id": "bac1637a-c087-4523-ade1-c47f9bf4c6f4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Status=[running]... 908.87s\r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"status = openai.FineTuningJob.retrieve(job.id).status\n",
|
|
"start_time = time.time()\n",
|
|
"while status != \"succeeded\":\n",
|
|
" print(f\"Status=[{status}]... {time.time() - start_time:.2f}s\", end=\"\\r\", flush=True)\n",
|
|
" time.sleep(5)\n",
|
|
" job = openai.FineTuningJob.retrieve(job.id)\n",
|
|
" status = job.status"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 66,
|
|
"id": "535895e1-bc69-40e5-82ed-e24ed2baeeee",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"ft:gpt-3.5-turbo-0613:personal::7rDwkaOq\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(job.fine_tuned_model)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "502ff73b-f9e9-49ce-ba45-401811e57946",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 5. Use in LangChain\n",
|
|
"\n",
|
|
"You can use the resulting model ID directly the `ChatOpenAI` model class."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 67,
|
|
"id": "3925d60d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.chat_models import ChatOpenAI\n",
|
|
"\n",
|
|
"model = ChatOpenAI(\n",
|
|
" model=job.fine_tuned_model,\n",
|
|
" temperature=1,\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 69,
|
|
"id": "7190cf2e-ab34-4ceb-bdad-45f24f069c29",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.prompts import ChatPromptTemplate\n",
|
|
"from langchain.schema.output_parser import StrOutputParser\n",
|
|
"\n",
|
|
"prompt = ChatPromptTemplate.from_messages(\n",
|
|
" [\n",
|
|
" (\"human\", \"{input}\"),\n",
|
|
" ]\n",
|
|
")\n",
|
|
"\n",
|
|
"chain = prompt | model | StrOutputParser()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 72,
|
|
"id": "f02057e9-f914-40b1-9c9d-9432ff594b98",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"for tok in chain.stream({\"input\": \"What classes are you taking?\"}):\n",
|
|
" print(tok, end=\"\", flush=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "35331503-3cc6-4d64-955e-64afe6b5fef3",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.1"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|