"This notebook shows how to load data from Facebook in a format you can finetune on. The overall steps are:\n",
"\n",
"1. Download your messenger data to disk.\n",
"2. Create the Chat Loader and call `loader.load()` (or `loader.lazy_load()`) to perform the conversion.\n",
"3. Optionally use `merge_chat_runs` to combine message from the same sender in sequence, and/or `map_ai_messages` to convert messages from the specified sender to the \"AIMessage\" class. Once you've done this, call `convert_messages_for_finetuning` to prepare your data for fine-tuning.\n",
"\n",
"\n",
"Once this has been done, you can fine-tune your model. To do so you would complete the following steps:\n",
"\n",
"4. Upload your messages to OpenAI and run a fine-tuning job.\n",
"6. Use the resulting model in your LangChain app!\n",
"\n",
"\n",
"Let's begin.\n",
"\n",
"\n",
"## 1. Download Data\n",
"\n",
"To download your own messenger data, following instructions [here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/). IMPORTANT - make sure to download them in JSON format (not HTML).\n",
"\n",
"We are hosting an example dump at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) that we will use in this walkthrough."
"[HumanMessage(content=\"Hi Hermione! How's your summer going so far?\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
" HumanMessage(content=\"Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?\", additional_kwargs={'sender': 'Hermione Granger'}, example=False),\n",
" HumanMessage(content=\"I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_session = loader.load()[0]\n",
"chat_session[\"messages\"][:3]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "8a3ee473",
"metadata": {},
"outputs": [],
"source": [
"loader = FolderFacebookMessengerChatLoader(\n",
" path=\"./hogwarts\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "9f41e122",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chat_sessions = loader.load()\n",
"len(chat_sessions)"
]
},
{
"cell_type": "markdown",
"id": "d4aa3580-adc1-4b48-9bba-0e8e8d9f44ce",
"metadata": {},
"source": [
"## 3. Prepare for fine-tuning\n",
"\n",
"Calling `load()` returns all the chat messages we could extract as human messages. When conversing with chat bots, conversations typically follow a more strict alternating dialogue pattern relative to real conversations. \n",
"\n",
"You can choose to merge message \"runs\" (consecutive messages from the same sender) and select a sender to represent the \"AI\". The fine-tuned LLM will learn to generate these AI messages."
"[AIMessage(content=\"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\", additional_kwargs={'sender': 'Harry Potter'}, example=False),\n",
" HumanMessage(content=\"What is it, Potter? I'm quite busy at the moment.\", additional_kwargs={'sender': 'Severus Snape'}, example=False),\n",
" AIMessage(content=\"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\", additional_kwargs={'sender': 'Harry Potter'}, example=False)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now all of Harry Potter's messages will take the AI message class\n",
"# which maps to the 'assistant' role in OpenAI's training format\n",
"alternating_sessions[0]['messages'][:3]"
]
},
{
"cell_type": "markdown",
"id": "d985478d-062e-47b9-ae9a-102f59be07c0",
"metadata": {},
"source": [
"#### Now we can convert to OpenAI format dictionaries"
"print(f\"Prepared {len(training_data)} dialogues for training\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "dfcbd181",
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[{'role': 'assistant',\n",
" 'content': \"Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.\"},\n",
" {'role': 'user',\n",
" 'content': \"What is it, Potter? I'm quite busy at the moment.\"},\n",
" {'role': 'assistant',\n",
" 'content': \"I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.\"}]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"training_data[0][:3]"
]
},
{
"cell_type": "markdown",
"id": "f1a9fd64-4f9f-42d3-b5dc-2a340e51e9e7",
"metadata": {},
"source": [
"OpenAI currently requires at least 10 training examples for a fine-tuning job, though they recommend between 50-100 for most tasks. Since we only have 9 chat sessions, we can subdivide them (optionally with some overlap) so that each training example is comprised of a portion of a whole conversation.\n",
"\n",
"Facebook chat sessions (1 per person) often span multiple days and conversations,\n",
"so the long-range dependencies may not be that important to model anyhow."
]
},
{
"cell_type": "code",
"execution_count": 42,
"id": "13cd290a-b1e9-4686-bb5e-d99de8b8612b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"100"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Our chat is alternating, we will make each datapoint a group of 8 messages,\n",