mirror of
https://github.com/hwchase17/langchain
synced 2024-10-29 17:07:25 +00:00
dff00ea91e
Still working out interface/notebooks + need discord data dump to test out things other than copy+paste Update: - Going to remove the 'user_id' arg in the loaders themselves and just standardize on putting the "sender" arg in the extra kwargs. Then can provide a utility function to map these to ai and human messages - Going to move the discord one into just a notebook since I don't have a good dump to test on and copy+paste maybe isn't the greatest thing to support in v0 - Need to do more testing on slack since it seems the dump only includes channels and NOT 1 on 1 convos - --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
188 lines
6.6 KiB
Plaintext
188 lines
6.6 KiB
Plaintext
---
|
||
sidebar_position: 0
|
||
---
|
||
|
||
# Chat loaders
|
||
|
||
Like document loaders, chat loaders are utilities designed to help load conversations from popular communication platforms such as Facebook, Slack, Discord, etc. These are loaded into memory as LangChain chat message objects. Such utilities facilitate tasks such as fine-tuning a language model to match your personal style or voice.
|
||
|
||
This brief guide will illustrate the process using [OpenAI's fine-tuning API](https://platform.openai.com/docs/guides/fine-tuning) comprised of six steps:
|
||
|
||
1. Export your Facebook Messenger chat data in a compatible format for your intended chat loader.
|
||
2. Load the chat data into memory as LangChain chat message objects. (_this is what is covered in each integration notebook in this section of the documentation_).
|
||
- Assign a person to the "AI" role and optionally filter, group, and merge messages.
|
||
3. Export these acquired messages in a format expected by the fine-tuning API.
|
||
4. Upload this data to OpenAI.
|
||
5. Fine-tune your model.
|
||
6. Implement the fine-tuned model in LangChain.
|
||
|
||
This guide is not wholly comprehensive but is designed to take you through the fundamentals of going from raw data to fine-tuned model.
|
||
|
||
We will demonstrate the procedure through an example of fine-tuning a `gpt-3.5-turbo` model on Facebook Messenger data.
|
||
|
||
### 1. Export your chat data
|
||
|
||
To export your Facebook messenger data, you can follow the [instructions here](https://www.zapptales.com/en/download-facebook-messenger-chat-history-how-to/).
|
||
|
||
:::important JSON format
|
||
You must select "JSON format" (instead of HTML) when exporting your data to be compatible with the current loader.
|
||
:::
|
||
|
||
OpenAI requires at least 10 examples to fine-tune your model, but they recommend between 50-100 for more optimal results.
|
||
You can use the example data stored at [this google drive link](https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing) to test the process.
|
||
|
||
### 2. Load the chat
|
||
|
||
Once you've obtained your chat data, you can load it into memory as LangChain chat message objects. Here’s an example of loading data using the Python code:
|
||
|
||
```python
|
||
from langchain.chat_loaders.facebook_messenger import FolderFacebookMessengerChatLoader
|
||
|
||
loader = FolderFacebookMessengerChatLoader(
|
||
path="./facebook_messenger_chats",
|
||
)
|
||
|
||
chat_sessions = loader.load()
|
||
```
|
||
|
||
In this snippet, we point the loader to a directory of Facebook chat dumps which are then loaded as multiple "sessions" of messages, one session per conversation file.
|
||
|
||
Once you've loaded the messages, you should decide which person you want to fine-tune the model to (usually yourself). You can also decide to merge consecutive messages from the same sender into a single chat message.
|
||
For both of these tasks, you can use the chat_loaders utilities to do so:
|
||
|
||
```
|
||
from langchain.chat_loaders.utils import (
|
||
merge_chat_runs,
|
||
map_ai_messages,
|
||
)
|
||
|
||
merged_sessions = merge_chat_runs(chat_sessions)
|
||
alternating_sessions = list(map_ai_messages(merged_sessions, "My Name"))
|
||
```
|
||
|
||
### 3. Export messages to OpenAI format
|
||
|
||
Convert the chat messages to dictionaries using the `convert_messages_for_finetuning` function. Then, group the data into chunks for better context modeling and overlap management.
|
||
|
||
```python
|
||
from langchain.adapters.openai import convert_messages_for_finetuning
|
||
|
||
openai_messages = convert_messages_for_finetuning(chat_sessions)
|
||
```
|
||
|
||
At this point, the data is ready for upload to OpenAI. You can choose to split up conversations into smaller chunks for training if you
|
||
do not have enough conversations to train on. Feel free to play around with different chunk sizes or with adding system messages to the fine-tuning data.
|
||
|
||
```python
|
||
chunk_size = 8
|
||
overlap = 2
|
||
|
||
message_groups = [
|
||
conversation_messages[i: i + chunk_size]
|
||
for conversation_messages in openai_messages
|
||
for i in range(
|
||
0, len(conversation_messages) - chunk_size + 1,
|
||
chunk_size - overlap)
|
||
]
|
||
|
||
len(message_groups)
|
||
# 9
|
||
```
|
||
|
||
### 4. Upload the data to OpenAI
|
||
|
||
Ensure you have set your OpenAI API key by following these [instructions](https://platform.openai.com/account/api-keys), then upload the training file.
|
||
An audit is performed to ensure data compliance, so you may have to wait a few minutes for the dataset to become ready for use.
|
||
|
||
```python
|
||
import time
|
||
import json
|
||
import io
|
||
|
||
import openai
|
||
|
||
my_file = io.BytesIO()
|
||
for group in message_groups:
|
||
my_file.write((json.dumps({"messages": group}) + "\n").encode('utf-8'))
|
||
|
||
my_file.seek(0)
|
||
training_file = openai.File.create(
|
||
file=my_file,
|
||
purpose='fine-tune'
|
||
)
|
||
|
||
# Wait while the file is processed
|
||
status = openai.File.retrieve(training_file.id).status
|
||
start_time = time.time()
|
||
while status != "processed":
|
||
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
|
||
time.sleep(5)
|
||
status = openai.File.retrieve(training_file.id).status
|
||
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")
|
||
```
|
||
|
||
Once this is done, you can proceed to the model training!
|
||
|
||
### 5. Fine-tune the model
|
||
|
||
Start the fine-tuning job with your chosen base model.
|
||
|
||
```python
|
||
job = openai.FineTuningJob.create(
|
||
training_file=training_file.id,
|
||
model="gpt-3.5-turbo",
|
||
)
|
||
```
|
||
|
||
This might take a while. Check the status with `openai.FineTuningJob.retrieve(job.id).status` and wait for it to report `succeeded`.
|
||
|
||
```python
|
||
# It may take 10-20+ minutes to complete training.
|
||
status = openai.FineTuningJob.retrieve(job.id).status
|
||
start_time = time.time()
|
||
while status != "succeeded":
|
||
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
|
||
time.sleep(5)
|
||
job = openai.FineTuningJob.retrieve(job.id)
|
||
status = job.status
|
||
```
|
||
|
||
### 6. Use the model in LangChain
|
||
|
||
You're almost there! Use the fine-tuned model in LangChain.
|
||
|
||
```python
|
||
from langchain import chat_models
|
||
|
||
model_name = job.fine_tuned_model
|
||
# Example: ft:gpt-3.5-turbo-0613:personal::5mty86jblapsed
|
||
model = chat_models.ChatOpenAI(model=model_name)
|
||
```
|
||
|
||
```python
|
||
from langchain.prompts import ChatPromptTemplate
|
||
from langchain.schema.output_parser import StrOutputParser
|
||
|
||
prompt = ChatPromptTemplate.from_messages(
|
||
[
|
||
("human", "{input}"),
|
||
]
|
||
)
|
||
|
||
chain = prompt | model | StrOutputParser()
|
||
|
||
for tok in chain.stream({"input": "What classes are you taking?"}):
|
||
print(tok, end="", flush=True)
|
||
|
||
# The usual - Potions, Transfiguration, Defense Against the Dark Arts. What about you?
|
||
```
|
||
|
||
And that's it! You've successfully fine-tuned a model and used it in LangChain.
|
||
|
||
## Supported Chat Loaders
|
||
|
||
LangChain currently supports the following chat loaders. Feel free to contribute more!
|
||
|
||
import DocCardList from "@theme/DocCardList";
|
||
|
||
<DocCardList /> |