"This goal is to simulate an architecture where the chat front end and the LLM are running as separate services that need to communicate with one another over an internal network.\n",
"It's an alternative to typical pattern of requesting a response from the model via a REST API (there's more info on why you would want to do this at the end of the notebook)."
"### 2. Build and install the llama-cpp-python library (with CUDA enabled so that we can advantage of Google Colab GPU\n",
"\n",
"The `llama-cpp-python` library is a Python wrapper around the `llama-cpp` library which enables you to efficiently leverage just a CPU to run quantized LLMs.\n",
"\n",
"When you use the standard `pip install llama-cpp-python` command, you do not get GPU support by default. Generation can be very slow if you rely on just the CPU in Google Colab, so the following command adds an extra option to build and install\n",
"`llama-cpp-python` with GPU support (make sure you have a GPU-enabled runtime selected in Google Colab)."
"### 3. Download and setup Kafka and Zookeeper instances\n",
"\n",
"Download the Kafka binaries from the Apache website and start the servers as daemons. We'll use the default configurations (provided by Apache Kafka) for spinning up the instances."
"# Set the current role to the role constant and initialize variables for supplementary customer metadata:\n",
"role = AGENT_ROLE"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HgJjJ9aZ-liy"
},
"source": [
"### 6. Download the \"llama-2-7b-chat.Q4_K_M.gguf\" model\n",
"\n",
"Download the quantized LLama-2 7B model from Hugging Face which we will use as a local LLM (rather than relying on REST API calls to an external service)."
"### 7. Load the model and initialize conversational memory\n",
"\n",
"Load Llama 2 and set the conversation buffer to 300 tokens using `ConversationTokenBufferMemory`. This value was used for running Llama in a CPU only container, so you can raise it if running in Google Colab. It prevents the container that is hosting the model from running out of memory.\n",
"Here, we're overriding the default system persona so that the chatbot has the personality of Marvin The Paranoid Android from the Hitchhiker's Guide to the Galaxy."
"### 8. Initialize the chat conversation with the chat bot\n",
"\n",
"We configure the chatbot to initialize the conversation by sending a fixed greeting to a \"chat\" Kafka topic. The \"chat\" topic gets automatically created when we send the first message."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KYyo5TnV_YC3"
},
"outputs": [],
"source": [
"def chat_init():\n",
" chat_id = str(\n",
" uuid.uuid4()\n",
" ) # Give the conversation an ID for effective message keying\n",
"This function defines how the chatbot should reply to incoming messages. Instead of sending a fixed message like the previous cell, we generate a reply using Llama-2 and send that reply back to the \"chat\" Kafka topic."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "yN5t71hY_hgn"
},
"outputs": [],
"source": [
"def reply(row: dict, state: State):\n",
" print(\"-------------------------------\")\n",
" print(\"Received:\")\n",
" print(row)\n",
" print(\"-------------------------------\")\n",
" print(f\"Thinking about the reply to: {row['text']}...\")\n",
" # Replace previous role and text values of the row so that it can be sent back to Kafka as a new message\n",
" # containing the agents role and reply\n",
" return row"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HZHwmIR0_kFY"
},
"source": [
"### 10. Check the Kafka topic for new human messages and have the model generate a reply\n",
"\n",
"If you are running this cell for this first time, run it and wait until you see Marvin's greeting ('Hello my name is Marvin...') in the console output. Stop the cell manually and proceed to the next cell where you'll be prompted for your reply.\n",
"\n",
"Once you have typed in your message, come back to this cell. Your reply is also sent to the same \"chat\" topic. The Kafka consumer checks for new messages and filters out messages that originate from the chatbot itself, leaving only the latest human messages.\n",
"\n",
"Once a new human message is detected, the reply function is triggered.\n",
"\n",
"\n",
"\n",
"_STOP THIS CELL MANUALLY WHEN YOU RECEIVE A REPLY FROM THE LLM IN THE OUTPUT_"
"# Publish the processed SDF to a Kafka topic specified by the output_topic object.\n",
"sdf = sdf.to_topic(output_topic)\n",
"\n",
"app.run(sdf)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EwXYrmWD_0CX"
},
"source": [
"\n",
"### 11. Enter a human message\n",
"\n",
"Run this cell to enter your message that you want to sent to the model. It uses another Kafka producer to send your text to the \"chat\" Kafka topic for the model to pick up (requires running the previous cell again)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6sxOPxSP_3iu"
},
"outputs": [],
"source": [
"chat_input = input(\"Please enter your reply: \")\n",
"print(\"\\n\\nRUN THE PREVIOUS CELL TO HAVE THE CHATBOT GENERATE A REPLY\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cSx3s7TBBegg"
},
"source": [
"### Why route chat messages through Kafka?\n",
"\n",
"It's easier to interact with the LLM directly using LangChains built-in conversation management features. Plus you can also use a REST API to generate a response from an externally hosted model. So why go to the trouble of using Apache Kafka?\n",
"\n",
"There are a few reasons, such as:\n",
"\n",
" * **Integration**: Many enterprises want to run their own LLMs so that they can keep their data in-house. This requires integrating LLM-powered components into existing architectures that might already be decoupled using some kind of message bus.\n",
"\n",
" * **Scalability**: Apache Kafka is designed with parallel processing in mind, so many teams prefer to use it to more effectively distribute work to available workers (in this case the \"worker\" is a container running an LLM).\n",
" * **Durability**: Kafka is designed to allow services to pick up where another service left off in the case where that service experienced a memory issue or went offline. This prevents data loss in highly complex, distributed architectures where multiple systems are communicating with one another (LLMs being just one of many interdependent systems that also include vector databases and traditional databases).\n",
"For more background on why event streaming is a good fit for Gen AI application architecture, see Kai Waehner's article [\"Apache Kafka + Vector Database + LLM = Real-Time GenAI\"](https://www.kai-waehner.de/blog/2023/11/08/apache-kafka-flink-vector-database-llm-real-time-genai/)."