langchain/docs/modules/models/llms/integrations/self_hosted_examples.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9597802c",
   "metadata": {},
   "source": [
    "# Self-Hosted Models via Runhouse\n",
    "This example goes over how to use LangChain and [Runhouse](https://github.com/run-house/runhouse) to interact with models hosted on your own GPU, or on-demand GPUs on AWS, GCP, AWS, or Lambda.\n",
    "\n",
    "For more information, see [Runhouse](https://github.com/run-house/runhouse) or the [Runhouse docs](https://runhouse-docs.readthedocs-hosted.com/en/latest/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fb585dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.llms import SelfHostedPipeline, SelfHostedHuggingFaceLLM\n",
    "from langchain import PromptTemplate, LLMChain\n",
    "import runhouse as rh"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06d6866e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For an on-demand A100 with GCP, Azure, or Lambda\n",
    "gpu = rh.cluster(name=\"rh-a10x\", instance_type=\"A100:1\", use_spot=False)\n",
    "\n",
    "# For an on-demand A10G with AWS (no single A100s on AWS)\n",
    "# gpu = rh.cluster(name='rh-a10x', instance_type='g5.2xlarge', provider='aws')\n",
    "\n",
    "# For an existing cluster\n",
    "# gpu = rh.cluster(ips=['<ip of the cluster>'], \n",
    "#                  ssh_creds={'ssh_user': '...', 'ssh_private_key':'<path_to_key>'},\n",
    "#                  name='rh-a10x')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "035dea0f",
   "metadata": {},
   "outputs": [],
   "source": [
    "template = \"\"\"Question: {question}\n",
    "\n",
    "Answer: Let's think step by step.\"\"\"\n",
    "\n",
    "prompt = PromptTemplate(template=template, input_variables=[\"question\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f3458d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = SelfHostedHuggingFaceLLM(model_id=\"gpt2\", hardware=gpu, model_reqs=[\"pip:./\", \"transformers\", \"torch\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a641dbd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "llm_chain = LLMChain(prompt=prompt, llm=llm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "6fb6fdb2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO | 2023-02-17 05:42:23,537 | Running _generate_text via gRPC\n",
      "INFO | 2023-02-17 05:42:24,016 | Time to send message: 0.48 seconds\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\"\\n\\nLet's say we're talking sports teams who won the Super Bowl in the year Justin Beiber\""
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question = \"What NFL team won the Super Bowl in the year Justin Beiber was born?\"\n",
    "\n",
    "llm_chain.run(question)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c88709cd",
   "metadata": {},
   "source": [
    "You can also load more custom models through the SelfHostedHuggingFaceLLM interface:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22820c5a",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "llm = SelfHostedHuggingFaceLLM(\n",
    "    model_id=\"google/flan-t5-small\",\n",
    "    task=\"text2text-generation\",\n",
    "    hardware=gpu,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "1528e70f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO | 2023-02-17 05:54:21,681 | Running _generate_text via gRPC\n",
      "INFO | 2023-02-17 05:54:21,937 | Time to send message: 0.25 seconds\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'berlin'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm(\"What is the capital of Germany?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a0c3746",
   "metadata": {},
   "source": [
    "Using a custom load function, we can load a custom pipeline directly on the remote hardware:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "893eb1d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_pipeline():\n",
    "    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline  # Need to be inside the fn in notebooks\n",
    "    model_id = \"gpt2\"\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
    "    model = AutoModelForCausalLM.from_pretrained(model_id)\n",
    "    pipe = pipeline(\n",
    "        \"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=10\n",
    "    )\n",
    "    return pipe\n",
    "\n",
    "def inference_fn(pipeline, prompt, stop = None):\n",
    "    return pipeline(prompt)[0][\"generated_text\"][len(prompt):]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "087d50dc",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "llm = SelfHostedHuggingFaceLLM(model_load_fn=load_pipeline, hardware=gpu, inference_fn=inference_fn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "feb8da8e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO | 2023-02-17 05:42:59,219 | Running _generate_text via gRPC\n",
      "INFO | 2023-02-17 05:42:59,522 | Time to send message: 0.3 seconds\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'john w. bush'"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llm(\"Who is the current US president?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af08575f",
   "metadata": {},
   "source": [
    "You can send your pipeline directly over the wire to your model, but this will only work for small models (<2 Gb), and will be pretty slow:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d23023b9",
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = load_pipeline()\n",
    "llm = SelfHostedPipeline.from_pipeline(\n",
    "    pipeline=pipeline, hardware=gpu, model_reqs=model_reqs\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcb447a1",
   "metadata": {},
   "source": [
    "Instead, we can also send it to the hardware's filesystem, which will be much faster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7206b7d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "rh.blob(pickle.dumps(pipeline), path=\"models/pipeline.pkl\").save().to(gpu, path=\"models\")\n",
    "\n",
    "llm = SelfHostedPipeline.from_pipeline(pipeline=\"models/pipeline.pkl\", hardware=gpu)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}