mirror of https://github.com/hwchase17/langchain
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
307 lines
8.9 KiB
Plaintext
307 lines
8.9 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "959300d4",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Hugging Face Local Pipelines\n",
|
|
"\n",
|
|
"Hugging Face models can be run locally through the `HuggingFacePipeline` class.\n",
|
|
"\n",
|
|
"The [Hugging Face Model Hub](https://huggingface.co/models) hosts over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together.\n",
|
|
"\n",
|
|
"These can be called from LangChain either through this local pipeline wrapper or by calling their hosted inference endpoints through the HuggingFaceHub class."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4c1b8450-5eaf-4d34-8341-2d785448a1ff",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"To use, you should have the ``transformers`` python [package installed](https://pypi.org/project/transformers/), as well as [pytorch](https://pytorch.org/get-started/locally/). You can also install `xformer` for a more memory-efficient attention implementation."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "d772b637-de00-4663-bd77-9bc96d798db2",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install --upgrade --quiet transformers --quiet"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "91ad075f-71d5-4bc8-ab91-cc0ad5ef16bb",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Model Loading\n",
|
|
"\n",
|
|
"Models can be loaded by specifying the model parameters using the `from_model_id` method."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "165ae236-962a-4763-8052-c4836d78a5d2",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n",
|
|
"\n",
|
|
"hf = HuggingFacePipeline.from_model_id(\n",
|
|
" model_id=\"gpt2\",\n",
|
|
" task=\"text-generation\",\n",
|
|
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
|
|
")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "00104b27-0c15-4a97-b198-4512337ee211",
|
|
"metadata": {},
|
|
"source": [
|
|
"They can also be loaded by passing in an existing `transformers` pipeline directly"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "7f426a4f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline\n",
|
|
"from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n",
|
|
"\n",
|
|
"model_id = \"gpt2\"\n",
|
|
"tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
|
|
"model = AutoModelForCausalLM.from_pretrained(model_id)\n",
|
|
"pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=10)\n",
|
|
"hf = HuggingFacePipeline(pipeline=pipe)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "60e7ba8d",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Create Chain\n",
|
|
"\n",
|
|
"With the model loaded into memory, you can compose it with a prompt to\n",
|
|
"form a chain."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3acf0069",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.prompts import PromptTemplate\n",
|
|
"\n",
|
|
"template = \"\"\"Question: {question}\n",
|
|
"\n",
|
|
"Answer: Let's think step by step.\"\"\"\n",
|
|
"prompt = PromptTemplate.from_template(template)\n",
|
|
"\n",
|
|
"chain = prompt | hf\n",
|
|
"\n",
|
|
"question = \"What is electroencephalography?\"\n",
|
|
"\n",
|
|
"print(chain.invoke({\"question\": question}))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dbbc3a37",
|
|
"metadata": {},
|
|
"source": [
|
|
"### GPU Inference\n",
|
|
"\n",
|
|
"When running on a machine with GPU, you can specify the `device=n` parameter to put the model on the specified device.\n",
|
|
"Defaults to `-1` for CPU inference.\n",
|
|
"\n",
|
|
"If you have multiple-GPUs and/or the model is too large for a single GPU, you can specify `device_map=\"auto\"`, which requires and uses the [Accelerate](https://huggingface.co/docs/accelerate/index) library to automatically determine how to load the model weights. \n",
|
|
"\n",
|
|
"*Note*: both `device` and `device_map` should not be specified together and can lead to unexpected behavior."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "703c91c8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"gpu_llm = HuggingFacePipeline.from_model_id(\n",
|
|
" model_id=\"gpt2\",\n",
|
|
" task=\"text-generation\",\n",
|
|
" device=0, # replace with device_map=\"auto\" to use the accelerate library.\n",
|
|
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
|
|
")\n",
|
|
"\n",
|
|
"gpu_chain = prompt | gpu_llm\n",
|
|
"\n",
|
|
"question = \"What is electroencephalography?\"\n",
|
|
"\n",
|
|
"print(gpu_chain.invoke({\"question\": question}))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "59276016",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Batch GPU Inference\n",
|
|
"\n",
|
|
"If running on a device with GPU, you can also run inference on the GPU in batch mode."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "097ba62f",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"gpu_llm = HuggingFacePipeline.from_model_id(\n",
|
|
" model_id=\"bigscience/bloom-1b7\",\n",
|
|
" task=\"text-generation\",\n",
|
|
" device=0, # -1 for CPU\n",
|
|
" batch_size=2, # adjust as needed based on GPU map and model size.\n",
|
|
" model_kwargs={\"temperature\": 0, \"max_length\": 64},\n",
|
|
")\n",
|
|
"\n",
|
|
"gpu_chain = prompt | gpu_llm.bind(stop=[\"\\n\\n\"])\n",
|
|
"\n",
|
|
"questions = []\n",
|
|
"for i in range(4):\n",
|
|
" questions.append({\"question\": f\"What is the number {i} in french?\"})\n",
|
|
"\n",
|
|
"answers = gpu_chain.batch(questions)\n",
|
|
"for answer in answers:\n",
|
|
" print(answer)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "df1d41d9",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Inference with OpenVINO backend\n",
|
|
"\n",
|
|
"To deploy a model with OpenVINO, you can specify the `backend=\"openvino\"` parameter to trigger OpenVINO as backend inference framework.\n",
|
|
"\n",
|
|
"If you have an Intel GPU, you can specify `model_kwargs={\"device\": \"GPU\"}` to run inference on it."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "efb73dd7-77bf-4436-92e5-51306af45bd7",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%pip install --upgrade-strategy eager \"optimum[openvino,nncf]\" --quiet"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "70f6826c",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ov_config = {\"PERFORMANCE_HINT\": \"LATENCY\", \"NUM_STREAMS\": \"1\", \"CACHE_DIR\": \"\"}\n",
|
|
"\n",
|
|
"ov_llm = HuggingFacePipeline.from_model_id(\n",
|
|
" model_id=\"gpt2\",\n",
|
|
" task=\"text-generation\",\n",
|
|
" backend=\"openvino\",\n",
|
|
" model_kwargs={\"device\": \"CPU\", \"ov_config\": ov_config},\n",
|
|
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
|
|
")\n",
|
|
"\n",
|
|
"ov_chain = prompt | ov_llm\n",
|
|
"\n",
|
|
"question = \"What is electroencephalography?\"\n",
|
|
"\n",
|
|
"print(ov_chain.invoke({\"question\": question}))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "12524837-e9ab-455a-86be-66b95f4f893a",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Inference with local OpenVINO model\n",
|
|
"\n",
|
|
"It is possible to [export your model](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#export) to the OpenVINO IR format with the CLI, and load the model from local folder.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "3d1104a2-79c7-43a6-aa1c-8076a5ad7747",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"!optimum-cli export openvino --model gpt2 ov_model"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ac71e60d-5595-454e-8602-03ebb0248205",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"ov_llm = HuggingFacePipeline.from_model_id(\n",
|
|
" model_id=\"ov_model\",\n",
|
|
" task=\"text-generation\",\n",
|
|
" backend=\"openvino\",\n",
|
|
" model_kwargs={\"device\": \"CPU\", \"ov_config\": ov_config},\n",
|
|
" pipeline_kwargs={\"max_new_tokens\": 10},\n",
|
|
")\n",
|
|
"\n",
|
|
"ov_chain = prompt | ov_llm\n",
|
|
"\n",
|
|
"question = \"What is electroencephalography?\"\n",
|
|
"\n",
|
|
"print(ov_chain.invoke({\"question\": question}))"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.12"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|