openai-cookbook/examples/third_party/How_to_automate_S3_storage_...

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "25750d3a",
   "metadata": {},
   "source": [
    "# How to automate tasks with functions (S3 bucket example)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5478248c",
   "metadata": {},
   "source": [
    "This code demonstrates how to interact with ChatGPT functions to perform tasks related to Amazon S3 buckets. The notebook covers S3 bucket key functionalities such as running simple listing commands, searching for a specific file in all buckets, uploading a file to a bucket, and downloading a file from a bucket. The OpenAI Chat API understands the user instructions, generates the natural language responses, and extracts appropriate function calls based on the user's input.\n",
    "\n",
    "**Requirements**:\n",
    "To run the notebook generate AWS access key with S3 bucket writing permission and store them in a local environment file alongside the Openai key. The \"`.env`\" file format:\n",
    "```\n",
    "AWS_ACCESS_KEY_ID=<your-key>\n",
    "AWS_SECRET_ACCESS_KEY=<your-key>\n",
    "OPENAI_API_KEY=<your-key>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed2fd156",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install openai\n",
    "! pip install boto3\n",
    "! pip install tenacity\n",
    "! pip install python-dotenv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9617e95e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from openai import OpenAI\n",
    "import json\n",
    "import boto3\n",
    "import os\n",
    "import datetime\n",
    "from urllib.request import urlretrieve\n",
    "\n",
    "# load environment variables\n",
    "from dotenv import load_dotenv\n",
    "load_dotenv() "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c8fb09f0",
   "metadata": {},
   "source": [
    "## Initials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "6d5b1991",
   "metadata": {},
   "outputs": [],
   "source": [
    "OpenAI.api_key = os.environ.get(\"OPENAI_API_KEY\")\n",
    "GPT_MODEL = \"gpt-3.5-turbo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a571b8d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional - if you had issues loading the environment file, you can set the AWS values using the below code\n",
    "# os.environ['AWS_ACCESS_KEY_ID'] = ''\n",
    "# os.environ['AWS_SECRET_ACCESS_KEY'] = ''\n",
    "\n",
    "# Create S3 client\n",
    "s3_client = boto3.client('s3')\n",
    "\n",
    "# Create openai client\n",
    "client = OpenAI()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3cfd8cf9",
   "metadata": {},
   "source": [
    "## Utilities"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "487f43bd",
   "metadata": {},
   "source": [
    "To connect user questions or commands to the appropriate function, we need to provide ChatGPT with the necessary function details and expected parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "da4a804b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Functions dict to pass S3 operations details for the GPT model\n",
    "functions = [\n",
    "    {   \n",
    "        \"type\": \"function\",\n",
    "        \"function\":{\n",
    "            \"name\": \"list_buckets\",\n",
    "            \"description\": \"List all available S3 buckets\",\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {}\n",
    "            }\n",
    "        }\n",
    "    },\n",
    "    {\n",
    "        \"type\": \"function\",\n",
    "        \"function\":{\n",
    "            \"name\": \"list_objects\",\n",
    "            \"description\": \"List the objects or files inside a given S3 bucket\",\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"bucket\": {\"type\": \"string\", \"description\": \"The name of the S3 bucket\"},\n",
    "                    \"prefix\": {\"type\": \"string\", \"description\": \"The folder path in the S3 bucket\"},\n",
    "                },\n",
    "                \"required\": [\"bucket\"],\n",
    "            },\n",
    "        }\n",
    "    },\n",
    "    {   \n",
    "        \"type\": \"function\",\n",
    "        \"function\":{\n",
    "            \"name\": \"download_file\",\n",
    "            \"description\": \"Download a specific file from an S3 bucket to a local distribution folder.\",\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"bucket\": {\"type\": \"string\", \"description\": \"The name of the S3 bucket\"},\n",
    "                    \"key\": {\"type\": \"string\", \"description\": \"The path to the file inside the bucket\"},\n",
    "                    \"directory\": {\"type\": \"string\", \"description\": \"The local destination directory to download the file, should be specificed by the user.\"},\n",
    "                },\n",
    "                \"required\": [\"bucket\", \"key\", \"directory\"],\n",
    "            }\n",
    "        }\n",
    "    },\n",
    "    {\n",
    "        \"type\": \"function\",\n",
    "        \"function\":{\n",
    "            \"name\": \"upload_file\",\n",
    "            \"description\": \"Upload a file to an S3 bucket\",\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"source\": {\"type\": \"string\", \"description\": \"The local source path or remote URL\"},\n",
    "                    \"bucket\": {\"type\": \"string\", \"description\": \"The name of the S3 bucket\"},\n",
    "                    \"key\": {\"type\": \"string\", \"description\": \"The path to the file inside the bucket\"},\n",
    "                    \"is_remote_url\": {\"type\": \"boolean\", \"description\": \"Is the provided source a URL (True) or local path (False)\"},\n",
    "                },\n",
    "                \"required\": [\"source\", \"bucket\", \"key\", \"is_remote_url\"],\n",
    "            }\n",
    "        }\n",
    "    },\n",
    "    {\n",
    "        \"type\": \"function\",\n",
    "        \"function\":{\n",
    "            \"name\": \"search_s3_objects\",\n",
    "            \"description\": \"Search for a specific file name inside an S3 bucket\",\n",
    "            \"parameters\": {\n",
    "                \"type\": \"object\",\n",
    "                \"properties\": {\n",
    "                    \"search_name\": {\"type\": \"string\", \"description\": \"The name of the file you want to search for\"},\n",
    "                    \"bucket\": {\"type\": \"string\", \"description\": \"The name of the S3 bucket\"},\n",
    "                    \"prefix\": {\"type\": \"string\", \"description\": \"The folder path in the S3 bucket\"},\n",
    "                    \"exact_match\": {\"type\": \"boolean\", \"description\": \"Set exact_match to True if the search should match the exact file name. Set exact_match to False to compare part of the file name string (the file contains)\"}\n",
    "                },\n",
    "                \"required\": [\"search_name\"],\n",
    "            },\n",
    "        }\n",
    "    }\n",
    "]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4cf69ccd",
   "metadata": {},
   "source": [
    "Create helper functions to interact with the S3 service, such as listing buckets, listing objects, downloading and uploading files, and searching for specific files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "cf30f14e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def datetime_converter(obj):\n",
    "    if isinstance(obj, datetime.datetime):\n",
    "        return obj.isoformat()\n",
    "    raise TypeError(f\"Object of type {obj.__class__.__name__} is not JSON serializable\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "37736b74",
   "metadata": {},
   "outputs": [],
   "source": [
    "def list_buckets():\n",
    "    response = s3_client.list_buckets()\n",
    "    return json.dumps(response['Buckets'], default=datetime_converter)\n",
    "\n",
    "def list_objects(bucket, prefix=''):\n",
    "    response = s3_client.list_objects_v2(Bucket=bucket, Prefix=prefix)\n",
    "    return json.dumps(response.get('Contents', []), default=datetime_converter)\n",
    "\n",
    "def download_file(bucket, key, directory):\n",
    "    \n",
    "    filename = os.path.basename(key)\n",
    "    \n",
    "    # Resolve destination to the correct file path\n",
    "    destination = os.path.join(directory, filename)\n",
    "    \n",
    "    s3_client.download_file(bucket, key, destination)\n",
    "    return json.dumps({\"status\": \"success\", \"bucket\": bucket, \"key\": key, \"destination\": destination})\n",
    "\n",
    "def upload_file(source, bucket, key, is_remote_url=False):\n",
    "    if is_remote_url:\n",
    "        file_name = os.path.basename(source)\n",
    "        urlretrieve(source, file_name)\n",
    "        source = file_name\n",
    "       \n",
    "    s3_client.upload_file(source, bucket, key)\n",
    "    return json.dumps({\"status\": \"success\", \"source\": source, \"bucket\": bucket, \"key\": key})\n",
    "\n",
    "def search_s3_objects(search_name, bucket=None, prefix='', exact_match=True):\n",
    "    search_name = search_name.lower()\n",
    "    \n",
    "    if bucket is None:\n",
    "        buckets_response = json.loads(list_buckets())\n",
    "        buckets = [bucket_info[\"Name\"] for bucket_info in buckets_response]\n",
    "    else:\n",
    "        buckets = [bucket]\n",
    "\n",
    "    results = []\n",
    "\n",
    "    for bucket_name in buckets:\n",
    "        objects_response = json.loads(list_objects(bucket_name, prefix))\n",
    "        if exact_match:\n",
    "            bucket_results = [obj for obj in objects_response if search_name == obj['Key'].lower()]\n",
    "        else:\n",
    "            bucket_results = [obj for obj in objects_response if search_name in obj['Key'].lower()]\n",
    "\n",
    "        if bucket_results:\n",
    "            results.extend([{\"Bucket\": bucket_name, \"Object\": obj} for obj in bucket_results])\n",
    "\n",
    "    return json.dumps(results)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ec9df90d",
   "metadata": {},
   "source": [
    "The below dictionary connects the name with the function to use it for execution based on ChatGPT responses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "03c3d555",
   "metadata": {},
   "outputs": [],
   "source": [
    "available_functions = {\n",
    "    \"list_buckets\": list_buckets,\n",
    "    \"list_objects\": list_objects,\n",
    "    \"download_file\": download_file,\n",
    "    \"upload_file\": upload_file,\n",
    "    \"search_s3_objects\": search_s3_objects\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "447b03ca",
   "metadata": {},
   "source": [
    "## ChatGPT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "796acfdd",
   "metadata": {},
   "outputs": [],
   "source": [
    "def chat_completion_request(messages, functions=None, function_call='auto', \n",
    "                            model_name=GPT_MODEL):\n",
    "    \n",
    "    if functions is not None:\n",
    "        return client.chat.completions.create(\n",
    "            model=model_name,\n",
    "            messages=messages,\n",
    "            tools=functions,\n",
    "            tool_choice=function_call)\n",
    "    else:\n",
    "        return client.chat.completions.create(\n",
    "            model=model_name,\n",
    "            messages=messages)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b5ce70f8",
   "metadata": {},
   "source": [
    "### Conversation flow"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1bdaa162",
   "metadata": {},
   "source": [
    "Create a main function for the chatbot, which takes user input, sends it to the OpenAI Chat API, receives a response, executes any function calls generated by the API, and returns a final response to the user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "3e2e9192",
   "metadata": {},
   "outputs": [],
   "source": [
    "def run_conversation(user_input, topic=\"S3 bucket functions.\", is_log=False):\n",
    "\n",
    "    system_message=f\"Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous. If the user ask question not related to {topic} response your scope is {topic} only.\"\n",
    "    \n",
    "    messages = [{\"role\": \"system\", \"content\": system_message},\n",
    "                {\"role\": \"user\", \"content\": user_input}]\n",
    "    \n",
    "    # Call the model to get a response\n",
    "    response = chat_completion_request(messages, functions=functions)\n",
    "    response_message = response.choices[0].message\n",
    "    \n",
    "    if is_log:\n",
    "        print(response.choices)\n",
    "    \n",
    "    # check if GPT wanted to call a function\n",
    "    if response_message.tool_calls:\n",
    "        function_name = response_message.tool_calls[0].function.name\n",
    "        function_args = json.loads(response_message.tool_calls[0].function.arguments)\n",
    "        \n",
    "        # Call the function\n",
    "        function_response = available_functions[function_name](**function_args)\n",
    "        \n",
    "        # Add the response to the conversation\n",
    "        messages.append(response_message)\n",
    "        messages.append({\n",
    "            \"role\": \"tool\",\n",
    "            \"content\": function_response,\n",
    "            \"tool_call_id\": response_message.tool_calls[0].id,\n",
    "        })\n",
    "        \n",
    "        # Call the model again to summarize the results\n",
    "        second_response = chat_completion_request(messages)\n",
    "        final_message = second_response.choices[0].message.content\n",
    "    else:\n",
    "        final_message = response_message.content\n",
    "\n",
    "    return final_message"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2069a3a8",
   "metadata": {},
   "source": [
    "### S3 bucket bot testing\n",
    "In the following examples, make sure to replace the placeholders such as `<file_name>`, `<bucket_name>`, and `<directory_path>` with your specific values before execution."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7d00b643",
   "metadata": {},
   "source": [
    "#### Listing and searching"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7f89ea89",
   "metadata": {},
   "source": [
    "Let's start by listing all the available buckets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8cafc30",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(run_conversation('list my S3 buckets'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e9f051a1",
   "metadata": {},
   "source": [
    "You can ask the assistant to search for a specific file name either in all the buckets or in a specific one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5a18e3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "search_file = '<file_name>'\n",
    "print(run_conversation(f'search for a file {search_file} in all buckets'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "afc2d6e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "search_word = '<file_name_part>'\n",
    "bucket_name = '<bucket_name>'\n",
    "print(run_conversation(f'search for a file contains {search_word} in {bucket_name}'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9bc4a08d",
   "metadata": {},
   "source": [
    "The model is expected to clarify the ask from the user in case of ambiguity in the parameters values as described in the system message."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c58d7372",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sure, to help me find what you're looking for, could you please provide the name of the file you want to search for and the name of the S3 bucket? Also, should the search match the file name exactly, or should it also consider partial matches?\n"
     ]
    }
   ],
   "source": [
    "print(run_conversation('search for a file'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "24df23df",
   "metadata": {},
   "source": [
    "#### Validate edge cases\n",
    "\n",
    "We also instructed the model to reject irrelevant tasks. Let's test it out and see how it works in action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "baaedf21",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Apologies for the misunderstanding, but I am only able to assist with S3 bucket functions. Can you please ask a question related to S3 bucket functions?\n"
     ]
    }
   ],
   "source": [
    "# the model should not answer details not related to the scope\n",
    "print(run_conversation('what is the weather today'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d65d3617",
   "metadata": {},
   "source": [
    "The provided functions are not limited to just retrieving information. They can also assist the user in uploading or downloading files."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ad1b0659",
   "metadata": {},
   "source": [
    "#### Download a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c9bc251",
   "metadata": {},
   "outputs": [],
   "source": [
    "search_file = '<file_name>'\n",
    "bucket_name = '<bucket_name>'\n",
    "local_directory = '<directory_path>'\n",
    "print(run_conversation(f'download {search_file} from {bucket_name} bucket to {local_directory} directory'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "95f1810e",
   "metadata": {},
   "source": [
    "#### Upload a file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0078875c",
   "metadata": {},
   "outputs": [],
   "source": [
    "local_file = '<file_name>'\n",
    "bucket_name = '<bucket_name>'\n",
    "print(run_conversation(f'upload {local_file} to {bucket_name} bucket'))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}