openai-cookbook/examples/gpt4o/introduction_to_gpt4o.ipynb

867 lines
1.2 MiB
Plaintext
Raw Normal View History

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to GPT-4o and GPT-4o mini\n",
"\n",
"GPT-4o (\"o\" for \"omni\") and GPT-4o mini are natively multimodal models designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats. GPT-4o mini is the lightweight version of GPT-4o.\n",
"\n",
"### Background\n",
"\n",
"Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o integrates these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs — whether text, visual, or auditory — are processed cohesively by the same neural network.\n",
"\n",
"GPT-4o mini is the next iteration of this omni model family, available in a smaller and cheaper version. This model offers higher accuracy than GPT-3.5 Turbo while being just as fast and supporting multimodal inputs and outputs.\n",
"\n",
"### Current API Capabilities\n",
"\n",
"Currently, the API supports `{text, image}` inputs only, with `{text}` outputs, the same modalities as `gpt-4-turbo`.\n",
"\n",
"Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o mini for text, image, and video understanding."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting Started"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install OpenAI SDK for Python\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install --upgrade openai"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure the OpenAI client and submit a test request\n",
"To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage. \n",
"\n",
"You can get an API key by following these steps:\n",
"1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)\n",
"2. [Generate an API key in your project](https://platform.openai.com/api-keys)\n",
"3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)\n",
"\n",
"Once we have this setup, let's start with a simple {text} input to the model for our first request. We'll use both `system` and `user` messages for our first request, and we'll receive a response from the `assistant` role."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI \n",
"import os\n",
"\n",
"## Set the API key and model name\n",
"MODEL=\"gpt-4o-mini\"\n",
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as an env var>\"))"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Assistant: Of course! \\( 2 + 2 = 4 \\).\n"
]
}
],
"source": [
"completion = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant. Help me with my math homework!\"}, # <-- This is the system message that provides context to the model\n",
" {\"role\": \"user\", \"content\": \"Hello! Could you solve 2+2?\"} # <-- This is the user message for which the model will generate a response\n",
" ]\n",
")\n",
"\n",
"print(\"Assistant: \" + completion.choices[0].message.content)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Image Processing\n",
"GPT-4o mini can directly process images and take intelligent actions based on the image. We can provide images in two formats:\n",
"1. Base64 Encoded\n",
"2. URL\n",
"\n",
"Let's first view the image we'll use, then try sending this image as both Base64 and as a URL link to the API"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAnEAAAEcCAAAAACNZL39AAAACXBIWXMAAAsSAAALEgHS3X78AAAfF0lEQVR42u2d63LjuLKlv0xAAMmyXKf3mT3v/34zs0902S2JFICcH3K55Lsk60JKWBHd0W1bFJhYyAQSeRGjouKM0CqCisq4isq4iorKuIrKuIqKyriKyriKyriKisq4iiuFryI4BwTEfKLe8FTGnYVv4qw4S3iplKtW9QyE82TFMlqIlXGVEKdGF7xJInvKDwb5cevrr6r5U69pg7af2RAGaJdiLlXGVZyScE5my42ouwU6s6FZVsZVnAjNutw9Aob4BN0CaJdSKuMqTiLcu8dmnV0xA6RZuwGZDfjEDbtJKuNOBrUwiGmhXQAaejHDFZdddm64WblXxp1MsmjBJ54OCj6rZAPxSTXhwuJWF2KlxokId08JpChPJ9PsizfAEhguL5tbFUzVcSdAt6RZ4TJ/NmyzhOaNyLVEem51L1cZdwqhgst3j2LdPzwTTqT85tj8EV/wurpF4Vereny+SYfL8RELz4QjNSa0T//ziBbK4JHKuIrvopHIwpvr+bf1f37sV4j+9vyaFSsaHrnFvVy1qkeWJ+CT2OtdmvgEW55f2fzRDe7lqo47Kt8ktlCQaK+o1KSgbutnlhCN8QbtatVxxxSmFnDF3lFdgsvE1Yu1bkAc7MZmoDLuqHxzWYuz8laoqtuukicS4iS1yxuzrNWqHotv8rNEshq5vEehHFReBlwbuOKX3FiUZtVxR5KjFiCsbXOL+vb3Yf3W2LYrNlb4luKXqo47CiJGgGDY+/elNri3p4SlRS/ZIatbWptVx30f7QpXzGf7ZEvmCu+FxYmjaAa/vhVhVR33/UUra5psIdknx04pTt7zhIQ8s3wPtxOjWXXctyWIw4pK/nQzpqHn/dBfkbiiWd2MM7jquO+hk0DORUqxz3f/fSC8+wuTVWCNOr0Nb3Bl3LfglgwOkK/MYvn4D7IMmtFst0G5yrjDIdIUURyE/KVJ9ImPDgdFSiPJe4lyA5yr+7jDRRcTkhApu8hQtuLj3vmtc2uDdunytU9I1XGH8k2c5gx6vxPh2I6Pewd5CKBLn69ezVUdd5jYVIoRU9EdE+xdwZfPuPljIX/9BxW59gJMlXGHCI1ugcSVzHZOAnwdH/fOX4SBZp0JV55YWK3q3miFsHDOVth6Z3KEV/Fxb2G9xVX2hCvXAVXH7S8yMS2OLLpHyZq38XHvYpZ8uvIZqTpuX/zAXCFA2a9GUnb0O/xZEpGrLiNZGbcfGrdwmhuW+3oxIrn7OhAug85x13xgrYzbC7O+xFnxiWZf25cIi691nN2TH7rhmqel1gHeA+KN0HtJuL1DKAvQfblrNiH2izgUl69WiPXksLOoFLPYg8j+dBBiv1Otc1e6hU8OrrWSZrWqu1JGXJkbvUI4SP/0u9Wdzn6BEXK+1q1c1XG7wZlipV0XykF3AgJ3j2q7BF6KNEstV9v8oTJuJymFgU0PkKeCl/s/QaQIO35WlILMhniN+Q/Vqu4Ah+BQI5DKgUvUZtiuZUaseJgl+mucnarjvsQs4bIWvObDd/OiRfexx04yzuUrjF2q3pGvoBaR7NfyveunQtnnxJFF/vpPbsr1aYSq4z7H/YMnic0fvrmPF8LAXlX1hTCg5erOD3Uf9+msy8PPlDpzD9+vR7Pmg9SaD2AMf+ElXltXparjPpENIWeAI4SCi5iWu4d9PnL/ALh8bWmFVcd9TBINQ4dr7jhGAJEhPO71iV+G+tzQXFeOV9VxH0qmXeKTluPoGMHlsN43816epuea2ipVHffRXOsSTW1pj1VRMLPe+0lWWkW5KpdC1XHvwmdkNsDRNlECyCHkFWmWQa6okmZl3PuEa1nCXVkcTTxyIOMQQirE/loOEJVxbxBLQgugcsSIoYMZxyy5HPtmpdcRMlf3ca9wL0PC3UEIeRwhamtzXa8rrqQuetVxL+EKPjfLo/jgXgr6YB0Hgpj64To8c1XHvcCsQJotPTqmJD6zRkpQfxWOuarjtoXhsstfJs8f+Oxv6DhA2wX4dAWFcKqO2yKFh0wK2Ckcrg0WDv90WQLJ5elv5irjnvl2T5o50JPU/WjCiuY71aWLx5GlnXwqa7WqvwUhIhnBTrM93/Dk/u/vPGNGAv76n2lPWdVxADiPFecxO5FzP4q4uz2v8l9jnSMN/xOmreWqjoOn3pN3jyf0Pgg+0X33CkOgGcqBJxAZh3Ol6jhEWooapyQcQRNyhDszWZXWOMRNMuN/iTzvAvVisQGVcXMosZQTF25bF22//xTDGkrUA8qiz5L7v/hIlB8i4i3/14U4d+tWtVsn9QMixZ30TktNy1GcadKsUErT72tavSTnhpAoRF0eK+7vBhkXB6BZHSi+TRBRWNupxX9wtNK7T4rrIrpnZZJZwicxLYRhE4l1mamftlXVKAORsArcy0xmcS9j46LicTMdwoRuLC0w78tPk9zs87E2dRRaVxrNMMBPd5nxT1jHCcD8cZYKEAaftNCsdn4fifTRhmZ1ltV+PB0Hrtw94vJeil3oFk6SmE/QLb6V7X2rjBMxN7Men5pBMs6vCavd20ZG6bl7pFmd56rymIwD4W4Y2KMGiiOsaJc+rIrL7fKSkz5Zxklja02+lPhUeTL27ZJmvSt/hLtHYn+uQs/HZRw/FpvEwl1r4YjL3P/akO/CUX9T3ccJq94yqbiVmcnck91S71Y7xsnK7CcLtA8TrSyu8K8sjn63vZz4HPjFHBfI3YVnbqI6rqHH5flj+U3AdinWLtkpQy8OagZ3i3K2tz+yjoN2BY0Nux2y7x/ioG5ttEvCcNlj0kQZJ+At/7EpswT45PMOJdrUfHKZTzuzjZ9xSMh511Bl6RZigkFclzBwyaYRk7SqIh6aLH82McmBc6TGvvTsRwMEVTvnWm+O/kRLGUHYJXwpLgAL5qQv3UBMzeWCniap4/71P5tradtWIWLi/JcOA8VCSe2SMztAj6/j4MfCJ9CmfOkSEu4fNzvcJuUgPcr87wvN/BQZJ8wf1F51YfuxaJdfb1Hk90vvV3RmnIxDwtruHgHaxed/2T53AxAtxHWJ/VcfqYx7hvOp+ETs9x664Mps6Ba7e+2OSbjjMw6kWTVl/WOPsBdPRku4XGLY9PZxUtZC8e6pSK782PkVRDUzsKK1MxMOH0/zXFuxSrYUZh9SshHZ9of4fEchczEfyQRPDn9ZjsVmzJAfrUhvu10Qivy00hqN5/z25L4/0YMtYs7MJXn/nbu+Jy63frni8SfgFs2Fzg6Ts6oSe7rNNc39gyFSdovdnWUD7h4v5BiIw0msKuAK+KSzdX7nS9ulT7F/sc3T+S/iuhCtv4Qgpsc4lYLLYZDQm/gkzvU+fV0hpFtu7r65UPWOk5wcnh4toe8W6Bt3tmjBp9jzYs8mEAYRP9R93I4zVwIlDpgZtqaxXk2Lky8mZUWXu4Ta9XVYM+vjglC0eaNNnCQxvL6gIayRs+9kt0Y1LfGqaaFZOcrTbZb3vejn3eXF4XstFy2IdUIdB0KzwuXXl/RqNEOhWb3wXHorBj5dqu7mxBgn+Gy8KDovsUfaxceUc79Fe8lyWCdlHAjE9etAcgnSE/wibLVbEtHclFS4lDtualbVk01jLNvH0x7CIgwf3CNFKR0uOv59hQb1GSauL+2r3tNS+rkOCx388Pyz4HJcDSWG5aWGOjEd58k+8SLIS2j74hPvuzR9RuIKCP1FB35iHQe+aCam/OKMcPfI/S8Nf27BXIFowzvHjMq496EGtEPevjNQc5kwvCtEIQzzB9zF9dvJGYeII9EMeXu5NevcrJ5vWH7XjnJxUWNHdoQFR7PMvDxpBZohmrw3zz7zOCddsUF9Fg0pO1Zl6/Yhsc6smphEpZNWXG5zTBwlVfvwlTGxk4Oa5lfx/a6IZt4Jp276sDZ8HkOd8NPrOGh6l3F5O2ROECngE2hBNYXh8FTL42Byt1zWvl4lyizDm/RNEXQwdamMgXAe5qcex4osCNuXrGaIJxacK47iSITVZVfg1M6qpXsk3b0o/efaITa
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from IPython.display import Image, display, Audio, Markdown\n",
"import base64\n",
"\n",
"IMAGE_PATH = \"data/triangle.png\"\n",
"\n",
"# Preview image for context\n",
"display(Image(IMAGE_PATH))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Base64 Image Processing"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"To find the area of the triangle, you can use the formula:\n",
"\n",
"\\[\n",
"\\text{Area} = \\frac{1}{2} \\times \\text{base} \\times \\text{height}\n",
"\\]\n",
"\n",
"In the triangle you provided:\n",
"\n",
"- The base is \\(9\\) (the length at the bottom).\n",
"- The height is \\(5\\) (the vertical line from the top vertex to the base).\n",
"\n",
"Now, plug in the values:\n",
"\n",
"\\[\n",
"\\text{Area} = \\frac{1}{2} \\times 9 \\times 5\n",
"\\]\n",
"\n",
"Calculating this gives:\n",
"\n",
"\\[\n",
"\\text{Area} = \\frac{1}{2} \\times 45 = 22.5\n",
"\\]\n",
"\n",
"So, the area of the triangle is **22.5 square units**."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Open the image file and encode it as a base64 string\n",
"def encode_image(image_path):\n",
" with open(image_path, \"rb\") as image_file:\n",
" return base64.b64encode(image_file.read()).decode(\"utf-8\")\n",
"\n",
"base64_image = encode_image(IMAGE_PATH)\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant that responds in Markdown. Help me with my math homework!\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" {\"type\": \"text\", \"text\": \"What's the area of the triangle?\"},\n",
" {\"type\": \"image_url\", \"image_url\": {\n",
" \"url\": f\"data:image/png;base64,{base64_image}\"}\n",
" }\n",
" ]}\n",
" ],\n",
" temperature=0.0,\n",
")\n",
"\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### URL Image Processing"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"It seems there was an error processing the image, so I can't see the triangle or its dimensions. However, I can help you calculate the area of a triangle if you provide the base and height or the lengths of the sides.\n",
"\n",
"The area \\( A \\) of a triangle can be calculated using the formula:\n",
"\n",
"1. **Using base and height**:\n",
" \\[\n",
" A = \\frac{1}{2} \\times \\text{base} \\times \\text{height}\n",
" \\]\n",
"\n",
"2. **Using Heron's formula** (if you know the lengths of all three sides \\( a, b, c \\)):\n",
" \\[\n",
" s = \\frac{a + b + c}{2} \\quad \\text{(semi-perimeter)}\n",
" \\]\n",
" \\[\n",
" A = \\sqrt{s(s-a)(s-b)(s-c)}\n",
" \\]\n",
"\n",
"Please provide the necessary dimensions, and I'll help you calculate the area!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant that responds in Markdown. Help me with my math homework!\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" {\"type\": \"text\", \"text\": \"What's the area of the triangle?\"},\n",
" {\"type\": \"image_url\", \"image_url\": {\n",
" \"url\": \"https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png\"}\n",
" }\n",
" ]}\n",
" ],\n",
" temperature=0.0,\n",
")\n",
"\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Video Processing\n",
"While it's not possible to directly send a video to the API, GPT-4o can understand videos if you sample frames and then provide them as images. \n",
"\n",
"Since GPT-4o mini in the API does not yet support audio-in (as of July 2024), we'll use a combination of GPT-4o mini and Whisper to process both the audio and visual for a provided video, and showcase two usecases:\n",
"1. Summarization\n",
"2. Question and Answering\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup for Video Processing\n",
"We'll use two python packages for video processing - opencv-python and moviepy. \n",
"\n",
"These require [ffmpeg](https://ffmpeg.org/about.html), so make sure to install this beforehand. Depending on your OS, you may need to run `brew install ffmpeg` or `sudo apt install ffmpeg`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install opencv-python\n",
"%pip install moviepy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Process the video into two components: frames and audio"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"import cv2\n",
"from moviepy.editor import VideoFileClip\n",
"import time\n",
"import base64\n",
"\n",
"# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk\n",
"VIDEO_PATH = \"data/keynote_recap.mp4\""
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MoviePy - Writing audio in data/keynote_recap.mp3\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" "
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"MoviePy - Done.\n",
"Extracted 218 frames\n",
"Extracted audio to data/keynote_recap.mp3\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\r"
]
}
],
"source": [
"def process_video(video_path, seconds_per_frame=2):\n",
" base64Frames = []\n",
" base_video_path, _ = os.path.splitext(video_path)\n",
"\n",
" video = cv2.VideoCapture(video_path)\n",
" total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))\n",
" fps = video.get(cv2.CAP_PROP_FPS)\n",
" frames_to_skip = int(fps * seconds_per_frame)\n",
" curr_frame=0\n",
"\n",
" # Loop through the video and extract frames at specified sampling rate\n",
" while curr_frame < total_frames - 1:\n",
" video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)\n",
" success, frame = video.read()\n",
" if not success:\n",
" break\n",
" _, buffer = cv2.imencode(\".jpg\", frame)\n",
" base64Frames.append(base64.b64encode(buffer).decode(\"utf-8\"))\n",
" curr_frame += frames_to_skip\n",
" video.release()\n",
"\n",
" # Extract audio from video\n",
" audio_path = f\"{base_video_path}.mp3\"\n",
" clip = VideoFileClip(video_path)\n",
" clip.audio.write_audiofile(audio_path, bitrate=\"32k\")\n",
" clip.audio.close()\n",
" clip.close()\n",
"\n",
" print(f\"Extracted {len(base64Frames)} frames\")\n",
" print(f\"Extracted audio to {audio_path}\")\n",
" return base64Frames, audio_path\n",
"\n",
"# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate\n",
"base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/jpeg": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAIBAQEBAQIBAQECAgICAgQDAgICAgUEBAMEBgUGBgYFBgYGBwkIBgcJBwYGCAsICQoKCgoKBggLDAsKDAkKCgr/2wBDAQICAgICAgUDAwUKBwYHCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgoKCgr/wAARCALQBQADASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD4Dooor6g/lcKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA
"text/plain": [
"<IPython.core.display.Image object>"
]
},
"metadata": {
"image/jpeg": {
"width": 600
}
},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
" <audio controls=\"controls\" >\n",
" <source src=\"data:audio/mpeg;base64,SUQzBAAAAAAAI1RTU0UAAAAPAAADTGF2ZjU4LjI5LjEwMAAAAAAAAAAAAAAA//tQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASW5mbwAAAA8AAB+PAAziXQADBggLDRASFRcaHB8hJCYpKy4wMzU4Oj1AQkZIS01QUlVXWlxfYWRmaWtucHN1eHp9gIKFh4qMkJKVl5qcn6GkpqmrrrCztbi6vcDCxcfKzM/R1Nba3N/h5Obp6+7w8/X4+v0AAAAATGF2YzU4LjU0AAAAAAAAAAAAAAAAJAU0AAAAAAAM4l3hR4sIAAAAAAAAAAAAAAAAAAAAAP/7EGQAD/AAAGkAAAAIAAANIAAAAQAAAaQAAAAgAAA0gAAABAUQKxYyaP/mjIKIFYqaMnPzJpUx8eTAfuTAz4CD8MHfgy0p0DOUWP4gqYR24rJ+wBBIBamBwyjXBuQSvMpccm1qsYxi//sSZCIP8AAAaQAAAAgAAA0gAAABAAABpAAAACAAADSAAAAEgGaL0VOC5BghiRyRt1wuLvaMTZJYul3q9H00qgAHRGYAKdxAIlFB8GYpKKEvXU6iz2fRruDjMBLugaPJKekXGxlDtObQ//sQZESP8AAAaQAAAAgAAA0gAAABAAABpAAAACAAADSAAAAExzHdMnRA2HhQYPeab/Xp1/H6//9Di9RMws3uMGFKMamXqb6t//fVx/wpPyYeGiCoEHN4gs1nrPU3/UAEINIAF4AeCS7/+xJkZo/wAABpAAAACAAADSAAAAEAAAGkAAAAIAAANIAAAAQCPpTTf3r/PfsRQAAg1nEDHcJDaNSB3mL3G+L/1dfYuApuAAFurHDiP/JKAAQJlQZVAAG20Vpf8P3gEXMFyaLkZzr6tnz/+xBkiQMwAABpAAAACAAADSAAAAEBJALXAARgICSAWuAAjARHFSsLiABFc6MmG9CIykl35wAhpksERo6K4MrewAcOD0ptWeU1R9Ob4/9JREBBnMCFs84SA2eLopso/s9v7FUAAAhGAP/7EkShgRBtAMCAIxgCDqDYQARgCEOAGxEgjECgT4BktACMBBhqkKNw3lKD8rTItgoxFFHRKFj98PlrX6S1N1RmoAABGqCNUAAVBAgUCXDHRHpWmvX1cp/agpFUfYuqSlPQQGkmydtFo//7EGSdifC+BkbJJiiQFKDIkATKBAIoSRgEgExARgMiwAM0CBw9ZC44fDNcBvzlJrMHFLJkpkx4bujZPAPaW4kO35a73j/65lUAAIAgYAAKPQcj40TmB/zw9oDAkiWatNbRAGAAha+6//sSZJeBUJsGRoGBWYAUQMjVDSUSApwbLaCUISArgic0ECQGjyyGhcAAEiJACp2IS5VIvUoPDxq0Xdq734PcXszOguQqoCEysxNnGkVAaYgU+qmtrsZUqx267IA7oiz0+Onfk2oAAhZi//sQZJYBENcGSWmBOYAQYNjQCYMSA0gZIMCxYIBMgiW0EKAEo0vuMpNWB2m5926qy7s/XegAAACCBlwAAEpJKcpJw8WwFcpHscaowak0+M0NpDTmsruo+2BPKg1KUlVjsRH4w2dqmC7/+xJki4mRFAhHSG8YQBgA2U0EZRID4CEewbxBAFMDZGQmIEjukc7QiSPCBxIkfxoAgXkSMF3HocusvqSmGYnexR1cpSq5aK+EXKCUCK0IIg+oG1mDGReGnxiXS+31e2sAAIBg2xBmE1P/+xBkdokRGghH0G8oQB2CSPklojID8GkhJ4RMwFMDZfQShCDATOBff9GD+k6ohdRh63TnZrT+FawADjDUYAAgDgQKstlp8A4+mlFmxLvhHZFaHIIQZcI45cE4Fhk+WyMCsrHRQzhgev/7EmReCZD+B8poLDgoFyI40DwlWARUSRonhE4AVgNkJBSUELLxS7k3/aQge4hbzaIYlDRU0pgxWAXIqI78+lyPf61bh6pAAM+AQ/MXUPOBtEkpLVO9wbH7y07daYCJ/T6ihggbGDYR3//7EGRJDRDuCUhB7BDAGsDJSgWFBANcHyRI4SAAawMlqYSITEDVHlYqBSxTwfsI1Oeo0hOJv6P9YCGHUxNhOB6wWkuDwyETWApUtp93VfuhEYz4Y/R3FICdMRg6BE8USugZq8BKrXY5//sSZDOBkOgaSsNYKCgaYNkRaywAA7hHJs0w4kBWg2TZlhRI+SsuqiwofCZywTYKGPzqvFRgHpJ7/LsielBW/8v+ocBAJiKBMHTKDBMUD/+j+tUx5Sp1mIND2VE2xKW0nLb+c+whmOvX//sQZCCPcNQRyYMaOCAPwNmFPGIVA0RHJA0ArEA9g2VBhIhIgE/9NMNPHukS0RBIItjPu2u4jhgZokd50Cx8vqv0mfe4XtL3VHmclBLDEiCObQxAnfTf+i/+KfABtCUFpj4wIwNT+Fn/+xJkGI8wpQbKAxpYIA8iOWBhIhICRBswDCQCwDiDZhTxAMxexvK/1GygFKJiLE/8r6jByl+Slw3Y9VF6Hi5aEDYLZX/p/prIVJ8iRD1L4sLiXG+S8qDAJqwdhjI1g+UDOfQKS1ogdwX/+xBkGQfwhAbMgww4kA7g2VBhghIByBs0rCQC4D6DZYD0iEjsxARHUuGlvR6P5eoqFdH05wi6FEE9vgy53IfZxlJVisNIfnheqhG8s7o9MAgnQDMhqhW0erDZfq+V/rhEzBQBbECYkv/7EmQcD/B/BkyBuUgAD4D5YGFnEgIEGzYMJKJAO4PmAPMcSJ8IMvdo8rWoYonlQZ5SzQ71BYr/22KBsm+xPY59NhHGo//87WDrCDACCSNA2KeV/5b+j2mApK0FgoWQ7hR/u9P9SugMJ//7EGQfD/BvBk2DBiiQDsDJkBnnBAHcDzYMYMAAO4NmAYScSKQYUHnaEC4HT9/F/6cB0hVWlWRllAnP/+S/p4ccK6Me5nUEM1QnBYinooZBq0TCMbTRX1BzU3fyVboMGaQK+LWKujjf//sSZCOP8H4GzYMMEJAOAJmAYSASAeAhNAwMRmA6g2XAnCQAABi//KVBGgc4F4rI2HQIOOj/lZolYCkmmDzsYznWEdV39eJYqj+FKsBaSPt4OrEf/LKIgVoJa8CMN9sb6TqBmFp11k5B//sQZCgP8H0GzoMPKJAOYNmQPAIwAeQbOAw84kA+A2YBhKxIrKRJDnsdaqgcb/ytYMQKGAURtKaorCyQNlvb/+iCErL8AFFcyHhFIRN/6f6aqA6VODdpHhQIud49PE0f4Yuoar6EAg3/+xJkK4/wgQhNgwgRmA6A2bBh6hICLBs4DDziQD4DZsDzKICcgEUeGjbv6I6IoggCcWVK3FO9RC8vfu8rNDph56agRr1i/pgTdI9FfAEhlKAskMdM5jXz049yckj3/20oNCZBGeSecT7/+xBkLg/whAhNgwwQsA6A2bBhKhICHB84DDziQDsDJsGGIExPONxV/9NQK2NwCL5FB6DokysCaT1u8lXEDRbT5FD5u3VX1v//WqUVOXGDAJq2mQjWBRrf+/+U/rXgAFhgACkN4BK4tP/7EmQwj/CdBk4B+HgAD0DZoGAGMAIwGzYMJCSAOQHmgYxgAMDf/13RyY68BOGhu2QN0wJIv/7v1aHaEkBAFZlg9K/wh79/lOsQYMOzSUpVh3wYWR/xD/K1H8XnBil53bGN2FmqMj6qYf/7EGQxjzCUEc6DCTkADyDZ5T1CEQIoGzoMICSAPINmwYMEkE4FZBQBa1/DyKnAO4//qf/NYoug1KzwcbsaNUIn/9UKQAaISLsUMGNLhopBG/dy8JoeQqwkOCIzvoM/o9+C2jAVLkBF//sSZDKP8IMGzoMGOQAPIMmgPwYCAlQZOAxhAEA3AycBhhxIf8iI6+hv1fKQMgKwwAKUj5VpgOqu/rXByzo9HUKSjjOKt5kINN7rvEMHxoEmEbkZzWecFs6Q/zkGk4LKp5tfgDhbfiPE//sQZDUD8IMG0CsPKJgNgMmwQekCAfAbPAwkBIA4g2cAp5wQX/yPQwCEXssaG0tAKm/+V/
" Your browser does not support the audio element.\n",
" </audio>\n",
" "
],
"text/plain": [
"<IPython.lib.display.Audio object>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"## Display the frames and audio for context\n",
"display_handle = display(None, display_id=True)\n",
"for img in base64Frames:\n",
" display_handle.update(Image(data=base64.b64decode(img.encode(\"utf-8\")), width=600))\n",
" time.sleep(0.025)\n",
"\n",
"Audio(audio_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 1: Summarization\n",
"Now that we have both the video frames and the audio, let's run a few different tests to generate a video summary to compare the results of using the models with different modalities. We should expect to see that the summary generated with context from both visual and audio inputs will be the most accurate, as the model is able to use the entire context from the video.\n",
"\n",
"1. Visual Summary\n",
"2. Audio Summary\n",
"3. Visual + Audio Summary\n",
"\n",
"#### Visual Summary\n",
"The visual summary is generated by sending the model only the frames from the video. With just the frames, the model is likely to capture the visual aspects, but will miss any details discussed by the speaker."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"# OpenAI Dev Day Summary\n",
"\n",
"## Overview\n",
"The video captures highlights from OpenAI's Dev Day, showcasing new advancements and features in AI technology, particularly focusing on the latest updates to their models and tools.\n",
"\n",
"## Key Highlights\n",
"\n",
"### Event Introduction\n",
"- The event is branded as \"OpenAI Dev Day,\" setting the stage for discussions on AI advancements.\n",
"\n",
"### Keynote Recap\n",
"- The keynote features a recap of significant updates and innovations in OpenAI's offerings.\n",
"\n",
"### New Features and Models\n",
"- Introduction of **GPT-4 Turbo** and **DALL-E 3**, emphasizing improvements in performance and capabilities.\n",
"- Discussion on **JSON Mode** and **Function Calling**, showcasing how these features enhance user interaction with AI.\n",
"\n",
"### Enhanced User Experience\n",
"- Presentation of new functionalities that allow for better control and expanded knowledge in AI interactions.\n",
"- Emphasis on **context length** and **more control** over AI responses.\n",
"\n",
"### Pricing and Efficiency\n",
"- Announcement of pricing structures for GPT-4 Turbo, highlighting cost-effectiveness with reduced token usage.\n",
"\n",
"### Custom Models\n",
"- Introduction of custom models that allow developers to tailor AI functionalities to specific needs.\n",
"\n",
"### Community Engagement\n",
"- Encouragement for developers to build applications using natural language, fostering a collaborative environment.\n",
"\n",
"### Closing Remarks\n",
"- The event concludes with a call to action for developers to engage with OpenAI's tools and contribute to the AI ecosystem.\n",
"\n",
"## Conclusion\n",
"OpenAI Dev Day serves as a platform for unveiling new technologies and fostering community engagement, aiming to empower developers with advanced AI tools and capabilities."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are generating a video summary. Please provide a summary of the video. Respond in Markdown.\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" \"These are the frames from the video.\",\n",
" *map(lambda x: {\"type\": \"image_url\", \n",
" \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames)\n",
" ],\n",
" }\n",
" ],\n",
" temperature=0,\n",
")\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results are as expected - the model is able to capture the high level aspects of the video visuals, but misses the details provided in the speech.\n",
"\n",
"#### Audio Summary\n",
"The audio summary is generated by sending the model the audio transcript. With just the audio, the model is likely to bias towards the audio content, and will miss the context provided by the presentations and visuals.\n",
"\n",
"`{audio}` input for GPT-4o isn't currently available but will be coming soon! For now, we use our existing `whisper-1` model to process the audio"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"# OpenAI Dev Day Summary\n",
"\n",
"Welcome to the inaugural OpenAI Dev Day, where several exciting updates and features were announced:\n",
"\n",
"## Key Announcements\n",
"\n",
"- **Launch of GPT-4 Turbo**: \n",
" - Supports up to **128,000 tokens** of context.\n",
" - Introduces **JSON mode** for valid JSON responses.\n",
" - Improved function calling capabilities.\n",
"\n",
"- **Knowledge Retrieval**: \n",
" - New feature allowing models to access external documents and databases for enhanced knowledge.\n",
"\n",
"- **Dolly 3 and Vision Models**: \n",
" - Integration of Dolly 3, GPT-4 Turbo with Vision, and a new Text-to-Speech model into the API.\n",
"\n",
"- **Custom Models Program**: \n",
" - Collaboration with companies to create tailored models for specific use cases.\n",
"\n",
"- **Increased Rate Limits**: \n",
" - Doubling of tokens per minute for established GPT-4 customers, with options for further adjustments in API settings.\n",
"\n",
"- **Cost Efficiency**: \n",
" - GPT-4 Turbo is **3x cheaper** for prompt tokens and **2x cheaper** for completion tokens compared to GPT-4.\n",
"\n",
"- **Introduction of GPTs**: \n",
" - Tailored versions of ChatGPT for specific purposes, allowing users to create private or public GPTs easily through conversation.\n",
"\n",
"- **Upcoming GPT Store**: \n",
" - Launching later this month for sharing GPT creations.\n",
"\n",
"- **Assistance API Enhancements**: \n",
" - Features include persistent threads, built-in retrieval, a code interpreter, and improved function calling.\n",
"\n",
"## Conclusion\n",
"\n",
"OpenAI is excited about the future of AI integration and the potential for users to leverage these new tools. The team looks forward to seeing the innovative applications that will emerge from these advancements. Thank you for participating in this event!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Transcribe the audio\n",
"transcription = client.audio.transcriptions.create(\n",
" model=\"whisper-1\",\n",
" file=open(audio_path, \"rb\"),\n",
")\n",
"## OPTIONAL: Uncomment the line below to print the transcription\n",
"#print(\"Transcript: \", transcription.text + \"\\n\\n\")\n",
"\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\":\"\"\"You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown.\"\"\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"}\n",
" ],\n",
" }\n",
" ],\n",
" temperature=0,\n",
")\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The audio summary is biased towards the content discussed during the speech, but comes out with much less structure than the video summary.\n",
"\n",
"#### Audio + Visual Summary\n",
"The Audio + Visual summary is generated by sending the model both the visual and the audio from the video at once. When sending both of these, the model is expected to better summarize since it can perceive the entire video at once."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"# OpenAI Dev Day Summary\n",
"\n",
"## Overview\n",
"The first-ever OpenAI Dev Day introduced several exciting updates and features, primarily focusing on the launch of **GPT-4 Turbo**. This new model enhances capabilities and expands the potential for developers.\n",
"\n",
"## Key Announcements\n",
"\n",
"### 1. **GPT-4 Turbo**\n",
"- Supports up to **128,000 tokens** of context.\n",
"- Offers improved performance in following instructions and handling multiple function calls.\n",
"\n",
"### 2. **JSON Mode**\n",
"- A new feature that ensures responses are formatted in valid JSON, enhancing data handling.\n",
"\n",
"### 3. **Retrieval Feature**\n",
"- Allows models to access external knowledge from documents or databases, improving the accuracy and relevance of responses.\n",
"\n",
"### 4. **DALL·E 3 and Vision Capabilities**\n",
"- Introduction of **DALL·E 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech** model, all available in the API.\n",
"\n",
"### 5. **Custom Models Program**\n",
"- A new initiative where OpenAI researchers collaborate with companies to create tailored models for specific use cases.\n",
"\n",
"### 6. **Rate Limits and Pricing**\n",
"- Doubling of tokens per minute for established GPT-4 customers.\n",
"- **GPT-4 Turbo** is significantly cheaper, with a **3x reduction** in prompt tokens and **2x reduction** in completion tokens.\n",
"\n",
"### 7. **Introduction of GPTs**\n",
"- Tailored versions of ChatGPT designed for specific purposes, combining instructions, expanded knowledge, and actions.\n",
"- Users can create private or public GPTs without needing coding skills.\n",
"\n",
"### 8. **Assistance API**\n",
"- Features persistent threads, built-in retrieval, a code interpreter, and improved function calling, making it easier for developers to manage conversations and data.\n",
"\n",
"## Conclusion\n",
"The event highlighted OpenAI's commitment to enhancing AI capabilities and accessibility for developers. The advancements presented are expected to empower users to create innovative applications and solutions. OpenAI looks forward to future developments and encourages ongoing collaboration with the developer community."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## Generate a summary with visual and audio\n",
"response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\":\"\"\"You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown\"\"\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" \"These are the frames from the video.\",\n",
" *map(lambda x: {\"type\": \"image_url\", \n",
" \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
" {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"}\n",
" ],\n",
" }\n",
"],\n",
" temperature=0,\n",
")\n",
"display(Markdown(response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After combining both the video and audio, we're able to get a much more detailed and comprehensive summary for the event which uses information from both the visual and audio elements from the video.\n",
"\n",
"### Example 2: Question and Answering\n",
"For the Q&A, we'll use the same concept as before to ask questions of our processed video while running the same 3 tests to demonstrate the benefit of combining input modalities:\n",
"1. Visual Q&A\n",
"2. Audio Q&A\n",
"3. Visual + Audio Q&A "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"QUESTION = \"Question: Why did Sam Altman have an example about raising windows and turning the radio on?\""
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Visual QA:Sam Altman used the example of raising windows and turning the radio on to illustrate the concept of function calling in AI. This example demonstrates how AI can interpret natural language commands and translate them into specific functions or actions, making interactions more intuitive and user-friendly. By showing a relatable scenario, he highlighted the advancements in AI's ability to understand and execute complex instructions seamlessly."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"qa_visual_response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"Use the video to answer the provided question. Respond in Markdown.\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" \"These are the frames from the video.\",\n",
" *map(lambda x: {\"type\": \"image_url\", \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
" QUESTION\n",
" ],\n",
" }\n",
" ],\n",
" temperature=0,\n",
")\n",
"display(Markdown(\"Visual QA:\" + qa_visual_response.choices[0].message.content))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Audio QA:\n",
"The transcription provided does not include any mention of Sam Altman discussing raising windows or turning the radio on. Therefore, I cannot provide an answer to that specific question based on the given transcription. If you have more context or another transcription that includes that example, please share it, and I would be happy to help!"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"qa_audio_response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\":\"\"\"Use the transcription to answer the provided question. Respond in Markdown.\"\"\"},\n",
" {\"role\": \"user\", \"content\": f\"The audio transcription is: {transcription.text}. \\n\\n {QUESTION}\"},\n",
" ],\n",
" temperature=0,\n",
")\n",
"display(Markdown(\"Audio QA:\\n\" + qa_audio_response.choices[0].message.content))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"Both QA:\n",
"Sam Altman used the example of raising windows and turning the radio on to illustrate the new function calling feature in the GPT-4 Turbo model. This example demonstrates how the model can interpret natural language commands and translate them into specific function calls, making it easier for users to interact with the system in a more intuitive way. It highlights the model's ability to understand context and execute multiple actions based on user instructions."
],
"text/plain": [
"<IPython.core.display.Markdown object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"qa_both_response = client.chat.completions.create(\n",
" model=MODEL,\n",
" messages=[\n",
" {\"role\": \"system\", \"content\":\"\"\"Use the video and transcription to answer the provided question.\"\"\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" \"These are the frames from the video.\",\n",
" *map(lambda x: {\"type\": \"image_url\", \n",
" \"image_url\": {\"url\": f'data:image/jpg;base64,{x}', \"detail\": \"low\"}}, base64Frames),\n",
" {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"},\n",
" QUESTION\n",
" ],\n",
" }\n",
" ],\n",
" temperature=0,\n",
")\n",
"display(Markdown(\"Both QA:\\n\" + qa_both_response.choices[0].message.content))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comparing the three answers, the most accurate answer is generated by using both the audio and visual from the video. Sam Altman did not discuss the raising windows or radio on during the Keynote, but referenced an improved capability for the model to execute multiple functions in a single request while the examples were shown behind him.\n",
"\n",
"## Conclusion\n",
"\n",
"Integrating many input modalities such as audio, visual, and textual, significantly enhances the performance of the model on a diverse range of tasks. This multimodal approach allows for more comprehensive understanding and interaction, mirroring more closely how humans perceive and process information. \n",
"\n",
"Currently, GPT-4o and GPT-4o mini in the API support text and image inputs, with audio capabilities coming soon."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}