"GPT-4o (\"o\" for \"omni\") and GPT-4o mini are natively multimodal models designed to handle a combination of text, audio, and video inputs, and can generate outputs in text, audio, and image formats. GPT-4o mini is the lightweight version of GPT-4o.\n",
"Before GPT-4o, users could interact with ChatGPT using Voice Mode, which operated with three separate models. GPT-4o integrates these capabilities into a single model that's trained across text, vision, and audio. This unified approach ensures that all inputs — whether text, visual, or auditory — are processed cohesively by the same neural network.\n",
"\n",
"GPT-4o mini is the next iteration of this omni model family, available in a smaller and cheaper version. This model offers higher accuracy than GPT-3.5 Turbo while being just as fast and supporting multimodal inputs and outputs.\n",
"Currently, the API supports `{text, image}` inputs only, with `{text}` outputs, the same modalities as `gpt-4-turbo`.\n",
"\n",
"Additional modalities, including audio, will be introduced soon. This guide will help you get started with using GPT-4o mini for text, image, and video understanding."
"### Configure the OpenAI client and submit a test request\n",
"To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage. \n",
"\n",
"You can get an API key by following these steps:\n",
"1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)\n",
"2. [Generate an API key in your project](https://platform.openai.com/api-keys)\n",
"3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)\n",
"\n",
"Once we have this setup, let's start with a simple {text} input to the model for our first request. We'll use both `system` and `user` messages for our first request, and we'll receive a response from the `assistant` role."
" {\"role\": \"system\", \"content\": \"You are a helpful assistant. Help me with my math homework!\"}, # <-- This is the system message that provides context to the model\n",
" {\"role\": \"user\", \"content\": \"Hello! Could you solve 2+2?\"} # <-- This is the user message for which the model will generate a response\n",
"It seems there was an error processing the image, so I can't see the triangle or its dimensions. However, I can help you calculate the area of a triangle if you provide the base and height or the lengths of the sides.\n",
"\n",
"The area \\( A \\) of a triangle can be calculated using the formula:\n",
"\n",
"1. **Using base and height**:\n",
" \\[\n",
" A = \\frac{1}{2} \\times \\text{base} \\times \\text{height}\n",
" \\]\n",
"\n",
"2. **Using Heron's formula** (if you know the lengths of all three sides \\( a, b, c \\)):\n",
" \\[\n",
" s = \\frac{a + b + c}{2} \\quad \\text{(semi-perimeter)}\n",
" \\]\n",
" \\[\n",
" A = \\sqrt{s(s-a)(s-b)(s-c)}\n",
" \\]\n",
"\n",
"Please provide the necessary dimensions, and I'll help you calculate the area!"
"Since GPT-4o mini in the API does not yet support audio-in (as of July 2024), we'll use a combination of GPT-4o mini and Whisper to process both the audio and visual for a provided video, and showcase two usecases:\n",
"We'll use two python packages for video processing - opencv-python and moviepy. \n",
"\n",
"These require [ffmpeg](https://ffmpeg.org/about.html), so make sure to install this beforehand. Depending on your OS, you may need to run `brew install ffmpeg` or `sudo apt install ffmpeg`"
"Now that we have both the video frames and the audio, let's run a few different tests to generate a video summary to compare the results of using the models with different modalities. We should expect to see that the summary generated with context from both visual and audio inputs will be the most accurate, as the model is able to use the entire context from the video.\n",
"\n",
"1. Visual Summary\n",
"2. Audio Summary\n",
"3. Visual + Audio Summary\n",
"\n",
"#### Visual Summary\n",
"The visual summary is generated by sending the model only the frames from the video. With just the frames, the model is likely to capture the visual aspects, but will miss any details discussed by the speaker."
"The video captures highlights from OpenAI's Dev Day, showcasing new advancements and features in AI technology, particularly focusing on the latest updates to their models and tools.\n",
"\n",
"## Key Highlights\n",
"\n",
"### Event Introduction\n",
"- The event is branded as \"OpenAI Dev Day,\" setting the stage for discussions on AI advancements.\n",
"\n",
"### Keynote Recap\n",
"- The keynote features a recap of significant updates and innovations in OpenAI's offerings.\n",
"\n",
"### New Features and Models\n",
"- Introduction of **GPT-4 Turbo** and **DALL-E 3**, emphasizing improvements in performance and capabilities.\n",
"- Discussion on **JSON Mode** and **Function Calling**, showcasing how these features enhance user interaction with AI.\n",
"\n",
"### Enhanced User Experience\n",
"- Presentation of new functionalities that allow for better control and expanded knowledge in AI interactions.\n",
"- Emphasis on **context length** and **more control** over AI responses.\n",
"\n",
"### Pricing and Efficiency\n",
"- Announcement of pricing structures for GPT-4 Turbo, highlighting cost-effectiveness with reduced token usage.\n",
"\n",
"### Custom Models\n",
"- Introduction of custom models that allow developers to tailor AI functionalities to specific needs.\n",
"\n",
"### Community Engagement\n",
"- Encouragement for developers to build applications using natural language, fostering a collaborative environment.\n",
"\n",
"### Closing Remarks\n",
"- The event concludes with a call to action for developers to engage with OpenAI's tools and contribute to the AI ecosystem.\n",
"\n",
"## Conclusion\n",
"OpenAI Dev Day serves as a platform for unveiling new technologies and fostering community engagement, aiming to empower developers with advanced AI tools and capabilities."
"The results are as expected - the model is able to capture the high level aspects of the video visuals, but misses the details provided in the speech.\n",
"\n",
"#### Audio Summary\n",
"The audio summary is generated by sending the model the audio transcript. With just the audio, the model is likely to bias towards the audio content, and will miss the context provided by the presentations and visuals.\n",
"\n",
"`{audio}` input for GPT-4o isn't currently available but will be coming soon! For now, we use our existing `whisper-1` model to process the audio"
"Welcome to the inaugural OpenAI Dev Day, where several exciting updates and features were announced:\n",
"\n",
"## Key Announcements\n",
"\n",
"- **Launch of GPT-4 Turbo**: \n",
" - Supports up to **128,000 tokens** of context.\n",
" - Introduces **JSON mode** for valid JSON responses.\n",
" - Improved function calling capabilities.\n",
"\n",
"- **Knowledge Retrieval**: \n",
" - New feature allowing models to access external documents and databases for enhanced knowledge.\n",
"\n",
"- **Dolly 3 and Vision Models**: \n",
" - Integration of Dolly 3, GPT-4 Turbo with Vision, and a new Text-to-Speech model into the API.\n",
"\n",
"- **Custom Models Program**: \n",
" - Collaboration with companies to create tailored models for specific use cases.\n",
"\n",
"- **Increased Rate Limits**: \n",
" - Doubling of tokens per minute for established GPT-4 customers, with options for further adjustments in API settings.\n",
"\n",
"- **Cost Efficiency**: \n",
" - GPT-4 Turbo is **3x cheaper** for prompt tokens and **2x cheaper** for completion tokens compared to GPT-4.\n",
"\n",
"- **Introduction of GPTs**: \n",
" - Tailored versions of ChatGPT for specific purposes, allowing users to create private or public GPTs easily through conversation.\n",
"\n",
"- **Upcoming GPT Store**: \n",
" - Launching later this month for sharing GPT creations.\n",
"\n",
"- **Assistance API Enhancements**: \n",
" - Features include persistent threads, built-in retrieval, a code interpreter, and improved function calling.\n",
"\n",
"## Conclusion\n",
"\n",
"OpenAI is excited about the future of AI integration and the potential for users to leverage these new tools. The team looks forward to seeing the innovative applications that will emerge from these advancements. Thank you for participating in this event!"
" {\"role\": \"system\", \"content\":\"\"\"You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown.\"\"\"},\n",
" {\"role\": \"user\", \"content\": [\n",
" {\"type\": \"text\", \"text\": f\"The audio transcription is: {transcription.text}\"}\n",
"The audio summary is biased towards the content discussed during the speech, but comes out with much less structure than the video summary.\n",
"\n",
"#### Audio + Visual Summary\n",
"The Audio + Visual summary is generated by sending the model both the visual and the audio from the video at once. When sending both of these, the model is expected to better summarize since it can perceive the entire video at once."
"The first-ever OpenAI Dev Day introduced several exciting updates and features, primarily focusing on the launch of **GPT-4 Turbo**. This new model enhances capabilities and expands the potential for developers.\n",
"\n",
"## Key Announcements\n",
"\n",
"### 1. **GPT-4 Turbo**\n",
"- Supports up to **128,000 tokens** of context.\n",
"- Offers improved performance in following instructions and handling multiple function calls.\n",
"\n",
"### 2. **JSON Mode**\n",
"- A new feature that ensures responses are formatted in valid JSON, enhancing data handling.\n",
"\n",
"### 3. **Retrieval Feature**\n",
"- Allows models to access external knowledge from documents or databases, improving the accuracy and relevance of responses.\n",
"\n",
"### 4. **DALL·E 3 and Vision Capabilities**\n",
"- Introduction of **DALL·E 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech** model, all available in the API.\n",
"\n",
"### 5. **Custom Models Program**\n",
"- A new initiative where OpenAI researchers collaborate with companies to create tailored models for specific use cases.\n",
"\n",
"### 6. **Rate Limits and Pricing**\n",
"- Doubling of tokens per minute for established GPT-4 customers.\n",
"- **GPT-4 Turbo** is significantly cheaper, with a **3x reduction** in prompt tokens and **2x reduction** in completion tokens.\n",
"\n",
"### 7. **Introduction of GPTs**\n",
"- Tailored versions of ChatGPT designed for specific purposes, combining instructions, expanded knowledge, and actions.\n",
"- Users can create private or public GPTs without needing coding skills.\n",
"\n",
"### 8. **Assistance API**\n",
"- Features persistent threads, built-in retrieval, a code interpreter, and improved function calling, making it easier for developers to manage conversations and data.\n",
"\n",
"## Conclusion\n",
"The event highlighted OpenAI's commitment to enhancing AI capabilities and accessibility for developers. The advancements presented are expected to empower users to create innovative applications and solutions. OpenAI looks forward to future developments and encourages ongoing collaboration with the developer community."
" {\"role\": \"system\", \"content\":\"\"\"You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown\"\"\"},\n",
"After combining both the video and audio, we're able to get a much more detailed and comprehensive summary for the event which uses information from both the visual and audio elements from the video.\n",
"\n",
"### Example 2: Question and Answering\n",
"For the Q&A, we'll use the same concept as before to ask questions of our processed video while running the same 3 tests to demonstrate the benefit of combining input modalities:\n",
"Visual QA:Sam Altman used the example of raising windows and turning the radio on to illustrate the concept of function calling in AI. This example demonstrates how AI can interpret natural language commands and translate them into specific functions or actions, making interactions more intuitive and user-friendly. By showing a relatable scenario, he highlighted the advancements in AI's ability to understand and execute complex instructions seamlessly."
"The transcription provided does not include any mention of Sam Altman discussing raising windows or turning the radio on. Therefore, I cannot provide an answer to that specific question based on the given transcription. If you have more context or another transcription that includes that example, please share it, and I would be happy to help!"
"Sam Altman used the example of raising windows and turning the radio on to illustrate the new function calling feature in the GPT-4 Turbo model. This example demonstrates how the model can interpret natural language commands and translate them into specific function calls, making it easier for users to interact with the system in a more intuitive way. It highlights the model's ability to understand context and execute multiple actions based on user instructions."
"Comparing the three answers, the most accurate answer is generated by using both the audio and visual from the video. Sam Altman did not discuss the raising windows or radio on during the Keynote, but referenced an improved capability for the model to execute multiple functions in a single request while the examples were shown behind him.\n",
"Integrating many input modalities such as audio, visual, and textual, significantly enhances the performance of the model on a diverse range of tasks. This multimodal approach allows for more comprehensive understanding and interaction, mirroring more closely how humans perceive and process information. \n",