openai-cookbook/examples/Whisper_processing_guide.ipynb
Gabor Cselle 2c441ab9a2
Migrate all notebooks to API V1 (#914)
Co-authored-by: ayush rajgor <ayushrajgorar@gmail.com>
2024-01-24 19:05:14 -06:00

501 lines
25 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Enhancing Whisper transcriptions: pre- & post-processing techniques\n",
"\n",
"This notebook offers a guide to improve the Whisper's transcriptions. We'll streamline your audio data via trimming and segmentation, enhancing Whisper's transcription quality. After transcriptions, we'll refine the output by adding punctuation, adjusting product terminology (e.g., 'five two nine' to '529'), and mitigating Unicode issues. These strategies will help improve the clarity of your transcriptions, but remember, customization based on your unique use-case may be beneficial.\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"To get started let's import a few different libraries:\n",
"\n",
"- [PyDub](http://pydub.com/) is a simple and easy-to-use Python library for audio processing tasks such as slicing, concatenating, and exporting audio files.\n",
"\n",
"- The `Audio` class from the `IPython.display` module allows you to create an audio control that can play sound in Jupyter notebooks, providing a straightforward way to play audio data directly in your notebook.\n",
"\n",
"- For our audio file, we'll use a fictional earnings call written by ChatGPT and read aloud by the author.This audio file is relatively short, but hopefully provides you with an illustrative idea of how these pre and post processing steps can be applied to any audio file. "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import os\n",
"import urllib\n",
"from IPython.display import Audio\n",
"from pathlib import Path\n",
"from pydub import AudioSegment\n",
"import ssl"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"client = OpenAI(api_key=os.environ.get(\"OPENAI_API_KEY\", \"<your OpenAI API key if not set as env var>\"))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('data/EarningsCall.wav', <http.client.HTTPMessage at 0x11be41f50>)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# set download paths\n",
"earnings_call_remote_filepath = \"https://cdn.openai.com/API/examples/data/EarningsCall.wav\"\n",
"\n",
"# set local save locations\n",
"earnings_call_filepath = \"data/EarningsCall.wav\"\n",
"\n",
"# download example audio files and save locally\n",
"ssl._create_default_https_context = ssl._create_unverified_context\n",
"urllib.request.urlretrieve(earnings_call_remote_filepath, earnings_call_filepath)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use Pydub to detect and trim the silence. \n",
"\n",
"Here, we've set the decibel threshold of 20. You can change this if you would like."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# Function to detect leading silence\n",
"# Returns the number of milliseconds until the first sound (chunk averaging more than X decibels)\n",
"def milliseconds_until_sound(sound, silence_threshold_in_decibels=-20.0, chunk_size=10):\n",
" trim_ms = 0 # ms\n",
"\n",
" assert chunk_size > 0 # to avoid infinite loop\n",
" while sound[trim_ms:trim_ms+chunk_size].dBFS < silence_threshold_in_decibels and trim_ms < len(sound):\n",
" trim_ms += chunk_size\n",
"\n",
" return trim_ms\n"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"def trim_start(filepath):\n",
" path = Path(filepath)\n",
" directory = path.parent\n",
" filename = path.name\n",
" audio = AudioSegment.from_file(filepath, format=\"wav\")\n",
" start_trim = milliseconds_until_sound(audio)\n",
" trimmed = audio[start_trim:]\n",
" new_filename = directory / f\"trimmed_{filename}\"\n",
" trimmed.export(new_filename, format=\"wav\")\n",
" return trimmed, new_filename\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"def transcribe_audio(file,output_dir):\n",
" audio_path = os.path.join(output_dir, file)\n",
" with open(audio_path, 'rb') as audio_data:\n",
" transcription = client.audio.transcriptions.create(\n",
" model=\"whisper-1\", file=audio_data)\n",
" return transcription.text"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"At times, we've seen unicode character injection in transcripts, removing any non-ASCII characters should help mitigate this issue.\n",
"\n",
"Keep in mind you should not use this function if you are transcribing in Greek, Cyrillic, Arabic, Chinese, etc"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"# Define function to remove non-ascii characters\n",
"def remove_non_ascii(text):\n",
" return ''.join(i for i in text if ord(i)<128)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This function will add formatting and punctuation to our transcript. Whisper generates a transcript with punctuation but without formatting."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"# Define function to add punctuation\n",
"def punctuation_assistant(ascii_transcript):\n",
"\n",
" system_prompt = \"\"\"You are a helpful assistant that adds punctuation to text.\n",
" Preserve the original words and only insert necessary punctuation such as periods,\n",
" commas, capialization, symbols like dollar sings or percentage signs, and formatting.\n",
" Use only the context provided. If there is no context provided say, 'No context provided'\\n\"\"\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" temperature=0,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": ascii_transcript\n",
" }\n",
" ]\n",
" )\n",
" return response\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Our audio file is a recording from a fake earnings call that includes a lot of financial products. This function can help ensure that if Whisper transcribes these financial product names incorrectly, that they can be corrected. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Define function to fix product mispellings\n",
"def product_assistant(ascii_transcript):\n",
" system_prompt = \"\"\"You are an intelligent assistant specializing in financial products;\n",
" your task is to process transcripts of earnings calls, ensuring that all references to\n",
" financial products and common financial terms are in the correct format. For each\n",
" financial product or common term that is typically abbreviated as an acronym, the full term \n",
" should be spelled out followed by the acronym in parentheses. For example, '401k' should be\n",
" transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)'\n",
" , 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)'\n",
", and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing \n",
"financial products into their numeric representations, followed by the full name of the product in parentheses. \n",
"For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'.\n",
" However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for \n",
"'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to \n",
"and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not \n",
"represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to\n",
" analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted \n",
" transcript and a list of the words you've changed\"\"\"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-4\",\n",
" temperature=0,\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": ascii_transcript\n",
" }\n",
" ]\n",
" )\n",
" return response\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This function will create a new file with 'trimmed' appended to the original file name"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# Trim the start of the original audio file\n",
"trimmed_audio = trim_start(earnings_call_filepath)\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"trimmed_audio, trimmed_filename = trim_start(earnings_call_filepath)\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Our fake earnings report audio file is fairly short in length, so we'll adjust the segments accordingly. Keep in mind you can adjust the segment length as you need."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# Segment audio\n",
"trimmed_audio = AudioSegment.from_wav(trimmed_filename) # Load the trimmed audio file\n",
"\n",
"one_minute = 1 * 60 * 1000 # Duration for each segment (in milliseconds)\n",
"\n",
"start_time = 0 # Start time for the first segment\n",
"\n",
"i = 0 # Index for naming the segmented files\n",
"\n",
"output_dir_trimmed = \"trimmed_earnings_directory\" # Output directory for the segmented files\n",
"\n",
"if not os.path.isdir(output_dir_trimmed): # Create the output directory if it does not exist\n",
" os.makedirs(output_dir_trimmed)\n",
"\n",
"while start_time < len(trimmed_audio): # Loop over the trimmed audio file\n",
" segment = trimmed_audio[start_time:start_time + one_minute] # Extract a segment\n",
" segment.export(os.path.join(output_dir_trimmed, f\"trimmed_{i:02d}.wav\"), format=\"wav\") # Save the segment\n",
" start_time += one_minute # Update the start time for the next segment\n",
" i += 1 # Increment the index for naming the next file\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# Get list of trimmed and segmented audio files and sort them numerically\n",
"audio_files = sorted(\n",
" (f for f in os.listdir(output_dir_trimmed) if f.endswith(\".wav\")),\n",
" key=lambda f: int(''.join(filter(str.isdigit, f)))\n",
")\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# Use a loop to apply the transcribe function to all audio files\n",
"transcriptions = [transcribe_audio(file, output_dir_trimmed) for file in audio_files]\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# Concatenate the transcriptions\n",
"full_transcript = ' '.join(transcriptions)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.\n"
]
}
],
"source": [
"print(full_transcript)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Remove non-ascii characters from the transcript\n",
"ascii_transcript = remove_non_ascii(full_transcript)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of 125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to 37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to 16 million, which is a noteworthy increase from 10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized. debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99%... confidence level indicating that our maximum loss will not exceed 5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around 135 million and 8% quarter over quarter growth driven primarily by our cutting edge blockchain solutions and AI driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise 200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.\n"
]
}
],
"source": [
"print(ascii_transcript)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# Use punctuation assistant function\n",
"response = punctuation_assistant(ascii_transcript)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"# Extract the punctuated transcript from the model's response\n",
"punctuated_transcript = response.choices[0].message.content\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations, and residential mortgage-backed securities. We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our debt-to-equity ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with customer acquisition cost dropping by 15% and lifetime value growing by 25%. Our LTVCAC ratio is at an impressive 3.5%. In terms of risk management, we have a value-at-risk model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy tier one capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming IPO of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful Q3. Thank you so much.\n"
]
}
],
"source": [
"print(punctuated_transcript)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# Use product assistant function\n",
"response = product_assistant(punctuated_transcript)\n"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# Extract the final transcript from the model's response\n",
"final_transcript = response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Good afternoon, everyone. And welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar second quarter (Q2) with a revenue of $125 million, a 25% increase year over year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in second quarter (Q2) 2022. Our total addressable market has grown substantially thanks to the expansion of our high yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in Collateralized Debt Obligations (CDOs), and Residential Mortgage-Backed Securities (RMBS). We've also invested $25 million in AAA rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 billion with total liabilities at $900 million, leaving us with a solid equity base of $600 million. Our Debt-to-Equity (D/E) ratio stands at 1.5, a healthy figure considering our expansionary phase. We continue to see substantial organic user growth, with Customer Acquisition Cost (CAC) dropping by 15% and Lifetime Value (LTV) growing by 25%. Our LTV to CAC (LTVCAC) ratio is at an impressive 3.5%. In terms of risk management, we have a Value at Risk (VaR) model in place with a 99% confidence level indicating that our maximum loss will not exceed $5 million in the next trading day. We've adopted a conservative approach to managing our leverage and have a healthy Tier 1 Capital ratio of 12.5%. Our forecast for the coming quarter is positive. We expect revenue to be around $135 million and 8% quarter over quarter growth driven primarily by our cutting-edge blockchain solutions and AI-driven predictive analytics. We're also excited about the upcoming Initial Public Offering (IPO) of our FinTech subsidiary Pay Plus, which we expect to raise $200 million, significantly bolstering our liquidity and paving the way for aggressive growth strategies. We thank our shareholders for their continued faith in us and we look forward to an even more successful third quarter (Q3). Thank you so much.\n",
"\n",
"Words Changed:\n",
"1. Q2 -> second quarter (Q2)\n",
"2. EBITDA -> Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)\n",
"3. Q2 2022 -> second quarter (Q2) 2022\n",
"4. CDOs -> Collateralized Debt Obligations (CDOs)\n",
"5. RMBS -> Residential Mortgage-Backed Securities (RMBS)\n",
"6. D/E -> Debt-to-Equity (D/E)\n",
"7. CAC -> Customer Acquisition Cost (CAC)\n",
"8. LTV -> Lifetime Value (LTV)\n",
"9. LTVCAC -> LTV to CAC (LTVCAC)\n",
"10. VaR -> Value at Risk (VaR)\n",
"11. IPO -> Initial Public Offering (IPO)\n",
"12. Q3 -> third quarter (Q3)\n"
]
}
],
"source": [
"print(final_transcript)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}