{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# TV Script Generation\n", "In this project, you'll generate your own [Simpsons](https://en.wikipedia.org/wiki/The_Simpsons) TV scripts using RNNs. You'll be using part of the [Simpsons dataset](https://www.kaggle.com/wcukierski/the-simpsons-by-the-data) of scripts from 27 seasons. The Neural Network you'll build will generate a new TV script for a scene at [Moe's Tavern](https://simpsonswiki.com/wiki/Moe's_Tavern).\n", "## Get the Data\n", "The data is already provided for you. You'll be using a subset of the original dataset. It consists of only the scenes in Moe's Tavern. This doesn't include other versions of the tavern, like \"Moe's Cavern\", \"Flaming Moe's\", \"Uncle Moe's Family Feed-Bag\", etc.." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "\"\"\"\n", "DON'T MODIFY ANYTHING IN THIS CELL\n", "\"\"\"\n", "import helper\n", "\n", "data_dir = './data/simpsons/moes_tavern_lines.txt'\n", "text = helper.load_data(data_dir)\n", "# Ignore notice, since we don't use it for analysing the data\n", "text = text[81:]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Explore the Data\n", "Play around with `view_sentence_range` to view different parts of the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dataset Stats\n", "Roughly the number of unique words: 11492\n", "Number of scenes: 262\n", "Average number of sentences in each scene: 15.248091603053435\n", "Number of lines: 4257\n", "Average number of words in each line: 11.50434578341555\n", "\n", "The sentences 0 to 10:\n", "Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.\n", "Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.\n", "Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?\n", "Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.\n", "Moe_Szyslak: What's the matter Homer? You're not your normal effervescent self.\n", "Homer_Simpson: I got my problems, Moe. Give me another one.\n", "Moe_Szyslak: Homer, hey, you should not drink to forget your problems.\n", "Barney_Gumble: Yeah, you should only drink to enhance your social skills.\n", "\n", "\n" ] } ], "source": [ "view_sentence_range = (0, 10)\n", "\n", "\"\"\"\n", "DON'T MODIFY ANYTHING IN THIS CELL\n", "\"\"\"\n", "import numpy as np\n", "\n", "print('Dataset Stats')\n", "print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))\n", "scenes = text.split('\\n\\n')\n", "print('Number of scenes: {}'.format(len(scenes)))\n", "sentence_count_scene = [scene.count('\\n') for scene in scenes]\n", "print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))\n", "\n", "sentences = [sentence for scene in scenes for sentence in scene.split('\\n')]\n", "print('Number of lines: {}'.format(len(sentences)))\n", "word_count_sentence = [len(sentence.split()) for sentence in sentences]\n", "print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))\n", "\n", "print()\n", "print('The sentences {} to {}:'.format(*view_sentence_range))\n", "print('\\n'.join(text.split('\\n')[view_sentence_range[0]:view_sentence_range[1]]))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)\r\n", "00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]\r\n", "00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]\r\n", "00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)\r\n", "00:02.0 VGA compatible controller: Cirrus Logic GD 5446\r\n", "00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)\r\n", "00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)\r\n" ] } ], "source": [ "!lspci" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "#### Exploring Sentence Lengths \n", "\n", "Find sentence length average to use it as the RRN sequence length" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "data": { "text/html": [ "\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " var force = true;\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", " window._bokeh_onload_callbacks = [];\n", " window._bokeh_is_loading = undefined;\n", " }\n", "\n", "\n", " \n", " if (typeof (window._bokeh_timeout) === \"undefined\" || force === true) {\n", " window._bokeh_timeout = Date.now() + 5000;\n", " window._bokeh_failed_load = false;\n", " }\n", "\n", " var NB_LOAD_WARNING = {'data': {'text/html':\n", " \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"
\\n\"+\n",
" \"