You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/Turning_relational_data_int...

1282 lines
48 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "2fad852c",
"metadata": {},
"source": [
"# Enriching a relational dataset to create a graph dataset using LLMs\n",
"\n",
"\n",
"In this notebook, we will cover how to transform a relational dataset into a knowledge graph.\n",
"\n",
"This allows to find relationships between data points more easily, which can be useful when building apps that can leverage those relationships.\n",
"\n",
"### Use case\n",
"\n",
"As an example, we'll use the [Amazon UK Products 2023 Dataset](https://www.kaggle.com/datasets/asaniczka/amazon-uk-products-dataset-2023) and transform it to import it into a Neo4J database.\n",
"\n",
"The graph database can later be used to build a recommendation system, by leveraging common relationships between products.\n",
"\n",
"We will use GPT-3.5-turbo to extract entities from the products' titles and use those entities to create our graph. \n",
"\n",
"You can use the example dataset or your own, adapting the entities extracted to your specific use case."
]
},
{
"cell_type": "markdown",
"id": "d62a1dab",
"metadata": {},
"source": [
"## Preparing the dataset\n",
"\n",
"/!\\ The dataset is not included in this repo - please download it from here: [Amazon UK Products 2023 Dataset](https://www.kaggle.com/datasets/asaniczka/amazon-uk-products-dataset-2023)\n",
"\n",
"\n",
"After downloading the dataset from Kaggle, we will filter out a large portion of it as it contains 2.2M products and it would be too long to run the entity extraction on all of it.\n",
"\n",
"If you're using your own dataset, feel free to skip this step but be aware that the entity extraction takes a long time."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "7f0f8bc6",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>asin</th>\n",
" <th>title</th>\n",
" <th>imgUrl</th>\n",
" <th>productURL</th>\n",
" <th>stars</th>\n",
" <th>reviews</th>\n",
" <th>price</th>\n",
" <th>isBestSeller</th>\n",
" <th>boughtInLastMonth</th>\n",
" <th>categoryName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>B09B96TG33</td>\n",
" <td>Echo Dot (5th generation, 2022 release) | Big ...</td>\n",
" <td>https://m.media-amazon.com/images/I/71C3lbbeLs...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09B96TG33</td>\n",
" <td>4.7</td>\n",
" <td>15308</td>\n",
" <td>21.99</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>B01HTH3C8S</td>\n",
" <td>Anker Soundcore mini, Super-Portable Bluetooth...</td>\n",
" <td>https://m.media-amazon.com/images/I/61c5rSxwP0...</td>\n",
" <td>https://www.amazon.co.uk/dp/B01HTH3C8S</td>\n",
" <td>4.7</td>\n",
" <td>98099</td>\n",
" <td>23.99</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>B09B8YWXDF</td>\n",
" <td>Echo Dot (5th generation, 2022 release) | Big ...</td>\n",
" <td>https://m.media-amazon.com/images/I/61j3SEUjMJ...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09B8YWXDF</td>\n",
" <td>4.7</td>\n",
" <td>15308</td>\n",
" <td>21.99</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>B09B8T5VGV</td>\n",
" <td>Echo Dot with clock (5th generation, 2022 rele...</td>\n",
" <td>https://m.media-amazon.com/images/I/71yf6yTNWS...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09B8T5VGV</td>\n",
" <td>4.7</td>\n",
" <td>7205</td>\n",
" <td>31.99</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>B09WX6QD65</td>\n",
" <td>Introducing Echo Pop | Full sound compact Wi-F...</td>\n",
" <td>https://m.media-amazon.com/images/I/613dEoF9-r...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09WX6QD65</td>\n",
" <td>4.6</td>\n",
" <td>1881</td>\n",
" <td>17.99</td>\n",
" <td>False</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" asin title \\\n",
"0 B09B96TG33 Echo Dot (5th generation, 2022 release) | Big ... \n",
"1 B01HTH3C8S Anker Soundcore mini, Super-Portable Bluetooth... \n",
"2 B09B8YWXDF Echo Dot (5th generation, 2022 release) | Big ... \n",
"3 B09B8T5VGV Echo Dot with clock (5th generation, 2022 rele... \n",
"4 B09WX6QD65 Introducing Echo Pop | Full sound compact Wi-F... \n",
"\n",
" imgUrl \\\n",
"0 https://m.media-amazon.com/images/I/71C3lbbeLs... \n",
"1 https://m.media-amazon.com/images/I/61c5rSxwP0... \n",
"2 https://m.media-amazon.com/images/I/61j3SEUjMJ... \n",
"3 https://m.media-amazon.com/images/I/71yf6yTNWS... \n",
"4 https://m.media-amazon.com/images/I/613dEoF9-r... \n",
"\n",
" productURL stars reviews price \\\n",
"0 https://www.amazon.co.uk/dp/B09B96TG33 4.7 15308 21.99 \n",
"1 https://www.amazon.co.uk/dp/B01HTH3C8S 4.7 98099 23.99 \n",
"2 https://www.amazon.co.uk/dp/B09B8YWXDF 4.7 15308 21.99 \n",
"3 https://www.amazon.co.uk/dp/B09B8T5VGV 4.7 7205 31.99 \n",
"4 https://www.amazon.co.uk/dp/B09WX6QD65 4.6 1881 17.99 \n",
"\n",
" isBestSeller boughtInLastMonth categoryName \n",
"0 False 0 Hi-Fi Speakers \n",
"1 True 0 Hi-Fi Speakers \n",
"2 False 0 Hi-Fi Speakers \n",
"3 False 0 Hi-Fi Speakers \n",
"4 False 0 Hi-Fi Speakers "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# Path where the downloaded file is located\n",
"# Update this with your own file path if it is different\n",
"file_path = \"data/amazon_product_db.csv\"\n",
"\n",
"df = pd.read_csv(file_path)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "4eefabe1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2222742, 10)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c00d1f00",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['asin', 'title', 'imgUrl', 'productURL', 'stars', 'reviews', 'price',\n",
" 'isBestSeller', 'boughtInLastMonth', 'categoryName'],\n",
" dtype='object')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"id": "b325192d",
"metadata": {},
"source": [
"### Filtering out data\n",
"\n",
"Let's imagine we want to use this dataset to find relevant products to recommend to users. \n",
"\n",
"There are a few categories we want to skip, as they are probably not something we want to recommend to buy on a whim.\n",
"\n",
"We will also filter out products that don't have a great rating and that are not best sellers."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "0e908db8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Hi-Fi Speakers', 'CD, Disc & Tape Players', 'Wearable Technology',\n",
" 'Light Bulbs', 'Bathroom Lighting',\n",
" 'Heating, Cooling & Air Quality', 'Coffee & Espresso Machines',\n",
" 'Lab & Scientific Products', 'Smart Speakers',\n",
" 'Motorbike Clothing', 'Motorbike Accessories',\n",
" 'Motorbike Batteries', 'Motorbike Boots & Luggage',\n",
" 'Motorbike Chassis', 'Handmade Home & Kitchen Products',\n",
" 'Hardware', 'Storage & Home Organisation',\n",
" 'Fireplaces, Stoves & Accessories', 'PC Gaming Accessories',\n",
" 'USB Gadgets', 'Blank Media Cases & Wallets', 'Car & Motorbike',\n",
" 'Boys', 'Sports & Outdoors', 'Microphones', 'String Instruments',\n",
" 'Karaoke Equipment', 'PA & Stage',\n",
" 'General Music-Making Accessories', 'Wind Instruments',\n",
" 'Handmade Gifts', 'Fragrances', 'Calendars & Personal Organisers',\n",
" 'Furniture & Lighting', 'Computer Printers', 'Ski Goggles',\n",
" 'Snowboards', 'Skiing Poles', 'Downhill Ski Boots',\n",
" 'Hiking Hand & Foot Warmers', 'Pet Supplies',\n",
" 'Plants, Seeds & Bulbs', 'Garden Furniture & Accessories',\n",
" 'Bird & Wildlife Care', 'Storage & Organisation',\n",
" 'Living Room Furniture', 'Bedding & Linen',\n",
" 'Curtain & Blind Accessories', 'Skin Care',\n",
" \"Kids' Art & Craft Supplies\", \"Kids' Play Vehicles\", 'Hobbies',\n",
" 'Laptops', 'Projectors', 'Graphics Cards', 'Computer Memory',\n",
" 'Motherboards', 'Power Supplies', 'CPUs', 'Computer Screws',\n",
" 'Streaming Clients', '3D Printers', 'Barebone PCs',\n",
" \"Women's Sports & Outdoor Shoes\", 'Luxury Food & Drink',\n",
" 'Alexa Built-In Devices', 'PC & Video Games', 'SIM Cards',\n",
" 'Mobile Phone Accessories', 'Birthday Gifts',\n",
" 'Handmade Kitchen & Dining', 'Abrasive & Finishing Products',\n",
" 'Professional Medical Supplies', 'Cutting Tools',\n",
" 'Material Handling Products', 'Packaging & Shipping Supplies',\n",
" 'Power & Hand Tools', 'Agricultural Equipment & Supplies',\n",
" 'Tennis Shoes', 'Boating Footwear', 'Cycling Shoes', 'Bath & Body',\n",
" 'Home Brewing & Wine Making', 'Tableware',\n",
" 'Kitchen Storage & Organisation', 'Kitchen Tools & Gadgets',\n",
" 'Cookware', 'Water Coolers, Filters & Cartridges',\n",
" 'Beer, Wine & Spirits', 'Manicure & Pedicure Products', 'Flashes',\n",
" 'Computers, Components & Accessories', 'Home Audio Record Players',\n",
" 'Radios & Boomboxes', 'Car & Vehicle Electronics',\n",
" 'eBook Readers & Accessories', 'Lighting',\n",
" 'Small Kitchen Appliances', 'Motorbike Engines & Engine Parts',\n",
" 'Motorbike Drive & Gears', 'Motorbike Brakes',\n",
" 'Motorbike Exhaust & Exhaust Systems',\n",
" 'Motorbike Handlebars, Controls & Grips',\n",
" 'Mowers & Outdoor Power Tools', 'Kitchen & Bath Fixtures',\n",
" 'Rough Plumbing', 'Monitor Accessories', 'Cables & Accessories',\n",
" 'Guitars & Gear', 'Pens, Pencils & Writing Supplies',\n",
" 'School & Educational Supplies', 'Ski Clothing',\n",
" 'Outdoor Heaters & Fire Pits', 'Garden Décor', 'Beauty',\n",
" 'Made in Italy Handmade', 'Cushions & Accessories',\n",
" 'Home Fragrance', 'Window Treatments',\n",
" 'Home Entertainment Furniture', 'Dining Room Furniture',\n",
" 'Home Bar Furniture', 'Kitchen Linen', 'Mattress Pads & Toppers',\n",
" \"Children's Bedding\", 'Bedding Accessories',\n",
" 'Games & Game Accessories', 'Dolls & Accessories',\n",
" 'Sports Toys & Outdoor', 'Monitors', 'I/O Port Cards',\n",
" 'Computer Cases', 'KVM Switches', 'Printers & Accessories',\n",
" 'Telephones, VoIP & Accessories', 'Handmade Artwork',\n",
" 'Industrial Electrical', 'Test & Measurement',\n",
" '3D Printing & Scanning', 'Basketball Footwear', 'Make-up',\n",
" 'Surveillance Cameras', 'Photo Printers', 'Tripods & Monopods',\n",
" 'Mobile Phones & Communication', 'Electrical Power Accessories',\n",
" 'Radio Communication', 'Outdoor Rope Lights',\n",
" 'Vacuums & Floorcare', 'Large Appliances', 'Motorbike Lighting',\n",
" 'Motorbike Seat Covers', 'Motorbike Instruments',\n",
" 'Motorbike Electrical & Batteries', 'Lights and switches', 'Plugs',\n",
" 'Home Entertainment', 'Girls',\n",
" 'Painting Supplies, Tools & Wall Treatments', 'Building Supplies',\n",
" 'Safety & Security', 'Tablet Accessories',\n",
" 'Keyboards, Mice & Input Devices', 'Laptop Accessories',\n",
" 'Headphones & Earphones', 'Baby', 'Smartwatches',\n",
" 'Piano & Keyboard', 'Drums & Percussion',\n",
" 'Synthesisers, Samplers & Digital Instruments',\n",
" 'Office Electronics', 'Office Supplies', 'Gardening',\n",
" 'Outdoor Cooking', 'Decking & Fencing',\n",
" 'Thermometers & Meteorological Instruments',\n",
" 'Pools, Hot Tubs & Supplies', 'Health & Personal Care',\n",
" 'Decorative Artificial Flora', 'Candles & Holders',\n",
" 'Signs & Plaques', 'Home Office Furniture', 'Bathroom Furniture',\n",
" 'Inflatable Beds, Pillows & Accessories', 'Bathroom Linen',\n",
" 'Bedding Collections', \"Kids' Play Figures\", 'Baby & Toddler Toys',\n",
" 'Learning & Education Toys', 'Toy Advent Calendars',\n",
" 'Electronic Toys', 'Tablets', 'External Sound Cards',\n",
" 'Internal TV Tuner & Video Capture Cards',\n",
" 'External TV Tuners & Video Capture Cards',\n",
" 'Scanners & Accessories', \"Men's Sports & Outdoor Shoes\",\n",
" 'Darts & Dartboards', 'Table Tennis', 'Billiard, Snooker & Pool',\n",
" 'Bowling', 'Trampolines & Accessories',\n",
" 'Handmade Clothing, Shoes & Accessories', 'Handmade Home Décor',\n",
" 'Handmade', 'Smart Home Security & Lighting',\n",
" 'Professional Education Supplies',\n",
" 'Hydraulics, Pneumatics & Plumbing', 'Ballet & Dancing Footwear',\n",
" 'Cricket Shoes', 'Golf Shoes', 'Boxing Shoes', 'Men',\n",
" 'Headphones, Earphones & Accessories', 'Bakeware', 'Grocery',\n",
" 'Lenses', 'Camcorders', 'Camera & Photo Accessories',\n",
" 'Household Batteries, Chargers & Accessories',\n",
" 'Home Cinema, TV & Video', 'Hi-Fi & Home Audio Accessories',\n",
" 'Portable Sound & Video Products', 'Outdoor Lighting', 'Torches',\n",
" 'Sports Supplements', 'Ironing & Steamers',\n",
" \"Customers' Most Loved\", 'Cameras', 'Electrical',\n",
" 'Construction Machinery', 'Handmade Baby Products', 'USB Hubs',\n",
" 'Computer Audio & Video Accessories', 'Adapters',\n",
" 'Computer & Server Racks', 'Hard Drive Accessories',\n",
" 'Printer Accessories', 'Computer Memory Card Accessories',\n",
" 'Uninterruptible Power Supply Units & Accessories',\n",
" 'Luggage and travel gear', 'Bass Guitars & Gear',\n",
" 'Recording & Computer', 'DJ & VJ Equipment',\n",
" 'Art & Craft Supplies', 'Office Paper Products', 'Ski Helmets',\n",
" 'Snowboard Boots', 'Snowboard Bindings', 'Downhill Skis',\n",
" 'Snow Sledding Equipment', 'Networking Devices',\n",
" 'Garden Storage & Housing', 'Garden Tools & Watering Equipment',\n",
" 'Photo Frames', 'Rugs, Pads & Protectors', 'Mirrors', 'Clocks',\n",
" 'Doormats', 'Decorative Home Accessories', 'Boxes & Organisers',\n",
" 'Slipcovers', 'Vases', 'Bedroom Furniture', 'Hallway Furniture',\n",
" 'Jigsaws & Puzzles', 'Building & Construction Toys',\n",
" 'Remote & App-Controlled Devices', \"Kids' Dress Up & Pretend Play\",\n",
" 'Soft Toys', 'Desktop PCs', 'External Optical Drives',\n",
" 'Internal Optical Drives', 'Network Cards', 'Data Storage',\n",
" 'Mobile Phones & Smartphones', 'Handmade Jewellery',\n",
" 'Gifts for Him', 'Gifts for Her', 'Women', 'Hockey Shoes',\n",
" 'Climbing Footwear', 'Equestrian Sports Boots', 'Arts & Crafts',\n",
" 'Hair Care', 'Coffee, Tea & Espresso', 'Digital Cameras',\n",
" 'Digital Frames', 'Action Cameras', 'Film Cameras',\n",
" 'Binoculars, Telescopes & Optics', 'Media Streaming Devices',\n",
" 'Hi-Fi Receivers & Separates', 'GPS, Finders & Accessories',\n",
" 'Indoor Lighting', 'String Lights'], dtype=object)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['categoryName'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a01f39fc",
"metadata": {},
"outputs": [],
"source": [
"categories_to_delete = ['CD, Disc & Tape Players',\n",
" 'Light Bulbs', 'Bathroom Lighting',\n",
" 'Heating, Cooling & Air Quality',\n",
" 'Lab & Scientific Products',\n",
" 'Motorbike Batteries', 'Motorbike Boots & Luggage',\n",
" 'Motorbike Chassis',\n",
" 'Fireplaces, Stoves & Accessories', 'Blank Media Cases & Wallets', 'Car & Motorbike',\n",
" 'PA & Stage',\n",
" 'Wind Instruments',\n",
" 'Computer Printers', 'Ski Goggles',\n",
" 'Snowboards', 'Skiing Poles', 'Downhill Ski Boots',\n",
" 'Hiking Hand & Foot Warmers', 'Pet Supplies',\n",
" 'Plants, Seeds & Bulbs', \n",
" 'Bird & Wildlife Care','Projectors', 'Graphics Cards', 'Computer Memory',\n",
" 'Motherboards', 'Power Supplies', 'CPUs', 'Computer Screws',\n",
" 'Streaming Clients', 'Barebone PCs',\n",
" 'SIM Cards',\n",
" 'Abrasive & Finishing Products',\n",
" 'Professional Medical Supplies', 'Cutting Tools',\n",
" 'Material Handling Products', 'Packaging & Shipping Supplies',\n",
" 'Power & Hand Tools', 'Agricultural Equipment & Supplies',\n",
" 'Tennis Shoes', 'Boating Footwear', 'Cycling Shoes', 'Water Coolers, Filters & Cartridges',\n",
" 'Flashes',\n",
" 'Computers, Components & Accessories', 'Motorbike Engines & Engine Parts',\n",
" 'Motorbike Drive & Gears', 'Motorbike Brakes',\n",
" 'Motorbike Exhaust & Exhaust Systems',\n",
" 'Motorbike Handlebars, Controls & Grips',\n",
" 'Mowers & Outdoor Power Tools', 'Kitchen & Bath Fixtures',\n",
" 'Rough Plumbing', 'Monitor Accessories', 'Cables & Accessories',\n",
" 'School & Educational Supplies',\n",
" 'Outdoor Heaters & Fire Pits', 'Window Treatments',\n",
" 'Mattress Pads & Toppers',\n",
" \"Children's Bedding\", 'I/O Port Cards',\n",
" 'Computer Cases', 'KVM Switches', 'Printers & Accessories',\n",
" 'Telephones, VoIP & Accessories',\n",
" 'Industrial Electrical', 'Test & Measurement',\n",
" 'Electrical Power Accessories',\n",
" 'Radio Communication', 'Outdoor Rope Lights',\n",
" 'Vacuums & Floorcare', 'Large Appliances', 'Motorbike Lighting',\n",
" 'Motorbike Seat Covers', 'Motorbike Instruments',\n",
" 'Motorbike Electrical & Batteries', 'Lights and switches', 'Plugs',\n",
" 'Painting Supplies, Tools & Wall Treatments', 'Building Supplies',\n",
" 'Safety & Security', 'Tablet Accessories',\n",
" 'Decking & Fencing',\n",
" 'Thermometers & Meteorological Instruments',\n",
" 'Pools, Hot Tubs & Supplies',\n",
" 'Signs & Plaques',\n",
" 'Inflatable Beds, Pillows & Accessories', 'External Sound Cards',\n",
" 'Internal TV Tuner & Video Capture Cards',\n",
" 'External TV Tuners & Video Capture Cards',\n",
" 'Scanners & Accessories',\n",
" 'Professional Education Supplies',\n",
" 'Hydraulics, Pneumatics & Plumbing', 'Grocery',\n",
" 'Household Batteries, Chargers & Accessories',\n",
" 'Torches',\n",
" 'Sports Supplements', 'Ironing & Steamers',\n",
" 'Electrical',\n",
" 'Construction Machinery', 'Handmade Baby Products', 'USB Hubs',\n",
" 'Adapters',\n",
" 'Computer & Server Racks', 'Hard Drive Accessories',\n",
" 'Printer Accessories', 'Computer Memory Card Accessories',\n",
" 'Uninterruptible Power Supply Units & Accessories',\n",
" 'Recording & Computer', 'Office Paper Products', 'Ski Helmets',\n",
" 'Snowboard Boots', 'Snowboard Bindings', 'Downhill Skis',\n",
" 'Snow Sledding Equipment', 'Networking Devices',\n",
" 'Rugs, Pads & Protectors',\n",
" 'Slipcovers', 'External Optical Drives',\n",
" 'Internal Optical Drives', 'Network Cards', 'Data Storage',\n",
" 'Mobile Phones & Smartphones', 'Media Streaming Devices',\n",
" 'Hi-Fi Receivers & Separates', 'GPS, Finders & Accessories']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "28494927",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"296\n",
"126\n"
]
}
],
"source": [
"print(len(df['categoryName'].unique()))\n",
"print(len(categories_to_delete))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "59fd7457",
"metadata": {},
"outputs": [],
"source": [
"# Removing all categories\n",
"df_filtered = df[~df['categoryName'].isin(categories_to_delete)]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "274a550a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1743315, 10)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_filtered.shape"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "15e6ffe5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(707380, 10)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Removing all items poorly rated\n",
"\n",
"threshold = 3.8\n",
"\n",
"df_filtered = df_filtered[df_filtered['stars'] >= 3.8]\n",
"df_filtered.shape"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "21d078e5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(4091, 10)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Keeping only best sellers\n",
"\n",
"df_best_seller = df_filtered[df_filtered['isBestSeller']]\n",
"df_best_seller.shape"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "40e0cc1a",
"metadata": {},
"outputs": [],
"source": [
"df_best_seller.reset_index(inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "39bc1102",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>index</th>\n",
" <th>asin</th>\n",
" <th>title</th>\n",
" <th>imgUrl</th>\n",
" <th>productURL</th>\n",
" <th>stars</th>\n",
" <th>reviews</th>\n",
" <th>price</th>\n",
" <th>isBestSeller</th>\n",
" <th>boughtInLastMonth</th>\n",
" <th>categoryName</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>B01HTH3C8S</td>\n",
" <td>Anker Soundcore mini, Super-Portable Bluetooth...</td>\n",
" <td>https://m.media-amazon.com/images/I/61c5rSxwP0...</td>\n",
" <td>https://www.amazon.co.uk/dp/B01HTH3C8S</td>\n",
" <td>4.7</td>\n",
" <td>98099</td>\n",
" <td>23.99</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>15</td>\n",
" <td>B09B97BPSW</td>\n",
" <td>Echo Dot Kids (5th generation, 2022 release) |...</td>\n",
" <td>https://m.media-amazon.com/images/I/71OimazcmO...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09B97BPSW</td>\n",
" <td>4.6</td>\n",
" <td>1017</td>\n",
" <td>26.99</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17</td>\n",
" <td>B09B8XRZYB</td>\n",
" <td>Echo Dot Kids (5th generation, 2022 release) |...</td>\n",
" <td>https://m.media-amazon.com/images/I/71QKSOmP-I...</td>\n",
" <td>https://www.amazon.co.uk/dp/B09B8XRZYB</td>\n",
" <td>4.6</td>\n",
" <td>1017</td>\n",
" <td>26.99</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>36</td>\n",
" <td>B08L84ST93</td>\n",
" <td>Bose Solo Soundbar Series II - TV Speaker with...</td>\n",
" <td>https://m.media-amazon.com/images/I/61kib4a8uq...</td>\n",
" <td>https://www.amazon.co.uk/dp/B08L84ST93</td>\n",
" <td>4.6</td>\n",
" <td>2799</td>\n",
" <td>169.00</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>55</td>\n",
" <td>B08CMJ2YZX</td>\n",
" <td>Sanyun SW208 3\" Active Bluetooth 5.0 Bookshelf...</td>\n",
" <td>https://m.media-amazon.com/images/I/81PdWvZcOB...</td>\n",
" <td>https://www.amazon.co.uk/dp/B08CMJ2YZX</td>\n",
" <td>4.4</td>\n",
" <td>974</td>\n",
" <td>59.49</td>\n",
" <td>True</td>\n",
" <td>0</td>\n",
" <td>Hi-Fi Speakers</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" index asin title \\\n",
"0 1 B01HTH3C8S Anker Soundcore mini, Super-Portable Bluetooth... \n",
"1 15 B09B97BPSW Echo Dot Kids (5th generation, 2022 release) |... \n",
"2 17 B09B8XRZYB Echo Dot Kids (5th generation, 2022 release) |... \n",
"3 36 B08L84ST93 Bose Solo Soundbar Series II - TV Speaker with... \n",
"4 55 B08CMJ2YZX Sanyun SW208 3\" Active Bluetooth 5.0 Bookshelf... \n",
"\n",
" imgUrl \\\n",
"0 https://m.media-amazon.com/images/I/61c5rSxwP0... \n",
"1 https://m.media-amazon.com/images/I/71OimazcmO... \n",
"2 https://m.media-amazon.com/images/I/71QKSOmP-I... \n",
"3 https://m.media-amazon.com/images/I/61kib4a8uq... \n",
"4 https://m.media-amazon.com/images/I/81PdWvZcOB... \n",
"\n",
" productURL stars reviews price \\\n",
"0 https://www.amazon.co.uk/dp/B01HTH3C8S 4.7 98099 23.99 \n",
"1 https://www.amazon.co.uk/dp/B09B97BPSW 4.6 1017 26.99 \n",
"2 https://www.amazon.co.uk/dp/B09B8XRZYB 4.6 1017 26.99 \n",
"3 https://www.amazon.co.uk/dp/B08L84ST93 4.6 2799 169.00 \n",
"4 https://www.amazon.co.uk/dp/B08CMJ2YZX 4.4 974 59.49 \n",
"\n",
" isBestSeller boughtInLastMonth categoryName \n",
"0 True 0 Hi-Fi Speakers \n",
"1 True 0 Hi-Fi Speakers \n",
"2 True 0 Hi-Fi Speakers \n",
"3 True 0 Hi-Fi Speakers \n",
"4 True 0 Hi-Fi Speakers "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_best_seller.head()"
]
},
{
"cell_type": "markdown",
"id": "5cddff7a",
"metadata": {},
"source": [
"## Extracting entities\n",
"\n",
"Now that we've drastically reduced the number of products we will be working with, we can use GPT-3.5-turbo to extract entities from the products' titles. \n",
"\n",
"Extracting these entities will allow us to create nodes to populate the graph, and visualize relationships between products and different types of entities such as characteristics, color, etc."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "3e6e5350",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"# Make sure you have your OpenAI key set up as the OPENAI_API_KEY environment variable, or set it manually\n",
"\n",
"# Set the OpenAI API key env variable manually\n",
"# os.environ[\"OPENAI_API_KEY\"] = \"<your_api_key>\"\n",
"\n",
"client = OpenAI()"
]
},
{
"cell_type": "markdown",
"id": "ba60a1c2",
"metadata": {},
"source": [
"### Describing entities\n",
"\n",
"The first step to extract entities is to define which types of entities we want to extract. Here, we will define a few entities that are relevant to a product recommendation system, with the meaning of each entity type.\n",
"\n",
"These are arbitrary and could be changed depending on your use case."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "6885b750",
"metadata": {},
"outputs": [],
"source": [
"entity_types = {\n",
" \"description\": \"Item detailed description, for example 'high waist pants', 'outdoor plant pot', 'chef kitchen knife'\",\n",
" \"type\": \"Item type, for example 'women clothing', 'plant pot', 'kitchen knife'\",\n",
" \"characteristic\": \"if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'\",\n",
" \"measurement\": \"if present, dimensions of the item\", \n",
" \"brand\": \"if present, brand of the item\",\n",
" \"color\": \"if present, specific color of the item.\",\n",
" \"color_group\": \"if the color is present, this is the broader color group. For example, 'navy blue' is part of the color group 'blue', 'burgundy' is part of 'red', or 'lilac' is part of purple.\",\n",
" \"age_group\": \"target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'. If it is not clear whether the product is aimed at a specific age group, it should be for 'adults'.\",\n",
" \"gender_group\": \"target gender for the product, one of 'women', 'men', 'all'. If it is not clear whether the product is aimed at a specific gender, it should be for 'all'.\"\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "1bb2575b",
"metadata": {},
"source": [
"### Crafting a prompt\n",
"\n",
"We will then use those entity types to craft a prompt for the model to extract the entities we are looking for.\n",
"We will use `gpt-3.5-turbo-1106` as we can instruct this model to only output valid json. \n",
"\n",
"The prompt should describe in details the output expected, and include examples of how to extract the entities."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "3959dea7",
"metadata": {},
"outputs": [],
"source": [
"import json"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20779b10",
"metadata": {},
"outputs": [],
"source": [
"system_prompt = f'''\n",
" You are an agent specialized in finding entities in online product descriptions.\n",
" The user will give you a product description.\n",
" Your task is to identify entities from the product description.\n",
"\n",
" The entities can be of those types:\n",
"\n",
" {json.dumps(entity_types)}\n",
"\n",
" You must return a JSON output containing for every type of entity found a list of values.\n",
" If you cannot find an entity type, return an empty array for this entity.\n",
" If you found one entity of this type, return an array with one value.\n",
" If you found 2 entities of this type, return an array with 2 values.\n",
" Etc.\n",
"\n",
" \n",
" Only use lower cases letters when defining entities values, and remove adjectives and specificities from values to try and have the simplest words or groups of words.\n",
" \n",
" For example:\n",
" \n",
" With the description: \"Super adhesive 100% waterproof outdoor 360° beautiful light\"\n",
" You could extract the characteristics:\n",
" - adhesive\n",
" - waterproof\n",
" - outdoor\n",
" \n",
" The description: outdoor 360° light\n",
" And the type: outdoor light\n",
" \n",
" \n",
" -----\n",
" \n",
" \n",
" Examples:\n",
" \n",
" 1. Description: \"YUVORA 3D Brick Wall Stickers | PE Foam Fancy Wallpaper for Walls,\n",
" Waterproof & Self Adhesive, White Color 3D Latest Unique Design Wallpaper for Home (70*70 CMT)\"\n",
" \n",
" Expected result:\n",
" {{\n",
" \"description\": [\"3d brick wall sticker\"],\n",
" \"type\": [\"wall sticker\", \"wallpaper\"],\n",
" \"brand\": [\"yuvora\"],\n",
" \"characteristic\": [\"waterproof\", \"self-adhesive\", \"fancy\"],\n",
" \"color\": [\"white\"],\n",
" \"color_group\": [\"white\"],\n",
" \"age_group\": [\"adults\"],\n",
" \"gender_group\": [\"all\"]\n",
" }}\n",
" \n",
" 2. Description: \"Marks & Spencer Girls' Pyjama Sets T86_2561C_Navy Mix_9-10Y\"\n",
" \n",
" Expected result:\n",
" {{\n",
" \"description\": [\"pyjama sets\"],\n",
" \"type\": [\"pyjamas\"],\n",
" \"brand\": [\"marks & spencer\"],\n",
" \"characteristic\": [],\n",
" \"color\": [\"navy\"],\n",
" \"color_group\": [\"blue\"],\n",
" \"age_group\": [\"children\"],\n",
" \"gender_group\": [\"women\"]\n",
" }}\n",
" \n",
" 3. Description: \"Star Trek 50th Anniversary Cereamic Storage Jar\"\n",
" \n",
" Expected result:\n",
" {{\n",
" \"description\": [\"star trek storage jar\"],\n",
" \"type\": [\"storage jar\"],\n",
" \"brand\": [],\n",
" \"characteristic\": [\"ceramic\", \"star trek\"],\n",
" \"color\": [],\n",
" \"color_group\": [],\n",
" \"age_group\": [\"adults\"],\n",
" \"gender_group\": [\"all\"]\n",
" }}\n",
" \n",
"\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "bf0437c1",
"metadata": {},
"source": [
"### Calling the model\n",
"\n",
"We will define a function to extract entities on a given text, and run this on every line in our dataset. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "d19ed06d",
"metadata": {},
"outputs": [],
"source": [
"model = \"gpt-3.5-turbo-1106\"\n",
"\n",
"def extract_entities(text, model=model):\n",
" completion = client.chat.completions.create(\n",
" model=model,\n",
" temperature=0,\n",
" response_format= {\n",
" \"type\": \"json_object\"\n",
" },\n",
" messages=[\n",
" {\n",
" \"role\": \"system\",\n",
" \"content\": system_prompt\n",
" },\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": text\n",
" }\n",
" ]\n",
" )\n",
"\n",
" return completion.choices[0].message.content "
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "afcdd78e",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"description\": [\"echo dot\"],\n",
" \"type\": [\"smart speaker\"],\n",
" \"brand\": [\"amazon\"],\n",
" \"characteristic\": [\"wi-fi\", \"bluetooth\", \"vibrant sound\", \"alexa\"],\n",
" \"color\": [\"charcoal\"],\n",
" \"color_group\": [\"black\"],\n",
" \"age_group\": [\"adults\"],\n",
" \"gender_group\": [\"all\"]\n",
"}\n"
]
}
],
"source": [
"# Example\n",
"title = \"Echo Dot (5th generation, 2022 release) | Big vibrant sound Wi-Fi and Bluetooth smart speaker with Alexa | Charcoal\"\n",
"\n",
"print(extract_entities(title))"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "b027b04f",
"metadata": {},
"outputs": [],
"source": [
"data_entities = []"
]
},
{
"cell_type": "markdown",
"id": "08d94209",
"metadata": {},
"source": [
"Running this will take a while so you can do it by chunks.\n",
"\n",
"Feel free to skip this step entirely and load the already prepared result."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b41250f9",
"metadata": {},
"outputs": [],
"source": [
"for i, row in df_best_seller[:100].iterrows():\n",
" try:\n",
" print(f\"#{i} - {row['title'][:20]}\")\n",
" entities = json.loads(extract_entities(row['title']))\n",
" product_data_string = row.to_json(orient='columns')\n",
" product_data = json.loads(product_data_string)\n",
" product_data.update(entities)\n",
" data_entities.append(product_data)\n",
" except Exception as e:\n",
" logging.error(e)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e00036fe",
"metadata": {},
"outputs": [],
"source": [
"print(len(data_entities))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d66d2355",
"metadata": {},
"outputs": [],
"source": [
"file_path = 'data/data_entities.json'\n",
"\n",
"# Saving the file locally\n",
"with open(file_path, 'w') as file:\n",
" json.dump(data_entities, file, indent=4)\n",
"\n",
"print(f\"Data written to {file_path}\")"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "ead7f764",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4090\n"
]
}
],
"source": [
"# Load result from local file\n",
"file_path = 'data/amazon_product_db.json'\n",
"with open(file_path, 'r') as file:\n",
" data_entities = json.load(file)\n",
"\n",
"print(len(data_entities))"
]
},
{
"cell_type": "markdown",
"id": "923f9a1d",
"metadata": {},
"source": [
"## Loading data in the database\n",
"\n",
"We will use cypher queries to load this data into a Neo4j database."
]
},
{
"cell_type": "markdown",
"id": "95cb4781",
"metadata": {},
"source": [
"### Setting up the database\n",
"\n",
"There are several ways to set up a Neo4j database, but the easiest would be to use the Neo4J Desktop app and create a local database. \n",
"\n",
"You can follow the steps to do so [here](https://neo4j.com/docs/desktop-manual/current/operations/create-dbms/).\n",
"\n",
"Once this is done, you can grab your credentials to connect to your new DB."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "ba44ec86",
"metadata": {},
"outputs": [],
"source": [
"#!pip install neo4j\n",
"from neo4j import GraphDatabase"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "515b6f31",
"metadata": {},
"outputs": [],
"source": [
"url = \"bolt://localhost:7687\"\n",
"username = \"neo4j\"\n",
"password = \"<your_password>\"\n",
"\n",
"\n",
"driver = GraphDatabase.driver(url, auth=(username, password))"
]
},
{
"cell_type": "markdown",
"id": "e5a9a2ab",
"metadata": {},
"source": [
"### Loading the data\n",
"\n",
"We will iterate over our array of objects and import them into the database with a Cypher query, using a relationships map to determine which relationships to create between nodes."
]
},
{
"cell_type": "code",
"execution_count": 56,
"id": "40e41d78",
"metadata": {},
"outputs": [],
"source": [
"entities_map = {\n",
" \"description\": {\n",
" \"entity_name\": \"Description\",\n",
" \"relationship_type\": \"HAS_DESCRIPTION\"\n",
" },\n",
" \"type\": {\n",
" \"entity_name\": \"Type\",\n",
" \"relationship_type\": \"HAS_TYPE\"\n",
" },\n",
" \"characteristic\": {\n",
" \"entity_name\": \"Characteristic\",\n",
" \"relationship_type\": \"HAS_CHARACTERISTIC\"\n",
" },\n",
" \"measurement\": {\n",
" \"entity_name\": \"Measurement\",\n",
" \"relationship_type\": \"HAS_MEASUREMENT\"\n",
" }, \n",
" \"brand\": {\n",
" \"entity_name\": \"Brand\",\n",
" \"relationship_type\": \"HAS_BRAND\"\n",
" \n",
" },\n",
" \"color\": {\n",
" \"entity_name\": \"Color\",\n",
" \"relationship_type\": \"HAS_COLOR\"\n",
" },\n",
" \"color_group\": {\n",
" \"entity_name\": \"ColorGroup\",\n",
" \"relationship_type\": \"HAS_COLOR_GROUP\"\n",
" },\n",
" \"age_group\": {\n",
" \"entity_name\": \"AgeGroup\",\n",
" \"relationship_type\": \"IS_FOR_AGE\"\n",
"\n",
" },\n",
" \"gender_group\": {\n",
" \"entity_name\": \"GenderGroup\",\n",
" \"relationship_type\": \"IS_FOR_GENDER\"\n",
" }\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "734b4168",
"metadata": {},
"outputs": [],
"source": [
"def run_query(query, parameters=None):\n",
" with driver.session() as session:\n",
" result = session.run(query, parameters)\n",
" return [r.data() for r in result]\n",
" \n",
"def load_data(json_data):\n",
" query = '''WITH $json_data as data\n",
" MERGE (p:Product {\n",
" asin: data.asin,\n",
" title: data.title,\n",
" imgUrl: data.imgUrl,\n",
" productURL: data.productURL,\n",
" stars: data.stars,\n",
" reviews: data.reviews,\n",
" price: data.price,\n",
" isBestSeller: data.isBestSeller,\n",
" boughtInLastMonth: data.boughtInLastMonth\n",
" })\n",
" WITH p, data\n",
" MERGE (c:Category {value: data.categoryName})\n",
" MERGE (p)-[:HAS_CATEGORY]->(c)\n",
" '''\n",
" for e in entities_map.keys():\n",
" if e in json_data:\n",
" query += f'''\n",
" WITH p, data\n",
" UNWIND {json_data[e]} as {e}\n",
" MERGE ({e[:1]}{e[-1:]}:{entities_map[e]['entity_name']} {{value: {e}}})\n",
" MERGE (p)-[:{entities_map[e]['relationship_type']}]->({e[:1]}{e[-1:]})\n",
" '''\n",
" run_query(query, {\"json_data\": json_data})"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff6547a1",
"metadata": {},
"outputs": [],
"source": [
"i = 1\n",
"for i in range(len(data_entities)):\n",
" p = data_entities[i]\n",
" print(f\"#{i} {p['title'][:20]}\")\n",
" load_data(p)\n",
" i+=1"
]
},
{
"cell_type": "markdown",
"id": "a1c3282b",
"metadata": {},
"source": [
"## Wrapping up\n",
"\n",
"Now that we've loaded the data in our Neo4j database, we can explore it using the Neo4j browser and see the relationships between products, which would be much harder to surface using a traditional database.\n",
"\n",
"For example, one product could have 3 different colors, and each color could be linked to multiple products as well.\n",
"\n",
"And a product could have a brand, a characteristic, and a category in common with another product, meaning they have a lot in common - again, what would be hard to figure out with a relational database jumps out when looking at a graph.\n",
"\n",
"Hopefully, this example can apply to multiple use cases, and you can see relationships between your data points more clearly with this data enrichment technique using GPT-3.5!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}