langchain/docs/modules/document_loaders/examples/web_base.ipynb

118 lines
12 KiB
Plaintext
Raw Normal View History

2023-02-09 15:52:50 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "bf920da0",
"metadata": {},
"source": [
"# Web Base\n",
"\n",
"This covers how to load all text from webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as IMSDbLoader, AZLyricsLoader, and CollegeConfidentialLoader"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "00b6de21",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import WebBaseLoader"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0231df35",
"metadata": {},
"outputs": [],
"source": [
"loader = WebBaseLoader(\"https://www.espn.com/\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f06bdc4e",
"metadata": {},
"outputs": [],
"source": [
"data = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a390d79f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(page_content=\"\\n\\n\\n\\n\\n\\n\\n\\n\\nESPN - Serving Sports Fans. Anytime. Anywhere.\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n Skip to main content\\n \\n\\n Skip to navigation\\n \\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n<\\n\\n>\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nMenuESPN\\n\\n\\nSearch\\n\\n\\n\\nscores\\n\\n\\n\\nNFLNBANHLNCAAMNCAAWSoccer…MLBNCAAFGolfTennisSports BettingBoxingCaribbean SeriesCFLNCAACricketF1HorseLLWSMMANASCARNBA G LeagueOlympic SportsRacingRN BBRN FBRugbyWNBAWWEX GamesXFLMore ESPNFantasyListenWatchESPN+\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n \\n\\nSUBSCRIBE NOW\\n\\n\\n\\n\\n\\nUFC 284: Makhachev vs. Volkanovski (ESPN+ PPV)\\n\\n\\n\\n\\n\\n\\n\\nMen's College Hoops: Select Games\\n\\n\\n\\n\\n\\n\\n\\nWomen's College Hoops: Select Games\\n\\n\\n\\n\\n\\n\\n\\nNHL: Select Games\\n\\n\\n\\n\\n\\n\\n\\nGerman Cup: Round of 16\\n\\n\\n\\n\\n\\n\\n\\n30 For 30: Bullies Of Baltimore\\n\\n\\n\\n\\n\\n\\n\\nMatt Miller's Two-Round NFL Mock Draft\\n\\n\\nQuick Links\\n\\n\\n\\n\\nSuper Bowl LVII\\n\\n\\n\\n\\n\\n\\n\\nSuper Bowl Betting\\n\\n\\n\\n\\n\\n\\n\\nNBA Trade Machine\\n\\n\\n\\n\\n\\n\\n\\nNBA All-Star Game\\n\\n\\n\\n\\n\\n\\n\\nFantasy Baseball: Sign Up\\n\\n\\n\\n\\n\\n\\n\\nHow To Watch NHL Games\\n\\n\\n\\n\\n\\n\\n\\nGames For Me\\n\\n\\n\\n\\n\\n\\nFavorites\\n\\n\\n\\n\\n\\n\\n Manage Favorites\\n \\n\\n\\n\\nCustomize ESPNSign UpLog InESPN Sites\\n\\n\\n\\n\\nESPN Deportes\\n\\n\\n\\n\\n\\n\\n\\nAndscape\\n\\n\\n\\n\\n\\n\\n\\nespnW\\n\\n\\n\\n\\n\\n\\n\\nESPNFC\\n\\n\\n\\n\\n\\n\\n\\nX Games\\n\\n\\n\\n\\n\\n\\n\\nSEC Network\\n\\n\\nESPN Apps\\n\\n\\n\\n\\nESPN\\n\\n\\n\\n\\n\\n\\n\\nESPN Fantasy\\n\\n\\nFollow ESPN\\n\\n\\n\\n\\nFacebook\\n\\n\\n\\n\\n\\n\\n\\nTwitter\\n\\n\\n\\n\\n\\n\\n\\nInstagram\\n\\n\\n\\n\\n\\n\\n\\nSnapchat\\n\\n\\n\\n\\n\\n\\n\\nYouTube\\n\\n\\n\\n\\n\\n\\n\\nThe ESPN Daily Podcast\\n\\n\\nAP Photo/Mark J. Terrilllive\\n\\n\\n\\nChristian Wood elevates for the big-time stuffChristian Wood elevates for the big-time stuff15m0:29\\n\\n\\nKyrie Irving nails the treyKyrie Irving nails the trey37m0:17\\n\\n\\nDwight Powell rises up for putback dunkDwight Powell throws down the putback dunk for the Mavericks.38m0:16\\n\\n\\nKyrie sinks his first basket with the MavericksKyrie Irving drains the jump shot early vs. the Clippers for his first points with the Mavericks.39m0:17\\n\\n\\nReggie Bullock pulls up for wide open 3Reggie Bullock is left wide open for the 3-pointer early vs. the Clippers.46m0:21\\n\\n\\n\\nTOP HEADLINESSources: Lakers get PG Russell in 3-team tradeTrail Blazers shipping Hart to Knicks, sources sayUConn loses two straight for first time in 30 yearsNFL's Goodell on officiating: Never been betterNFLPA's Smith: Get rid of 'intrusive' NFL combineAlex Morgan: 'Bizarre' for Saudis to sponsor WWCBills' Hamlin makes appearance to receive awardWWE Hall of Famer Lawler recovering from strokeWhich NFL team trades up to No. 1?NBA TRADE DEADLINE3 P.M. ET ON THURSDAYTrade grades: What to make of the three-team deal involving Russell Westbrook and D'Angelo RussellESPN NBA Insider Kevin Pelton is handing out grades for the biggest moves.2hLayne Murdoch Jr./NBAE via Getty ImagesNBA trade tracker: Grades, details for every deal for the 2022-23 seasonWhich players are finding new homes and which teams are making trades during the free-agency frenzy?59mESPN.comNBA trade deadline: Latest buzz and newsNBA SCOREBOARDWEDNESDAY'S GAMESSee AllCLEAR THE RUNWAYJalen Green soars for lefty alley-oop1h0:19Jarrett Allen skies to drop the hammer2h0:16Once the undisputed greatest, Joe Montana is still working things out15hWright ThompsonSUPER BOWL LVII6:30 P.M. ET ON SUNDAYBarbershop tales, a fistfight and brotherly love: Unto
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "878179f7",
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"# Use this piece of code for testing new custom BeautifulSoup parsers\n",
"\n",
"import requests\n",
"from bs4 import BeautifulSoup\n",
"\n",
"html_doc = requests.get(\"{INSERT_NEW_URL_HERE}\")\n",
"soup = BeautifulSoup(html_doc.text, 'html.parser')\n",
"\n",
"# Beautiful soup logic to be exported to langchain.document_loaders.webpage.py\n",
"# Example: transcript = soup.select_one(\"td[class='scrtext']\").text\n",
"# BS4 documentation can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n",
"\n",
"\"\"\";"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ca330a63",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 5
}