mirror of
https://github.com/hwchase17/langchain
synced 2024-11-08 07:10:35 +00:00
71f98db2fe
- Description: Added newspaper3k based news article loader. Provide a list of urls. - Issue: N/A - Dependencies: newspaper3k, - Tag maintainer: @rlancemartin , @eyurtsev - Twitter handle: @ruze --------- Co-authored-by: Bagatur <baskaryan@gmail.com>
193 lines
7.6 KiB
Plaintext
193 lines
7.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "2dfc4698",
|
|
"metadata": {},
|
|
"source": [
|
|
"# News URL\n",
|
|
"\n",
|
|
"This covers how to load HTML news articles from a list of URLs into a document format that we can use downstream."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "16c3699e",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:18.886031400Z",
|
|
"start_time": "2023-08-02T21:18:17.682345Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from langchain.document_loaders import NewsURLLoader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "836fbac1",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:18.895539800Z",
|
|
"start_time": "2023-08-02T21:18:18.895539800Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"urls = [\n",
|
|
" \"https://www.bbc.com/news/world-us-canada-66388172\",\n",
|
|
" \"https://www.bbc.com/news/entertainment-arts-66384971\",\n",
|
|
"]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "33089aba-ff74-4d00-8f40-9449c29587cc",
|
|
"metadata": {},
|
|
"source": [
|
|
"Pass in urls to load them into Documents"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "00f46fda",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:19.227074500Z",
|
|
"start_time": "2023-08-02T21:18:18.895539800Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"First article: page_content='In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them that \"no reasonable person\" would view her claims as fact. Neither she nor her representatives have commented.' metadata={'title': 'Donald Trump indictment: What do we know about the six co-conspirators?', 'link': 'https://www.bbc.com/news/world-us-canada-66388172', 'authors': [], 'language': 'en', 'description': 'Six people accused of helping Mr Trump undermine the election have been described by prosecutors.', 'publish_date': None}\n",
|
|
"\n",
|
|
"Second article: page_content='Ms Williams added: \"If there\\'s anything that I can do in my power to ensure that dancers or singers or whoever decides to work with her don\\'t have to go through that same experience, I\\'m going to do that.\"' metadata={'title': \"Lizzo dancers Arianna Davis and Crystal Williams: 'No one speaks out, they are scared'\", 'link': 'https://www.bbc.com/news/entertainment-arts-66384971', 'authors': [], 'language': 'en', 'description': 'The US pop star is being sued for sexual harassment and fat-shaming but has yet to comment.', 'publish_date': None}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = NewsURLLoader(urls=urls)\n",
|
|
"data = loader.load()\n",
|
|
"print(\"First article: \", data[0])\n",
|
|
"print(\"\\nSecond article: \", data[1])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"source": [
|
|
"Use nlp=True to run nlp analysis and generate keywords + summary"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"id": "98ac26c488315bff"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "b68a26b3",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:19.585758200Z",
|
|
"start_time": "2023-08-02T21:18:19.227074500Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"First article: page_content='In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them that \"no reasonable person\" would view her claims as fact. Neither she nor her representatives have commented.' metadata={'title': 'Donald Trump indictment: What do we know about the six co-conspirators?', 'link': 'https://www.bbc.com/news/world-us-canada-66388172', 'authors': [], 'language': 'en', 'description': 'Six people accused of helping Mr Trump undermine the election have been described by prosecutors.', 'publish_date': None, 'keywords': ['powell', 'know', 'donald', 'trump', 'review', 'indictment', 'telling', 'view', 'reasonable', 'person', 'testimony', 'coconspirators', 'riot', 'representatives', 'claims'], 'summary': 'In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them that \"no reasonable person\" would view her claims as fact.\\nNeither she nor her representatives have commented.'}\n",
|
|
"\n",
|
|
"Second article: page_content='Ms Williams added: \"If there\\'s anything that I can do in my power to ensure that dancers or singers or whoever decides to work with her don\\'t have to go through that same experience, I\\'m going to do that.\"' metadata={'title': \"Lizzo dancers Arianna Davis and Crystal Williams: 'No one speaks out, they are scared'\", 'link': 'https://www.bbc.com/news/entertainment-arts-66384971', 'authors': [], 'language': 'en', 'description': 'The US pop star is being sued for sexual harassment and fat-shaming but has yet to comment.', 'publish_date': None, 'keywords': ['davis', 'lizzo', 'singers', 'experience', 'crystal', 'ensure', 'arianna', 'theres', 'williams', 'power', 'going', 'dancers', 'im', 'speaks', 'work', 'ms', 'scared'], 'summary': 'Ms Williams added: \"If there\\'s anything that I can do in my power to ensure that dancers or singers or whoever decides to work with her don\\'t have to go through that same experience, I\\'m going to do that.\"'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"loader = NewsURLLoader(urls=urls, nlp=True)\n",
|
|
"data = loader.load()\n",
|
|
"print(\"First article: \", data[0])\n",
|
|
"print(\"\\nSecond article: \", data[1])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": "['powell',\n 'know',\n 'donald',\n 'trump',\n 'review',\n 'indictment',\n 'telling',\n 'view',\n 'reasonable',\n 'person',\n 'testimony',\n 'coconspirators',\n 'riot',\n 'representatives',\n 'claims']"
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data[0].metadata['keywords']"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:19.585758200Z",
|
|
"start_time": "2023-08-02T21:18:19.585758200Z"
|
|
}
|
|
},
|
|
"id": "ae37e004e0284b1d"
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": "'In testimony to the congressional committee examining the 6 January riot, Mrs Powell said she did not review all of the many claims of election fraud she made, telling them that \"no reasonable person\" would view her claims as fact.\\nNeither she nor her representatives have commented.'"
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data[0].metadata['summary']"
|
|
],
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-02T21:18:19.598966800Z",
|
|
"start_time": "2023-08-02T21:18:19.594950200Z"
|
|
}
|
|
},
|
|
"id": "7676155fb175e53e"
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.6"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|