langchain/docs/extras/integrations/document_loaders/geopandas.ipynb
Bagatur b485c3048b
rm base64 images from docs (#10110)
Causing problems indexing docs and notebook images don't render after markdown conversion anyways
2023-09-01 15:15:12 -07:00

189 lines
6.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "ca4c8c2a",
"metadata": {},
"source": [
"# Geopandas\n",
"\n",
"[Geopandas](https://geopandas.org/en/stable/index.html) is an open source project to make working with geospatial data in python easier. \n",
"\n",
"GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. \n",
"\n",
"Geometric operations are performed by shapely. Geopandas further depends on fiona for file access and matplotlib for plotting.\n",
"\n",
"LLM applications (chat, QA) that utilize geospatial data are an interesting area for exploration."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "00b3bf80",
"metadata": {},
"outputs": [],
"source": [
"! pip install sodapy\n",
"! pip install pandas\n",
"! pip install geopandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "cecc9320",
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"import pandas as pd\n",
"import geopandas as gpd\n",
"from langchain.document_loaders import OpenCityDataLoader"
]
},
{
"cell_type": "markdown",
"id": "04981332",
"metadata": {},
"source": [
"Create a GeoPandas dataframe from [`Open City Data`](https://python.langchain.com/docs/integrations/document_loaders/open_city_data) as an example input."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e7de46b",
"metadata": {},
"outputs": [],
"source": [
"# Load Open City Data\n",
"dataset = \"tmnf-yvry\" # San Francisco crime data\n",
"loader = OpenCityDataLoader(city_id=\"data.sfgov.org\", dataset_id=dataset, limit=5000)\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "7cda2e38",
"metadata": {},
"outputs": [],
"source": [
"# Convert list of dictionaries to DataFrame\n",
"df = pd.DataFrame([ast.literal_eval(d.page_content) for d in docs])\n",
"\n",
"# Extract latitude and longitude\n",
"df[\"Latitude\"] = df[\"location\"].apply(lambda loc: loc[\"coordinates\"][1])\n",
"df[\"Longitude\"] = df[\"location\"].apply(lambda loc: loc[\"coordinates\"][0])\n",
"\n",
"# Create geopandas DF\n",
"gdf = gpd.GeoDataFrame(\n",
" df, geometry=gpd.points_from_xy(df.Longitude, df.Latitude), crs=\"EPSG:4326\"\n",
")\n",
"\n",
"# Only keep valid longitudes and latitudes for San Francisco\n",
"gdf = gdf[\n",
" (gdf[\"Longitude\"] >= -123.173825)\n",
" & (gdf[\"Longitude\"] <= -122.281780)\n",
" & (gdf[\"Latitude\"] >= 37.623983)\n",
" & (gdf[\"Latitude\"] <= 37.929824)\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "030a535c",
"metadata": {},
"source": [
"Visiualization of the sample of SF crimne data. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8148a63e",
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"# Load San Francisco map data\n",
"sf = gpd.read_file(\"https://data.sfgov.org/resource/3psu-pn9h.geojson\")\n",
"\n",
"# Plot the San Francisco map and the points\n",
"fig, ax = plt.subplots(figsize=(10, 10))\n",
"sf.plot(ax=ax, color=\"white\", edgecolor=\"black\")\n",
"gdf.plot(ax=ax, color=\"red\", markersize=5)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "a081a9d1",
"metadata": {},
"source": [
"Load GeoPandas dataframe as a `Document` for downstream processing (embedding, chat, etc). \n",
"\n",
"The `geometry` will be the default `page_content` columns, and all other columns are placed in `metadata`.\n",
"\n",
"But, we can specify the `page_content_column`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "381a5f7b",
"metadata": {},
"outputs": [],
"source": [
"from langchain.document_loaders import GeoDataFrameLoader\n",
"\n",
"loader = GeoDataFrameLoader(data_frame=gdf, page_content_column=\"geometry\")\n",
"docs = loader.load()"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "74baf6ee",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(page_content='POINT (-122.420084075249 37.7083109744362)', metadata={'pdid': '4133422003074', 'incidntnum': '041334220', 'incident_code': '03074', 'category': 'ROBBERY', 'descript': 'ROBBERY, BODILY FORCE', 'dayofweek': 'Monday', 'date': '2004-11-22T00:00:00.000', 'time': '17:50', 'pddistrict': 'INGLESIDE', 'resolution': 'NONE', 'address': 'GENEVA AV / SANTOS ST', 'x': '-122.420084075249', 'y': '37.7083109744362', 'location': {'type': 'Point', 'coordinates': [-122.420084075249, 37.7083109744362]}, ':@computed_region_26cr_cadq': '9', ':@computed_region_rxqg_mtj9': '8', ':@computed_region_bh8s_q3mv': '309', ':@computed_region_6qbp_sg9q': nan, ':@computed_region_qgnn_b9vv': nan, ':@computed_region_ajp5_b2md': nan, ':@computed_region_yftq_j783': nan, ':@computed_region_p5aj_wyqh': nan, ':@computed_region_fyvs_ahh9': nan, ':@computed_region_6pnf_4xz7': nan, ':@computed_region_jwn9_ihcz': nan, ':@computed_region_9dfj_4gjx': nan, ':@computed_region_4isq_27mq': nan, ':@computed_region_pigm_ib2e': nan, ':@computed_region_9jxd_iqea': nan, ':@computed_region_6ezc_tdp2': nan, ':@computed_region_h4ep_8xdi': nan, ':@computed_region_n4xg_c4py': nan, ':@computed_region_fcz8_est8': nan, ':@computed_region_nqbw_i6c3': nan, ':@computed_region_2dwj_jsy4': nan, 'Latitude': 37.7083109744362, 'Longitude': -122.420084075249})"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs[0]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}