langchain/docs/docs/integrations/document_loaders/glue_catalog.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "MwTWzDxYgbrR"
   },
   "source": [
    "# Glue Catalog\n",
    "\n",
    "\n",
    "The [AWS Glue Data Catalog](https://docs.aws.amazon.com/en_en/glue/latest/dg/catalog-and-crawler.html) is a centralized metadata repository that allows you to manage, access, and share metadata about your data stored in AWS. It acts as a metadata store for your data assets, enabling various AWS services and your applications to query and connect to the data they need efficiently.\n",
    "\n",
    "When you define data sources, transformations, and targets in AWS Glue, the metadata about these elements is stored in the Data Catalog. This includes information about data locations, schema definitions, runtime metrics, and more. It supports various data store types, such as Amazon S3, Amazon RDS, Amazon Redshift, and external databases compatible with JDBC. It is also directly integrated with Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR, allowing these services to directly access and query the data.\n",
    "\n",
    "The Langchain GlueCatalogLoader will get the schema of all tables inside the given Glue database in the same format as Pandas dtype."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up\n",
    "\n",
    "- Follow [instructions to set up an AWS accoung](https://docs.aws.amazon.com/athena/latest/ug/setting-up.html).\n",
    "- Install the boto3 library: `pip install boto3`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "076NLjfngoWJ"
   },
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders.glue_catalog import GlueCatalogLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XpMRQwU9gu44"
   },
   "outputs": [],
   "source": [
    "database_name = \"my_database\"\n",
    "profile_name = \"my_profile\"\n",
    "\n",
    "loader = GlueCatalogLoader(\n",
    "    database=database_name,\n",
    "    profile_name=profile_name,\n",
    ")\n",
    "\n",
    "schemas = loader.load()\n",
    "print(schemas)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example with table filtering\n",
    "\n",
    "Table filtering allows you to selectively retrieve schema information for a specific subset of tables within a Glue database. Instead of loading the schemas for all tables, you can use the `table_filter` argument to specify exactly which tables you're interested in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders.glue_catalog import GlueCatalogLoader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "database_name = \"my_database\"\n",
    "profile_name = \"my_profile\"\n",
    "table_filter = [\"table1\", \"table2\", \"table3\"]\n",
    "\n",
    "loader = GlueCatalogLoader(\n",
    "    database=database_name, profile_name=profile_name, table_filter=table_filter\n",
    ")\n",
    "\n",
    "schemas = loader.load()\n",
    "print(schemas)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}