{ "cells": [ { "cell_type": "markdown", "id": "1f3a5ebf", "metadata": {}, "source": [ "# Airbyte JSON" ] }, { "cell_type": "markdown", "id": "35ac77b1-449b-44f7-b8f3-3494d55c286e", "metadata": {}, "source": [ ">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases." ] }, { "cell_type": "markdown", "id": "1fe72234-3110-4c07-a766-3dc505dd25cc", "metadata": {}, "source": [ "This covers how to load any source from Airbyte into a local JSON file that can be read in as a document\n", "\n", "Prereqs:\n", "Have docker desktop installed\n", "\n", "Steps:\n", "\n", "1) Clone Airbyte from GitHub - `git clone https://github.com/airbytehq/airbyte.git`\n", "\n", "2) Switch into Airbyte directory - `cd airbyte`\n", "\n", "3) Start Airbyte - `docker compose up`\n", "\n", "4) In your browser, just visit http://localhost:8000. You will be asked for a username and password. By default, that's username `airbyte` and password `password`.\n", "\n", "5) Setup any source you wish.\n", "\n", "6) Set destination as Local JSON, with specified destination path - lets say `/json_data`. Set up manual sync.\n", "\n", "7) Run the connection.\n", "\n", "7) To see what files are create, you can navigate to: `file:///tmp/airbyte_local`\n", "\n", "8) Find your data and copy path. That path should be saved in the file variable below. It should start with `/tmp/airbyte_local`\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "180c8b74", "metadata": {}, "outputs": [], "source": [ "from langchain.document_loaders import AirbyteJSONLoader" ] }, { "cell_type": "code", "execution_count": 2, "id": "4af10665", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "_airbyte_raw_pokemon.jsonl\n" ] } ], "source": [ "!ls /tmp/airbyte_local/json_data/" ] }, { "cell_type": "code", "execution_count": 3, "id": "721d9316", "metadata": {}, "outputs": [], "source": [ "loader = AirbyteJSONLoader('/tmp/airbyte_local/json_data/_airbyte_raw_pokemon.jsonl')" ] }, { "cell_type": "code", "execution_count": 4, "id": "9858b946", "metadata": {}, "outputs": [], "source": [ "data = loader.load()" ] }, { "cell_type": "code", "execution_count": 8, "id": "fca024cb", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "abilities: \n", "ability: \n", "name: blaze\n", "url: https://pokeapi.co/api/v2/ability/66/\n", "\n", "is_hidden: False\n", "slot: 1\n", "\n", "\n", "ability: \n", "name: solar-power\n", "url: https://pokeapi.co/api/v2/ability/94/\n", "\n", "is_hidden: True\n", "slot: 3\n", "\n", "base_experience: 267\n", "forms: \n", "name: charizard\n", "url: https://pokeapi.co/api/v2/pokemon-form/6/\n", "\n", "game_indices: \n", "game_index: 180\n", "version: \n", "name: red\n", "url: https://pokeapi.co/api/v2/version/1/\n", "\n", "\n", "\n", "game_index: 180\n", "version: \n", "name: blue\n", "url: https://pokeapi.co/api/v2/version/2/\n", "\n", "\n", "\n", "game_index: 180\n", "version: \n", "n\n" ] } ], "source": [ "print(data[0].page_content[:500])" ] }, { "cell_type": "code", "execution_count": null, "id": "9fa002a5", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }