airbyte[patch]: init pkg (#18236)

3 months ago · be8d2ff5f7
parent ac1d7d9de8
commit be8d2ff5f7
25 changed files with 3829 additions and 6 deletions
--- a/.github/workflows/_dependencies.yml
+++ b/.github/workflows/_dependencies.yml
@ -63,6 +63,8 @@ jobs:
      - name: Install the opposite major version of pydantic
        # If normal tests use pydantic v1, here we'll use v2, and vice versa.
        shell: bash
+        # airbyte currently doesn't support pydantic v2
+        if: ${{ !startsWith(inputs.working-directory, 'libs/partners/airbyte') }}
        run: |
          # Determine the major part of pydantic version
          REGULAR_VERSION=$(poetry run python -c "import pydantic; print(pydantic.__version__)" | cut -d. -f1)
@ -97,6 +99,8 @@ jobs:
          fi
          echo "Found pydantic version ${CURRENT_VERSION}, as expected"
      - name: Run pydantic compatibility tests
+        # airbyte currently doesn't support pydantic v2
+        if: ${{ !startsWith(inputs.working-directory, 'libs/partners/airbyte') }}
        shell: bash
        run: make test

--- a/.github/workflows/_integration_test.yml
+++ b/.github/workflows/_integration_test.yml
@ -73,6 +73,7 @@ jobs:
          ES_URL: ${{ secrets.ES_URL }}
          ES_CLOUD_ID: ${{ secrets.ES_CLOUD_ID }}
          ES_API_KEY: ${{ secrets.ES_API_KEY }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # for airbyte
        run: |
          make integration_tests

--- a/.github/workflows/_release.yml
+++ b/.github/workflows/_release.yml
@ -194,6 +194,7 @@ jobs:
          ES_URL: ${{ secrets.ES_URL }}
          ES_CLOUD_ID: ${{ secrets.ES_CLOUD_ID }}
          ES_API_KEY: ${{ secrets.ES_API_KEY }}
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # for airbyte
        run: make integration_tests
        working-directory: ${{ inputs.working-directory }}

--- a/.gitignore
+++ b/.gitignore
@ -115,13 +115,10 @@ celerybeat.pid
 # Environments
 .env
 .envrc
-.venv
-.venvs
+.venv*
 env/
-venv/
 ENV/
 env.bak/
-venv.bak/

 # Spyder project settings
 .spyderproject
--- a/docs/docs/integrations/document_loaders/airbyte.ipynb
+++ b/docs/docs/integrations/document_loaders/airbyte.ipynb
@ -0,0 +1,292 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "1f3a5ebf",
+   "metadata": {},
+   "source": [
+    "# AirbyteLoader"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "35ac77b1-449b-44f7-b8f3-3494d55c286e",
+   "metadata": {},
+   "source": [
+    ">[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.\n",
+    "\n",
+    "This covers how to load any source from Airbyte into LangChain documents\n",
+    "\n",
+    "## Installation\n",
+    "\n",
+    "In order to use `AirbyteLoader` you need to install the `langchain-airbyte` integration package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "180c8b74",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "% pip install -qU langchain-airbyte"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3dd92c62",
+   "metadata": {},
+   "source": [
+    "## Loading Documents\n",
+    "\n",
+    "By default, the `AirbyteLoader` will load any structured data from a stream and output yaml-formatted documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "721d9316",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "```yaml\n",
+      "academic_degree: PhD\n",
+      "address:\n",
+      "  city: Lauderdale Lakes\n",
+      "  country_code: FI\n",
+      "  postal_code: '75466'\n",
+      "  province: New Jersey\n",
+      "  state: Hawaii\n",
+      "  street_name: Stoneyford\n",
+      "  street_number: '1112'\n",
+      "age: 44\n",
+      "blood_type: \"O\\u2212\"\n",
+      "created_at: '2004-04-02T13:05:27+00:00'\n",
+      "email: bread2099+1@outlook.com\n",
+      "gender: Fluid\n",
+      "height: '1.62'\n",
+      "id: 1\n",
+      "language: Belarusian\n",
+      "name: Moses\n",
+      "nationality: Dutch\n",
+      "occupation: Track Worker\n",
+      "telephone: 1-467-194-2318\n",
+      "title: M.Sc.Tech.\n",
+      "updated_at: '2024-02-27T16:41:01+00:00'\n",
+      "weight: 6\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_airbyte import AirbyteLoader\n",
+    "\n",
+    "loader = AirbyteLoader(\n",
+    "    source=\"source-faker\",\n",
+    "    stream=\"users\",\n",
+    "    config={\"count\": 10},\n",
+    ")\n",
+    "docs = loader.load()\n",
+    "print(docs[0].page_content[:500])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fca024cb",
+   "metadata": {
+    "scrolled": true
+   },
+   "source": [
+    "You can also specify a custom prompt template for formatting documents:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "9fa002a5",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "My name is Verdie and I am 1.73 meters tall.\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_core.prompts import PromptTemplate\n",
+    "\n",
+    "loader_templated = AirbyteLoader(\n",
+    "    source=\"source-faker\",\n",
+    "    stream=\"users\",\n",
+    "    config={\"count\": 10},\n",
+    "    template=PromptTemplate.from_template(\n",
+    "        \"My name is {name} and I am {height} meters tall.\"\n",
+    "    ),\n",
+    ")\n",
+    "docs_templated = loader_templated.load()\n",
+    "print(docs_templated[0].page_content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3e6d887",
+   "metadata": {},
+   "source": [
+    "## Lazy Loading Documents\n",
+    "\n",
+    "One of the powerful features of `AirbyteLoader` is its ability to load large documents from upstream sources. When working with large datasets, the default `.load()` behavior can be slow and memory-intensive. To avoid this, you can use the `.lazy_load()` method to load documents in a more memory-efficient manner."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "684b9187",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Just calling lazy load is quick! This took 0.0001 seconds\n"
+     ]
+    }
+   ],
+   "source": [
+    "import time\n",
+    "\n",
+    "loader = AirbyteLoader(\n",
+    "    source=\"source-faker\",\n",
+    "    stream=\"users\",\n",
+    "    config={\"count\": 3},\n",
+    "    template=PromptTemplate.from_template(\n",
+    "        \"My name is {name} and I am {height} meters tall.\"\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "start_time = time.time()\n",
+    "my_iterator = loader.lazy_load()\n",
+    "print(\n",
+    "    f\"Just calling lazy load is quick! This took {time.time() - start_time:.4f} seconds\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b24a64b",
+   "metadata": {},
+   "source": [
+    "And you can iterate over documents as they're yielded:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "3e8355d0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "My name is Andera and I am 1.91 meters tall.\n",
+      "My name is Jody and I am 1.85 meters tall.\n",
+      "My name is Zonia and I am 1.53 meters tall.\n"
+     ]
+    }
+   ],
+   "source": [
+    "for doc in my_iterator:\n",
+    "    print(doc.page_content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1040d81",
+   "metadata": {},
+   "source": [
+    "You can also lazy load documents in an async manner with `.alazy_load()`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "dc5d0911",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "My name is Carmelina and I am 1.74 meters tall.\n",
+      "My name is Ali and I am 1.90 meters tall.\n",
+      "My name is Rochell and I am 1.83 meters tall.\n"
+     ]
+    }
+   ],
+   "source": [
+    "loader = AirbyteLoader(\n",
+    "    source=\"source-faker\",\n",
+    "    stream=\"users\",\n",
+    "    config={\"count\": 3},\n",
+    "    template=PromptTemplate.from_template(\n",
+    "        \"My name is {name} and I am {height} meters tall.\"\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "my_async_iterator = loader.alazy_load()\n",
+    "\n",
+    "async for doc in my_async_iterator:\n",
+    "    print(doc.page_content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba4ede33",
+   "metadata": {},
+   "source": [
+    "## Configuration\n",
+    "\n",
+    "`AirbyteLoader` can be configured with the following options:\n",
+    "\n",
+    "- `source` (str, required): The name of the Airbyte source to load from.\n",
+    "- `stream` (str, required): The name of the stream to load from (Airbyte sources can return multiple streams)\n",
+    "- `config` (dict, required): The configuration for the Airbyte source\n",
+    "- `template` (PromptTemplate, optional): A custom prompt template for formatting documents\n",
+    "- `include_metadata` (bool, optional, default True): Whether to include all fields as metadata in the output documents\n",
+    "\n",
+    "The majority of the configuration will be in `config`, and you can find the specific configuration options in the \"Config field reference\" for each source in the [Airbyte documentation](https://docs.airbyte.com/integrations/)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e2ed269",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/docs/integrations/providers/airbyte.mdx
+++ b/docs/docs/integrations/providers/airbyte.mdx
@ -3,7 +3,41 @@
 >[Airbyte](https://github.com/airbytehq/airbyte) is a data integration platform for ELT pipelines from APIs, 
 > databases & files to warehouses & lakes. It has the largest catalog of ELT connectors to data warehouses and databases.

-## Installation and Setup
+## [AirbyteLoader](/docs/integrations/document_loaders/airbyte)
+
+This loader is built on top of [PyAirbyte](https://pypi.org/project/airbyte/) for easy setup and use.
+
+### Installation and Setup
+
+```bash
+pip install -U langchain-airbyte
+```
+
+The integration package doesn't have any global environment variables that need to be
+set, but some integrations (e.g. `source-github`) may need credentials passed in.
+
+### Document Loader
+
+`AirbyteLoader` class exposes a single document loader for Airbyte sources.
+
+```python
+from langchain_airbyte import AirbyteLoader
+
+loader = AirbyteLoader(
+    source="source-faker",
+    stream="users",
+    config={"count": 100},
+)
+docs = loader.load()
+```
+
+For more information, see the full [AirbyteLoader docs](/docs/integrations/document_loaders/airbyte).
+
+## AirbyteJSONLoader (Deprecated)
+
+This loader is deprecated and should be swapped out for `AirbyteLoader`, which doesn't require any of the docker setup!
+
+### Installation and Setup

 This instruction shows how to load any source from `Airbyte` into a local `JSON` file that can be read in as a document.

@ -20,7 +54,7 @@ Have `docker desktop` installed.
 7. Run the connection.
 8. To see what files are created, navigate to: `file:///tmp/airbyte_local/`.

-## Document Loader
+### Document Loader

 See a [usage example](/docs/integrations/document_loaders/airbyte_json).

--- a/libs/partners/airbyte/.gitignore
+++ b/libs/partners/airbyte/.gitignore
@ -0,0 +1 @@
+__pycache__
--- a/libs/partners/airbyte/LICENSE
+++ b/libs/partners/airbyte/LICENSE
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 LangChain, Inc.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/libs/partners/airbyte/Makefile
+++ b/libs/partners/airbyte/Makefile
@ -0,0 +1,56 @@
+.PHONY: all format lint test tests integration_tests docker_tests help extended_tests
+
+# Default target executed when no arguments are given to make.
+all: help
+
+# Define a variable for the test file path.
+TEST_FILE ?= tests/unit_tests/
+integration_test integration_tests: TEST_FILE = tests/integration_tests/
+
+test tests integration_test integration_tests:
+	poetry run pytest $(TEST_FILE)
+
+######################
+# LINTING AND FORMATTING
+######################
+
+# Define a variable for Python and notebook files.
+PYTHON_FILES=.
+MYPY_CACHE=.mypy_cache
+lint format: PYTHON_FILES=.
+lint_diff format_diff: PYTHON_FILES=$(shell git diff --relative=libs/partners/airbyte --name-only --diff-filter=d master | grep -E '\.py$$|\.ipynb$$')
+lint_package: PYTHON_FILES=langchain_airbyte
+lint_tests: PYTHON_FILES=tests
+lint_tests: MYPY_CACHE=.mypy_cache_test
+
+lint lint_diff lint_package lint_tests:
+	poetry run ruff .
+	poetry run ruff format $(PYTHON_FILES) --diff
+	poetry run ruff --select I $(PYTHON_FILES)
+	mkdir -p $(MYPY_CACHE); poetry run mypy $(PYTHON_FILES) --cache-dir $(MYPY_CACHE)
+
+format format_diff:
+	poetry run ruff format $(PYTHON_FILES)
+	poetry run ruff --select I --fix $(PYTHON_FILES)
+
+spell_check:
+	poetry run codespell --toml pyproject.toml
+
+spell_fix:
+	poetry run codespell --toml pyproject.toml -w
+
+check_imports: $(shell find langchain_airbyte -name '*.py')
+	poetry run python ./scripts/check_imports.py $^
+
+######################
+# HELP
+######################
+
+help:
+	@echo '----'
+	@echo 'check_imports				- check imports'
+	@echo 'format                       - run code formatters'
+	@echo 'lint                         - run linters'
+	@echo 'test                         - run unit tests'
+	@echo 'tests                        - run unit tests'
+	@echo 'test TEST_FILE=<test_file>   - run all tests in file'
--- a/libs/partners/airbyte/README.md
+++ b/libs/partners/airbyte/README.md
@ -0,0 +1,27 @@
+# langchain-airbyte
+
+This package contains the LangChain integration with Airbyte
+
+## Installation
+
+```bash
+pip install -U langchain-airbyte
+```
+
+The integration package doesn't have any global environment variables that need to be
+set, but some integrations (e.g. `source-github`) may need credentials passed in.
+
+## Document Loaders
+
+`AirbyteLoader` class exposes a single document loader for Airbyte sources.
+
+```python
+from langchain_airbyte import AirbyteLoader
+
+loader = AirbyteLoader(
+    source="source-faker",
+    stream="users",
+    config={"count": 100},
+)
+docs = loader.load()
+```
--- a/libs/partners/airbyte/langchain_airbyte/init.py
+++ b/libs/partners/airbyte/langchain_airbyte/init.py
@ -0,0 +1,3 @@
+from langchain_airbyte.document_loaders import AirbyteLoader
+
+__all__ = ["AirbyteLoader"]
--- a/libs/partners/airbyte/langchain_airbyte/document_loaders.py
+++ b/libs/partners/airbyte/langchain_airbyte/document_loaders.py
@ -0,0 +1,121 @@
+"""Airbyte vector stores."""
+
+from __future__ import annotations
+
+from typing import (
+    TYPE_CHECKING,
+    Any,
+    AsyncIterator,
+    Dict,
+    Iterator,
+    List,
+    Mapping,
+    Optional,
+    TypeVar,
+)
+
+import airbyte as ab
+from langchain_core.documents import Document
+from langchain_core.prompts import PromptTemplate
+from langchain_core.runnables import run_in_executor
+from langchain_core.vectorstores import VectorStore
+
+if TYPE_CHECKING:
+    from langchain.text_splitter import TextSplitter
+    from langchain_core.documents import Document
+
+VST = TypeVar("VST", bound=VectorStore)
+
+
+class AirbyteLoader:
+    """Airbyte Document Loader.
+
+    Example:
+        .. code-block:: python
+
+            from langchain_airbyte import AirbyteLoader
+
+            loader = AirbyteLoader(
+                source="github",
+                stream="pull_requests",
+            )
+            documents = loader.lazy_load()
+    """
+
+    def __init__(
+        self,
+        source: str,
+        stream: str,
+        *,
+        config: Optional[Dict] = None,
+        include_metadata: bool = True,
+        template: Optional[PromptTemplate] = None,
+    ):
+        self._airbyte_source = ab.get_source(source, config=config, streams=[stream])
+        self._stream = stream
+        self._template = template
+        self._include_metadata = include_metadata
+
+    def load(self) -> List[Document]:
+        """Load source data into Document objects."""
+        return list(self.lazy_load())
+
+    def load_and_split(
+        self, text_splitter: Optional[TextSplitter] = None
+    ) -> List[Document]:
+        """Load Documents and split into chunks. Chunks are returned as Documents.
+
+        Args:
+            text_splitter: TextSplitter instance to use for splitting documents.
+              Defaults to RecursiveCharacterTextSplitter.
+
+        Returns:
+            List of Documents.
+        """
+        from langchain.text_splitter import RecursiveCharacterTextSplitter
+
+        if text_splitter is None:
+            _text_splitter: TextSplitter = RecursiveCharacterTextSplitter()
+        else:
+            _text_splitter = text_splitter
+        docs = self.lazy_load()
+        return _text_splitter.split_documents(docs)
+
+    def lazy_load(self) -> Iterator[Document]:
+        """A lazy loader for Documents."""
+        # if no prompt template defined, use default airbyte documents
+        if not self._template:
+            for document in self._airbyte_source.get_documents(self._stream):
+                # convert airbyte document to langchain document
+                metadata = (
+                    {}
+                    if not self._include_metadata
+                    else {
+                        **document.metadata,
+                        "_last_modified": document.last_modified,
+                        "_id": document.id,
+                    }
+                )
+                yield Document(
+                    page_content=document.content,
+                    metadata=metadata,
+                )
+        else:
+            records: Iterator[Mapping[str, Any]] = self._airbyte_source.get_records(
+                self._stream
+            )
+            for record in records:
+                metadata = {} if not self._include_metadata else dict(record)
+                yield Document(
+                    page_content=self._template.format(**record), metadata=metadata
+                )
+
+    async def alazy_load(self) -> AsyncIterator[Document]:
+        """A lazy loader for Documents."""
+        iterator = await run_in_executor(None, self.lazy_load)
+        done = object()
+        while True:
+            doc = await run_in_executor(None, next, iterator, done)  # type: ignore[call-arg, arg-type]
+            if doc is done:
+                break
+            yield doc  # type: ignore[misc]
--- a/libs/partners/airbyte/langchain_airbyte/py.typed
+++ b/libs/partners/airbyte/langchain_airbyte/py.typed
--- a/libs/partners/airbyte/poetry.lock
+++ b/libs/partners/airbyte/poetry.lock
--- a/libs/partners/airbyte/pyproject.toml
+++ b/libs/partners/airbyte/pyproject.toml
@ -0,0 +1,85 @@
+[tool.poetry]
+name = "langchain-airbyte"
+version = "0.1.0"
+description = "An integration package connecting Airbyte and LangChain"
+authors = []
+readme = "README.md"
+repository = "https://github.com/langchain-ai/langchain"
+license = "MIT"
+
+[tool.poetry.urls]
+"Source Code" = "https://github.com/langchain-ai/langchain/tree/master/libs/partners/airbyte"
+
+[tool.poetry.dependencies]
+python = "^3.10"
+langchain-core = "^0.1"
+airbyte = "^0.7.0"
+
+[tool.poetry.group.test]
+optional = true
+
+[tool.poetry.group.test.dependencies]
+pytest = "^7.4.3"
+pytest-asyncio = "^0.23.2"
+langchain-core = { path = "../../core", develop = true }
+
+[tool.poetry.group.codespell]
+optional = true
+
+[tool.poetry.group.codespell.dependencies]
+codespell = "^2.2.6"
+
+[tool.poetry.group.test_integration]
+optional = true
+
+[tool.poetry.group.test_integration.dependencies]
+
+[tool.poetry.group.lint]
+optional = true
+
+[tool.poetry.group.lint.dependencies]
+ruff = "^0.1.8"
+
+[tool.poetry.group.typing.dependencies]
+mypy = "^1.7.1"
+langchain-core = { path = "../../core", develop = true }
+langchain = "^0.1.9"
+
+[tool.poetry.group.dev]
+optional = true
+
+[tool.poetry.group.dev.dependencies]
+langchain-core = { path = "../../core", develop = true }
+
+[tool.ruff.lint]
+select = [
+  "E",    # pycodestyle
+  "F",    # pyflakes
+  "I",    # isort
+  "T201", # print
+]
+
+[tool.mypy]
+disallow_untyped_defs = "True"
+
+[tool.coverage.run]
+omit = ["tests/*"]
+
+[build-system]
+requires = ["poetry-core>=1.0.0"]
+build-backend = "poetry.core.masonry.api"
+
+[tool.pytest.ini_options]
+# --strict-markers will raise errors on unknown marks.
+# https://docs.pytest.org/en/7.1.x/how-to/mark.html#raising-errors-on-unknown-marks
+#
+# https://docs.pytest.org/en/7.1.x/reference/reference.html
+# --strict-config       any warnings encountered while parsing the `pytest`
+#                       section of the configuration file raise errors.
+addopts = "--strict-markers --strict-config --durations=5"
+# Registering custom markers.
+# https://docs.pytest.org/en/7.1.x/example/markers.html#registering-markers
+markers = [
+  "compile: mark placeholder test used to compile integration tests without running them",
+]
+asyncio_mode = "auto"
--- a/libs/partners/airbyte/scripts/check_imports.py
+++ b/libs/partners/airbyte/scripts/check_imports.py
@ -0,0 +1,17 @@
+import sys
+import traceback
+from importlib.machinery import SourceFileLoader
+
+if __name__ == "__main__":
+    files = sys.argv[1:]
+    has_failure = False
+    for file in files:
+        try:
+            SourceFileLoader("x", file).load_module()
+        except Exception:
+            has_faillure = True
+            print(file)  # noqa: T201
+            traceback.print_exc()
+            print()  # noqa: T201
+
+    sys.exit(1 if has_failure else 0)
--- a/libs/partners/airbyte/scripts/check_pydantic.sh
+++ b/libs/partners/airbyte/scripts/check_pydantic.sh
@ -0,0 +1,27 @@
+#!/bin/bash
+#
+# This script searches for lines starting with "import pydantic" or "from pydantic"
+# in tracked files within a Git repository.
+#
+# Usage: ./scripts/check_pydantic.sh /path/to/repository
+
+# Check if a path argument is provided
+if [ $# -ne 1 ]; then
+  echo "Usage: $0 /path/to/repository"
+  exit 1
+fi
+
+repository_path="$1"
+
+# Search for lines matching the pattern within the specified repository
+result=$(git -C "$repository_path" grep -E '^import pydantic|^from pydantic')
+
+# Check if any matching lines were found
+if [ -n "$result" ]; then
+  echo "ERROR: The following lines need to be updated:"
+  echo "$result"
+  echo "Please replace the code with an import from langchain_core.pydantic_v1."
+  echo "For example, replace 'from pydantic import BaseModel'"
+  echo "with 'from langchain_core.pydantic_v1 import BaseModel'"
+  exit 1
+fi
--- a/libs/partners/airbyte/scripts/lint_imports.sh
+++ b/libs/partners/airbyte/scripts/lint_imports.sh
@ -0,0 +1,18 @@
+#!/bin/bash
+
+set -eu
+
+# Initialize a variable to keep track of errors
+errors=0
+
+# make sure not importing from langchain, langchain_experimental, or langchain_community
+git --no-pager grep '^from langchain\.' . && errors=$((errors+1))
+git --no-pager grep '^from langchain_experimental\.' . && errors=$((errors+1))
+git --no-pager grep '^from langchain_community\.' . && errors=$((errors+1))
+
+# Decide on an exit status based on the errors
+if [ "$errors" -gt 0 ]; then
+    exit 1
+else
+    exit 0
+fi
--- a/libs/partners/airbyte/tests/init.py
+++ b/libs/partners/airbyte/tests/init.py
--- a/libs/partners/airbyte/tests/integration_tests/init.py
+++ b/libs/partners/airbyte/tests/integration_tests/init.py
--- a/libs/partners/airbyte/tests/integration_tests/test_compile.py
+++ b/libs/partners/airbyte/tests/integration_tests/test_compile.py
@ -0,0 +1,7 @@
+import pytest
+
+
+@pytest.mark.compile
+def test_placeholder() -> None:
+    """Used for compiling integration tests without running any real tests."""
+    pass
--- a/libs/partners/airbyte/tests/integration_tests/test_document_loaders.py
+++ b/libs/partners/airbyte/tests/integration_tests/test_document_loaders.py
@ -0,0 +1,28 @@
+"""Test Airbyte embeddings."""
+
+import os
+
+from langchain_airbyte import AirbyteLoader
+
+GITHUB_TOKEN = os.environ.get("GITHUB_TOKEN")
+
+
+def test_load_github() -> None:
+    """Test loading from GitHub."""
+    airbyte_loader = AirbyteLoader(
+        source="source-github",
+        stream="issues",
+        config={
+            "repositories": ["airbytehq/quickstarts"],
+            "credentials": {"personal_access_token": GITHUB_TOKEN},
+        },
+    )
+    documents = airbyte_loader.load()
+    assert len(documents) > 0
+    # make sure some documents have body in metadata
+    found_body = False
+    for doc in documents:
+        if "body" in doc.metadata and doc.metadata["body"]:
+            found_body = True
+            break
+    assert found_body, "No documents with body found"
--- a/libs/partners/airbyte/tests/unit_tests/init.py
+++ b/libs/partners/airbyte/tests/unit_tests/init.py
--- a/libs/partners/airbyte/tests/unit_tests/test_document_loaders.py
+++ b/libs/partners/airbyte/tests/unit_tests/test_document_loaders.py
@ -0,0 +1,77 @@
+from langchain_core.prompts import PromptTemplate
+
+from langchain_airbyte import AirbyteLoader
+
+
+def test_initialization() -> None:
+    """Test integration loader initialization."""
+    AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 3},
+    )
+
+
+def test_load() -> None:
+    """Test loading from source."""
+    airbyte_loader = AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 5},
+    )
+    documents = airbyte_loader.load()
+    assert len(documents) == 5
+
+
+def test_lazy_load() -> None:
+    """Test lazy loading from source."""
+    airbyte_loader = AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 3},
+    )
+    documents = airbyte_loader.lazy_load()
+    assert len(list(documents)) == 3
+
+
+async def test_alazy_load() -> None:
+    """Test async lazy loading from source."""
+    airbyte_loader = AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 3},
+    )
+    documents = airbyte_loader.alazy_load()
+    lendocs = 0
+    async for _ in documents:
+        lendocs += 1
+    assert lendocs == 3
+
+
+def test_load_with_template() -> None:
+    """Test loading from source with template."""
+    airbyte_loader = AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 3},
+        template=PromptTemplate.from_template("My name is {name}"),
+    )
+    documents = airbyte_loader.load()
+    assert len(documents) == 3
+    for doc in documents:
+        assert doc.page_content.startswith("My name is ")
+        assert doc.metadata["name"]  # should have a name
+
+
+def test_load_no_metadata() -> None:
+    """Test loading from source with no metadata."""
+    airbyte_loader = AirbyteLoader(
+        source="source-faker",
+        stream="users",
+        config={"count": 3},
+        include_metadata=False,
+    )
+    documents = airbyte_loader.load()
+    assert len(documents) == 3
+    for doc in documents:
+        assert doc.metadata == {}
--- a/libs/partners/airbyte/tests/unit_tests/test_imports.py
+++ b/libs/partners/airbyte/tests/unit_tests/test_imports.py
@ -0,0 +1,9 @@
+from langchain_airbyte import __all__
+
+EXPECTED_ALL = [
+    "AirbyteLoader",
+]
+
+
+def test_all_imports() -> None:
+    assert sorted(EXPECTED_ALL) == sorted(__all__)