Document loader for Cube Semantic Layer (#6882)

### Description

This pull request introduces the "Cube Semantic Layer" document loader,
which demonstrates the retrieval of Cube's data model metadata in a
format suitable for passing to LLMs as embeddings. This enhancement aims
to provide contextual information and improve the understanding of data.

Twitter handle:
@the_cube_dev

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
pull/7238/head
Mike Nitsenko 1 year ago committed by GitHub
parent e533da8bf2
commit d669b9ece9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -0,0 +1,118 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cube Semantic Layer"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook demonstrates the process of retrieving Cube's data model metadata in a format suitable for passing to LLMs as embeddings, thereby enhancing contextual information."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### About Cube"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"[Cube](https://cube.dev/) is the Semantic Layer for building data apps. It helps data engineers and application developers access data from modern data stores, organize it into consistent definitions, and deliver it to every application."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Cubes data model provides structure and definitions that are used as a context for LLM to understand data and generate correct queries. LLM doesnt need to navigate complex joins and metrics calculations because Cube abstracts those and provides a simple interface that operates on the business-level terminology, instead of SQL table and column names. This simplification helps LLM to be less error-prone and avoid hallucinations."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`Cube Semantic Loader` requires 2 arguments:\n",
"| Input Parameter | Description |\n",
"| --- | --- |\n",
"| `cube_api_url` | The URL of your Cube's deployment REST API. Please refer to the [Cube documentation](https://cube.dev/docs/http-api/rest#configuration-base-path) for more information on configuring the base path. |\n",
"| `cube_api_token` | The authentication token generated based on your Cube's API secret. Please refer to the [Cube documentation](https://cube.dev/docs/security#generating-json-web-tokens-jwt) for instructions on generating JSON Web Tokens (JWT). |\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import jwt\n",
"from langchain.document_loaders import CubeSemanticLoader\n",
"\n",
"api_url = \"https://api-example.gcp-us-central1.cubecloudapp.dev/cubejs-api/v1/meta\"\n",
"cubejs_api_secret = \"api-secret-here\"\n",
"security_context = {}\n",
"# Read more about security context here: https://cube.dev/docs/security\n",
"api_token = jwt.encode(security_context, cubejs_api_secret, algorithm=\"HS256\")\n",
"\n",
"loader = CubeSemanticLoader(api_url, api_token)\n",
"\n",
"documents = loader.load()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Returns:\n",
"\n",
"A list of documents with the following attributes:\n",
"\n",
"- `page_content`\n",
"- `metadata`\n",
" - `table_name`\n",
" - `column_name`\n",
" - `column_data_type`\n",
" - `column_title`\n",
" - `column_description`"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"> page_content='table name: orders_view, column name: orders_view.total_amount, column data type: number, column title: Orders View Total Amount, column description: None' metadata={'table_name': 'orders_view', 'column_name': 'orders_view.total_amount', 'column_data_type': 'number', 'column_title': 'Orders View Total Amount', 'column_description': 'None'}"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -29,6 +29,7 @@ from langchain.document_loaders.college_confidential import CollegeConfidentialL
from langchain.document_loaders.confluence import ConfluenceLoader
from langchain.document_loaders.conllu import CoNLLULoader
from langchain.document_loaders.csv_loader import CSVLoader, UnstructuredCSVLoader
from langchain.document_loaders.cube_semantic import CubeSemanticLoader
from langchain.document_loaders.dataframe import DataFrameLoader
from langchain.document_loaders.diffbot import DiffbotLoader
from langchain.document_loaders.directory import DirectoryLoader
@ -175,6 +176,7 @@ __all__ = [
"CoNLLULoader",
"CollegeConfidentialLoader",
"ConfluenceLoader",
"CubeSemanticLoader",
"DataFrameLoader",
"DiffbotLoader",
"DirectoryLoader",

@ -0,0 +1,78 @@
from typing import List
import requests
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader
class CubeSemanticLoader(BaseLoader):
"""Load Cube semantic layer metadata."""
def __init__(
self,
cube_api_url: str,
cube_api_token: str,
):
self.cube_api_url = cube_api_url
"""Use the REST API of your Cube's deployment.
Please find out more information here:
https://cube.dev/docs/http-api/rest#configuration-base-path
"""
self.cube_api_token = cube_api_token
"""Authentication tokens are generated based on your Cube's API secret.
Please find out more information here:
https://cube.dev/docs/security#generating-json-web-tokens-jwt
"""
def load(self) -> List[Document]:
"""Makes a call to Cube's REST API metadata endpoint.
Returns:
A list of documents with attributes:
- page_content=column_name
- metadata
- table_name
- column_name
- column_data_type
- column_title
- column_description
"""
headers = {
"Content-Type": "application/json",
"Authorization": self.cube_api_token,
}
response = requests.get(self.cube_api_url, headers=headers)
response.raise_for_status()
raw_meta_json = response.json()
cubes = raw_meta_json.get("cubes", [])
docs = []
for cube in cubes:
if cube.get("type") != "view":
continue
cube_name = cube.get("name")
measures = cube.get("measures", [])
dimensions = cube.get("dimensions", [])
for item in measures + dimensions:
metadata = dict(
table_name=str(cube_name),
column_name=str(item.get("name")),
column_data_type=str(item.get("type")),
column_title=str(item.get("title")),
column_description=str(item.get("description")),
)
page_content = f"table name: {str(cube_name)}, "
page_content += f"column name: {str(item.get('name'))}, "
page_content += f"column data type: {str(item.get('type'))}, "
page_content += f"column title: {str(item.get('title'))}, "
page_content += f"column description: {str(item.get('description'))}"
docs.append(Document(page_content=page_content, metadata=metadata))
return docs

@ -0,0 +1,86 @@
from typing import List
from unittest import TestCase
from unittest.mock import MagicMock, patch
import requests
from langchain.docstore.document import Document
from langchain.document_loaders import CubeSemanticLoader
class TestCubeSemanticLoader(TestCase):
@patch.object(requests, "get")
def test_load_success(self, mock_get: MagicMock) -> None:
# Arrange
cube_api_url: str = "https://example.com/cube_api"
cube_api_token: str = "abc123"
mock_response: MagicMock = MagicMock()
mock_response.status_code = 200
mock_response_json: dict = {
"cubes": [
{
"type": "view",
"name": "cube1",
"measures": [{"type": "sum", "name": "sales", "title": "Sales"}],
"dimensions": [
{
"type": "string",
"name": "product_name",
"title": "Product Name",
}
],
}
]
}
mock_response.json.return_value = mock_response_json
mock_get.return_value = mock_response
expected_docs: List[Document] = [
Document(
page_content=(
"table name: cube1, "
"column name: sales, "
"column data type: sum, "
"column title: Sales, "
"column description: None"
),
metadata={
"table_name": "cube1",
"column_name": "sales",
"column_data_type": "sum",
"column_title": "Sales",
"column_description": "None",
},
),
Document(
page_content=(
"table name: cube1, "
"column name: product_name, "
"column data type: string, "
"column title: Product Name, "
"column description: None"
),
metadata={
"table_name": "cube1",
"column_name": "product_name",
"column_data_type": "string",
"column_title": "Product Name",
"column_description": "None",
},
),
]
loader: CubeSemanticLoader = CubeSemanticLoader(cube_api_url, cube_api_token)
# Act
result: List[Document] = loader.load()
# Assert
self.assertEqual(result, expected_docs)
mock_get.assert_called_once_with(
cube_api_url,
headers={
"Content-Type": "application/json",
"Authorization": cube_api_token,
},
)
Loading…
Cancel
Save