Deep Lake retriever example analyzing Twitter the-algorithm source code (#2602)

Improvements to Deep Lake Vector Store
- much faster view loading of embeddings after filters with
`fetch_chunks=True`
- 2x faster ingestion
- use np.float32 for embeddings to save 2x storage, LZ4 compression for
text and metadata storage (saves up to 4x storage for text data)
- user defined functions as filters

Docs
- Added retriever full example for analyzing twitter the-algorithm
source code with GPT4
- Added a use case for code analysis (please let us know your thoughts
how we can improve it)

---------

Co-authored-by: Davit Buniatyan <d@activeloop.ai>
fix_agent_callbacks
Davit Buniatyan 1 year ago committed by GitHub
parent 5c0c5fafb2
commit aaac7071a3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -8,8 +8,9 @@ This page covers how to use the Deep Lake ecosystem within LangChain.
## More Resources
1. [Ultimate Guide to LangChain & Deep Lake: Build ChatGPT to Answer Questions on Your Financial Data](https://www.activeloop.ai/resources/ultimate-guide-to-lang-chain-deep-lake-build-chat-gpt-to-answer-questions-on-your-financial-data/)
1. Here is [whitepaper](https://www.deeplake.ai/whitepaper) and [academic paper](https://arxiv.org/pdf/2209.10785.pdf) for Deep Lake
2. Here is a set of additional resources available for review: [Deep Lake](https://github.com/activeloopai/deeplake), [Getting Started](https://docs.activeloop.ai/getting-started) and [Tutorials](https://docs.activeloop.ai/hub-tutorials)
2. [Twitter the-algorithm codebase analysis with Deep Lake](../modules/indexes/retrievers/examples/twitter-the-algorithm-analysis-deeplake.ipynb)
3. Here is [whitepaper](https://www.deeplake.ai/whitepaper) and [academic paper](https://arxiv.org/pdf/2209.10785.pdf) for Deep Lake
4. Here is a set of additional resources available for review: [Deep Lake](https://github.com/activeloopai/deeplake), [Getting Started](https://docs.activeloop.ai/getting-started) and [Tutorials](https://docs.activeloop.ai/hub-tutorials)
## Installation and Setup
- Install the Python package with `pip install deeplake`

@ -71,6 +71,8 @@ The above modules can be used in a variety of ways. LangChain also provides guid
- `Querying Tabular Data <./use_cases/tabular.html>`_: If you want to understand how to use LLMs to query data that is stored in a tabular format (csvs, SQL, dataframes, etc) you should read this page.
- `Code Understanding <./use_cases/code.html>`_: If you want to understand how to use LLMs to query source code from github, you should read this page.
- `Interacting with APIs <./use_cases/apis.html>`_: Enabling LLMs to interact with APIs is extremely powerful in order to give them more up-to-date information and allow them to take actions.
- `Extraction <./use_cases/extraction.html>`_: Extract structured information from text.

@ -0,0 +1,442 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Deep Lake\n",
"In this tutorial, we are going to use Langchain + Deep Lake with GPT4 to analyze the code base of the twitter algorithm. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python3 -m pip install --upgrade langchain deeplake openai tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow https://docs.activeloop.ai/ and API reference https://docs.deeplake.ai/en/latest/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import DeepLake\n",
"\n",
"os.environ['OPENAI_API_KEY']='sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'\n",
"embeddings = OpenAIEmbeddings()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the platform at https://app.activeloop.ai"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!activeloop login -t <TOKEN>"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Index the code base (optional)\n",
"You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Load all files inside the repository"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.document_loaders import TextLoader\n",
"\n",
"root_dir = './the-algorithm'\n",
"docs = []\n",
"for dirpath, dirnames, filenames in os.walk(root_dir):\n",
" for file in filenames:\n",
" try: \n",
" loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')\n",
" docs.extend(loader.load_and_split())\n",
" except Exception as e: \n",
" pass"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, chunk the files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_documents(docs)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = DeepLake.from_documents(texts, embeddings, dataset_path=\"hub://davitbun/twitter-algorithm\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Question Answering on Twitter algorithm codebase\n",
"First load the dataset, construct the retriever, then construct the Conversational Chain"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/twitter-algorithm\n",
"\n",
"hub://davitbun/twitter-algorithm loaded successfully.\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Deep Lake Dataset in hub://davitbun/twitter-algorithm already exists, loading from the storage\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset(path='hub://davitbun/twitter-algorithm', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])\n",
"\n",
" tensor htype shape dtype compression\n",
" ------- ------- ------- ------- ------- \n",
" embedding generic (23152, 1536) float32 None \n",
" ids text (23152, 1) str None \n",
" metadata json (23152, 1) str None \n",
" text text (23152, 1) str None \n"
]
}
],
"source": [
"db = DeepLake(dataset_path=\"hub://davitbun/twitter-algorithm\", read_only=True, embedding_function=embeddings)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"\n",
"retriever = db.as_retriever()\n",
"retriever.search_kwargs['distance_metric'] = 'cos'\n",
"retriever.search_kwargs['fetch_k'] = 100\n",
"retriever.search_kwargs['maximal_marginal_relevance'] = True\n",
"retriever.search_kwargs['k'] = 20"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"def filter(x):\n",
" # filter based on source code\n",
" if 'com.google' in x['text'].data()['value']:\n",
" return False\n",
" \n",
" # filter based on path e.g. extension\n",
" metadata = x['metadata'].data()['value']\n",
" return 'scala' in metadata['source'] or 'py' in metadata['source']\n",
"\n",
"### turn on below for custom filtering\n",
"# retriever.search_kwargs['filter'] = filter"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"model = ChatOpenAI(model='gpt-4') # 'gpt-3.5-turbo',\n",
"qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions = [\n",
" \"What does favCountParams do?\",\n",
" \"is it Likes + Bookmarks, or not clear from the code?\",\n",
" \"What are the major negative modifiers that lower your linear ranking parameters?\", \n",
" \"How do you get assigned to SimClusters?\",\n",
" \"What is needed to migrate from one SimClusters to another SimClusters?\",\n",
" \"How much do I get boosted within my cluster?\", \n",
" \"How does Heavy ranker work. what are its main inputs?\",\n",
" \"How can one influence Heavy ranker?\",\n",
" \"why threads and long tweets do so well on the platform?\",\n",
" \"Are thread and long tweet creators building a following that reacts to only threads?\",\n",
" \"Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\",\n",
" \"Content meta data and how it impacts virality (e.g. ALT in images).\",\n",
" \"What are some unexpected fingerprints for spam factors?\",\n",
" \"Is there any difference between company verified checkmarks and blue verified individual checkmarks?\",\n",
"] \n",
"chat_history = []\n",
"\n",
"for question in questions: \n",
" result = qa({\"question\": question, \"chat_history\": chat_history})\n",
" chat_history.append((question, result['answer']))\n",
" print(f\"-> **Question**: {question} \\n\")\n",
" print(f\"**Answer**: {result['answer']} \\n\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"-> **Question**: is it Likes + Bookmarks, or not clear from the code?\n",
"\n",
"**Answer**: From the provided code, it is not clear if the favorite count metric is determined by the sum of likes and bookmarks. The favorite count is mentioned in the code, but there is no explicit reference to how it is calculated in terms of likes and bookmarks. \n",
"\n",
"-> **Question**: What are the major negative modifiers that lower your linear ranking parameters?\n",
"\n",
"**Answer**: In the given code, major negative modifiers that lower the linear ranking parameters are:\n",
"\n",
"1. `scoringData.querySpecificScore`: This score adjustment is based on the query-specific information. If its value is negative, it will lower the linear ranking parameters.\n",
"\n",
"2. `scoringData.authorSpecificScore`: This score adjustment is based on the author-specific information. If its value is negative, it will also lower the linear ranking parameters.\n",
"\n",
"Please note that I cannot provide more information on the exact calculations of these negative modifiers, as the code for their determination is not provided. \n",
"\n",
"-> **Question**: How do you get assigned to SimClusters?\n",
"\n",
"**Answer**: The assignment to SimClusters occurs through a Metropolis-Hastings sampling-based community detection algorithm that is run on the Producer-Producer similarity graph. This graph is created by computing the cosine similarity scores between the users who follow each producer. The algorithm identifies communities or clusters of Producers with similar followers, and takes a parameter *k* for specifying the number of communities to be detected.\n",
"\n",
"After the community detection, different users and content are represented as sparse, interpretable vectors within these identified communities (SimClusters). The resulting SimClusters embeddings can be used for various recommendation tasks. \n",
"\n",
"-> **Question**: What is needed to migrate from one SimClusters to another SimClusters?\n",
"\n",
"**Answer**: To migrate from one SimClusters representation to another, you can follow these general steps:\n",
"\n",
"1. **Prepare the new representation**: Create the new SimClusters representation using any necessary updates or changes in the clustering algorithm, similarity measures, or other model parameters. Ensure that this new representation is properly stored and indexed as needed.\n",
"\n",
"2. **Update the relevant code and configurations**: Modify the relevant code and configuration files to reference the new SimClusters representation. This may involve updating paths or dataset names to point to the new representation, as well as changing code to use the new clustering method or similarity functions if applicable.\n",
"\n",
"3. **Test the new representation**: Before deploying the changes to production, thoroughly test the new SimClusters representation to ensure its effectiveness and stability. This may involve running offline jobs like candidate generation and label candidates, validating the output, as well as testing the new representation in the evaluation environment using evaluation tools like TweetSimilarityEvaluationAdhocApp.\n",
"\n",
"4. **Deploy the changes**: Once the new representation has been tested and validated, deploy the changes to production. This may involve creating a zip file, uploading it to the packer, and then scheduling it with Aurora. Be sure to monitor the system to ensure a smooth transition between representations and verify that the new representation is being used in recommendations as expected.\n",
"\n",
"5. **Monitor and assess the new representation**: After the new representation has been deployed, continue to monitor its performance and impact on recommendations. Take note of any improvements or issues that arise and be prepared to iterate on the new representation if needed. Always ensure that the results and performance metrics align with the system's goals and objectives. \n",
"\n",
"-> **Question**: How much do I get boosted within my cluster?\n",
"\n",
"**Answer**: It's not possible to determine the exact amount your content is boosted within your cluster in the SimClusters representation without specific data about your content and its engagement metrics. However, a combination of factors, such as the favorite score and follow score, alongside other engagement signals and SimCluster calculations, influence the boosting of content. \n",
"\n",
"-> **Question**: How does Heavy ranker work. what are its main inputs?\n",
"\n",
"**Answer**: The Heavy Ranker is a machine learning model that plays a crucial role in ranking and scoring candidates within the recommendation algorithm. Its primary purpose is to predict the likelihood of a user engaging with a tweet or connecting with another user on the platform.\n",
"\n",
"Main inputs to the Heavy Ranker consist of:\n",
"\n",
"1. Static Features: These are features that can be computed directly from a tweet at the time it's created, such as whether it has a URL, has cards, has quotes, etc. These features are produced by the Index Ingester as the tweets are generated and stored in the index.\n",
"\n",
"2. Real-time Features: These per-tweet features can change after the tweet has been indexed. They mostly consist of social engagements like retweet count, favorite count, reply count, and some spam signals that are computed with later activities. The Signal Ingester, which is part of a Heron topology, processes multiple event streams to collect and compute these real-time features.\n",
"\n",
"3. User Table Features: These per-user features are obtained from the User Table Updater that processes a stream written by the user service. This input is used to store sparse real-time user information, which is later propagated to the tweet being scored by looking up the author of the tweet.\n",
"\n",
"4. Search Context Features: These features represent the context of the current searcher, like their UI language, their content consumption, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring.\n",
"\n",
"These inputs are then processed by the Heavy Ranker to score and rank candidates based on their relevance and likelihood of engagement by the user. \n",
"\n",
"-> **Question**: How can one influence Heavy ranker?\n",
"\n",
"**Answer**: To influence the Heavy Ranker's output or ranking of content, consider the following actions:\n",
"\n",
"1. Improve content quality: Create high-quality and engaging content that is relevant, informative, and valuable to users. High-quality content is more likely to receive positive user engagement, which the Heavy Ranker considers when ranking content.\n",
"\n",
"2. Increase user engagement: Encourage users to interact with content through likes, retweets, replies, and comments. Higher engagement levels can lead to better ranking in the Heavy Ranker's output.\n",
"\n",
"3. Optimize your user profile: A user's reputation, based on factors such as their follower count and follower-to-following ratio, may impact the ranking of their content. Maintain a good reputation by following relevant users, keeping a reasonable follower-to-following ratio and engaging with your followers.\n",
"\n",
"4. Enhance content discoverability: Use relevant keywords, hashtags, and mentions in your tweets, making it easier for users to find and engage with your content. This increased discoverability may help improve the ranking of your content by the Heavy Ranker.\n",
"\n",
"5. Leverage multimedia content: Experiment with different content formats, such as videos, images, and GIFs, which may capture users' attention and increase engagement, resulting in better ranking by the Heavy Ranker.\n",
"\n",
"6. User feedback: Monitor and respond to feedback for your content. Positive feedback may improve your ranking, while negative feedback provides an opportunity to learn and improve.\n",
"\n",
"Note that the Heavy Ranker uses a combination of machine learning models and various features to rank the content. While the above actions may help influence the ranking, there are no guarantees as the ranking process is determined by a complex algorithm, which evolves over time. \n",
"\n",
"-> **Question**: why threads and long tweets do so well on the platform?\n",
"\n",
"**Answer**: Threads and long tweets perform well on the platform for several reasons:\n",
"\n",
"1. **More content and context**: Threads and long tweets provide more information and context about a topic, which can make the content more engaging and informative for users. People tend to appreciate a well-structured and detailed explanation of a subject or a story, and threads and long tweets can do that effectively.\n",
"\n",
"2. **Increased user engagement**: As threads and long tweets provide more content, they also encourage users to engage with the tweets through replies, retweets, and likes. This increased engagement can lead to better visibility of the content, as the Twitter algorithm considers user engagement when ranking and surfacing tweets.\n",
"\n",
"3. **Narrative structure**: Threads enable users to tell stories or present arguments in a step-by-step manner, making the information more accessible and easier to follow. This narrative structure can capture users' attention and encourage them to read through the entire thread and interact with the content.\n",
"\n",
"4. **Expanded reach**: When users engage with a thread, their interactions can bring the content to the attention of their followers, helping to expand the reach of the thread. This increased visibility can lead to more interactions and higher performance for the threaded tweets.\n",
"\n",
"5. **Higher content quality**: Generally, threads and long tweets require more thought and effort to create, which may lead to higher quality content. Users are more likely to appreciate and interact with high-quality, well-reasoned content, further improving the performance of these tweets within the platform.\n",
"\n",
"Overall, threads and long tweets perform well on Twitter because they encourage user engagement and provide a richer, more informative experience that users find valuable. \n",
"\n",
"-> **Question**: Are thread and long tweet creators building a following that reacts to only threads?\n",
"\n",
"**Answer**: Based on the provided code and context, there isn't enough information to conclude if the creators of threads and long tweets primarily build a following that engages with only thread-based content. The code provided is focused on Twitter's recommendation and ranking algorithms, as well as infrastructure components like Kafka, partitions, and the Follow Recommendations Service (FRS). To answer your question, data analysis of user engagement and results of specific edge cases would be required. \n",
"\n",
"-> **Question**: Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\n",
"\n",
"**Answer**: Yes, different strategies need to be followed to maximize the number of followers compared to maximizing likes and bookmarks per tweet. While there may be some overlap in the approaches, they target different aspects of user engagement.\n",
"\n",
"Maximizing followers: The primary focus is on growing your audience on the platform. Strategies include:\n",
"\n",
"1. Consistently sharing high-quality content related to your niche or industry.\n",
"2. Engaging with others on the platform by replying, retweeting, and mentioning other users.\n",
"3. Using relevant hashtags and participating in trending conversations.\n",
"4. Collaborating with influencers and other users with a large following.\n",
"5. Posting at optimal times when your target audience is most active.\n",
"6. Optimizing your profile by using a clear profile picture, catchy bio, and relevant links.\n",
"\n",
"Maximizing likes and bookmarks per tweet: The focus is on creating content that resonates with your existing audience and encourages engagement. Strategies include:\n",
"\n",
"1. Crafting engaging and well-written tweets that encourage users to like or save them.\n",
"2. Incorporating visually appealing elements, such as images, GIFs, or videos, that capture attention.\n",
"3. Asking questions, sharing opinions, or sparking conversations that encourage users to engage with your tweets.\n",
"4. Using analytics to understand the type of content that resonates with your audience and tailoring your tweets accordingly.\n",
"5. Posting a mix of educational, entertaining, and promotional content to maintain variety and interest.\n",
"6. Timing your tweets strategically to maximize engagement, likes, and bookmarks per tweet.\n",
"\n",
"Both strategies can overlap, and you may need to adapt your approach by understanding your target audience's preferences and analyzing your account's performance. However, it's essential to recognize that maximizing followers and maximizing likes and bookmarks per tweet have different focuses and require specific strategies. \n",
"\n",
"-> **Question**: Content meta data and how it impacts virality (e.g. ALT in images).\n",
"\n",
"**Answer**: There is no direct information in the provided context about how content metadata, such as ALT text in images, impacts the virality of a tweet or post. However, it's worth noting that including ALT text can improve the accessibility of your content for users who rely on screen readers, which may lead to increased engagement for a broader audience. Additionally, metadata can be used in search engine optimization, which might improve the visibility of the content, but the context provided does not mention any specific correlation with virality. \n",
"\n",
"-> **Question**: What are some unexpected fingerprints for spam factors?\n",
"\n",
"**Answer**: In the provided context, an unusual indicator of spam factors is when a tweet contains a non-media, non-news link. If the tweet has a link but does not have an image URL, video URL, or news URL, it is considered a potential spam vector, and a threshold for user reputation (tweepCredThreshold) is set to MIN_TWEEPCRED_WITH_LINK.\n",
"\n",
"While this rule may not cover all possible unusual spam indicators, it is derived from the specific codebase and logic shared in the context. \n",
"\n",
"-> **Question**: Is there any difference between company verified checkmarks and blue verified individual checkmarks?\n",
"\n",
"**Answer**: Yes, there is a distinction between the verified checkmarks for companies and blue verified checkmarks for individuals. The code snippet provided mentions \"Blue-verified account boost\" which indicates that there is a separate category for blue verified accounts. Typically, blue verified checkmarks are used to indicate notable individuals, while verified checkmarks are for companies or organizations. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -0,0 +1,25 @@
# Code Understanding
Overview
LangChain is a useful tool designed to parse GitHub code repositories. By leveraging VectorStores, Conversational RetrieverChain, and GPT-4, it can answer questions in the context of an entire GitHub repository or generate new code. This documentation page outlines the essential components of the system and guides using LangChain for better code comprehension, contextual question answering, and code generation in GitHub repositories.
## Conversational Retriever Chain
Conversational RetrieverChain is a retrieval-focused system that interacts with the data stored in a VectorStore. Utilizing advanced techniques, like context-aware filtering and ranking, it retrieves the most relevant code snippets and information for a given user query. Conversational RetrieverChain is engineered to deliver high-quality, pertinent results while considering conversation history and context.
LangChain Workflow for Code Understanding and Generation
1. Index the code base: Clone the target repository, load all files within, chunk the files, and execute the indexing process. Optionally, you can skip this step and use an already indexed dataset.
2. Embedding and Code Store: Code snippets are embedded using a code-aware embedding model and stored in a VectorStore.
Query Understanding: GPT-4 processes user queries, grasping the context and extracting relevant details.
3. Construct the Retriever: Conversational RetrieverChain searches the VectorStore to identify the most relevant code snippets for a given query.
4. Build the Conversational Chain: Customize the retriever settings and define any user-defined filters as needed.
5. Ask questions: Define a list of questions to ask about the codebase, and then use the ConversationalRetrievalChain to generate context-aware answers. The LLM (GPT-4) generates comprehensive, context-aware answers based on retrieved code snippets and conversation history.
The full tutorial is available below.
- [Twitter the-algorithm codebase analysis with Deep Lake](../modules/indexes/retrievers/examples/twitter-the-algorithm-analysis-deeplake.ipynb): A notebook walking through how to parse github source code and run queries conversation.

@ -4,7 +4,7 @@ from __future__ import annotations
import logging
import uuid
from functools import partial
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple
from typing import Any, Callable, Dict, Iterable, List, Optional, Sequence, Tuple
import numpy as np
@ -46,6 +46,7 @@ def vector_search(
# Calculate the distance between the query_vector and all data_vectors
distances = distance_metric_map[distance_metric](query_embedding, data_vectors)
nearest_indices = np.argsort(distances)
nearest_indices = (
nearest_indices[::-1][:k] if distance_metric in ["cos"] else nearest_indices[:k]
)
@ -93,12 +94,17 @@ class DeepLake(VectorStore):
dataset_path: str = _LANGCHAIN_DEFAULT_DEEPLAKE_PATH,
token: Optional[str] = None,
embedding_function: Optional[Embeddings] = None,
read_only: Optional[bool] = None,
read_only: Optional[bool] = False,
ingestion_batch_size: int = 1024,
num_workers: int = 4,
) -> None:
"""Initialize with Deep Lake client."""
self.ingestion_batch_size = ingestion_batch_size
self.num_workers = num_workers
try:
import deeplake
from deeplake.constants import MB
except ImportError:
raise ValueError(
"Could not import deeplake python package. "
@ -115,6 +121,7 @@ class DeepLake(VectorStore):
self.ds.summary()
else:
self.ds = deeplake.empty(dataset_path, token=token, overwrite=True)
with self.ds:
self.ds.create_tensor(
"text",
@ -122,6 +129,7 @@ class DeepLake(VectorStore):
create_id_tensor=False,
create_sample_info_tensor=False,
create_shape_tensor=False,
chunk_compression="lz4",
)
self.ds.create_tensor(
"metadata",
@ -129,13 +137,16 @@ class DeepLake(VectorStore):
create_id_tensor=False,
create_sample_info_tensor=False,
create_shape_tensor=False,
chunk_compression="lz4",
)
self.ds.create_tensor(
"embedding",
htype="generic",
dtype=np.float32,
create_id_tensor=False,
create_sample_info_tensor=False,
create_shape_tensor=False,
max_chunk_size=64 * MB,
create_shape_tensor=True,
)
self.ds.create_tensor(
"ids",
@ -143,6 +154,7 @@ class DeepLake(VectorStore):
create_id_tensor=False,
create_sample_info_tensor=False,
create_shape_tensor=False,
chunk_compression="lz4",
)
self._embedding_function = embedding_function
@ -170,29 +182,45 @@ class DeepLake(VectorStore):
text_list = list(texts)
if self._embedding_function is None:
embeddings: Sequence[Optional[List[float]]] = [None] * len(text_list)
else:
embeddings = self._embedding_function.embed_documents(text_list)
if metadatas is None:
metadatas = [{}] * len(text_list)
elements = zip(text_list, embeddings, metadatas, ids)
elements = list(zip(text_list, metadatas, ids))
@self._deeplake.compute
def ingest(sample_in: list, sample_out: list) -> None:
s = {
"text": sample_in[0],
"embedding": sample_in[1],
"metadata": sample_in[2],
"ids": sample_in[3],
}
sample_out.append(s)
ingest().eval(list(elements), self.ds)
self.ds.commit(allow_empty=True)
text_list = [s[0] for s in sample_in]
embeds: Sequence[Optional[np.ndarray]] = []
if self._embedding_function is not None:
embeddings = self._embedding_function.embed_documents(text_list)
embeds = [np.array(e, dtype=np.float32) for e in embeddings]
else:
embeds = [None] * len(text_list)
for s, e in zip(sample_in, embeds):
sample_out.append(
{
"text": s[0],
"metadata": s[1],
"ids": s[2],
"embedding": e,
}
)
batch_size = min(self.ingestion_batch_size, len(elements))
batched = [
elements[i : i + batch_size] for i in range(0, len(elements), batch_size)
]
ingest().eval(
batched,
self.ds,
num_workers=min(self.num_workers, len(batched) // self.num_workers),
)
self.ds.commit(allow_empty=True)
self.ds.summary()
return ids
def search(
@ -203,7 +231,7 @@ class DeepLake(VectorStore):
distance_metric: str = "L2",
use_maximal_marginal_relevance: Optional[bool] = False,
fetch_k: Optional[int] = 20,
filter: Optional[Dict[str, str]] = None,
filter: Optional[Any[Dict[str, str], Callable, str]] = None,
return_score: Optional[bool] = False,
**kwargs: Any,
) -> Any[List[Document], List[Tuple[Document, float]]]:
@ -216,7 +244,9 @@ class DeepLake(VectorStore):
distance_metric: `L2` for Euclidean, `L1` for Nuclear,
`max` L-infinity distance, `cos` for cosine similarity,
'dot' for dot product. Defaults to `L2`.
filter: Attribute filter by metadata example {'key': 'value'}.
filter: Attribute filter by metadata example {'key': 'value'}. It can also
take [Deep Lake filter]
(https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)
Defaults to None.
maximal_marginal_relevance: Whether to use maximal marginal relevance.
Defaults to False.
@ -232,8 +262,10 @@ class DeepLake(VectorStore):
# attribute based filtering
if filter is not None:
view = view.filter(partial(dp_filter, filter=filter))
if isinstance(filter, dict):
filter = partial(dp_filter, filter=filter)
view = view.filter(filter)
if len(view) == 0:
return []
@ -252,8 +284,7 @@ class DeepLake(VectorStore):
query
) # type: ignore
query_emb = np.array(emb, dtype=np.float32)
embeddings = view.embedding.numpy()
embeddings = view.embedding.numpy(fetch_chunks=True)
k_search = fetch_k if use_maximal_marginal_relevance else k
indices, scores = vector_search(
query_emb,
@ -261,8 +292,8 @@ class DeepLake(VectorStore):
k=k_search,
distance_metric=distance_metric.lower(),
)
view = view[indices]
view = view[indices]
if use_maximal_marginal_relevance:
indices = maximal_marginal_relevance(
query_emb, embeddings[indices], k=min(k, len(indices))

Loading…
Cancel
Save