langchain/docs/use_cases/code/twitter-the-algorithm-analysis-deeplake.ipynb
Davit Buniatyan 2c0023393b
Deep Lake mini upgrades (#3375)
Improvements
* set default num_workers for ingestion to 0
* upgraded notebooks for avoiding dataset creation ambiguity
* added `force_delete_dataset_by_path`
* bumped deeplake to 3.3.0
* creds arg passing to deeplake object that would allow custom S3

Notes
* please double check if poetry is not messed up (thanks!)

Asks
* Would be great to create a shared slack channel for quick questions

---------

Co-authored-by: Davit Buniatyan <d@activeloop.ai>
2023-04-23 21:23:54 -07:00

413 lines
24 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Deep Lake\n",
"In this tutorial, we are going to use Langchain + Deep Lake with GPT4 to analyze the code base of the twitter algorithm. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python3 -m pip install --upgrade langchain deeplake openai tiktoken"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow [docs](https://docs.activeloop.ai/) and [API reference](https://docs.deeplake.ai/en/latest/).\n",
"\n",
"Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the [platform](https://app.activeloop.ai)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import getpass\n",
"\n",
"from langchain.embeddings.openai import OpenAIEmbeddings\n",
"from langchain.vectorstores import DeepLake\n",
"\n",
"os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')\n",
"os.environ['ACTIVELOOP_TOKEN'] = getpass.getpass('Activeloop Token:')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = OpenAIEmbeddings(disallowed_special=())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"disallowed_special=() is required to avoid `Exception: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte` from tiktoken for some repositories"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Index the code base (optional)\n",
"You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/twitter/the-algorithm # replace any repository of your choice "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load all files inside the repository"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from langchain.document_loaders import TextLoader\n",
"\n",
"root_dir = './the-algorithm'\n",
"docs = []\n",
"for dirpath, dirnames, filenames in os.walk(root_dir):\n",
" for file in filenames:\n",
" try: \n",
" loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')\n",
" docs.extend(loader.load_and_split())\n",
" except Exception as e: \n",
" pass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, chunk the files"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"\n",
"text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n",
"texts = text_splitter.split_documents(docs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"username = \"davitbun\" # replace with your username from app.activeloop.ai\n",
"db = DeepLake(dataset_path=f\"hub://{username}/twitter-algorithm\", embedding_function=embeddings, public=True) #dataset would be publicly available\n",
"db.add_documents(texts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Question Answering on Twitter algorithm codebase\n",
"First load the dataset, construct the retriever, then construct the Conversational Chain"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"db = DeepLake(dataset_path=\"hub://davitbun/twitter-algorithm\", read_only=True, embedding_function=embeddings)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"retriever = db.as_retriever()\n",
"retriever.search_kwargs['distance_metric'] = 'cos'\n",
"retriever.search_kwargs['fetch_k'] = 100\n",
"retriever.search_kwargs['maximal_marginal_relevance'] = True\n",
"retriever.search_kwargs['k'] = 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def filter(x):\n",
" # filter based on source code\n",
" if 'com.google' in x['text'].data()['value']:\n",
" return False\n",
" \n",
" # filter based on path e.g. extension\n",
" metadata = x['metadata'].data()['value']\n",
" return 'scala' in metadata['source'] or 'py' in metadata['source']\n",
"\n",
"### turn on below for custom filtering\n",
"# retriever.search_kwargs['filter'] = filter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"from langchain.chains import ConversationalRetrievalChain\n",
"\n",
"model = ChatOpenAI(model='gpt-3.5-turbo') # switch to 'gpt-4'\n",
"qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"questions = [\n",
" \"What does favCountParams do?\",\n",
" \"is it Likes + Bookmarks, or not clear from the code?\",\n",
" \"What are the major negative modifiers that lower your linear ranking parameters?\", \n",
" \"How do you get assigned to SimClusters?\",\n",
" \"What is needed to migrate from one SimClusters to another SimClusters?\",\n",
" \"How much do I get boosted within my cluster?\", \n",
" \"How does Heavy ranker work. what are its main inputs?\",\n",
" \"How can one influence Heavy ranker?\",\n",
" \"why threads and long tweets do so well on the platform?\",\n",
" \"Are thread and long tweet creators building a following that reacts to only threads?\",\n",
" \"Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\",\n",
" \"Content meta data and how it impacts virality (e.g. ALT in images).\",\n",
" \"What are some unexpected fingerprints for spam factors?\",\n",
" \"Is there any difference between company verified checkmarks and blue verified individual checkmarks?\",\n",
"] \n",
"chat_history = []\n",
"\n",
"for question in questions: \n",
" result = qa({\"question\": question, \"chat_history\": chat_history})\n",
" chat_history.append((question, result['answer']))\n",
" print(f\"-> **Question**: {question} \\n\")\n",
" print(f\"**Answer**: {result['answer']} \\n\")\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"-> **Question**: What does favCountParams do? \n",
"\n",
"**Answer**: `favCountParams` is an optional ThriftLinearFeatureRankingParams instance that represents the parameters related to the \"favorite count\" feature in the ranking process. It is used to control the weight of the favorite count feature while ranking tweets. The favorite count is the number of times a tweet has been marked as a favorite by users, and it is considered an important signal in the ranking of tweets. By using `favCountParams`, the system can adjust the importance of the favorite count while calculating the final ranking score of a tweet. \n",
"\n",
"-> **Question**: is it Likes + Bookmarks, or not clear from the code?\n",
"\n",
"**Answer**: From the provided code, it is not clear if the favorite count metric is determined by the sum of likes and bookmarks. The favorite count is mentioned in the code, but there is no explicit reference to how it is calculated in terms of likes and bookmarks. \n",
"\n",
"-> **Question**: What are the major negative modifiers that lower your linear ranking parameters?\n",
"\n",
"**Answer**: In the given code, major negative modifiers that lower the linear ranking parameters are:\n",
"\n",
"1. `scoringData.querySpecificScore`: This score adjustment is based on the query-specific information. If its value is negative, it will lower the linear ranking parameters.\n",
"\n",
"2. `scoringData.authorSpecificScore`: This score adjustment is based on the author-specific information. If its value is negative, it will also lower the linear ranking parameters.\n",
"\n",
"Please note that I cannot provide more information on the exact calculations of these negative modifiers, as the code for their determination is not provided. \n",
"\n",
"-> **Question**: How do you get assigned to SimClusters?\n",
"\n",
"**Answer**: The assignment to SimClusters occurs through a Metropolis-Hastings sampling-based community detection algorithm that is run on the Producer-Producer similarity graph. This graph is created by computing the cosine similarity scores between the users who follow each producer. The algorithm identifies communities or clusters of Producers with similar followers, and takes a parameter *k* for specifying the number of communities to be detected.\n",
"\n",
"After the community detection, different users and content are represented as sparse, interpretable vectors within these identified communities (SimClusters). The resulting SimClusters embeddings can be used for various recommendation tasks. \n",
"\n",
"-> **Question**: What is needed to migrate from one SimClusters to another SimClusters?\n",
"\n",
"**Answer**: To migrate from one SimClusters representation to another, you can follow these general steps:\n",
"\n",
"1. **Prepare the new representation**: Create the new SimClusters representation using any necessary updates or changes in the clustering algorithm, similarity measures, or other model parameters. Ensure that this new representation is properly stored and indexed as needed.\n",
"\n",
"2. **Update the relevant code and configurations**: Modify the relevant code and configuration files to reference the new SimClusters representation. This may involve updating paths or dataset names to point to the new representation, as well as changing code to use the new clustering method or similarity functions if applicable.\n",
"\n",
"3. **Test the new representation**: Before deploying the changes to production, thoroughly test the new SimClusters representation to ensure its effectiveness and stability. This may involve running offline jobs like candidate generation and label candidates, validating the output, as well as testing the new representation in the evaluation environment using evaluation tools like TweetSimilarityEvaluationAdhocApp.\n",
"\n",
"4. **Deploy the changes**: Once the new representation has been tested and validated, deploy the changes to production. This may involve creating a zip file, uploading it to the packer, and then scheduling it with Aurora. Be sure to monitor the system to ensure a smooth transition between representations and verify that the new representation is being used in recommendations as expected.\n",
"\n",
"5. **Monitor and assess the new representation**: After the new representation has been deployed, continue to monitor its performance and impact on recommendations. Take note of any improvements or issues that arise and be prepared to iterate on the new representation if needed. Always ensure that the results and performance metrics align with the system's goals and objectives. \n",
"\n",
"-> **Question**: How much do I get boosted within my cluster?\n",
"\n",
"**Answer**: It's not possible to determine the exact amount your content is boosted within your cluster in the SimClusters representation without specific data about your content and its engagement metrics. However, a combination of factors, such as the favorite score and follow score, alongside other engagement signals and SimCluster calculations, influence the boosting of content. \n",
"\n",
"-> **Question**: How does Heavy ranker work. what are its main inputs?\n",
"\n",
"**Answer**: The Heavy Ranker is a machine learning model that plays a crucial role in ranking and scoring candidates within the recommendation algorithm. Its primary purpose is to predict the likelihood of a user engaging with a tweet or connecting with another user on the platform.\n",
"\n",
"Main inputs to the Heavy Ranker consist of:\n",
"\n",
"1. Static Features: These are features that can be computed directly from a tweet at the time it's created, such as whether it has a URL, has cards, has quotes, etc. These features are produced by the Index Ingester as the tweets are generated and stored in the index.\n",
"\n",
"2. Real-time Features: These per-tweet features can change after the tweet has been indexed. They mostly consist of social engagements like retweet count, favorite count, reply count, and some spam signals that are computed with later activities. The Signal Ingester, which is part of a Heron topology, processes multiple event streams to collect and compute these real-time features.\n",
"\n",
"3. User Table Features: These per-user features are obtained from the User Table Updater that processes a stream written by the user service. This input is used to store sparse real-time user information, which is later propagated to the tweet being scored by looking up the author of the tweet.\n",
"\n",
"4. Search Context Features: These features represent the context of the current searcher, like their UI language, their content consumption, and the current time (implied). They are combined with Tweet Data to compute some of the features used in scoring.\n",
"\n",
"These inputs are then processed by the Heavy Ranker to score and rank candidates based on their relevance and likelihood of engagement by the user. \n",
"\n",
"-> **Question**: How can one influence Heavy ranker?\n",
"\n",
"**Answer**: To influence the Heavy Ranker's output or ranking of content, consider the following actions:\n",
"\n",
"1. Improve content quality: Create high-quality and engaging content that is relevant, informative, and valuable to users. High-quality content is more likely to receive positive user engagement, which the Heavy Ranker considers when ranking content.\n",
"\n",
"2. Increase user engagement: Encourage users to interact with content through likes, retweets, replies, and comments. Higher engagement levels can lead to better ranking in the Heavy Ranker's output.\n",
"\n",
"3. Optimize your user profile: A user's reputation, based on factors such as their follower count and follower-to-following ratio, may impact the ranking of their content. Maintain a good reputation by following relevant users, keeping a reasonable follower-to-following ratio and engaging with your followers.\n",
"\n",
"4. Enhance content discoverability: Use relevant keywords, hashtags, and mentions in your tweets, making it easier for users to find and engage with your content. This increased discoverability may help improve the ranking of your content by the Heavy Ranker.\n",
"\n",
"5. Leverage multimedia content: Experiment with different content formats, such as videos, images, and GIFs, which may capture users' attention and increase engagement, resulting in better ranking by the Heavy Ranker.\n",
"\n",
"6. User feedback: Monitor and respond to feedback for your content. Positive feedback may improve your ranking, while negative feedback provides an opportunity to learn and improve.\n",
"\n",
"Note that the Heavy Ranker uses a combination of machine learning models and various features to rank the content. While the above actions may help influence the ranking, there are no guarantees as the ranking process is determined by a complex algorithm, which evolves over time. \n",
"\n",
"-> **Question**: why threads and long tweets do so well on the platform?\n",
"\n",
"**Answer**: Threads and long tweets perform well on the platform for several reasons:\n",
"\n",
"1. **More content and context**: Threads and long tweets provide more information and context about a topic, which can make the content more engaging and informative for users. People tend to appreciate a well-structured and detailed explanation of a subject or a story, and threads and long tweets can do that effectively.\n",
"\n",
"2. **Increased user engagement**: As threads and long tweets provide more content, they also encourage users to engage with the tweets through replies, retweets, and likes. This increased engagement can lead to better visibility of the content, as the Twitter algorithm considers user engagement when ranking and surfacing tweets.\n",
"\n",
"3. **Narrative structure**: Threads enable users to tell stories or present arguments in a step-by-step manner, making the information more accessible and easier to follow. This narrative structure can capture users' attention and encourage them to read through the entire thread and interact with the content.\n",
"\n",
"4. **Expanded reach**: When users engage with a thread, their interactions can bring the content to the attention of their followers, helping to expand the reach of the thread. This increased visibility can lead to more interactions and higher performance for the threaded tweets.\n",
"\n",
"5. **Higher content quality**: Generally, threads and long tweets require more thought and effort to create, which may lead to higher quality content. Users are more likely to appreciate and interact with high-quality, well-reasoned content, further improving the performance of these tweets within the platform.\n",
"\n",
"Overall, threads and long tweets perform well on Twitter because they encourage user engagement and provide a richer, more informative experience that users find valuable. \n",
"\n",
"-> **Question**: Are thread and long tweet creators building a following that reacts to only threads?\n",
"\n",
"**Answer**: Based on the provided code and context, there isn't enough information to conclude if the creators of threads and long tweets primarily build a following that engages with only thread-based content. The code provided is focused on Twitter's recommendation and ranking algorithms, as well as infrastructure components like Kafka, partitions, and the Follow Recommendations Service (FRS). To answer your question, data analysis of user engagement and results of specific edge cases would be required. \n",
"\n",
"-> **Question**: Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?\n",
"\n",
"**Answer**: Yes, different strategies need to be followed to maximize the number of followers compared to maximizing likes and bookmarks per tweet. While there may be some overlap in the approaches, they target different aspects of user engagement.\n",
"\n",
"Maximizing followers: The primary focus is on growing your audience on the platform. Strategies include:\n",
"\n",
"1. Consistently sharing high-quality content related to your niche or industry.\n",
"2. Engaging with others on the platform by replying, retweeting, and mentioning other users.\n",
"3. Using relevant hashtags and participating in trending conversations.\n",
"4. Collaborating with influencers and other users with a large following.\n",
"5. Posting at optimal times when your target audience is most active.\n",
"6. Optimizing your profile by using a clear profile picture, catchy bio, and relevant links.\n",
"\n",
"Maximizing likes and bookmarks per tweet: The focus is on creating content that resonates with your existing audience and encourages engagement. Strategies include:\n",
"\n",
"1. Crafting engaging and well-written tweets that encourage users to like or save them.\n",
"2. Incorporating visually appealing elements, such as images, GIFs, or videos, that capture attention.\n",
"3. Asking questions, sharing opinions, or sparking conversations that encourage users to engage with your tweets.\n",
"4. Using analytics to understand the type of content that resonates with your audience and tailoring your tweets accordingly.\n",
"5. Posting a mix of educational, entertaining, and promotional content to maintain variety and interest.\n",
"6. Timing your tweets strategically to maximize engagement, likes, and bookmarks per tweet.\n",
"\n",
"Both strategies can overlap, and you may need to adapt your approach by understanding your target audience's preferences and analyzing your account's performance. However, it's essential to recognize that maximizing followers and maximizing likes and bookmarks per tweet have different focuses and require specific strategies. \n",
"\n",
"-> **Question**: Content meta data and how it impacts virality (e.g. ALT in images).\n",
"\n",
"**Answer**: There is no direct information in the provided context about how content metadata, such as ALT text in images, impacts the virality of a tweet or post. However, it's worth noting that including ALT text can improve the accessibility of your content for users who rely on screen readers, which may lead to increased engagement for a broader audience. Additionally, metadata can be used in search engine optimization, which might improve the visibility of the content, but the context provided does not mention any specific correlation with virality. \n",
"\n",
"-> **Question**: What are some unexpected fingerprints for spam factors?\n",
"\n",
"**Answer**: In the provided context, an unusual indicator of spam factors is when a tweet contains a non-media, non-news link. If the tweet has a link but does not have an image URL, video URL, or news URL, it is considered a potential spam vector, and a threshold for user reputation (tweepCredThreshold) is set to MIN_TWEEPCRED_WITH_LINK.\n",
"\n",
"While this rule may not cover all possible unusual spam indicators, it is derived from the specific codebase and logic shared in the context. \n",
"\n",
"-> **Question**: Is there any difference between company verified checkmarks and blue verified individual checkmarks?\n",
"\n",
"**Answer**: Yes, there is a distinction between the verified checkmarks for companies and blue verified checkmarks for individuals. The code snippet provided mentions \"Blue-verified account boost\" which indicates that there is a separate category for blue verified accounts. Typically, blue verified checkmarks are used to indicate notable individuals, while verified checkmarks are for companies or organizations. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}