mirror of https://github.com/hwchase17/langchain synced 2024-11-02 09:40:22 +00:00

Go to file

Andrew Zhou 64c4a698a8 More comprehensive readthedocs document loader (#12382 ) ## Description: When building our own readthedocs.io scraper, we noticed a couple interesting things: 1. Text lines with a lot of nested <span> tags would give unclean text with a bunch of newlines. For example, for [Langchain's documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.readthedocs.ReadTheDocsLoader.html#langchain.document_loaders.readthedocs.ReadTheDocsLoader), a single line is represented in a complicated nested HTML structure, and the naive `soup.get_text()` call currently being made will create a newline for each nested HTML element. Therefore, the document loader would give a messy, newline-separated blob of text. This would be true in a lot of cases. <img width="945" alt="Screenshot 2023-10-26 at 6 15 39 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/eca85d1f-d2bf-4487-a18a-e1e732fadf19"> <img width="1031" alt="Screenshot 2023-10-26 at 6 16 00 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/035938a0-9892-4f6a-83cd-0d7b409b00a3"> Additionally, content from iframes, code from scripts, css from styles, etc. will be gotten if it's a subclass of the selector (which happens more often than you'd think). For example, [this page](https://pydeck.gl/gallery/contour_layer.html#) will scrape 1.5 million characters of content that looks like this: <img width="1372" alt="Screenshot 2023-10-26 at 6 32 55 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/dbd89e39-9478-4a18-9e84-f0eb91954eac"> Therefore, I wrote a recursive _get_clean_text(soup) class function that 1. skips all irrelevant elements, and 2. only adds newlines when necessary. 2. Index pages (like [this one](https://api.python.langchain.com/en/latest/api_reference.html)) would be loaded, chunked, and eventually embedded. This is really bad not just because the user will be embedding irrelevant information - but because index pages are very likely to show up in retrieved content, making retrieval less effective (in our tests). Therefore, I added a bool parameter `exclude_index_pages` defaulted to False (which is the current behavior — although I'd petition to default this to True) that will skip all pages where links take up 50%+ of the page. Through manual testing, this seems to be the best threshold. ## Other Information: - Issue: n/a - Dependencies: n/a - Tag maintainer: n/a - Twitter handle: @andrewthezhou --------- Co-authored-by: Andrew Zhou <andrew@heykona.com> Co-authored-by: Bagatur <baskaryan@gmail.com>		2023-10-29 16:26:53 -07:00
.devcontainer	rename repo namespace to langchain-ai (#11259 )	2023-10-01 15:30:58 -04:00
.github	update contributing (#12532 )	2023-10-29 16:22:18 -07:00
cookbook	notebook fmt (#12498 )	2023-10-29 15:50:09 -07:00
docker	Update Dockerfile.base (#11556 )	2023-10-09 16:43:04 +01:00
docs	Bagatur/fix doc ci (#12529 )	2023-10-29 16:15:18 -07:00
libs	More comprehensive readthedocs document loader (#12382 )	2023-10-29 16:26:53 -07:00
templates	notebook fmt (#12498 )	2023-10-29 15:50:09 -07:00
.gitattributes	Update dev container (#6189 )	2023-06-16 15:42:14 -07:00
.gitignore	Add LCEL to LLM intro (#11835 )	2023-10-15 14:59:45 -07:00
.readthedocs.yaml	customize rtd build (#11797 )	2023-10-13 19:50:22 -07:00
CITATION.cff	rename repo namespace to langchain-ai (#11259 )	2023-10-01 15:30:58 -04:00
LICENSE
Makefile	Bagatur/fix doc ci (#12529 )	2023-10-29 16:15:18 -07:00
MIGRATE.md	cr	2023-07-28 17:47:00 -07:00
poetry.lock	Bagatur/fix doc ci (#12529 )	2023-10-29 16:15:18 -07:00
poetry.toml	Unbreak devcontainer (#8154 )	2023-07-23 19:33:47 -07:00
pyproject.toml	Bagatur/fix doc ci (#12529 )	2023-10-29 16:15:18 -07:00
README.md	Improved readability of Docs (#12136 )	2023-10-22 17:16:30 -07:00
SECURITY.md	Update `SECURITY.md` email address. (#9558 )	2023-08-21 14:52:21 -04:00

README.md

🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

Looking for the JS/TS version? Check out LangChain.js.

To help you ship LangChain apps to production faster, check out LangSmith. LangSmith is a unified developer platform for building, testing, and monitoring LLM applications. Fill out this form to get off the waitlist or speak with our sales team

🚨Breaking Changes for select chains (SQLDatabase) on 7/28/23

In an effort to make langchain leaner and safer, we are moving select chains to langchain_experimental. This migration has already started, but we are remaining backwards compatible until 7/28. On that date, we will remove functionality from langchain. Read more about the motivation and the progress here. Read how to migrate your code here.

Quick Install

pip install langchain or pip install langsmith && conda install langchain -c conda-forge

🤔 What is this?

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. However, using these LLMs in isolation is often insufficient for creating a truly powerful app - the real power comes when you can combine them with other sources of computation or knowledge.

This library aims to assist in the development of those types of applications. Common examples of these applications include:

❓ Question Answering over specific documents

Documentation
End-to-end Example: Question Answering over Notion Database

💬 Chatbots

Documentation
End-to-end Example: Chat-LangChain

🤖 Agents

Documentation
End-to-end Example: GPT+WolframAlpha

📖 Documentation

Please see here for full documentation on:

Getting started (installation, setting up the environment, simple examples)
How-To examples (demos, integrations, helper functions)
Reference (full API docs)
Resources (high-level explanation of core concepts)

🚀 What can this help with?

There are six main areas that LangChain is designed to help with. These are, in increasing order of complexity:

📃 LLMs and Prompts:

This includes prompt management, prompt optimization, a generic interface for all LLMs, and common utilities for working with LLMs.

🔗 Chains:

Chains go beyond a single LLM call and involve sequences of calls (whether to an LLM or a different utility). LangChain provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

📚 Data Augmented Generation:

Data Augmented Generation involves specific types of chains that first interact with an external data source to fetch data for use in the generation step. Examples include summarization of long pieces of text and question/answering over specific data sources.

🤖 Agents:

Agents involve an LLM making decisions about which Actions to take, taking that Action, seeing an Observation, and repeating that until done. LangChain provides a standard interface for agents, a selection of agents to choose from, and examples of end-to-end agents.

🧠 Memory:

Memory refers to persisting state between calls of a chain/agent. LangChain provides a standard interface for memory, a collection of memory implementations, and examples of chains/agents that use memory.

🧐 Evaluation:

[BETA] Generative models are notoriously hard to evaluate with traditional metrics. One new way of evaluating them is by using language models themselves to do the evaluation. LangChain provides some prompts/chains for assisting in this.

For more information on these concepts, please see our full documentation.

💁 Contributing

As an open-source project in a rapidly developing field, we are extremely open to contributions, whether it be in the form of a new feature, improved infrastructure, or better documentation.

For detailed information on how to contribute, see here.

README.md Unescape Escape