docs: added template to `arxiv` page (#21846)

Updated `arXiv` page with the arxiv references from Templates (were
references from Docs and API Refs, not Templates).
Re #21450 
CC @eyurtsev
pull/21843/head^2
Leonid Ganeline 4 months ago committed by GitHub
parent e6207ad4f3
commit 6a59f76f2b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -1,54 +1,146 @@
# arXiv
LangChain implements the latest research in the field of Natural Language Processing.
This page contains `arXiv` papers referenced in the LangChain Documentation and API Reference.
This page contains `arXiv` papers referenced in the LangChain Documentation, API Reference,
and Templates.
## Summary
| arXiv id / Title | Authors | Published date 🔻 | LangChain Documentation and API Reference |
|------------------|---------|-------------------|-------------------------|
| `2307.03172v3` [Lost in the Middle: How Language Models Use Long Contexts](http://arxiv.org/abs/2307.03172v3) | Nelson F. Liu, Kevin Lin, John Hewitt, et al. | 2023-07-06 | `Docs:` [docs/modules/data_connection/retrievers/long_context_reorder](https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder)
| arXiv id / Title | Authors | Published date 🔻 | LangChain Documentation|
|------------------|---------|-------------------|------------------------|
| `2312.06648v2` [Dense X Retrieval: What Retrieval Granularity Should We Use?](http://arxiv.org/abs/2312.06648v2) | Tong Chen, Hongwei Wang, Sihao Chen, et al. | 2023-12-11 | `Template:` [propositional-retrieval](https://python.langchain.com/docs/templates/propositional-retrieval)
| `2311.09210v1` [Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models](http://arxiv.org/abs/2311.09210v1) | Wenhao Yu, Hongming Zhang, Xiaoman Pan, et al. | 2023-11-15 | `Template:` [chain-of-note-wiki](https://python.langchain.com/docs/templates/chain-of-note-wiki)
| `2310.06117v2` [Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models](http://arxiv.org/abs/2310.06117v2) | Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, et al. | 2023-10-09 | `Template:` [stepback-qa-prompting](https://python.langchain.com/docs/templates/stepback-qa-prompting)
| `2305.14283v3` [Query Rewriting for Retrieval-Augmented Large Language Models](http://arxiv.org/abs/2305.14283v3) | Xinbei Ma, Yeyun Gong, Pengcheng He, et al. | 2023-05-23 | `Template:` [rewrite-retrieve-read](https://python.langchain.com/docs/templates/rewrite-retrieve-read)
| `2305.08291v1` [Large Language Model Guided Tree-of-Thought](http://arxiv.org/abs/2305.08291v1) | Jieyi Long | 2023-05-15 | `API:` [langchain_experimental.tot](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.tot)
| `2305.06983v2` [Active Retrieval Augmented Generation](http://arxiv.org/abs/2305.06983v2) | Zhengbao Jiang, Frank F. Xu, Luyu Gao, et al. | 2023-05-11 | `Docs:` [docs/modules/chains](https://python.langchain.com/docs/modules/chains)
| `2303.17580v4` [HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face](http://arxiv.org/abs/2303.17580v4) | Yongliang Shen, Kaitao Song, Xu Tan, et al. | 2023-03-30 | `API:` [langchain_experimental.autonomous_agents](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.autonomous_agents)
| `2303.08774v6` [GPT-4 Technical Report](http://arxiv.org/abs/2303.08774v6) | OpenAI, Josh Achiam, Steven Adler, et al. | 2023-03-15 | `Docs:` [docs/integrations/vectorstores/mongodb_atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas)
| `2301.10226v4` [A Watermark for Large Language Models](http://arxiv.org/abs/2301.10226v4) | John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al. | 2023-01-24 | `API:` [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint), [langchain_community.llms...OCIModelDeploymentTGI](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI.html#langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI)
| `2212.10496v1` [Precise Zero-Shot Dense Retrieval without Relevance Labels](http://arxiv.org/abs/2212.10496v1) | Luyu Gao, Xueguang Ma, Jimmy Lin, et al. | 2022-12-20 | `Docs:` [docs/use_cases/query_analysis/techniques/hyde](https://python.langchain.com/docs/use_cases/query_analysis/techniques/hyde), `API:` [langchain.chains...HypotheticalDocumentEmbedder](https://api.python.langchain.com/en/latest/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html#langchain.chains.hyde.base.HypotheticalDocumentEmbedder)
| `2212.08073v1` [Constitutional AI: Harmlessness from AI Feedback](http://arxiv.org/abs/2212.08073v1) | Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. | 2022-12-15 | `Docs:` [docs/guides/productionization/evaluation/string/criteria_eval_chain](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain)
| `2301.10226v4` [A Watermark for Large Language Models](http://arxiv.org/abs/2301.10226v4) | John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al. | 2023-01-24 | `API:` [langchain_community.llms...OCIModelDeploymentTGI](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI.html#langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI), [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
| `2212.10496v1` [Precise Zero-Shot Dense Retrieval without Relevance Labels](http://arxiv.org/abs/2212.10496v1) | Luyu Gao, Xueguang Ma, Jimmy Lin, et al. | 2022-12-20 | `API:` [langchain.chains...HypotheticalDocumentEmbedder](https://api.python.langchain.com/en/latest/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html#langchain.chains.hyde.base.HypotheticalDocumentEmbedder), `Template:` [hyde](https://python.langchain.com/docs/templates/hyde)
| `2212.07425v3` [Robust and Explainable Identification of Logical Fallacies in Natural Language Arguments](http://arxiv.org/abs/2212.07425v3) | Zhivar Sourati, Vishnu Priya Prasanna Venkatesh, Darshan Deshpande, et al. | 2022-12-12 | `API:` [langchain_experimental.fallacy_removal](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.fallacy_removal)
| `2211.13892v2` [Complementary Explanations for Effective In-Context Learning](http://arxiv.org/abs/2211.13892v2) | Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, et al. | 2022-11-25 | `API:` [langchain_core.example_selectors...MaxMarginalRelevanceExampleSelector](https://api.python.langchain.com/en/latest/example_selectors/langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector.html#langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector)
| `2211.10435v2` [PAL: Program-aided Language Models](http://arxiv.org/abs/2211.10435v2) | Luyu Gao, Aman Madaan, Shuyan Zhou, et al. | 2022-11-18 | `API:` [langchain_experimental.pal_chain...PALChain](https://api.python.langchain.com/en/latest/pal_chain/langchain_experimental.pal_chain.base.PALChain.html#langchain_experimental.pal_chain.base.PALChain), [langchain_experimental.pal_chain](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.pal_chain)
| `2209.10785v2` [Deep Lake: a Lakehouse for Deep Learning](http://arxiv.org/abs/2209.10785v2) | Sasun Hambardzumyan, Abhinav Tuli, Levon Ghukasyan, et al. | 2022-09-22 | `Docs:` [docs/integrations/providers/activeloop_deeplake](https://python.langchain.com/docs/integrations/providers/activeloop_deeplake)
| `2205.12654v1` [Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages](http://arxiv.org/abs/2205.12654v1) | Kevin Heffernan, Onur Çelebi, Holger Schwenk | 2022-05-25 | `API:` [langchain_community.embeddings...LaserEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.laser.LaserEmbeddings.html#langchain_community.embeddings.laser.LaserEmbeddings)
| `2204.00498v1` [Evaluating the Text-to-SQL Capabilities of Large Language Models](http://arxiv.org/abs/2204.00498v1) | Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau | 2022-03-15 | `Docs:` [docs/use_cases/sql/quickstart](https://python.langchain.com/docs/use_cases/sql/quickstart), `API:` [langchain_community.utilities...SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html#langchain_community.utilities.sql_database.SQLDatabase), [langchain_community.utilities...SparkSQL](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.spark_sql.SparkSQL.html#langchain_community.utilities.spark_sql.SparkSQL)
| `2204.00498v1` [Evaluating the Text-to-SQL Capabilities of Large Language Models](http://arxiv.org/abs/2204.00498v1) | Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau | 2022-03-15 | `API:` [langchain_community.utilities...SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html#langchain_community.utilities.sql_database.SQLDatabase), [langchain_community.utilities...SparkSQL](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.spark_sql.SparkSQL.html#langchain_community.utilities.spark_sql.SparkSQL)
| `2202.00666v5` [Locally Typical Sampling](http://arxiv.org/abs/2202.00666v5) | Clara Meister, Tiago Pimentel, Gian Wiher, et al. | 2022-02-01 | `API:` [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
| `2103.00020v1` [Learning Transferable Visual Models From Natural Language Supervision](http://arxiv.org/abs/2103.00020v1) | Alec Radford, Jong Wook Kim, Chris Hallacy, et al. | 2021-02-26 | `API:` [langchain_experimental.open_clip](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.open_clip)
| `1909.05858v2` [CTRL: A Conditional Transformer Language Model for Controllable Generation](http://arxiv.org/abs/1909.05858v2) | Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, et al. | 2019-09-11 | `API:` [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
| `1908.10084v1` [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](http://arxiv.org/abs/1908.10084v1) | Nils Reimers, Iryna Gurevych | 2019-08-27 | `Docs:` [docs/integrations/text_embedding/sentence_transformers](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers)
## Lost in the Middle: How Language Models Use Long Contexts
- **arXiv id:** 2307.03172v3
- **Title:** Lost in the Middle: How Language Models Use Long Contexts
- **Authors:** Nelson F. Liu, Kevin Lin, John Hewitt, et al.
- **Published Date:** 2023-07-06
- **URL:** http://arxiv.org/abs/2307.03172v3
- **LangChain Documentation:** [docs/modules/data_connection/retrievers/long_context_reorder](https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder)
**Abstract:** While recent language models have the ability to take long contexts as input,
relatively little is known about how well they use longer context. We analyze
the performance of language models on two tasks that require identifying
relevant information in their input contexts: multi-document question answering
and key-value retrieval. We find that performance can degrade significantly
when changing the position of relevant information, indicating that current
language models do not robustly make use of information in long input contexts.
In particular, we observe that performance is often highest when relevant
information occurs at the beginning or end of the input context, and
significantly degrades when models must access relevant information in the
middle of long contexts, even for explicitly long-context models. Our analysis
provides a better understanding of how language models use their input context
and provides new evaluation protocols for future long-context language models.
## Dense X Retrieval: What Retrieval Granularity Should We Use?
- **arXiv id:** 2312.06648v2
- **Title:** Dense X Retrieval: What Retrieval Granularity Should We Use?
- **Authors:** Tong Chen, Hongwei Wang, Sihao Chen, et al.
- **Published Date:** 2023-12-11
- **URL:** http://arxiv.org/abs/2312.06648v2
- **LangChain:**
- **Template:** [propositional-retrieval](https://python.langchain.com/docs/templates/propositional-retrieval)
**Abstract:** Dense retrieval has become a prominent method to obtain relevant context or
world knowledge in open-domain NLP tasks. When we use a learned dense retriever
on a retrieval corpus at inference time, an often-overlooked design choice is
the retrieval unit in which the corpus is indexed, e.g. document, passage, or
sentence. We discover that the retrieval unit choice significantly impacts the
performance of both retrieval and downstream tasks. Distinct from the typical
approach of using passages or sentences, we introduce a novel retrieval unit,
proposition, for dense retrieval. Propositions are defined as atomic
expressions within text, each encapsulating a distinct factoid and presented in
a concise, self-contained natural language format. We conduct an empirical
comparison of different retrieval granularity. Our results reveal that
proposition-based retrieval significantly outperforms traditional passage or
sentence-based methods in dense retrieval. Moreover, retrieval by proposition
also enhances the performance of downstream QA tasks, since the retrieved texts
are more condensed with question-relevant information, reducing the need for
lengthy input tokens and minimizing the inclusion of extraneous, irrelevant
information.
## Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- **arXiv id:** 2311.09210v1
- **Title:** Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- **Authors:** Wenhao Yu, Hongming Zhang, Xiaoman Pan, et al.
- **Published Date:** 2023-11-15
- **URL:** http://arxiv.org/abs/2311.09210v1
- **LangChain:**
- **Template:** [chain-of-note-wiki](https://python.langchain.com/docs/templates/chain-of-note-wiki)
**Abstract:** Retrieval-augmented language models (RALMs) represent a substantial
advancement in the capabilities of large language models, notably in reducing
factual hallucination by leveraging external knowledge sources. However, the
reliability of the retrieved information is not always guaranteed. The
retrieval of irrelevant data can lead to misguided responses, and potentially
causing the model to overlook its inherent knowledge, even when it possesses
adequate information to address the query. Moreover, standard RALMs often
struggle to assess whether they possess adequate knowledge, both intrinsic and
retrieved, to provide an accurate answer. In situations where knowledge is
lacking, these systems should ideally respond with "unknown" when the answer is
unattainable. In response to these challenges, we introduces Chain-of-Noting
(CoN), a novel approach aimed at improving the robustness of RALMs in facing
noisy, irrelevant documents and in handling unknown scenarios. The core idea of
CoN is to generate sequential reading notes for retrieved documents, enabling a
thorough evaluation of their relevance to the given question and integrating
this information to formulate the final answer. We employed ChatGPT to create
training data for CoN, which was subsequently trained on an LLaMa-2 7B model.
Our experiments across four open-domain QA benchmarks show that RALMs equipped
with CoN significantly outperform standard RALMs. Notably, CoN achieves an
average improvement of +7.9 in EM score given entirely noisy retrieved
documents and +10.5 in rejection rates for real-time questions that fall
outside the pre-training knowledge scope.
## Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
- **arXiv id:** 2310.06117v2
- **Title:** Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
- **Authors:** Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, et al.
- **Published Date:** 2023-10-09
- **URL:** http://arxiv.org/abs/2310.06117v2
- **LangChain:**
- **Template:** [stepback-qa-prompting](https://python.langchain.com/docs/templates/stepback-qa-prompting)
**Abstract:** We present Step-Back Prompting, a simple prompting technique that enables
LLMs to do abstractions to derive high-level concepts and first principles from
instances containing specific details. Using the concepts and principles to
guide reasoning, LLMs significantly improve their abilities in following a
correct reasoning path towards the solution. We conduct experiments of
Step-Back Prompting with PaLM-2L, GPT-4 and Llama2-70B models, and observe
substantial performance gains on various challenging reasoning-intensive tasks
including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back
Prompting improves PaLM-2L performance on MMLU (Physics and Chemistry) by 7%
and 11% respectively, TimeQA by 27%, and MuSiQue by 7%.
## Query Rewriting for Retrieval-Augmented Large Language Models
- **arXiv id:** 2305.14283v3
- **Title:** Query Rewriting for Retrieval-Augmented Large Language Models
- **Authors:** Xinbei Ma, Yeyun Gong, Pengcheng He, et al.
- **Published Date:** 2023-05-23
- **URL:** http://arxiv.org/abs/2305.14283v3
- **LangChain:**
- **Template:** [rewrite-retrieve-read](https://python.langchain.com/docs/templates/rewrite-retrieve-read)
**Abstract:** Large Language Models (LLMs) play powerful, black-box readers in the
retrieve-then-read pipeline, making remarkable progress in knowledge-intensive
tasks. This work introduces a new framework, Rewrite-Retrieve-Read instead of
the previous retrieve-then-read for the retrieval-augmented LLMs from the
perspective of the query rewriting. Unlike prior studies focusing on adapting
either the retriever or the reader, our approach pays attention to the
adaptation of the search query itself, for there is inevitably a gap between
the input text and the needed knowledge in retrieval. We first prompt an LLM to
generate the query, then use a web search engine to retrieve contexts.
Furthermore, to better align the query to the frozen modules, we propose a
trainable scheme for our pipeline. A small language model is adopted as a
trainable rewriter to cater to the black-box LLM reader. The rewriter is
trained using the feedback of the LLM reader by reinforcement learning.
Evaluation is conducted on downstream tasks, open-domain QA and multiple-choice
QA. Experiments results show consistent performance improvement, indicating
that our framework is proven effective and scalable, and brings a new framework
for retrieval-augmented LLM.
## Large Language Model Guided Tree-of-Thought
@ -57,8 +149,9 @@ and provides new evaluation protocols for future long-context language models.
- **Authors:** Jieyi Long
- **Published Date:** 2023-05-15
- **URL:** http://arxiv.org/abs/2305.08291v1
- **LangChain:**
- **LangChain API Reference:** [langchain_experimental.tot](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.tot)
- **API Reference:** [langchain_experimental.tot](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.tot)
**Abstract:** In this paper, we introduce the Tree-of-Thought (ToT) framework, a novel
approach aimed at improving the problem-solving capabilities of auto-regressive
@ -78,35 +171,6 @@ significantly increase the success rate of Sudoku puzzle solving. Our
implementation of the ToT-based Sudoku solver is available on GitHub:
\url{https://github.com/jieyilong/tree-of-thought-puzzle-solver}.
## Active Retrieval Augmented Generation
- **arXiv id:** 2305.06983v2
- **Title:** Active Retrieval Augmented Generation
- **Authors:** Zhengbao Jiang, Frank F. Xu, Luyu Gao, et al.
- **Published Date:** 2023-05-11
- **URL:** http://arxiv.org/abs/2305.06983v2
- **LangChain Documentation:** [docs/modules/chains](https://python.langchain.com/docs/modules/chains)
**Abstract:** Despite the remarkable ability of large language models (LMs) to comprehend
and generate language, they have a tendency to hallucinate and create factually
inaccurate output. Augmenting LMs by retrieving information from external
knowledge resources is one promising solution. Most existing retrieval
augmented LMs employ a retrieve-and-generate setup that only retrieves
information once based on the input. This is limiting, however, in more general
scenarios involving generation of long texts, where continually gathering
information throughout generation is essential. In this work, we provide a
generalized view of active retrieval augmented generation, methods that
actively decide when and what to retrieve across the course of the generation.
We propose Forward-Looking Active REtrieval augmented generation (FLARE), a
generic method which iteratively uses a prediction of the upcoming sentence to
anticipate future content, which is then utilized as a query to retrieve
relevant documents to regenerate the sentence if it contains low-confidence
tokens. We test FLARE along with baselines comprehensively over 4 long-form
knowledge-intensive generation tasks/datasets. FLARE achieves superior or
competitive performance on all tasks, demonstrating the effectiveness of our
method. Code and datasets are available at https://github.com/jzbjyb/FLARE.
## HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
- **arXiv id:** 2303.17580v4
@ -114,8 +178,9 @@ method. Code and datasets are available at https://github.com/jzbjyb/FLARE.
- **Authors:** Yongliang Shen, Kaitao Song, Xu Tan, et al.
- **Published Date:** 2023-03-30
- **URL:** http://arxiv.org/abs/2303.17580v4
- **LangChain:**
- **LangChain API Reference:** [langchain_experimental.autonomous_agents](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.autonomous_agents)
- **API Reference:** [langchain_experimental.autonomous_agents](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.autonomous_agents)
**Abstract:** Solving complicated AI tasks with different domains and modalities is a key
step toward artificial general intelligence. While there are numerous AI models
@ -144,8 +209,9 @@ realization of artificial general intelligence.
- **Authors:** OpenAI, Josh Achiam, Steven Adler, et al.
- **Published Date:** 2023-03-15
- **URL:** http://arxiv.org/abs/2303.08774v6
- **LangChain Documentation:** [docs/integrations/vectorstores/mongodb_atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas)
- **LangChain:**
- **Documentation:** [docs/integrations/vectorstores/mongodb_atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas)
**Abstract:** We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
@ -167,8 +233,9 @@ more than 1/1,000th the compute of GPT-4.
- **Authors:** John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al.
- **Published Date:** 2023-01-24
- **URL:** http://arxiv.org/abs/2301.10226v4
- **LangChain:**
- **LangChain API Reference:** [langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint), [langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI.html#langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI)
- **API Reference:** [langchain_community.llms...OCIModelDeploymentTGI](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI.html#langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI), [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
**Abstract:** Potential harms of large language models can be mitigated by watermarking
model output, i.e., embedding signals into generated text that are invisible to
@ -191,8 +258,10 @@ family, and discuss robustness and security.
- **Authors:** Luyu Gao, Xueguang Ma, Jimmy Lin, et al.
- **Published Date:** 2022-12-20
- **URL:** http://arxiv.org/abs/2212.10496v1
- **LangChain Documentation:** [docs/use_cases/query_analysis/techniques/hyde](https://python.langchain.com/docs/use_cases/query_analysis/techniques/hyde)
- **LangChain API Reference:** [langchain.chains.hyde.base.HypotheticalDocumentEmbedder](https://api.python.langchain.com/en/latest/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html#langchain.chains.hyde.base.HypotheticalDocumentEmbedder)
- **LangChain:**
- **API Reference:** [langchain.chains...HypotheticalDocumentEmbedder](https://api.python.langchain.com/en/latest/chains/langchain.chains.hyde.base.HypotheticalDocumentEmbedder.html#langchain.chains.hyde.base.HypotheticalDocumentEmbedder)
- **Template:** [hyde](https://python.langchain.com/docs/templates/hyde)
**Abstract:** While dense retrieval has been shown effective and efficient across tasks and
languages, it remains difficult to create effective fully zero-shot dense
@ -212,35 +281,6 @@ state-of-the-art unsupervised dense retriever Contriever and shows strong
performance comparable to fine-tuned retrievers, across various tasks (e.g. web
search, QA, fact verification) and languages~(e.g. sw, ko, ja).
## Constitutional AI: Harmlessness from AI Feedback
- **arXiv id:** 2212.08073v1
- **Title:** Constitutional AI: Harmlessness from AI Feedback
- **Authors:** Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al.
- **Published Date:** 2022-12-15
- **URL:** http://arxiv.org/abs/2212.08073v1
- **LangChain Documentation:** [docs/guides/productionization/evaluation/string/criteria_eval_chain](https://python.langchain.com/docs/guides/productionization/evaluation/string/criteria_eval_chain)
**Abstract:** As AI systems become more capable, we would like to enlist their help to
supervise other AIs. We experiment with methods for training a harmless AI
assistant through self-improvement, without any human labels identifying
harmful outputs. The only human oversight is provided through a list of rules
or principles, and so we refer to the method as 'Constitutional AI'. The
process involves both a supervised learning and a reinforcement learning phase.
In the supervised phase we sample from an initial model, then generate
self-critiques and revisions, and then finetune the original model on revised
responses. In the RL phase, we sample from the finetuned model, use a model to
evaluate which of the two samples is better, and then train a preference model
from this dataset of AI preferences. We then train with RL using the preference
model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a
result we are able to train a harmless but non-evasive AI assistant that
engages with harmful queries by explaining its objections to them. Both the SL
and RL methods can leverage chain-of-thought style reasoning to improve the
human-judged performance and transparency of AI decision making. These methods
make it possible to control AI behavior more precisely and with far fewer human
labels.
## Robust and Explainable Identification of Logical Fallacies in Natural Language Arguments
- **arXiv id:** 2212.07425v3
@ -248,8 +288,9 @@ labels.
- **Authors:** Zhivar Sourati, Vishnu Priya Prasanna Venkatesh, Darshan Deshpande, et al.
- **Published Date:** 2022-12-12
- **URL:** http://arxiv.org/abs/2212.07425v3
- **LangChain:**
- **LangChain API Reference:** [langchain_experimental.fallacy_removal](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.fallacy_removal)
- **API Reference:** [langchain_experimental.fallacy_removal](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.fallacy_removal)
**Abstract:** The spread of misinformation, propaganda, and flawed argumentation has been
amplified in the Internet era. Given the volume of data and the subtlety of
@ -280,8 +321,9 @@ further work on logical fallacy identification.
- **Authors:** Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, et al.
- **Published Date:** 2022-11-25
- **URL:** http://arxiv.org/abs/2211.13892v2
- **LangChain:**
- **LangChain API Reference:** [langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector](https://api.python.langchain.com/en/latest/example_selectors/langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector.html#langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector)
- **API Reference:** [langchain_core.example_selectors...MaxMarginalRelevanceExampleSelector](https://api.python.langchain.com/en/latest/example_selectors/langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector.html#langchain_core.example_selectors.semantic_similarity.MaxMarginalRelevanceExampleSelector)
**Abstract:** Large language models (LLMs) have exhibited remarkable capabilities in
learning from explanations in prompts, but there has been limited understanding
@ -307,8 +349,9 @@ performance across three real-world tasks on multiple LLMs.
- **Authors:** Luyu Gao, Aman Madaan, Shuyan Zhou, et al.
- **Published Date:** 2022-11-18
- **URL:** http://arxiv.org/abs/2211.10435v2
- **LangChain:**
- **LangChain API Reference:** [langchain_experimental.pal_chain.base.PALChain](https://api.python.langchain.com/en/latest/pal_chain/langchain_experimental.pal_chain.base.PALChain.html#langchain_experimental.pal_chain.base.PALChain), [langchain_experimental.pal_chain](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.pal_chain)
- **API Reference:** [langchain_experimental.pal_chain...PALChain](https://api.python.langchain.com/en/latest/pal_chain/langchain_experimental.pal_chain.base.PALChain.html#langchain_experimental.pal_chain.base.PALChain), [langchain_experimental.pal_chain](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.pal_chain)
**Abstract:** Large language models (LLMs) have recently demonstrated an impressive ability
to perform arithmetic and symbolic reasoning tasks, when provided with a few
@ -340,8 +383,9 @@ publicly available at http://reasonwithpal.com/ .
- **Authors:** Sasun Hambardzumyan, Abhinav Tuli, Levon Ghukasyan, et al.
- **Published Date:** 2022-09-22
- **URL:** http://arxiv.org/abs/2209.10785v2
- **LangChain Documentation:** [docs/integrations/providers/activeloop_deeplake](https://python.langchain.com/docs/integrations/providers/activeloop_deeplake)
- **LangChain:**
- **Documentation:** [docs/integrations/providers/activeloop_deeplake](https://python.langchain.com/docs/integrations/providers/activeloop_deeplake)
**Abstract:** Traditional data lakes provide critical data infrastructure for analytical
workloads by enabling time travel, running SQL queries, ingesting data with
@ -367,8 +411,9 @@ TensorFlow, JAX, and integrate with numerous MLOps tools.
- **Authors:** Kevin Heffernan, Onur Çelebi, Holger Schwenk
- **Published Date:** 2022-05-25
- **URL:** http://arxiv.org/abs/2205.12654v1
- **LangChain:**
- **LangChain API Reference:** [langchain_community.embeddings.laser.LaserEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.laser.LaserEmbeddings.html#langchain_community.embeddings.laser.LaserEmbeddings)
- **API Reference:** [langchain_community.embeddings...LaserEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.laser.LaserEmbeddings.html#langchain_community.embeddings.laser.LaserEmbeddings)
**Abstract:** Scaling multilingual representation learning beyond the hundred most frequent
languages is challenging, in particular to cover the long tail of low-resource
@ -395,8 +440,9 @@ encoders, mine bitexts, and validate the bitexts by training NMT systems.
- **Authors:** Nitarshan Rajkumar, Raymond Li, Dzmitry Bahdanau
- **Published Date:** 2022-03-15
- **URL:** http://arxiv.org/abs/2204.00498v1
- **LangChain Documentation:** [docs/use_cases/sql/quickstart](https://python.langchain.com/docs/use_cases/sql/quickstart)
- **LangChain API Reference:** [langchain_community.utilities.sql_database.SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html#langchain_community.utilities.sql_database.SQLDatabase), [langchain_community.utilities.spark_sql.SparkSQL](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.spark_sql.SparkSQL.html#langchain_community.utilities.spark_sql.SparkSQL)
- **LangChain:**
- **API Reference:** [langchain_community.utilities...SQLDatabase](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.sql_database.SQLDatabase.html#langchain_community.utilities.sql_database.SQLDatabase), [langchain_community.utilities...SparkSQL](https://api.python.langchain.com/en/latest/utilities/langchain_community.utilities.spark_sql.SparkSQL.html#langchain_community.utilities.spark_sql.SparkSQL)
**Abstract:** We perform an empirical evaluation of Text-to-SQL capabilities of the Codex
language model. We find that, without any finetuning, Codex is a strong
@ -413,8 +459,9 @@ few-shot examples.
- **Authors:** Clara Meister, Tiago Pimentel, Gian Wiher, et al.
- **Published Date:** 2022-02-01
- **URL:** http://arxiv.org/abs/2202.00666v5
- **LangChain:**
- **LangChain API Reference:** [langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
- **API Reference:** [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
**Abstract:** Today's probabilistic language generators fall short when it comes to
producing coherent and fluent text despite the fact that the underlying models
@ -444,8 +491,9 @@ reducing degenerate repetitions.
- **Authors:** Alec Radford, Jong Wook Kim, Chris Hallacy, et al.
- **Published Date:** 2021-02-26
- **URL:** http://arxiv.org/abs/2103.00020v1
- **LangChain:**
- **LangChain API Reference:** [langchain_experimental.open_clip](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.open_clip)
- **API Reference:** [langchain_experimental.open_clip](https://api.python.langchain.com/en/latest/experimental_api_reference.html#module-langchain_experimental.open_clip)
**Abstract:** State-of-the-art computer vision systems are trained to predict a fixed set
of predetermined object categories. This restricted form of supervision limits
@ -475,8 +523,9 @@ https://github.com/OpenAI/CLIP.
- **Authors:** Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, et al.
- **Published Date:** 2019-09-11
- **URL:** http://arxiv.org/abs/1909.05858v2
- **LangChain:**
- **LangChain API Reference:** [langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
- **API Reference:** [langchain_community.llms...HuggingFaceTextGenInference](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference.html#langchain_community.llms.huggingface_text_gen_inference.HuggingFaceTextGenInference), [langchain_community.llms...HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)
**Abstract:** Large-scale language models show promising text generation capabilities, but
users cannot easily control particular aspects of the generated text. We
@ -497,8 +546,9 @@ full-sized, pretrained versions of CTRL at https://github.com/salesforce/ctrl.
- **Authors:** Nils Reimers, Iryna Gurevych
- **Published Date:** 2019-08-27
- **URL:** http://arxiv.org/abs/1908.10084v1
- **LangChain Documentation:** [docs/integrations/text_embedding/sentence_transformers](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers)
- **LangChain:**
- **Documentation:** [docs/integrations/text_embedding/sentence_transformers](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers)
**Abstract:** BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new
state-of-the-art performance on sentence-pair regression tasks like semantic

@ -11,15 +11,14 @@ from typing import Any, Dict, List, Set
from pydantic.v1 import BaseModel, root_validator
# TODO parse docstrings for arXiv references
# TODO Generate a page with a table of the references with correspondent modules/classes/functions.
logger = logging.getLogger(__name__)
_ROOT_DIR = Path(os.path.abspath(__file__)).parents[2]
DOCS_DIR = _ROOT_DIR / "docs" / "docs"
CODE_DIR = _ROOT_DIR / "libs"
TEMPLATES_DIR = _ROOT_DIR / "templates"
ARXIV_ID_PATTERN = r"https://arxiv\.org/(abs|pdf)/(\d+\.\d+)"
LANGCHAIN_PYTHON_URL = "python.langchain.com"
@dataclass
@ -27,8 +26,9 @@ class ArxivPaper:
"""ArXiv paper information."""
arxiv_id: str
referencing_docs: list[str] # TODO: Add the referencing docs
referencing_api_refs: list[str] # TODO: Add the referencing docs
referencing_doc2url: dict[str, str]
referencing_api_ref2url: dict[str, str]
referencing_template2url: dict[str, str]
title: str
authors: list[str]
abstract: str
@ -218,6 +218,35 @@ def search_code_for_arxiv_references(code_dir: Path) -> dict[str, set[str]]:
return arxiv_id2module_name_and_members_reduced
def search_templates_for_arxiv_references(templates_dir: Path) -> dict[str, set[str]]:
arxiv_url_pattern = re.compile(ARXIV_ID_PATTERN)
# exclude_strings = {"file_path", "metadata", "link", "loader", "PyPDFLoader"}
# loop all the Readme.md files since they are parsed into LangChain documentation
# exclude the Readme.md in the root folder
files = (
p.resolve()
for p in Path(templates_dir).glob("**/*")
if p.name.lower() in {"readme.md"} and p.parent.name != "templates"
)
arxiv_id2template_names: dict[str, set[str]] = {}
for file in files:
with open(file, "r", encoding="utf-8") as f:
lines = f.readlines()
for line in lines:
# if any(exclude_string in line for exclude_string in exclude_strings):
# continue
matches = arxiv_url_pattern.search(line)
if matches:
arxiv_id = matches.group(2)
template_name = file.parent.name
if arxiv_id not in arxiv_id2template_names:
arxiv_id2template_names[arxiv_id] = {template_name}
else:
arxiv_id2template_names[arxiv_id].add(template_name)
return arxiv_id2template_names
def _get_doc_path(file_parts: tuple[str, ...], file_extension) -> str:
"""Get the relative path to the documentation page
from the absolute path of the file.
@ -257,58 +286,70 @@ def _get_module_name(file_parts: tuple[str, ...]) -> str:
def compound_urls(
arxiv_id2file_names: dict[str, set[str]], arxiv_id2code_urls: dict[str, set[str]]
arxiv_id2file_names: dict[str, set[str]],
arxiv_id2code_urls: dict[str, set[str]],
arxiv_id2templates: dict[str, set[str]],
) -> dict[str, dict[str, set[str]]]:
arxiv_id2urls = dict()
for arxiv_id, code_urls in arxiv_id2code_urls.items():
arxiv_id2urls[arxiv_id] = {"api": code_urls}
# intersection of the two sets
if arxiv_id in arxiv_id2file_names:
arxiv_id2urls[arxiv_id]["docs"] = arxiv_id2file_names[arxiv_id]
# format urls and verify that the urls are correct
arxiv_id2file_names_new = {}
for arxiv_id, file_names in arxiv_id2file_names.items():
if arxiv_id not in arxiv_id2code_urls:
arxiv_id2urls[arxiv_id] = {"docs": file_names}
key2urls = {
key: _format_doc_url(key)
for key in file_names
if _is_url_ok(_format_doc_url(key))
}
if key2urls:
arxiv_id2file_names_new[arxiv_id] = key2urls
arxiv_id2code_urls_new = {}
for arxiv_id, code_urls in arxiv_id2code_urls.items():
key2urls = {
key: _format_api_ref_url(key)
for key in code_urls
if _is_url_ok(_format_api_ref_url(key))
}
if key2urls:
arxiv_id2code_urls_new[arxiv_id] = key2urls
arxiv_id2templates_new = {}
for arxiv_id, templates in arxiv_id2templates.items():
key2urls = {
key: _format_template_url(key)
for key in templates
if _is_url_ok(_format_template_url(key))
}
if key2urls:
arxiv_id2templates_new[arxiv_id] = key2urls
arxiv_id2type2key2urls = dict.fromkeys(
arxiv_id2file_names_new | arxiv_id2code_urls_new | arxiv_id2templates_new
)
arxiv_id2type2key2urls = {k: {} for k in arxiv_id2type2key2urls}
for arxiv_id, key2urls in arxiv_id2file_names_new.items():
arxiv_id2type2key2urls[arxiv_id]["docs"] = key2urls
for arxiv_id, key2urls in arxiv_id2code_urls_new.items():
arxiv_id2type2key2urls[arxiv_id]["apis"] = key2urls
for arxiv_id, key2urls in arxiv_id2templates_new.items():
arxiv_id2type2key2urls[arxiv_id]["templates"] = key2urls
# reverse sort by the arxiv_id (the newest papers first)
ret = dict(sorted(arxiv_id2urls.items(), key=lambda item: item[0], reverse=True))
ret = dict(
sorted(arxiv_id2type2key2urls.items(), key=lambda item: item[0], reverse=True)
)
return ret
def _format_doc_link(doc_paths: list[str]) -> list[str]:
return [
f"[{doc_path}](https://python.langchain.com/{doc_path})"
for doc_path in doc_paths
]
def _is_url_ok(url: str) -> bool:
"""Check if the url page is open without error."""
import requests
def _format_api_ref_link(
doc_paths: list[str], compact: bool = False
) -> list[str]: # TODO
# agents/langchain_core.agents.AgentAction.html#langchain_core.agents.AgentAction
ret = []
for doc_path in doc_paths:
module = doc_path.split("#")[1].replace("module-", "")
if compact and module.count(".") > 2:
# langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI
# -> langchain_community.llms...OCIModelDeploymentTGI
module_parts = module.split(".")
module = f"{module_parts[0]}.{module_parts[1]}...{module_parts[-1]}"
ret.append(
f"[{module}](https://api.python.langchain.com/en/latest/{doc_path.split('langchain.com/')[-1]})"
)
return ret
def log_results(arxiv_id2urls):
arxiv_ids = arxiv_id2urls.keys()
doc_number, api_number = 0, 0
for urls in arxiv_id2urls.values():
if "docs" in urls:
doc_number += len(urls["docs"])
if "api" in urls:
api_number += len(urls["api"])
logger.info(
f"Found {len(arxiv_ids)} arXiv references in the {doc_number} docs and in {api_number} API Refs."
)
try:
response = requests.get(url)
response.raise_for_status()
except requests.exceptions.RequestException as ex:
logger.warning(f"Could not open the {url}.")
return False
return True
class ArxivAPIWrapper(BaseModel):
@ -335,7 +376,7 @@ class ArxivAPIWrapper(BaseModel):
return values
def get_papers(
self, arxiv_id2urls: dict[str, dict[str, set[str]]]
self, arxiv_id2type2key2urls: dict[str, dict[str, dict[str, str]]]
) -> list[ArxivPaper]:
"""
Performs an arxiv search and returns information about the papers found.
@ -343,8 +384,8 @@ class ArxivAPIWrapper(BaseModel):
If an error occurs or no documents found, error text
is returned instead.
Args:
arxiv_id2urls: Dictionary with arxiv_id as key and dictionary
with sets of doc file names and API Ref urls.
arxiv_id2type2key2urls: Dictionary with arxiv_id as key and dictionary
with dicts of doc file names/API objects/templates to urls.
Returns:
List of ArxivPaper objects.
@ -356,10 +397,10 @@ class ArxivAPIWrapper(BaseModel):
else:
return [str(a) for a in authors]
if not arxiv_id2urls:
if not arxiv_id2type2key2urls:
return []
try:
arxiv_ids = list(arxiv_id2urls.keys())
arxiv_ids = list(arxiv_id2type2key2urls.keys())
results = self.arxiv_search(
id_list=arxiv_ids,
max_results=len(arxiv_ids),
@ -374,38 +415,99 @@ class ArxivAPIWrapper(BaseModel):
abstract=result.summary,
url=result.entry_id,
published_date=str(result.published.date()),
referencing_docs=urls["docs"] if "docs" in urls else [],
referencing_api_refs=urls["api"] if "api" in urls else [],
referencing_doc2url=type2key2urls["docs"]
if "docs" in type2key2urls
else {},
referencing_api_ref2url=type2key2urls["apis"]
if "apis" in type2key2urls
else {},
referencing_template2url=type2key2urls["templates"]
if "templates" in type2key2urls
else {},
)
for result, urls in zip(results, arxiv_id2urls.values())
for result, type2key2urls in zip(results, arxiv_id2type2key2urls.values())
]
return papers
def generate_arxiv_references_page(file_name: str, papers: list[ArxivPaper]) -> None:
def _format_doc_url(doc_path: str) -> str:
return f"https://{LANGCHAIN_PYTHON_URL}/{doc_path}"
def _format_api_ref_url(doc_path: str, compact: bool = False) -> str:
# agents/langchain_core.agents.AgentAction.html#langchain_core.agents.AgentAction
return f"https://api.{LANGCHAIN_PYTHON_URL}/en/latest/{doc_path.split('langchain.com/')[-1]}"
def _format_template_url(template_name: str) -> str:
return f"https://{LANGCHAIN_PYTHON_URL}/docs/templates/{template_name}"
def _compact_module_full_name(doc_path: str) -> str:
# agents/langchain_core.agents.AgentAction.html#langchain_core.agents.AgentAction
module = doc_path.split("#")[1].replace("module-", "")
if module.count(".") > 2:
# langchain_community.llms.oci_data_science_model_deployment_endpoint.OCIModelDeploymentTGI
# -> langchain_community.llms...OCIModelDeploymentTGI
module_parts = module.split(".")
module = f"{module_parts[0]}.{module_parts[1]}...{module_parts[-1]}"
return module
def log_results(arxiv_id2type2key2urls):
arxiv_ids = arxiv_id2type2key2urls.keys()
doc_number, api_number, templates_number = 0, 0, 0
for type2key2url in arxiv_id2type2key2urls.values():
if "docs" in type2key2url:
doc_number += len(type2key2url["docs"])
if "apis" in type2key2url:
api_number += len(type2key2url["apis"])
if "templates" in type2key2url:
templates_number += len(type2key2url["templates"])
logger.warning(
f"Found {len(arxiv_ids)} arXiv references in the {doc_number} docs, {api_number} API Refs,"
f" and {templates_number} Templates."
)
def generate_arxiv_references_page(file_name: Path, papers: list[ArxivPaper]) -> None:
with open(file_name, "w") as f:
# Write the table headers
f.write("""# arXiv
LangChain implements the latest research in the field of Natural Language Processing.
This page contains `arXiv` papers referenced in the LangChain Documentation and API Reference.
This page contains `arXiv` papers referenced in the LangChain Documentation, API Reference,
and Templates.
## Summary
| arXiv id / Title | Authors | Published date 🔻 | LangChain Documentation and API Reference |
|------------------|---------|-------------------|-------------------------|
| arXiv id / Title | Authors | Published date 🔻 | LangChain Documentation|
|------------------|---------|-------------------|------------------------|
""")
for paper in papers:
refs = []
if paper.referencing_docs:
if paper.referencing_doc2url:
refs += [
"`Docs:` " + ", ".join(_format_doc_link(paper.referencing_docs))
"`Docs:` "
+ ", ".join(
f"[{key}]({url})"
for key, url in paper.referencing_doc2url.items()
)
]
if paper.referencing_api_refs:
if paper.referencing_api_ref2url:
refs += [
"`API:` "
+ ", ".join(
_format_api_ref_link(paper.referencing_api_refs, compact=True)
f"[{_compact_module_full_name(key)}]({url})"
for key, url in paper.referencing_api_ref2url.items()
)
]
if paper.referencing_template2url:
refs += [
"`Template:` "
+ ", ".join(
f"[{key}]({url})"
for key, url in paper.referencing_template2url.items()
)
]
refs_str = ", ".join(refs)
@ -417,15 +519,23 @@ This page contains `arXiv` papers referenced in the LangChain Documentation and
for paper in papers:
docs_refs = (
f"- **LangChain Documentation:** {', '.join(_format_doc_link(paper.referencing_docs))}"
if paper.referencing_docs
f" - **Documentation:** {', '.join(f'[{key}]({url})' for key, url in paper.referencing_doc2url.items())}"
if paper.referencing_doc2url
else ""
)
api_ref_refs = (
f"- **LangChain API Reference:** {', '.join(_format_api_ref_link(paper.referencing_api_refs))}"
if paper.referencing_api_refs
f" - **API Reference:** {', '.join(f'[{_compact_module_full_name(key)}]({url})' for key, url in paper.referencing_api_ref2url.items())}"
if paper.referencing_api_ref2url
else ""
)
template_refs = (
f" - **Template:** {', '.join(f'[{key}]({url})' for key, url in paper.referencing_template2url.items())}"
if paper.referencing_template2url
else ""
)
refs = "\n".join(
[el for el in [docs_refs, api_ref_refs, template_refs] if el]
)
f.write(f"""
## {paper.title}
@ -434,13 +544,14 @@ This page contains `arXiv` papers referenced in the LangChain Documentation and
- **Authors:** {', '.join(paper.authors)}
- **Published Date:** {paper.published_date}
- **URL:** {paper.url}
{docs_refs}
{api_ref_refs}
- **LangChain:**
{refs}
**Abstract:** {paper.abstract}
""")
logger.info(f"Created the {file_name} file with {len(papers)} arXiv references.")
logger.warning(f"Created the {file_name} file with {len(papers)} arXiv references.")
def main():
@ -450,14 +561,17 @@ def main():
arxiv_id2module_name_and_members
)
arxiv_id2file_names = search_documentation_for_arxiv_references(DOCS_DIR)
arxiv_id2urls = compound_urls(arxiv_id2file_names, arxiv_id2code_urls)
log_results(arxiv_id2urls)
arxiv_id2templates = search_templates_for_arxiv_references(TEMPLATES_DIR)
arxiv_id2type2key2urls = compound_urls(
arxiv_id2file_names, arxiv_id2code_urls, arxiv_id2templates
)
log_results(arxiv_id2type2key2urls)
# get the arXiv paper information
papers = ArxivAPIWrapper().get_papers(arxiv_id2urls)
papers = ArxivAPIWrapper().get_papers(arxiv_id2type2key2urls)
# generate the arXiv references page
output_file = str(DOCS_DIR / "additional_resources" / "arxiv_references.mdx")
output_file = DOCS_DIR / "additional_resources" / "arxiv_references.mdx"
generate_arxiv_references_page(output_file, papers)

Loading…
Cancel
Save