#docs: text splitters improvements
Changes are only in the Jupyter notebooks.
- added links to the source packages and a short description of these
packages
- removed " Text Splitters" suffixes from the TOC elements (they made
the list of the text splitters messy)
- moved text splitters, based on the length function into a separate
list. They can be mixed with any classes from the "Text Splitters", so
it is a different classification.
## Who can review?
@hwchase17 - project lead
@eyurtsev
@vowelparrot
NOTE: please, check out the results of the `Python code` text splitter
example (text_splitters/examples/python.ipynb). It looks suboptimal.
# Added another helpful way for developers who want to set OpenAI API
Key dynamically
Previous methods like exporting environment variables are good for
project-wide settings.
But many use cases need to assign API keys dynamically, recently.
```python
from langchain.llms import OpenAI
llm = OpenAI(openai_api_key="OPENAI_API_KEY")
```
## Before submitting
```bash
export OPENAI_API_KEY="..."
```
Or,
```python
import os
os.environ["OPENAI_API_KEY"] = "..."
```
<hr>
Thank you.
Cheers,
Bongsang
# Documentation for Azure OpenAI embeddings model
- OPENAI_API_VERSION environment variable is needed for the endpoint
- The constructor does not work with model, it works with deployment.
I fixed it in the notebook.
(This is my first contribution)
## Who can review?
@hwchase17
@agola
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
# Add bs4 html parser
* Some minor refactors
* Extract the bs4 html parsing code from the bs html loader
* Move some tests from integration tests to unit tests
# Add generic document loader
* This PR adds a generic document loader which can assemble a loader
from a blob loader and a parser
* Adds a registry for parsers
* Populate registry with a default mimetype based parser
## Expected changes
- Parsing involves loading content via IO so can be sped up via:
* Threading in sync
* Async
- The actual parsing logic may be computatinoally involved: may need to
figure out to add multi-processing support
- May want to add suffix based parser since suffixes are easier to
specify in comparison to mime types
## Before submitting
No notebooks yet, we first need to get a few of the basic parsers up
(prior to advertising the interface)
It's currently not possible to change the `TEMPLATE_TOOL_RESPONSE`
prompt for ConversationalChatAgent, this PR changes that.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Update deployments doc with langcorn API server
API server example
```python
from fastapi import FastAPI
from langcorn import create_service
app: FastAPI = create_service(
"examples.ex1:chain",
"examples.ex2:chain",
"examples.ex3:chain",
"examples.ex4:sequential_chain",
"examples.ex5:conversation",
"examples.ex6:conversation_with_summary",
)
```
More examples: https://github.com/msoedov/langcorn/tree/main/examples
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Docs and code review fixes for Docugami DataLoader
1. I noticed a couple of hyperlinks that are not loading in the
langchain docs (I guess need explicit anchor tags). Added those.
2. In code review @eyurtsev had a
[suggestion](https://github.com/hwchase17/langchain/pull/4727#discussion_r1194069347)
to allow string paths. Turns out just updating the type works (I tested
locally with string paths).
# Pre-submission checks
I ran `make lint` and `make tests` successfully.
---------
Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
# Fix Homepage Typo
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested... not sure
# Docs: improvements in the `retrievers/examples/` notebooks
Its primary purpose is to make the Jupyter notebook examples
**consistent** and more suitable for first-time viewers.
- add links to the integration source (if applicable) with a short
description of this source;
- removed `_retriever` suffix from the file names (where it existed) for
consistency;
- removed ` retriever` from the notebook title (where it existed) for
consistency;
- added code to install necessary Python package(s);
- added code to set up the necessary API Key.
- very small fixes in notebooks from other folders (for consistency):
- docs/modules/indexes/vectorstores/examples/elasticsearch.ipynb
- docs/modules/indexes/vectorstores/examples/pinecone.ipynb
- docs/modules/models/llms/integrations/cohere.ipynb
- fixed misspelling in langchain/retrievers/time_weighted_retriever.py
comment (sorry, about this change in a .py file )
## Who can review
@dev2049
# Remove unused variables in Milvus vectorstore
This PR simply removes a variable unused in Milvus. The variable looks
like a copy-paste from other functions in Milvus but it is really
unnecessary.
# Fix TypeError in Vectorstore Redis class methods
This change resolves a TypeError that was raised when invoking the
`from_texts_return_keys` method from the `from_texts` method in the
`Redis` class. The error was due to the `cls` argument being passed
explicitly, which led to it being provided twice since it's also
implicitly passed in class methods. No relevant tests were added as the
issue appeared to be better suited for linters to catch proactively.
Changes:
- Removed `cls=cls` from the call to `from_texts_return_keys` in the
`from_texts` method.
Related to:
https://github.com/hwchase17/langchain/pull/4653
# Remove unnecessary comment
Remove unnecessary comment accidentally included in #4800
## Before submitting
- no test
- no document
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
# Fixed typos (issues #4818 & #4668 & more typos)
- At some places, it said `model = ChatOpenAI(model='gpt-3.5-turbo')`
but should be `model = ChatOpenAI(model_name='gpt-3.5-turbo')`
- Fixes some other typos
Fixes#4818, #4668
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
Previously, the client expected a strict 'prompt' or 'messages' format
and wouldn't permit running a chat model or llm on prompts or messages
(respectively).
Since many datasets may want to specify custom key: string , relax this
requirement.
Also, add support for running a chat model on raw prompts and LLM on
chat messages through their respective fallbacks.
# Your PR Title (What it does)
<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
# Fix subclassing OpenAIEmbeddings
Fixes#4498
## Before submitting
- Problem: Due to annotated type `Tuple[()]`.
- Fix: Change the annotated type to "Iterable[str]". Even though
tiktoken use
[Collection[str]](095924e02c/tiktoken/core.py (L80))
type annotation, but pydantic doesn't support Collection type, and
[Iterable](https://docs.pydantic.dev/latest/usage/types/#typing-iterables)
is the closest to Collection.
# fix: agenerate miss run_manager args in llm.py
<!--
Thank you for contributing to LangChain! Your PR will appear in our next
release under the title you set. Please make sure it highlights your
valuable contribution.
Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.
After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.
-->
<!-- Remove if not applicable -->
Fixes # (issue)
fix: agenerate miss run_manager args in llm.py
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoaders
- @eyurtsev
Models
- @hwchase17
- @agola11
Agents / Tools / Toolkits
- @vowelparrot
VectorStores / Retrievers / Memory
- @dev2049
-->
ArxivAPIWrapper searches and downloads PDFs to get related information.
But I found that it doesn't delete the downloaded file. The reason why
this is a problem is that a lot of PDF files remain on the server. For
example, one size is about 28M.
So, I added a delete line because it's too big to maintain on the
server.
# Clean up downloaded PDF files
- Changes: Added new line to delete downloaded file
- Background: To get the information on arXiv's paper, ArxivAPIWrapper
class downloads a PDF.
It's a natural approach, but the wrapper retains a lot of PDF files on
the server.
- Problem: One size of PDFs is about 28M. It's too big to maintain on a
small server like AWS.
- Dependency: import os
Thank you.
---------
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Get the memory importance score from regex matched group
In `GenerativeAgentMemory`, the `_score_memory_importance()` will make a
prompt to get a rating score. The prompt is:
```
prompt = PromptTemplate.from_template(
"On the scale of 1 to 10, where 1 is purely mundane"
+ " (e.g., brushing teeth, making bed) and 10 is"
+ " extremely poignant (e.g., a break up, college"
+ " acceptance), rate the likely poignancy of the"
+ " following piece of memory. Respond with a single integer."
+ "\nMemory: {memory_content}"
+ "\nRating: "
)
```
For some LLM, it will respond with, for example, `Rating: 8`. Thus we
might want to get the score from the matched regex group.
The function _get_prompt() was returning the DEFAULT_EXAMPLES even if
some custom examples were given. The return FewShotPromptTemplate was
returnong DEFAULT_EXAMPLES and not examples
# The cohere embedding model do not use large, small. It is deprecated.
Changed the modules default model
Fixes#4694
Co-authored-by: rajib76 <rajib76@yahoo.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
**Feature**: This PR adds `from_template_file` class method to
BaseStringMessagePromptTemplate. This is useful to help user to create
message prompt templates directly from template files, including
`ChatMessagePromptTemplate`, `HumanMessagePromptTemplate`,
`AIMessagePromptTemplate` & `SystemMessagePromptTemplate`.
**Tests**: Unit tests have been added in this PR.
Co-authored-by: charosen <charosen@bupt.cn>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Removed usage of deprecated methods
Replaced `SQLDatabaseChain` deprecated direct initialisation with
`from_llm` method
## Who can review?
@hwchase17
@agola11
---------
Co-authored-by: imeckr <chandanroutray2012@gmail.com>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
# Fixed query checker for SQLDatabaseChain
When `SQLDatabaseChain`'s llm attribute was deprecated, the query
checker stopped working if `SQLDatabaseChain` is initialised via
`from_llm` method. With this fix, `SQLDatabaseChain`'s query checker
would use the same `llm` as used in the `llm_chain`
## Who can review?
@hwchase17 - project lead
Co-authored-by: imeckr <chandanroutray2012@gmail.com>
- Installation of non-colab packages
- Get API keys
# Added dependencies to make notebook executable on hosted notebooks
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@hwchase17
@vowelparrot