# Refactor the test workflow
This PR refactors the tests to run using a single test workflow. This
makes it easier to relaunch failing tests and see in the UI which test
failed since the jobs are grouped together.
## Before submitting
## Who can review?
Thanks to @anna-charlotte and @jupyterjazz for the contribution! Made
few small changes to get it across the finish line
---------
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Co-authored-by: jupyterjazz <saba.sturua@jina.ai>
Co-authored-by: Saba Sturua <45267439+jupyterjazz@users.noreply.github.com>
# Add action to test with all dependencies installed
PR adds a custom action for setting up poetry that allows specifying a
cache key:
https://github.com/actions/setup-python/issues/505#issuecomment-1273013236
This makes it possible to run 2 types of unit tests:
(1) unit tests with only core dependencies
(2) unit tests with extended dependencies (e.g., those that rely on an
optional pdf parsing library)
As part of this PR, we're moving some pdf parsing tests into the
unit-tests section and making sure that these unit tests get executed
when running with extended dependencies.
# ODF File Loader
Adds a data loader for handling Open Office ODT files. Requires
`unstructured>=0.6.3`.
### Testing
The following should work using the `fake.odt` example doc from the
[`unstructured` repo](https://github.com/Unstructured-IO/unstructured).
```python
from langchain.document_loaders import UnstructuredODTLoader
loader = UnstructuredODTLoader(file_path="fake.odt", mode="elements")
loader.load()
loader = UnstructuredODTLoader(file_path="fake.odt", mode="single")
loader.load()
```
Any import that touches langchain.retrievers currently requires Lark.
Here's one attempt to fix. Not very pretty, very open to other ideas.
Alternatives I thought of are 1) make Lark requirement, 2) put
everything in parser.py in the try/except. Neither sounds much better
Related to #4316, #4275
Fixed two small bugs (as reported in issue #1619 ) in the filtering by
metadata for `chroma` databases :
- ```langchain.vectorstores.chroma.similarity_search``` takes a
```filter``` input parameter but do not forward it to
```langchain.vectorstores.chroma.similarity_search_with_score```
- ```langchain.vectorstores.chroma.similarity_search_by_vector```
doesn't take this parameter in input, although it could be very useful,
without any additional complexity - and it would thus be coherent with
the syntax of the two other functions.
Co-authored-by: Davis Chase <130488702+dev2049@users.noreply.github.com>
Currently, MultiPromptChain instantiates a ChatOpenAI LLM instance for
the default chain to use if none of the prompts passed match. This seems
like an error as it means that you can't use your choice of LLM, or
configure how to instantiate the default LLM (e.g. passing in an API key
that isn't in the usual env variable).
Fixes#4153
If the sender of a message in a group chat isn't in your contact list,
they will appear with a ~ prefix in the exported chat. This PR adds
support for parsing such lines.
# Add support for Qdrant nested filter
This extends the filter functionality for the Qdrant vectorstore. The
current filter implementation is limited to a single-level metadata
structure; however, Qdrant supports nested metadata filtering. This
extends the functionality for users to maximize the filter functionality
when using Qdrant as the vectorstore.
Reference: https://qdrant.tech/documentation/filtering/#nested-key
---------
Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
This pr makes it possible to extract more metadata from websites for
later use.
my usecase:
parsing ld+json or microdata from sites and store it as structured data
in the metadata field
- added `Wikipedia` retriever. It is effectively a wrapper for
`WikipediaAPIWrapper`. It wrapps load() into get_relevant_documents()
- sorted `__all__` in the `retrievers/__init__`
- added integration tests for the WikipediaRetriever
- added an example (as Jupyter notebook) for the WikipediaRetriever
# Minor Wording Documentation Change
```python
agent_chain.run("When's my friend Eric's surname?")
# Answer with 'Zhu'
```
is change to
```python
agent_chain.run("What's my friend Eric's surname?")
# Answer with 'Zhu'
```
I think when is a residual of the old query that was "When’s my friends
Eric`s birthday?".
# Add PDF parser implementations
This PR separates the data loading from the parsing for a number of
existing PDF loaders.
Parser tests have been designed to help encourage developers to create a
consistent interface for parsing PDFs.
This interface can be made more consistent in the future by adding
information into the initializer on desired behavior with respect to splitting by
page etc.
This code is expected to be backwards compatible -- with the exception
of a bug fix with pymupdf parser which was returning `bytes` in the page
content rather than strings.
Also changing the lazy parser method of document loader to return an
Iterator rather than Iterable over documents.
## Before submitting
<!-- If you're adding a new integration, include an integration test and
an example notebook showing its use! -->
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@
<!-- For a quicker response, figure out the right person to tag with @
@hwchase17 - project lead
Tracing / Callbacks
- @agola11
Async
- @agola11
DataLoader Abstractions
- @eyurtsev
LLM/Chat Wrappers
- @hwchase17
- @agola11
Tools / Toolkits
- @vowelparrot
-->
# Add MimeType Based Parser
This PR adds a MimeType Based Parser. The parser inspects the mime-type
of the blob it is parsing and based on the mime-type can delegate to the sub
parser.
## Before submitting
Waiting on adding notebooks until more implementations are landed.
## Who can review?
Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@hwchase17
@vowelparrot
# Update Writer LLM integration
Changes the parameters and base URL to be in line with Writer's current
API.
Based on the documentation on this page:
https://dev.writer.com/reference/completions-1
# Fix grammar in Text Splitters docs
Just a small fix of grammar in the documentation:
"That means there two different axes" -> "That means there are two
different axes"
Add a notebook in the `experimental/` directory detailing:
- How to capture traces with the v2 endpoint
- How to create datasets
- How to run traces over the dataset
Ensure compatibility with both SQLAlchemy v1/v2
fix the issue when using SQLAlchemy v1 (reported at #3884)
`
langchain/vectorstores/pgvector.py", line 168, in
create_tables_if_not_exists
self._conn.commit()
AttributeError: 'Connection' object has no attribute 'commit'
`
Ref Doc :
https://docs.sqlalchemy.org/en/14/changelog/migration_20.html#migration-20-autocommit
Handle duplicate and incorrectly specified OpenAI params
Thanks @PawelFaron for the fix! Made small update
Closes#4331
---------
Co-authored-by: PawelFaron <42373772+PawelFaron@users.noreply.github.com>
Co-authored-by: Pawel Faron <ext-pawel.faron@vaisala.com>
### Description
Add `similarity_search_with_score` method for OpenSearch to return
scores along with documents in the search results
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
fix: solve the infinite loop caused by 'add_memory' function when run
'pause_to_reflect' function
run steps:
'add_memory' -> 'pause_to_reflect' -> 'add_memory': infinite loop
This PR adds:
* Option to show a tqdm progress bar when using the file system blob loader
* Update pytest run configuration to be stricter
* Adding a new marker that checks that required pkgs exist