Commit Graph

2602 Commits

Author SHA1 Message Date
Harrison Chase
bd8d418a95 Merge branch 'master' of github.com:hwchase17/langchain 2023-06-18 16:45:49 -07:00
Harrison Chase
3a75d59c3d searx - docs 2023-06-18 16:45:42 -07:00
MIDORIBIN
5be465bd86
Fixed PermissionError on windows (#6170)
Fixed PermissionError that occurred when downloading PDF files via http
in BasePDFLoader on windows.

When downloading PDF files via http in BasePDFLoader, NamedTemporaryFile
is used.
This function cannot open the file again on **Windows**.[Python
Doc](https://docs.python.org/3.9/library/tempfile.html#tempfile.NamedTemporaryFile)

So, we created a **temporary directory** with TemporaryDirectory and
placed the downloaded file there.
temporary directory is deleted in the deconstruct.

Fixes #2698

#### Who can review?

Tag maintainers/contributors who might be interested:

  - @eyurtsev
  - @hwchase17
2023-06-18 16:39:57 -07:00
xleven
4fc7939848
fix link of callbacks on modules page (#6323)
Since
[Callbacks](https://python.langchain.com/docs/modules/callbacks/getting_started/)
on [Modules](https://python.langchain.com/docs/modules/) went to a "Page
Not Found".
2023-06-18 15:08:12 -07:00
Vijay
2b3b4e0f60
Add the ability to run the map_reduce chains process results step as async (#6181)
This will add the ability to add an AsyncCallbackManager (handler) for
the reducer chain, which would be able to stream the tokens via the
`async def on_llm_new_token` callback method



Fixes # (issue)
[5532](https://github.com/hwchase17/langchain/issues/5532)


 @hwchase17  @agola11 
The following code snippet explains how this change would be used to
enable `reduce_llm` with streaming support in a `map_reduce` chain

I have tested this change and it works for the streaming use-case of
reducer responses. I am happy to share more information if this makes
solution sense.

```

AsyncHandler
..........................
class StreamingLLMCallbackHandler(AsyncCallbackHandler):
    """Callback handler for streaming LLM responses."""

    def __init__(self, websocket):
        self.websocket = websocket
    
    # This callback method is to be executed in async
    async def on_llm_new_token(self, token: str, **kwargs: Any) -> None:
        resp = ChatResponse(sender="bot", message=token, type="stream")
        await self.websocket.send_json(resp.dict())


Chain
..........
stream_handler = StreamingLLMCallbackHandler(websocket)
stream_manager = AsyncCallbackManager([stream_handler])

streaming_llm = ChatOpenAI(
        streaming=True,
        callback_manager=stream_manager,
        verbose=False,
        temperature=0,
    )
    main_llm = OpenAI(
        temperature=0,
        verbose=False,
    )

    doc_chain = load_qa_chain(
        llm=main_llm,
        reduce_llm=streaming_llm,
        chain_type="map_reduce", 
        callback_manager=manager
    )
    qa_chain = ConversationalRetrievalChain(
        retriever=vectorstore.as_retriever(),
        combine_docs_chain=doc_chain,
        question_generator=question_generator,
        callback_manager=manager,
    )
    
    # Here `acall` will trigger `acombine_docs` on `map_reduce` which should then call `_aprocess_result` which in turn will call `self.combine_document_chain.arun` hence async callback will be awaited
    result = await qa_chain.acall(
         {"question": question, "chat_history": chat_history}
      )
```
2023-06-18 13:19:56 -07:00
Alvaro Bartolome
e0dea577ee
Extend ArgillaCallbackHandler support (#6153)
Hi again @agola11! 🤗

## What's in this PR?

After playing around with different chains we noticed that some chains
were using different `output_key`s and we were just handling some, so
we've extended the support to any output, either if it's a Python list
or a string.

Kudos to @dvsrepo for spotting this!

---------

Co-authored-by: Daniel Vila Suero <daniel@argilla.io>
2023-06-18 11:18:33 -07:00
Harrison Chase
a8cb9ee013
Harrison/gdrive enhancements (#6375)
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-18 11:07:23 -07:00
rafael
ebfffaa38f
Guardrails output parser: Pass LLM api for reasking (#6089)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes https://github.com/ShreyaR/guardrails/issues/155 

Enables guardrails reasking by specifying an LLM api in the output
parser.
2023-06-18 10:50:20 -07:00
Davis Chase
ec850e607f
bump 203 (#6372) 2023-06-18 09:20:47 -07:00
Lance Martin
370becdfc2
Add self query retriever example with MD header splitting (#6359)
Flesh out the notebook example for `MarkdownHeaderTextSplitter`
2023-06-17 21:40:20 -07:00
Lance Martin
2c97fbabbd
Update MD header text splitter notebook (#6339)
Highlight use case for maintaining header groups when splitting.
2023-06-17 13:19:27 -07:00
Harrison Chase
a2bbe3dda4
Harrison/mmr support for opensearch (#6349)
Co-authored-by: Mehmet Öner Yalçın <oneryalcin@gmail.com>
2023-06-17 12:22:37 -07:00
Davis Chase
2eea5d4cb4
Add ignore vercel preview script (#6320)
skip building preview of docs for anything branch that doesn't start
with `__docs__`. will eventually update to look at code diff directories
but patching for now
2023-06-17 11:17:08 -07:00
Harrison Chase
7a48d9ee82 Merge branch 'master' of github.com:hwchase17/langchain 2023-06-17 11:16:19 -07:00
Kenny
e30fdffd1e
Add new openai 0613 model costs (#6110)
Added costs for gpt-4-32k-0613, gpt-4-0613, gpt-3.5-turbo-16k,
gpt-3.5-turbo-0613, and gpt-3.5-turbo-16k-0613 to openai_info callback
based on this [OpenAI
post](https://openai.com/blog/function-calling-and-other-api-updates)

@agola11
2023-06-17 11:11:47 -07:00
Dhruvil Shah
2eec687474
update web_base.py to have verify option (#6107)
We propose an enhancement to the web-based loader initialize method by
introducing a "verify" option. This enhancement addresses the issue of
SSL verification errors encountered on certain web pages. By providing
users with the option to set the verify parameter to False, we offer
greater flexibility and control.
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

### Fixes #6079 

#### Who can review?
@eyurtsev @hwchase17

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-06-17 11:10:48 -07:00
Harrison Chase
680d6bbbf8 fix titles in documentation 2023-06-17 11:09:11 -07:00
Nuno Campos
e194dc5306
Make lckwargs private (#6344)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes # (issue)

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

#### Who can review?

Tag maintainers/contributors who might be interested:

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @hwchase17

  VectorStores / Retrievers / Memory
  - @dev2049

 -->
2023-06-17 19:08:25 +01:00
Harrison Chase
8cfb52ddbb fix spelling 2023-06-17 11:06:54 -07:00
zengbo
5d5298087f
Custom Anthropic API URL (#6221)
[Feature] User can custom the Anthropic API URL

#### Who can review?

Tag maintainers/contributors who might be interested:

  Models
  - @hwchase17
  - @agola11
2023-06-17 11:01:29 -07:00
Harrison Chase
61e4a1adf9
Harrison/faiss score (#6341)
Co-authored-by: Frank Stein <16441059+simonfromla@users.noreply.github.com>
Co-authored-by: Sims Juju <sims@Ju.lan>
2023-06-17 11:00:47 -07:00
Harrison Chase
42a28ac1ba
Harrison/error zero tools (#6340)
Co-authored-by: Juhee Kim <46583939+juppytt@users.noreply.github.com>
2023-06-17 11:00:35 -07:00
Slawomir Gonet
eef62bf4e9
qdrant: search by vector (#6043)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Added support to `search_by_vector` to Qdrant Vector store.

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->


### Who can review
VectorStores / Retrievers / Memory
- @dev2049
<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @hwchase17



 -->
2023-06-17 09:44:28 -07:00
Mark
b7ba7e8a7b
Allow GoogleDrive to authenticate via application default credentials on Cloud Run/GCE etc without service key (#6035)
@eyurtsev

The existing GoogleDrive implementation always needs a service account
to be available at the credentials location. When running on GCP
services such as Cloud Run, a service account already exists in the
metadata of the service, so no physical key is necessary. This change
adds a check to see if it is running in such an environment, and uses
that authentication instead.

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-06-17 09:44:17 -07:00
lonestriker
6f36f0f930
Add oobabooga/text-generation-webui support as a llm (#5997)
Add oobabooga/text-generation-webui support as an LLM. Currently,
supports using text-generation-webui's non-streaming API interface.
Allows users who already have text-gen running to use the same models
with langchain.

#### Before submitting

Simple usage, similar to existing LLM supported:

```
from langchain.llms import TextGen
llm = TextGen(model_url = "http://localhost:5000")
```
#### Who can review?

 @hwchase17 - project lead

---------

Co-authored-by: Hien Ngo <Hien.Ngo@adia.ae>
2023-06-17 09:42:15 -07:00
Richy Wang
444ca3f669
Improve AnalyticDB Vector Store implementation without affecting user (#6086)
Hi there:

As I implement the AnalyticDB VectorStore use two table to store the
document before. It seems just use one table is a better way. So this
commit is try to improve AnalyticDB VectorStore implementation without
affecting user behavior:

**1. Streamline the `post_init `behavior by creating a single table with
vector indexing.
2. Update the `add_texts` API for document insertion.
3. Optimize `similarity_search_with_score_by_vector` to retrieve results
directly from the table.
4. Implement `_similarity_search_with_relevance_scores`.
5. Add `embedding_dimension` parameter to support different dimension
embedding functions.**

Users can continue using the API as before. 
Test cases added before is enough to meet this commit.
2023-06-17 09:36:31 -07:00
Ja-sonYun
cdd1d78bf2
make modelname_to_contextsize as a staticmethod (#6040)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes ##6039

#### Before submitting

<!-- If you're adding a new integration, please include:

1. a test for the integration - favor unit tests that does not rely on
network access.
2. an example notebook showing its use


See contribution guidelines for more information on how to write tests,
lint
etc:


https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md
-->

#### Who can review?

Tag maintainers/contributors who might be interested:
@hwchase17 @agola11
<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @hwchase17

  VectorStores / Retrievers / Memory
  - @dev2049

 -->
2023-06-17 09:13:08 -07:00
Saba Sturua
427551eabf
DocArray as a Retriever (#6031)
## DocArray as a Retriever

[DocArray](https://github.com/docarray/docarray) is an open-source tool
for managing your multi-modal data. It offers flexibility to store and
search through your data using various document index backends. This PR
introduces `DocArrayRetriever` - which works with any available backend
and serves as a retriever for Langchain apps.

Also, I added 2 notebooks:
DocArray Backends - intro to all 5 currently supported backends, how to
initialize, index, and use them as a retriever
DocArray Usage - showcasing what additional search parameters you can
pass to create versatile retrievers

Example:
```python
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.retrievers import DocArrayRetriever


# define document schema
class MyDoc(BaseDoc):
    description: str
    description_embedding: NdArray[1536]


embeddings = OpenAIEmbeddings()
# create documents
descriptions = ["description 1", "description 2"]
desc_embeddings = embeddings.embed_documents(texts=descriptions)
docs = DocList[MyDoc](
    [
        MyDoc(description=desc, description_embedding=embedding)
        for desc, embedding in zip(descriptions, desc_embeddings)
    ]
)

# initialize document index with data
db = InMemoryExactNNIndex[MyDoc](docs)

# create a retriever
retriever = DocArrayRetriever(
    index=db,
    embeddings=embeddings,
    search_field="description_embedding",
    content_field="description",
)

# find the relevant document
doc = retriever.get_relevant_documents("action movies")
print(doc)
```

#### Who can review?

@dev2049

---------

Signed-off-by: jupyterjazz <saba.sturua@jina.ai>
2023-06-17 09:09:33 -07:00
Masafumi Mori
7bb437146d
fix links to prompt templates and example selectors (#6332)
<!--
Thank you for contributing to LangChain! Your PR will appear in our
release under the title you set. Please make sure it highlights your
valuable contribution.

Replace this with a description of the change, the issue it fixes (if
applicable), and relevant context. List any dependencies required for
this change.

After you're done, someone will review your PR. They may suggest
improvements. If no one reviews your PR within a few days, feel free to
@-mention the same people again, as notifications can get lost.

Finally, we'd love to show appreciation for your contribution - if you'd
like us to shout you out on Twitter, please also include your handle!
-->

<!-- Remove if not applicable -->

Fixes # 
links to prompt templates and example selectors on the
[Prompts](https://python.langchain.com/docs/modules/model_io/prompts/)
page are invalid.

#### Before submitting
Just a small note that I tried to run `make docs_clean` and other
related commands before PR written
[here](https://github.com/hwchase17/langchain/blob/master/.github/CONTRIBUTING.md#build-documentation-locally),
it gives me an error:
```bash
langchain % make docs_clean
Traceback (most recent call last):
  File "/Users/masafumi/Downloads/langchain/.venv/bin/make", line 5, in <module>
    from scripts.proto import main
ModuleNotFoundError: No module named 'scripts'
make: *** [docs_clean] Error 1
# Poetry (version 1.5.1)
# Python 3.9.13
```
I couldn't figure out how to fix this, so I didn't run those command.
But links should work.

#### Who can review?

Tag maintainers/contributors who might be interested:
@hwchase17

Similar issue #6323

Co-authored-by: masafumimori <m.masafumimori@outlook.com>
2023-06-17 09:07:14 -07:00
Francisco Ingham
83eea230f3
changed height in the nb example (#6327)
changed height in the example to a more reasonable number (from 9 feet
to 6 feet)
2023-06-17 00:05:48 -07:00
James O'Dwyer
0475d015fe
Handle Managed Motorhead Data Key (#6169)
# Handle Managed Motorhead Data Key
Managed motorhead will return a payload with a `data` key. we need to
handle this to properly access messages from the server.
2023-06-16 20:36:18 -07:00
Luke Stanley
364f8e7b5d
Better Entity Memory code documentation (#6318)
Just adds some comments and docstring improvements.

There was some behaviour that was quite unclear to me at first like:
- "when do things get updated?"
- "why are there only entity names and no summaries?"
- "why do the entity names disappear?" 

Now it can be much more obvious to many.

I am lukestanley on Twitter.
2023-06-16 18:08:44 -07:00
Harrison Chase
af18413d97
Harrison/deeplake new features (#6263)
Co-authored-by: adilkhan <adilkhan.sarsen@nu.edu.kz>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
2023-06-16 17:53:55 -07:00
Davis Chase
6640293087
fix eval guide links (#6319) 2023-06-16 17:53:46 -07:00
ljeagle
ad324a39ae
Improve the performance of add_texts interface and upgrade the AwaDB from 0.3.2 to 0.3.3 (#6316)
1. Changed the implementation of add_texts interface for the AwaDB
vector store in order to improve the performance
2. Upgrade the AwaDB from 0.3.2 to 0.3.3

---------

Co-authored-by: vincent <awadb.vincent@gmail.com>
2023-06-16 16:50:01 -07:00
Davis Chase
24b2af5218
nit (#6305) 2023-06-16 16:21:27 -07:00
Pierre Alexandre SCHEMBRI
9ca11c06b7
Fixes #6282 (#6283)
Fixes #6282 

1 liner to fix default http headers not passed by `LLMRequestsChain`
2023-06-16 16:21:01 -07:00
Davis Chase
23cdebddc4
Del linkcheck readme (#6317) 2023-06-16 16:18:45 -07:00
Brigit Murtaugh
ccd916babe
Update dev container (#6189)
Fixes https://github.com/hwchase17/langchain/issues/6172

As described in https://github.com/hwchase17/langchain/issues/6172, I'd
love to help update the dev container in this project.

**Summary of changes:**
- Dev container now builds (the current container in this repo won't
build for me)
- Dockerfile updates
- Update image to our [currently-maintained Python
image](https://github.com/devcontainers/images/tree/main/src/python/.devcontainer)
(`mcr.microsoft.com/devcontainers/python`) rather than the deprecated
image from vscode-dev-containers
- Move Dockerfile to root of repo - in order for `COPY` to work
properly, it needs the files (in this case, `pyproject.toml` and
`poetry.toml`) in the same directory
- devcontainer.json updates
- Removed `customizations` and `remoteUser` since they should be covered
by the updated image in the Dockerfile
     - Update comments
- Update docker-compose.yaml to properly point to updated Dockerfile
- Add a .gitattributes to avoid line ending conversions, which can
result in hundreds of pending changes
([info](https://code.visualstudio.com/docs/devcontainers/tips-and-tricks#_resolving-git-line-ending-issues-in-containers-resulting-in-many-modified-files))
- Add a README in the .devcontainer folder and info on the dev container
in the contributing.md

**Outstanding questions:**
- Is it expected for `poetry install` to take some time? It takes about
30 minutes for this dev container to finish building in a Codespace, but
a user should only have to experience this once. Through some online
investigation, this doesn't seem unusual
- Versions of poetry newer than 1.3.2 failed every time - based on some
of the guidance in contributing.md and other online resources, it seemed
changing poetry versions might be a good solution. 1.3.2 is from Jan
2023

---------

Co-authored-by: bamurtaugh <brmurtau@microsoft.com>
Co-authored-by: Samruddhi Khandale <samruddhikhandale@github.com>
2023-06-16 15:42:14 -07:00
Davis Chase
03b5891cf7
more redirect (#6314) 2023-06-16 14:43:59 -07:00
Davis Chase
eaee492dbc
basic redirect (#6309) 2023-06-16 13:39:58 -07:00
Davis Chase
d2243757a3
update readme (#6304) 2023-06-16 12:27:16 -07:00
Davis Chase
2f47e5c766
update api link (#6303) 2023-06-16 12:18:17 -07:00
Davis Chase
d558bcfad8
rm ignore_vercel (#6302) 2023-06-16 12:06:58 -07:00
Davis Chase
87e502c6bc
Doc refactor (#6300)
Co-authored-by: jacoblee93 <jacoblee93@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
2023-06-16 11:52:56 -07:00
Harrison Chase
94c82a189d
bump to 202 (#6262) 2023-06-16 06:52:36 -07:00
hp0404
b01cf0dd54
ArxivAPIWrapper - doc_content_chars_max (#6063)
This PR refactors the ArxivAPIWrapper class making
`doc_content_chars_max` parameter optional. Additionally, tests have
been added to ensure the functionality of the doc_content_chars_max
parameter.

Fixes #6027 (issue)
2023-06-15 22:16:42 -07:00
Daniel King
a9b97aa6f4
Update output format of MosaicML endpoint to be more flexible (#6060)
There will likely be another change or two coming over the next couple
weeks as we stabilize the API, but putting this one in now which just
makes the integration a bit more flexible with the response output
format.

```
(langchain) danielking@MML-1B940F4333E2 langchain % pytest tests/integration_tests/llms/test_mosaicml.py tests/integration_tests/embeddings/test_mosaicml.py 
=================================================================================== test session starts ===================================================================================
platform darwin -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0
rootdir: /Users/danielking/github/langchain
configfile: pyproject.toml
plugins: asyncio-0.20.3, mock-3.10.0, dotenv-0.5.2, cov-4.0.0, anyio-3.6.2
asyncio: mode=strict
collected 12 items                                                                                                                                                                        

tests/integration_tests/llms/test_mosaicml.py ......                                                                                                                                [ 50%]
tests/integration_tests/embeddings/test_mosaicml.py ......                                                                                                                          [100%]

=================================================================================== slowest 5 durations ===================================================================================
4.76s call     tests/integration_tests/llms/test_mosaicml.py::test_retry_logic
4.74s call     tests/integration_tests/llms/test_mosaicml.py::test_mosaicml_llm_call
4.13s call     tests/integration_tests/llms/test_mosaicml.py::test_instruct_prompt
0.91s call     tests/integration_tests/llms/test_mosaicml.py::test_short_retry_does_not_loop
0.66s call     tests/integration_tests/llms/test_mosaicml.py::test_mosaicml_extra_kwargs
=================================================================================== 12 passed in 19.70s ===================================================================================
```

#### Who can review?

  @hwchase17
  @dev2049
2023-06-15 22:15:39 -07:00
JaysonAlbert
50d9c7d5a4
Fix: change the chatgpt plugin retriever metadata format (#5920)
the current implement put the doc itself as the metadata, but the
document chatgpt plugin retriever returned already has a `metadata`
field, it's better to use that instead.

the original code will throw the following exception when using
`RetrievalQAWithSourcesChain`, becuse it can not find the field
`metadata`:

```python
Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Document prompt requires documents to have metadata variables: ['source']. Received document with missing metadata: ['source'].
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 27, in format_document
    raise ValueError(
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 65, in <listcomp>
    doc_strings = [format_document(doc, self.document_prompt) for doc in docs]
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 65, in _get_inputs
    doc_strings = [format_document(doc, self.document_prompt) for doc in docs]
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 85, in combine_docs
    inputs = self._get_inputs(docs, **kwargs)
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 84, in _call
    output, extra_return_dict = self.combine_docs(
  File "/home/wangjie/anaconda3/envs/chatglm/lib/python3.10/site-packages/langchain/chains/base.py", line 140, in __call__
    raise e
```

Additionally, the `metadata` filed in the `chatgpt plugin retriever`
have these fileds by default:
```json
{
    "source":  "file",   //email, file or chat
    "source_id": "filename.docx", // the filename
    "url": "", 
    ...
}
```
so, we should set `source_id` to `source` in the langchain metadata.

```python
metadata = d.pop("metadata", d)
if(metadata.get("source_id")):
    metadata["source"] = metadata.pop("source_id")
```

#### Who can review?
@dev2049

<!-- For a quicker response, figure out the right person to tag with @

  @hwchase17 - project lead

  Tracing / Callbacks
  - @agola11

  Async
  - @agola11

  DataLoaders
  - @eyurtsev

  Models
  - @hwchase17
  - @agola11

  Agents / Tools / Toolkits
  - @vowelparrot

  VectorStores / Retrievers / Memory
  - @dev2049

 -->

---------

Co-authored-by: wangjie <wangjie@htffund.com>
2023-06-15 22:04:45 -07:00
Harrison Chase
e67b26eee9
Harrison/openai functions (#6261)
Co-authored-by: Francisco Ingham <24279597+fpingham@users.noreply.github.com>
2023-06-15 21:54:39 -07:00