seems linkchecker isn't catching them because it runs on generated html.
at that point the links are already missing.
the generation process seems to strip invalid references when they can't
be re-written from md to html.
I used https://github.com/tcort/markdown-link-check to check the doc
source directly.
There are a few false positives on localhost for development.
I noticed that the "getting started" guide section on agents included an
example test where the agent was getting the question wrong 😅
I guess Olivia Wilde's dating life is too tough to keep track of for
this simple agent example. Let's change it to something a little easier,
so users who are running their agent for the first time are less likely
to be confused by a result that doesn't match that which is on the docs.
Added support for document loaders for Azure Blob Storage using a
connection string. Fixes#1805
---------
Co-authored-by: Mick Vleeshouwer <mick@imick.nl>
Ran into a broken build if bs4 wasn't installed in the project.
Minor tweak to follow the other doc loaders optional package-loading
conventions.
Also updated html docs to include reference to this new html loader.
side note: Should there be 2 different html-to-text document loaders?
This new one only handles local files, while the existing unstructured
html loader handles HTML from local and remote. So it seems like the
improvement was adding the title to the metadata, which is useful but
could also be added to `html.py`
In https://github.com/hwchase17/langchain/issues/1716 , it was
identified that there were two .py files performing similar tasks. As a
resolution, one of the files has been removed, as its purpose had
already been fulfilled by the other file. Additionally, the init has
been updated accordingly.
Furthermore, the how_to_guides.rst file has been updated to include
links to documentation that was previously missing. This was deemed
necessary as the existing list on
https://langchain.readthedocs.io/en/latest/modules/document_loaders/how_to_guides.html
was incomplete, causing confusion for users who rely on the full list of
documentation on the left sidebar of the website.
The GPT Index project is transitioning to the new project name,
LlamaIndex.
I've updated a few files referencing the old project name and repository
URL to the current ones.
From the [LlamaIndex repo](https://github.com/jerryjliu/llama_index):
> NOTE: We are rebranding GPT Index as LlamaIndex! We will carry out
this transition gradually.
>
> 2/25/2023: By default, our docs/notebooks/instructions now reference
"LlamaIndex" instead of "GPT Index".
>
> 2/19/2023: By default, our docs/notebooks/instructions now use the
llama-index package. However the gpt-index package still exists as a
duplicate!
>
> 2/16/2023: We have a duplicate llama-index pip package. Simply replace
all imports of gpt_index with llama_index if you choose to pip install
llama-index.
I'm not associated with LlamaIndex in any way. I just noticed the
discrepancy when studying the lanchain documentation.
# What does this PR do?
This PR adds similar to `llms` a SageMaker-powered `embeddings` class.
This is helpful if you want to leverage Hugging Face models on SageMaker
for creating your indexes.
I added a example into the
[docs/modules/indexes/examples/embeddings.ipynb](https://github.com/hwchase17/langchain/compare/master...philschmid:add-sm-embeddings?expand=1#diff-e82629e2894974ec87856aedd769d4bdfe400314b03734f32bee5990bc7e8062)
document. The example currently includes some `_### TEMPORARY: Showing
how to deploy a SageMaker Endpoint from a Hugging Face model ###_ ` code
showing how you can deploy a sentence-transformers to SageMaker and then
run the methods of the embeddings class.
@hwchase17 please let me know if/when i should remove the `_###
TEMPORARY: Showing how to deploy a SageMaker Endpoint from a Hugging
Face model ###_` in the description i linked to a detail blog on how to
deploy a Sentence Transformers so i think we don't need to include those
steps here.
I also reused the `ContentHandlerBase` from
`langchain.llms.sagemaker_endpoint` and changed the output type to `any`
since it is depending on the implementation.
Fixes the import typo in the vector db text generator notebook for the
chroma library
Co-authored-by: Anupam <anupam@10-16-252-145.dynapool.wireless.nyu.edu>
Use the following code to test:
```python
import os
from langchain.llms import OpenAI
from langchain.chains.api import podcast_docs
from langchain.chains import APIChain
# Get api key here: https://openai.com/pricing
os.environ["OPENAI_API_KEY"] = "sk-xxxxx"
# Get api key here: https://www.listennotes.com/api/pricing/
listen_api_key = 'xxx'
llm = OpenAI(temperature=0)
headers = {"X-ListenAPI-Key": listen_api_key}
chain = APIChain.from_llm_and_api_docs(llm, podcast_docs.PODCAST_DOCS, headers=headers, verbose=True)
chain.run("Search for 'silicon valley bank' podcast episodes, audio length is more than 30 minutes, return only 1 results")
```
Known issues: the api response data might be too big, and we'll get such
error:
`openai.error.InvalidRequestError: This model's maximum context length
is 4097 tokens, however you requested 6733 tokens (6477 in your prompt;
256 for the completion). Please reduce your prompt; or completion
length.`
New to Langchain, was a bit confused where I should find the toolkits
section when I'm at `agent/key_concepts` docs. I added a short link that
points to the how to section.
```
class Joke(BaseModel):
setup: str = Field(description="question to set up a joke")
punchline: str = Field(description="answer to resolve the joke")
joke_query = "Tell me a joke."
# Or, an example with compound type fields.
#class FloatArray(BaseModel):
# values: List[float] = Field(description="list of floats")
#
#float_array_query = "Write out a few terms of fiboacci."
model = OpenAI(model_name='text-davinci-003', temperature=0.0)
parser = PydanticOutputParser(pydantic_object=Joke)
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()}
)
_input = prompt.format_prompt(query=joke_query)
print("Prompt:\n", _input.to_string())
output = model(_input.to_string())
print("Completion:\n", output)
parsed_output = parser.parse(output)
print("Parsed completion:\n", parsed_output)
```
```
Prompt:
Answer the user query.
The output should be formatted as a JSON instance that conforms to the JSON schema below. For example, the object {"foo": ["bar", "baz"]} conforms to the schema {"foo": {"description": "a list of strings field", "type": "string"}}.
Here is the output schema:
---
{"setup": {"description": "question to set up a joke", "type": "string"}, "punchline": {"description": "answer to resolve the joke", "type": "string"}}
---
Tell me a joke.
Completion:
{"setup": "Why don't scientists trust atoms?", "punchline": "Because they make up everything!"}
Parsed completion:
setup="Why don't scientists trust atoms?" punchline='Because they make up everything!'
```
Ofc, works only with LMs of sufficient capacity. DaVinci is reliable but
not always.
---------
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
PromptLayer now has support for [several different tracking
features.](https://magniv.notion.site/Track-4deee1b1f7a34c1680d085f82567dab9)
In order to use any of these features you need to have a request id
associated with the request.
In this PR we add a boolean argument called `return_pl_id` which will
add `pl_request_id` to the `generation_info` dictionary associated with
a generation.
We also updated the relevant documentation.
add the state_of_the_union.txt file so that its easier to follow through
with the example.
---------
Co-authored-by: Jithin James <jjmachan@pop-os.localdomain>
* Zapier Wrapper and Tools (implemented by Zapier Team)
* Zapier Toolkit, examples with mrkl agent
---------
Co-authored-by: Mike Knoop <mikeknoop@gmail.com>
Co-authored-by: Robert Lewis <robert.lewis@zapier.com>
### Summary
Allows users to pass in `**unstructured_kwargs` to Unstructured document
loaders. Implemented with the `strategy` kwargs in mind, but will pass
in other kwargs like `include_page_breaks` as well. The two currently
supported strategies are `"hi_res"`, which is more accurate but takes
longer, and `"fast"`, which processes faster but with lower accuracy.
The `"hi_res"` strategy is the default. For PDFs, if `detectron2` is not
available and the user selects `"hi_res"`, the loader will fallback to
using the `"fast"` strategy.
### Testing
#### Make sure the `strategy` kwarg works
Run the following in iPython to verify that the `"fast"` strategy is
indeed faster.
```python
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
%timeit loader.load()
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
%timeit loader.load()
```
On my system I get:
```python
In [3]: from langchain.document_loaders import UnstructuredFileLoader
In [4]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
In [5]: %timeit loader.load()
247 ms ± 369 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
In [7]: %timeit loader.load()
2.45 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
#### Make sure older versions of `unstructured` still work
Run `pip install unstructured==0.5.3` and then verify the following runs
without error:
```python
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", mode="elements")
loader.load()
```
# Description
Add `RediSearch` vectorstore for LangChain
RediSearch: [RediSearch quick
start](https://redis.io/docs/stack/search/quick_start/)
# How to use
```
from langchain.vectorstores.redisearch import RediSearch
rds = RediSearch.from_documents(docs, embeddings,redisearch_url="redis://localhost:6379")
```
Seeing a lot of issues in Discord in which the LLM is not using the
correct LIMIT clause for different SQL dialects. ie, it's using `LIMIT`
for mssql instead of `TOP`, or instead of `ROWNUM` for Oracle, etc.
I think this could be due to us specifying the LIMIT statement in the
example rows portion of `table_info`. So the LLM is seeing the `LIMIT`
statement used in the prompt.
Since we can't specify each dialect's method here, I think it's fine to
just replace the `SELECT... LIMIT 3;` statement with `3 rows from
table_name table:`, and wrap everything in a block comment directly
following the `CREATE` statement. The Rajkumar et al paper wrapped the
example rows and `SELECT` statement in a block comment as well anyway.
Thoughts @fpingham?
`OnlinePDFLoader` and `PagedPDFSplitter` lived separate from the rest of
the pdf loaders.
Because they're all similar, I propose moving all to `pdy.py` and the
same docs/examples page.
Additionally, `PagedPDFSplitter` naming doesn't match the pattern the
rest of the loaders follow, so I renamed to `PyPDFLoader` and had it
inherit from `BasePDFLoader` so it can now load from remote file
sources.
Provide shared memory capability for the Agent.
Inspired by #1293 .
## Problem
If both Agent and Tools (i.e., LLMChain) use the same memory, both of
them will save the context. It can be annoying in some cases.
## Solution
Create a memory wrapper that ignores the save and clear, thereby
preventing updates from Agent or Tools.
Simple CSV document loader which wraps `csv` reader, and preps the file
with a single `Document` per row.
The column header is prepended to each value for context which is useful
for context with embedding and semantic search