fix:get bilibili subtitles (#8165)

- Description: fix the Loader 'BiliBiliLoader'
  - Issue: the API response was changed

![image](https://github.com/langchain-ai/langchain/assets/2113954/91216793-82f8-4c82-a018-d49f36f5f6aa)
The previously used API no longer returns the "subtitle_url" property.

![image](https://github.com/langchain-ai/langchain/assets/2113954/a8ec2a7a-f40d-4c2a-b7d0-0ccdf2b327cc)
We should use another API to get `subtitle_url` property. 
The `subtitle_url` returned by this API does not include the http schema
and needs to be added.

  - Dependencies: Nope
  - Tag maintainer: @rlancemartin
This commit is contained in:
liguoqinjim 2023-08-05 05:30:41 +08:00 committed by GitHub
parent 21771a6f1c
commit d00a247da7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -54,12 +54,14 @@ class BiliBiliLoader(BaseLoader):
video_info = sync(v.get_info())
video_info.update({"url": url})
sub = sync(v.get_subtitle(video_info["cid"]))
# Get subtitle url
subtitle = video_info.pop("subtitle")
sub_list = subtitle["list"]
sub_list = sub["subtitles"]
if sub_list:
sub_url = sub_list[0]["subtitle_url"]
if not sub_url.startswith("http"):
sub_url = "https:" + sub_url
result = requests.get(sub_url)
raw_sub_titles = json.loads(result.content)["body"]
raw_transcript = " ".join([c["content"] for c in raw_sub_titles])