forked from Archives/langchain
WhatsApp document loader - update regex (#2776)
I was testing out the WhatsApp Document loader, and noticed that sometimes the date is of the following format (notice the additional underscore): ``` 3/24/23, 1:54_PM - +91 99999 99999 joined using this group's invite link 3/24/23, 6:29_PM - +91 99999 99999: When are we starting then? ``` Wierdly, the underscore is visible in Vim, but not on editors like VSCode. I presume it is some unusual character/line terminator. Nevertheless, I think handling this edge case will make the document loader more robust.
This commit is contained in:
parent
2db9b7a45d
commit
7688bf9182
@ -8,4 +8,5 @@
|
|||||||
1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!
|
1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!
|
||||||
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
|
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
|
||||||
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
|
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
|
||||||
1/23/23, 3:19 AM - User 2: Bye!
|
1/23/23, 3:19 AM - User 2: Bye!
|
||||||
|
1/23/23, 3:22_AM - User 1: And let me know if anything changes
|
@ -26,9 +26,14 @@ class WhatsAppChatLoader(BaseLoader):
|
|||||||
with open(p, encoding="utf8") as f:
|
with open(p, encoding="utf8") as f:
|
||||||
lines = f.readlines()
|
lines = f.readlines()
|
||||||
|
|
||||||
|
message_line_regex = (
|
||||||
|
r"(\d{1,2}/\d{1,2}/\d{2,4}, "
|
||||||
|
r"\d{1,2}:\d{1,2}[ _]?(?:AM|PM)?) - "
|
||||||
|
r"(.*?): (.*)"
|
||||||
|
)
|
||||||
for line in lines:
|
for line in lines:
|
||||||
result = re.match(
|
result = re.match(
|
||||||
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)",
|
message_line_regex,
|
||||||
line.strip(),
|
line.strip(),
|
||||||
)
|
)
|
||||||
if result:
|
if result:
|
||||||
|
Loading…
Reference in New Issue
Block a user