WhatsApp document loader - update regex (#2776)

I was testing out the WhatsApp Document loader, and noticed that
sometimes the date is of the following format (notice the additional
underscore):
```
3/24/23, 1:54_PM - +91 99999 99999 joined using this group's invite link
3/24/23, 6:29_PM - +91 99999 99999: When are we starting then?
```

Wierdly, the underscore is visible in Vim, but not on editors like
VSCode. I presume it is some unusual character/line terminator.
Nevertheless, I think handling this edge case will make the document
loader more robust.
This commit is contained in:
Rounak Datta 2023-04-13 22:18:32 +05:30 committed by GitHub
parent 2db9b7a45d
commit 7688bf9182
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 8 additions and 2 deletions

View File

@ -8,4 +8,5 @@
1/23/23, 3:02 AM - User 1: I thought you were selling the blue one! 1/23/23, 3:02 AM - User 1: I thought you were selling the blue one!
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale 1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
1/23/23, 3:19 AM - User 1: Oh no worries! Bye 1/23/23, 3:19 AM - User 1: Oh no worries! Bye
1/23/23, 3:19 AM - User 2: Bye! 1/23/23, 3:19 AM - User 2: Bye!
1/23/23, 3:22_AM - User 1: And let me know if anything changes

View File

@ -26,9 +26,14 @@ class WhatsAppChatLoader(BaseLoader):
with open(p, encoding="utf8") as f: with open(p, encoding="utf8") as f:
lines = f.readlines() lines = f.readlines()
message_line_regex = (
r"(\d{1,2}/\d{1,2}/\d{2,4}, "
r"\d{1,2}:\d{1,2}[ _]?(?:AM|PM)?) - "
r"(.*?): (.*)"
)
for line in lines: for line in lines:
result = re.match( result = re.match(
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)", message_line_regex,
line.strip(), line.strip(),
) )
if result: if result: