forked from Archives/langchain
WhatsApp document loader - update regex (#2776)
I was testing out the WhatsApp Document loader, and noticed that sometimes the date is of the following format (notice the additional underscore): ``` 3/24/23, 1:54_PM - +91 99999 99999 joined using this group's invite link 3/24/23, 6:29_PM - +91 99999 99999: When are we starting then? ``` Wierdly, the underscore is visible in Vim, but not on editors like VSCode. I presume it is some unusual character/line terminator. Nevertheless, I think handling this edge case will make the document loader more robust.
This commit is contained in:
parent
2db9b7a45d
commit
7688bf9182
@ -9,3 +9,4 @@
|
||||
1/23/23, 3:18 AM - User 2: No Im sorry it was my mistake, the blue one is not for sale
|
||||
1/23/23, 3:19 AM - User 1: Oh no worries! Bye
|
||||
1/23/23, 3:19 AM - User 2: Bye!
|
||||
1/23/23, 3:22_AM - User 1: And let me know if anything changes
|
@ -26,9 +26,14 @@ class WhatsAppChatLoader(BaseLoader):
|
||||
with open(p, encoding="utf8") as f:
|
||||
lines = f.readlines()
|
||||
|
||||
message_line_regex = (
|
||||
r"(\d{1,2}/\d{1,2}/\d{2,4}, "
|
||||
r"\d{1,2}:\d{1,2}[ _]?(?:AM|PM)?) - "
|
||||
r"(.*?): (.*)"
|
||||
)
|
||||
for line in lines:
|
||||
result = re.match(
|
||||
r"(\d{1,2}/\d{1,2}/\d{2,4}, \d{1,2}:\d{1,2}(?: AM| PM)?) - (.*?): (.*)",
|
||||
message_line_regex,
|
||||
line.strip(),
|
||||
)
|
||||
if result:
|
||||
|
Loading…
Reference in New Issue
Block a user