You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
openai-cookbook/examples/data/parsed_pdf_docs.json

1 line
79 KiB
JSON

Added a new notebook: "Parse PDF docs for RAG applications" (#1080) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: prestontuggle <97747561+prestontuggle@users.noreply.github.com> Co-authored-by: Shyamal H Anadkat <shyamal@openai.com> Co-authored-by: Simón Fishman <simonpfish@gmail.com> Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com> Co-authored-by: aalmaksour82 <49364099+aalmaksour82@users.noreply.github.com> Co-authored-by: colin-openai <119888926+colin-openai@users.noreply.github.com> Co-authored-by: Michael Wu <mwu1993@users.noreply.github.com> Co-authored-by: Logan Kilpatrick <logan@openai.com> Co-authored-by: Viet Hoang Tran Duong <36019296+viethoangtranduong@users.noreply.github.com> Co-authored-by: Christine Belzie <105683440+CBID2@users.noreply.github.com> Co-authored-by: Eliah Kagan <degeneracypressure@gmail.com> Co-authored-by: recordcrash <recordcrash@users.noreply.github.com> Co-authored-by: Stefano Lottini <hemidactylus@users.noreply.github.com> Co-authored-by: Safa Asgar <70315479+SaFaUU@users.noreply.github.com> Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Co-authored-by: Will DePue <will@depue.net> Co-authored-by: ys64 <815824+ys64@users.noreply.github.com> Co-authored-by: Shawn Yuxuan Tong <tongyuxuan361@gmail.com> Co-authored-by: Steven Pousty <steve.pousty@gmail.com> Co-authored-by: Puneet Dhiman <142409038+PuneetDhimanShorthillsAI@users.noreply.github.com> Co-authored-by: Krista Pratico <krpratic@microsoft.com> Co-authored-by: dongqqcom <32085836+dongqqcom@users.noreply.github.com> Co-authored-by: Alvaro Videla <videlalvaro@gmail.com> Co-authored-by: DevilsWorkShop <ashokmanghat@gmail.com> Co-authored-by: Ashok Manghat <amanghat@rmplc.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Matthew Jericho Go Sy <69558553+jerichosy@users.noreply.github.com> Co-authored-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com> Co-authored-by: Nirant <NirantK@users.noreply.github.com> Co-authored-by: Cathy Chen <cathykaichen@gmail.com> Co-authored-by: gusmally <hannahmbmoraes@gmail.com> Co-authored-by: Chuong Ho <31106432+chuongmep@users.noreply.github.com> Co-authored-by: ridrisa <138629783+ridrisa@users.noreply.github.com> Co-authored-by: Xin(Leo) Jing <jingxin@berkeley.edu> Co-authored-by: Per Harald Borgen <perhborgen@gmail.com> Co-authored-by: Hoang Viet Khoa <khoahv92@gmail.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: Albarqawi <barqawi.88@outlook.com> Co-authored-by: Saarika Bhasi <55930906+saarikabhasi@users.noreply.github.com> Co-authored-by: Daniel <10074684+danieltprice@users.noreply.github.com> Co-authored-by: Dhruv Anand <105786647+dhruv-anand-aintech@users.noreply.github.com> Co-authored-by: Jiří Hofman <jiri.hofman@gmail.com> Co-authored-by: Fayaz Rahman <fayazrahman4u@gmail.com> Co-authored-by: Anish Shah <93145909+ash0ts@users.noreply.github.com> Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com> Co-authored-by: Emil Sedgh <emilsedgh@kde.org> Co-authored-by: Megan O'Keefe <3137106+askmeegs@users.noreply.github.com> Co-authored-by: Joschka Braun <47435119+joschkabraun@users.noreply.github.com> Co-authored-by: Roger Zurawicki <zurawiki@users.noreply.github.com> Co-authored-by: pavlovp <pavel.pavlov1990@gmail.com> Co-authored-by: Surav Shrestha <98219089+suravshresth@users.noreply.github.com> Co-authored-by: vrushankportkey <134934501+vrushankportkey@users.noreply.github.com> Co-authored-by: Soonoh <chk0ndanger@gmail.com> Co-authored-by: Mayuresh Dharwadkar <98738585+Mayureshd-18@users.noreply.github.com> Co-authored-by: Yashwant Jodha <76436993+yashwantjodha@users.noreply.github.com> Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Co-authored-by: Ana Martins <60753223+OutSystemsAMM@users.noreply.github.com> Co-authored-by: Greg Richardson <greg.nmr@gmail.com> Co-authored-by: john <johnoctubre7@gmail.com> Co-authored-by: John Octubre <johnoctubre@Johns-MacBook-Pro.local> Co-authored-by: jhills20 <70035505+jhills20@users.noreply.github.com> Co-authored-by: Tad <wptady@gmail.com> Co-authored-by: Ilan Bigio <ilanbigio@gmail.com> Co-authored-by: Ilan Bigio <ilan@openai.com> Co-authored-by: royziv11 <103690170+royziv11@users.noreply.github.com> Co-authored-by: Gabor Cselle <gaborcselle@users.noreply.github.com> Co-authored-by: D. Carpintero <6709785+dcarpintero@users.noreply.github.com> Co-authored-by: Ed Spencer <ed@edspencer.net> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: dylanra-openai <149511600+dylanra-openai@users.noreply.github.com> Co-authored-by: Taranjeet Singh <reachtotj@gmail.com> Co-authored-by: Frode Jensen <jensen.frode@gmail.com> Co-authored-by: Lionel Cheng <60159831+lionelchg@users.noreply.github.com> Co-authored-by: lionelchg <Cheng.Lionel@bcg.com> Co-authored-by: Jing Ai <42414856+jingairpi@users.noreply.github.com> Co-authored-by: Jing Ai <jingai@jings-air-2020.lan> Co-authored-by: Spring_MT <today.is.sky.blue.sky@gmail.com> Co-authored-by: kevleininger <kevleininger@gmail.com> Co-authored-by: Prakul <discover.prakul@gmail.com> Co-authored-by: Logan Kilpatrick <23kilpatrick23@gmail.com> Co-authored-by: Jiang Yucheng <fatjyc@gmail.com> Co-authored-by: Haomin Liu <644074553@qq.com> Co-authored-by: Xavier Amatriain <xavier.amatriain@gmail.com> Co-authored-by: Caio Curitiba Marcellos <caiocuritiba@gmail.com> Co-authored-by: Kesku <62210496+kesku@users.noreply.github.com> Co-authored-by: markbigears <86395716+markbigears@users.noreply.github.com> Co-authored-by: bigears <mark.forsyth@yourbigears.com> Co-authored-by: Nghiauet <63385521+Nghiauet@users.noreply.github.com> Co-authored-by: Vince Fulco--Bighire.tools <vince@bighire.io> Co-authored-by: Wang22004K <152562528+Wang22004K@users.noreply.github.com> Co-authored-by: Shaurya Rohatgi <shauryr@gmail.com> Co-authored-by: Dhruv Singh <ds3638@columbia.edu> Co-authored-by: Adam Hendel <ChuckHend@users.noreply.github.com> Co-authored-by: Enoch Cheung <enoch@enochc.com> Co-authored-by: Zanie Blue <contact@zanie.dev> Co-authored-by: rissois <44072214+rissois@users.noreply.github.com> Co-authored-by: ayush rajgor <ayushrajgorar@gmail.com> Co-authored-by: teomusatoiu <156829031+teomusatoiu@users.noreply.github.com> Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com> Co-authored-by: Shivam Rastogi <shivamsupr@gmail.com> Co-authored-by: Alex Yang <himself65@outlook.com> Co-authored-by: Elmira Ghorbani <elmira.ghorbani96@gmail.com> Co-authored-by: gloryjain <glory@openai.com> Co-authored-by: Andrew Peng <apeng@berkeley.edu>
7 months ago
[{"filename": "rag-deck.pdf", "text": "RAG\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nRetrieval-Augmented Generation \nenhances the capabilities of language \nmodels by combining them with a \nretrieval system. This allows the model \nto leverage external knowledge sources \nto generate more accurate and \ncontextually relevant responses.\n\nExample use cases\n\n- Provide answers with up-to-date \n\ninformation\n\n- Generate contextual responses\n\nWhat we\u2019ll cover\n\n\u25cf Technical patterns\n\n\u25cf Best practices\n\n\u25cf Common pitfalls\n\n\u25cf Resources\n\n3\n\n\fWhat is RAG\n\nRetrieve information to Augment the model\u2019s knowledge and Generate the output\n\n\u201cWhat is your \nreturn policy?\u201d\n\nask\n\nresult\n\nsearch\n\nLLM\n\nreturn information\n\nTotal refunds: 0-14 days\n50% of value vouchers: 14-30 days\n$5 discount on next order: > 30 days\n\n\u201cYou can get a full refund up \nto 14 days after the \npurchase, then up to 30 days \nyou would get a voucher for \nhalf the value of your order\u201d\n\nKnowledge \nBase / External \nsources\n\n4\n\n\fWhen to use RAG\n\nGood for \u2705\n\nNot good for \u274c\n\n\u25cf\n\n\u25cf\n\nIntroducing new information to the model \n\n\u25cf\n\nTeaching the model a speci\ufb01c format, style, \n\nto update its knowledge\n\nReducing hallucinations by controlling \n\ncontent\n\n/!\\ Hallucinations can still happen with RAG\n\nor language\n\u2794 Use \ufb01ne-tuning or custom models instead\n\n\u25cf\n\nReducing token usage\n\u2794 Consider \ufb01ne-tuning depending on the use \n\ncase\n\n5\n\n\fTechnical patterns\n\nData preparation\n\nInput processing\n\nRetrieval\n\nAnswer Generation\n\n\u25cf Chunking\n\n\u25cf\n\n\u25cf\n\nEmbeddings\n\nAugmenting \ncontent\n\n\u25cf\n\nInput \naugmentation\n\n\u25cf NER\n\n\u25cf\n\nSearch\n\n\u25cf Context window\n\n\u25cf Multi-step \nretrieval\n\n\u25cf Optimisation\n\n\u25cf\n\nSafety checks\n\n\u25cf\n\nEmbeddings\n\n\u25cf Re-ranking\n\n6\n\n\fTechnical patterns\nData preparation\n\nchunk documents into multiple \npieces for easier consumption\n\ncontent\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\n0.983, 0.123, 0.289\u2026\n\nAugment content \nusing LLMs\n\nEx: parse text only, ask gpt-4 to rephrase & \nsummarize each part, generate bullet points\u2026\n\nBEST PRACTICES\n\nPre-process content for LLM \nconsumption: \nAdd summary, headers for each \npart, etc.\n+ curate relevant data sources\n\nKnowledge \nBase\n\nCOMMON PITFALLS\n\n\u2794 Having too much low-quality \n\ncontent\n\n\u2794 Having too large documents\n\n7\n\n\fTechnical patterns\nData preparation: chunking\n\nWhy chunking?\n\nIf your system doesn\u2019t require \nentire documents to provide \nrelevant answers, you can \nchunk them into multiple pieces \nfor easier consumption (reduced \ncost & latency).\n\nOther approaches: graphs or \nmap-reduce\n\nThings to consider\n\n\u25cf\n\nOverlap:\n\n\u25cb\n\n\u25cb\n\nShould chunks be independent or overlap one \nanother?\nIf they overlap, by how much?\n\n\u25cf\n\nSize of chunks: \n\n\u25cb What is the optimal chunk size for my use case?\n\u25cb\n\nDo I want to include a lot in the context window or \njust the minimum?\n\n\u25cf Where to chunk:\n\n\u25cb\n\n\u25cb\n\nShould I chunk every N tokens or use speci\ufb01c \nseparators? \nIs there a logical way to split the context that would \nhelp the retrieval process?\n\n\u25cf What to return:\n\n\u25cb\n\n\u25cb\n\nShould I return chunks across multiple documents \nor top chunks within the same doc?\nShould chunks be linked together with metadata to \nindicate common properties?\n\n8\n\n\fTechnical patterns\nData preparation: embeddings\n\nWhat to embed?\n\nDepending on your use case \nyou might not want just to \nembed the text in the \ndocuments but metadata as well \n- anything that will make it easier \nto surface this speci\ufb01c chunk or \ndocument when performing a \nsearch\n\nExamples\n\nEmbedding Q&A posts in a forum\nYou might want to embed the title of the posts, \nthe text of the original question and the c