openai-cookbook/examples/data/parsed_pdf_docs.json
katia-openai e92df85ad4
Added a new notebook: "Parse PDF docs for RAG applications" (#1080)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: prestontuggle <97747561+prestontuggle@users.noreply.github.com>
Co-authored-by: Shyamal H Anadkat <shyamal@openai.com>
Co-authored-by: Simón Fishman <simonpfish@gmail.com>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: aalmaksour82 <49364099+aalmaksour82@users.noreply.github.com>
Co-authored-by: colin-openai <119888926+colin-openai@users.noreply.github.com>
Co-authored-by: Michael Wu <mwu1993@users.noreply.github.com>
Co-authored-by: Logan Kilpatrick <logan@openai.com>
Co-authored-by: Viet Hoang Tran Duong <36019296+viethoangtranduong@users.noreply.github.com>
Co-authored-by: Christine Belzie <105683440+CBID2@users.noreply.github.com>
Co-authored-by: Eliah Kagan <degeneracypressure@gmail.com>
Co-authored-by: recordcrash <recordcrash@users.noreply.github.com>
Co-authored-by: Stefano Lottini <hemidactylus@users.noreply.github.com>
Co-authored-by: Safa Asgar <70315479+SaFaUU@users.noreply.github.com>
Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
Co-authored-by: Will DePue <will@depue.net>
Co-authored-by: ys64 <815824+ys64@users.noreply.github.com>
Co-authored-by: Shawn Yuxuan Tong <tongyuxuan361@gmail.com>
Co-authored-by: Steven Pousty <steve.pousty@gmail.com>
Co-authored-by: Puneet Dhiman <142409038+PuneetDhimanShorthillsAI@users.noreply.github.com>
Co-authored-by: Krista Pratico <krpratic@microsoft.com>
Co-authored-by: dongqqcom <32085836+dongqqcom@users.noreply.github.com>
Co-authored-by: Alvaro Videla <videlalvaro@gmail.com>
Co-authored-by: DevilsWorkShop <ashokmanghat@gmail.com>
Co-authored-by: Ashok Manghat <amanghat@rmplc.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matthew Jericho Go Sy <69558553+jerichosy@users.noreply.github.com>
Co-authored-by: Farzad Sunavala <40604067+farzad528@users.noreply.github.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
Co-authored-by: Cathy Chen <cathykaichen@gmail.com>
Co-authored-by: gusmally <hannahmbmoraes@gmail.com>
Co-authored-by: Chuong Ho <31106432+chuongmep@users.noreply.github.com>
Co-authored-by: ridrisa <138629783+ridrisa@users.noreply.github.com>
Co-authored-by: Xin(Leo) Jing <jingxin@berkeley.edu>
Co-authored-by: Per Harald Borgen <perhborgen@gmail.com>
Co-authored-by: Hoang Viet Khoa <khoahv92@gmail.com>
Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com>
Co-authored-by: Albarqawi <barqawi.88@outlook.com>
Co-authored-by: Saarika Bhasi <55930906+saarikabhasi@users.noreply.github.com>
Co-authored-by: Daniel <10074684+danieltprice@users.noreply.github.com>
Co-authored-by: Dhruv Anand <105786647+dhruv-anand-aintech@users.noreply.github.com>
Co-authored-by: Jiří Hofman <jiri.hofman@gmail.com>
Co-authored-by: Fayaz Rahman <fayazrahman4u@gmail.com>
Co-authored-by: Anish Shah <93145909+ash0ts@users.noreply.github.com>
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
Co-authored-by: Emil Sedgh <emilsedgh@kde.org>
Co-authored-by: Megan O'Keefe <3137106+askmeegs@users.noreply.github.com>
Co-authored-by: Joschka Braun <47435119+joschkabraun@users.noreply.github.com>
Co-authored-by: Roger Zurawicki <zurawiki@users.noreply.github.com>
Co-authored-by: pavlovp <pavel.pavlov1990@gmail.com>
Co-authored-by: Surav Shrestha <98219089+suravshresth@users.noreply.github.com>
Co-authored-by: vrushankportkey <134934501+vrushankportkey@users.noreply.github.com>
Co-authored-by: Soonoh <chk0ndanger@gmail.com>
Co-authored-by: Mayuresh Dharwadkar <98738585+Mayureshd-18@users.noreply.github.com>
Co-authored-by: Yashwant Jodha <76436993+yashwantjodha@users.noreply.github.com>
Co-authored-by: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com>
Co-authored-by: Ana Martins <60753223+OutSystemsAMM@users.noreply.github.com>
Co-authored-by: Greg Richardson <greg.nmr@gmail.com>
Co-authored-by: john <johnoctubre7@gmail.com>
Co-authored-by: John Octubre <johnoctubre@Johns-MacBook-Pro.local>
Co-authored-by: jhills20 <70035505+jhills20@users.noreply.github.com>
Co-authored-by: Tad <wptady@gmail.com>
Co-authored-by: Ilan Bigio <ilanbigio@gmail.com>
Co-authored-by: Ilan Bigio <ilan@openai.com>
Co-authored-by: royziv11 <103690170+royziv11@users.noreply.github.com>
Co-authored-by: Gabor Cselle <gaborcselle@users.noreply.github.com>
Co-authored-by: D. Carpintero <6709785+dcarpintero@users.noreply.github.com>
Co-authored-by: Ed Spencer <ed@edspencer.net>
Co-authored-by: Ravi Theja <ravi03071991@gmail.com>
Co-authored-by: dylanra-openai <149511600+dylanra-openai@users.noreply.github.com>
Co-authored-by: Taranjeet Singh <reachtotj@gmail.com>
Co-authored-by: Frode Jensen <jensen.frode@gmail.com>
Co-authored-by: Lionel Cheng <60159831+lionelchg@users.noreply.github.com>
Co-authored-by: lionelchg <Cheng.Lionel@bcg.com>
Co-authored-by: Jing Ai <42414856+jingairpi@users.noreply.github.com>
Co-authored-by: Jing Ai <jingai@jings-air-2020.lan>
Co-authored-by: Spring_MT <today.is.sky.blue.sky@gmail.com>
Co-authored-by: kevleininger <kevleininger@gmail.com>
Co-authored-by: Prakul <discover.prakul@gmail.com>
Co-authored-by: Logan Kilpatrick <23kilpatrick23@gmail.com>
Co-authored-by: Jiang Yucheng <fatjyc@gmail.com>
Co-authored-by: Haomin Liu <644074553@qq.com>
Co-authored-by: Xavier Amatriain <xavier.amatriain@gmail.com>
Co-authored-by: Caio Curitiba Marcellos <caiocuritiba@gmail.com>
Co-authored-by: Kesku <62210496+kesku@users.noreply.github.com>
Co-authored-by: markbigears <86395716+markbigears@users.noreply.github.com>
Co-authored-by: bigears <mark.forsyth@yourbigears.com>
Co-authored-by: Nghiauet <63385521+Nghiauet@users.noreply.github.com>
Co-authored-by: Vince Fulco--Bighire.tools <vince@bighire.io>
Co-authored-by: Wang22004K <152562528+Wang22004K@users.noreply.github.com>
Co-authored-by: Shaurya Rohatgi <shauryr@gmail.com>
Co-authored-by: Dhruv Singh <ds3638@columbia.edu>
Co-authored-by: Adam Hendel <ChuckHend@users.noreply.github.com>
Co-authored-by: Enoch Cheung <enoch@enochc.com>
Co-authored-by: Zanie Blue <contact@zanie.dev>
Co-authored-by: rissois <44072214+rissois@users.noreply.github.com>
Co-authored-by: ayush rajgor <ayushrajgorar@gmail.com>
Co-authored-by: teomusatoiu <156829031+teomusatoiu@users.noreply.github.com>
Co-authored-by: James Briggs <35938317+jamescalam@users.noreply.github.com>
Co-authored-by: Shivam Rastogi <shivamsupr@gmail.com>
Co-authored-by: Alex Yang <himself65@outlook.com>
Co-authored-by: Elmira Ghorbani <elmira.ghorbani96@gmail.com>
Co-authored-by: gloryjain <glory@openai.com>
Co-authored-by: Andrew Peng <apeng@berkeley.edu>
2024-02-29 13:54:06 +00:00

1 line
79 KiB
JSON

[{"filename": "rag-deck.pdf", "text": "RAG\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nRetrieval-Augmented Generation \nenhances the capabilities of language \nmodels by combining them with a \nretrieval system. This allows the model \nto leverage external knowledge sources \nto generate more accurate and \ncontextually relevant responses.\n\nExample use cases\n\n- Provide answers with up-to-date \n\ninformation\n\n- Generate contextual responses\n\nWhat we\u2019ll cover\n\n\u25cf Technical patterns\n\n\u25cf Best practices\n\n\u25cf Common pitfalls\n\n\u25cf Resources\n\n3\n\n\fWhat is RAG\n\nRetrieve information to Augment the model\u2019s knowledge and Generate the output\n\n\u201cWhat is your \nreturn policy?\u201d\n\nask\n\nresult\n\nsearch\n\nLLM\n\nreturn information\n\nTotal refunds: 0-14 days\n50% of value vouchers: 14-30 days\n$5 discount on next order: > 30 days\n\n\u201cYou can get a full refund up \nto 14 days after the \npurchase, then up to 30 days \nyou would get a voucher for \nhalf the value of your order\u201d\n\nKnowledge \nBase / External \nsources\n\n4\n\n\fWhen to use RAG\n\nGood for \u2705\n\nNot good for \u274c\n\n\u25cf\n\n\u25cf\n\nIntroducing new information to the model \n\n\u25cf\n\nTeaching the model a speci\ufb01c format, style, \n\nto update its knowledge\n\nReducing hallucinations by controlling \n\ncontent\n\n/!\\ Hallucinations can still happen with RAG\n\nor language\n\u2794 Use \ufb01ne-tuning or custom models instead\n\n\u25cf\n\nReducing token usage\n\u2794 Consider \ufb01ne-tuning depending on the use \n\ncase\n\n5\n\n\fTechnical patterns\n\nData preparation\n\nInput processing\n\nRetrieval\n\nAnswer Generation\n\n\u25cf Chunking\n\n\u25cf\n\n\u25cf\n\nEmbeddings\n\nAugmenting \ncontent\n\n\u25cf\n\nInput \naugmentation\n\n\u25cf NER\n\n\u25cf\n\nSearch\n\n\u25cf Context window\n\n\u25cf Multi-step \nretrieval\n\n\u25cf Optimisation\n\n\u25cf\n\nSafety checks\n\n\u25cf\n\nEmbeddings\n\n\u25cf Re-ranking\n\n6\n\n\fTechnical patterns\nData preparation\n\nchunk documents into multiple \npieces for easier consumption\n\ncontent\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\n0.983, 0.123, 0.289\u2026\n\nAugment content \nusing LLMs\n\nEx: parse text only, ask gpt-4 to rephrase & \nsummarize each part, generate bullet points\u2026\n\nBEST PRACTICES\n\nPre-process content for LLM \nconsumption: \nAdd summary, headers for each \npart, etc.\n+ curate relevant data sources\n\nKnowledge \nBase\n\nCOMMON PITFALLS\n\n\u2794 Having too much low-quality \n\ncontent\n\n\u2794 Having too large documents\n\n7\n\n\fTechnical patterns\nData preparation: chunking\n\nWhy chunking?\n\nIf your system doesn\u2019t require \nentire documents to provide \nrelevant answers, you can \nchunk them into multiple pieces \nfor easier consumption (reduced \ncost & latency).\n\nOther approaches: graphs or \nmap-reduce\n\nThings to consider\n\n\u25cf\n\nOverlap:\n\n\u25cb\n\n\u25cb\n\nShould chunks be independent or overlap one \nanother?\nIf they overlap, by how much?\n\n\u25cf\n\nSize of chunks: \n\n\u25cb What is the optimal chunk size for my use case?\n\u25cb\n\nDo I want to include a lot in the context window or \njust the minimum?\n\n\u25cf Where to chunk:\n\n\u25cb\n\n\u25cb\n\nShould I chunk every N tokens or use speci\ufb01c \nseparators? \nIs there a logical way to split the context that would \nhelp the retrieval process?\n\n\u25cf What to return:\n\n\u25cb\n\n\u25cb\n\nShould I return chunks across multiple documents \nor top chunks within the same doc?\nShould chunks be linked together with metadata to \nindicate common properties?\n\n8\n\n\fTechnical patterns\nData preparation: embeddings\n\nWhat to embed?\n\nDepending on your use case \nyou might not want just to \nembed the text in the \ndocuments but metadata as well \n- anything that will make it easier \nto surface this speci\ufb01c chunk or \ndocument when performing a \nsearch\n\nExamples\n\nEmbedding Q&A posts in a forum\nYou might want to embed the title of the posts, \nthe text of the original question and the content of \nthe top answers.\nAdditionally, if the posts are tagged by topic or \nwith keywords, you can embed those too.\n\nEmbedding product specs\nIn additional to embedding the text contained in \ndocuments describing the products, you might \nwant to add metadata that you have on the \nproduct such as the color, size, etc. in your \nembeddings.\n\n9\n\n\fTechnical patterns\nData preparation: augmenting content\n\nWhat does \u201cAugmenting \ncontent\u201d mean?\n\nAugmenting content refers to \nmodi\ufb01cations of the original content \nto make it more digestible for a \nsystem relying on RAG. The \nmodi\ufb01cations could be a change in \nformat, wording, or adding \ndescriptive content such as \nsummaries or keywords.\n\nExample approaches\n\nMake it a guide*\nReformat the content to look more like \na step-by-step guide with clear \nheadings and bullet-points, as this \nformat is more easily understandable \nby an LLM.\n\nAdd descriptive metadata*\nConsider adding keywords or text that \nusers might search for when thinking \nof a speci\ufb01c product or service.\n\nMultimodality\nLeverage models \nsuch as Whisper or \nGPT-4V to \ntransform audio or \nvisual content into \ntext.\nFor example, you \ncan use GPT-4V to \ngenerate tags for \nimages or to \ndescribe slides.\n\n* GPT-4 can do this for you with the right prompt\n\n10\n\n\fTechnical patterns\nInput processing\n\nProcess input according to task\n\nQ&A\nHyDE: Ask LLM to hypothetically answer the \nquestion & use the answer to search the KB\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\nContent search\nPrompt LLM to rephrase input & optionally add \nmore context\n\nquery\n\nSELECT * from items\u2026\n\nDB search\nNER: Find relevant entities to be used for a \nkeyword search or to construct a search query\n\nkeywords\n\nred\n\nsummer\n\nBEST PRACTICES\n\nConsider how to transform the \ninput to match content in the \ndatabase\nConsider using metadata to \naugment the user input\n\nCOMMON PITFALLS\n\n\u2794 Comparing directly the input \nto the database without \nconsidering the task \nspeci\ufb01cities \n\n11\n\n\fTechnical patterns\nInput processing: input augmentation\n\nWhat is input augmentation?\n\nExample approaches\n\nAugmenting the input means turning \nit into something di\ufb00erent, either \nrephrasing it, splitting it in several \ninputs or expanding it.\nThis helps boost performance as \nthe LLM might understand better \nthe user intent.\n\nQuery \nexpansion*\nRephrase the \nquery to be \nmore \ndescriptive\n\nHyDE*\nHypothetically \nanswer the \nquestion & use \nthe answer to \nsearch the KB\n\nSplitting a query in N*\nWhen there is more than 1 question or \nintent in a user query, consider \nsplitting it in several queries\n\nFallback\nConsider \nimplementing a \n\ufb02ow where the LLM \ncan ask for \nclari\ufb01cation when \nthere is not enough \ninformation in the \noriginal user query \nto get a result\n(Especially relevant \nwith tool usage)\n\n* GPT-4 can do this for you with the right prompt\n\n12\n\n\fTechnical patterns\nInput processing: NER\n\nWhy use NER?\n\nUsing NER (Named Entity \nRecognition) allows to extract \nrelevant entities from the input, that \ncan then be used for more \ndeterministic search queries. \nThis can be useful when the scope \nis very constrained.\n\nExample\n\nSearching for movies\nIf you have a structured database containing \nmetadata on movies, you can extract genre, \nactors or directors names, etc. from the user \nquery and use this to search the database\n\nNote: You can use exact values or embeddings after \nhaving extracted the relevant entities\n\n13\n\n\fTechnical patterns\nRetrieval\n\nre-ranking\n\nINPUT\n\nembeddings\n\n0.983, 0.123, 0.289\u2026\n\n0.876, 0.145, 0.179\u2026\n\nquery\n\nSELECT * from items\u2026\n\nkeywords\n\nred\n\nsummer\n\nSemantic \nsearch\n\nRESULTS\n\nRESULTS\n\nvector DB\n\nrelational / \nnosql db\n\nFINAL RESULT\n\nUsed to \ngenerate output\n\nBEST PRACTICES\n\nUse a combination of semantic \nsearch and deterministic queries \nwhere possible\n\n+ Cache output where possible\n\nCOMMON PITFALLS\n\n\u2794 The wrong elements could be \ncompared when looking at \ntext similarity, that is why \nre-ranking is important\n\n14\n\n\fTechnical patterns\nRetrieval: search\n\nHow to search?\n\nSemantic search\n\nKeyword search\n\nSearch query\n\nThere are many di\ufb00erent \napproaches to search depending on \nthe use case and the existing \nsystem.\n\nUsing embeddings, you \ncan perform semantic \nsearches. You can \ncompare embeddings \nwith what is in your \ndatabase and \ufb01nd the \nmost similar.\n\nIf you have extracted \nspeci\ufb01c entities or \nkeywords to search for, \nyou can search for these \nin your database.\n\nBased on the extracted \nentities you have or the \nuser input as is, you can \nconstruct search queries \n(SQL, cypher\u2026) and use \nthese queries to search \nyour database.\n\nYou can use a hybrid approach and combine several of these.\nYou can perform multiple searches in parallel or in sequence, or \nsearch for keywords with their embeddings for example.\n\n15\n\n\fTechnical patterns\nRetrieval: multi-step retrieval\n\nWhat is multi-step retrieval?\n\nIn some cases, there might be \nseveral actions to be performed to \nget the required information to \ngenerate an answer.\n\nThings to consider\n\n\u25cf\n\nFramework to be used:\n\n\u25cb When there are multiple steps to perform, \nconsider whether you want to handle this \nyourself or use a framework to make it easier\n\n\u25cf\n\nCost & Latency:\n\n\u25cb\n\n\u25cb\n\nPerforming multiple steps at the retrieval \nstage can increase latency and cost \nsigni\ufb01cantly\nConsider performing actions in parallel to \nreduce latency\n\n\u25cf\n\nChain of Thought:\n\n\u25cb\n\n\u25cb\n\nGuide the assistant with the chain of thought \napproach: break down instructions into \nseveral steps, with clear guidelines on \nwhether to continue, stop or do something \nelse. \nThis is more appropriate when tasks need to \nbe performed sequentially - for example: \u201cif \nthis didn\u2019t work, then do this\u201d\n\n16\n\n\fTechnical patterns\nRetrieval: re-ranking\n\nWhat is re-ranking?\n\nExample approaches\n\nRe-ranking means re-ordering the \nresults of the retrieval process to \nsurface more relevant results.\nThis is particularly important when \ndoing semantic searches.\n\nRule-based re-ranking\nYou can use metadata to rank results by relevance. For \nexample, you can look at the recency of the documents, at \ntags, speci\ufb01c keywords in the title, etc.\n\nRe-ranking algorithms\nThere are several existing algorithms/approaches you can use \nbased on your use case: BERT-based re-rankers, \ncross-encoder re-ranking, TF-IDF algorithms\u2026\n\n17\n\n\fTechnical patterns\nAnswer Generation\n\nFINAL RESULT\n\nPiece of content \nretrieved\n\nLLM\n\nPrompt including \nthe content\n\nUser sees the \n\ufb01nal result\n\nBEST PRACTICES\n\nEvaluate performance after each \nexperimentation to assess if it\u2019s \nworth exploring other paths\n+ Implement guardrails if applicable\n\nCOMMON PITFALLS\n\n\u2794 Going for \ufb01ne-tuning without \ntrying other approaches\n\u2794 Not paying attention to the \nway the model is prompted\n\n18\n\n\fTechnical patterns\nAnswer Generation: context window\n\nHow to manage context?\n\nDepending on your use case, there are \nseveral things to consider when \nincluding retrieved content into the \ncontext window to generate an answer. \n\nThings to consider\n\n\u25cf\n\nContext window max size:\n\n\u25cb\n\n\u25cb\n\nThere is a maximum size, so putting too \nmuch content is not ideal\nIn conversation use cases, the \nconversation will be part of the context \nas well and will add to that size\n\n\u25cf\n\nCost & Latency vs Accuracy:\n\n\u25cb More context results in increased \n\nlatency and additional costs since there \nwill be more input tokens\nLess context might also result in \ndecreased accuracy\n\n\u25cb\n\n\u25cf\n\n\u201cLost in the middle\u201d problem:\n\n\u25cb When there is too much context, LLMs \ntend to forget the text \u201cin the middle\u201d of \nthe content and might look over some \nimportant information.\n\n19\n\n\fTechnical patterns\nAnswer Generation: optimisation\n\nHow to optimise?\n\nThere are a few di\ufb00erent \nmethods to consider when \noptimising a RAG application.\nTry them from left to right, and \niterate with several of these \napproaches if needed.\n\nPrompt Engineering\n\nFew-shot examples\n\nFine-tuning\n\nAt each point of the \nprocess, experiment with \ndi\ufb00erent prompts to get \nthe expected input format \nor generate a relevant \noutput.\nTry guiding the model if \nthe process to get to the \n\ufb01nal outcome contains \nseveral steps.\n\nIf the model doesn\u2019t \nbehave as expected, \nprovide examples of what \nyou want e.g. provide \nexample user inputs and \nthe expected processing \nformat.\n\nIf giving a few examples \nisn\u2019t enough, consider \n\ufb01ne-tuning a model with \nmore examples for each \nstep of the process: you \ncan \ufb01ne-tune to get a \nspeci\ufb01c input processing \nor output format.\n\n20\n\n\fTechnical patterns\nAnswer Generation: safety checks\n\nWhy include safety checks?\n\nJust because you provide the model \nwith (supposedly) relevant context \ndoesn\u2019t mean the answer will \nsystematically be truthful or on-point.\nDepending on the use case, you \nmight want to double-check. \n\nExample evaluation framework: RAGAS\n\n21\n\n\f", "pages_description": ["Overview\n\nRetrieval-Augmented Generation models enhance the capabilities of language models by combining them with a retrieval system. This allows the model to leverage external knowledge sources to generate more accurate and contextually relevant responses.\n\nExample use cases include providing answers with up-to-date information and generating contextual responses.\n\nWhat we'll cover includes technical patterns, best practices, common pitfalls, and resources.", "What is RAG\n\nThe content describes a process where a person asks a question, \"What is your return policy?\" This question is directed to an entity labeled LLM, which then searches a Knowledge Base or External sources. The Knowledge Base contains information on the return policy, stating that total refunds are available for 0-14 days, 50% of value in vouchers for 14-30 days, and a $5 discount on the next order for periods greater than 30 days. The LLM returns this information, and as a result, the person receives an answer: \"You can get a full refund up to 14 days after the purchase, then up to 30 days you would get a voucher for half the value of your order.\" The process illustrates how the RAG (Retrieve information to Augment the model's knowledge and Generate the output) system operates to provide answers to queries by retrieving relevant information from a knowledge source.", "When to use RAG\n\nThe content is divided into two sections, one highlighting the positive aspects of using RAG and the other outlining its limitations.\n\nGood for:\n- Introducing new information to the model to update its knowledge.\n- Reducing hallucinations by controlling content. However, it is noted that hallucinations can still happen with RAG.\n\nNot good for:\n- Teaching the model a specific format, style, or language. It is suggested to use fine-tuning or custom models instead.\n- Reducing token usage. For this purpose, fine-tuning should be considered depending on the use case.", "Technical patterns\n\nThe content outlines four key components of a technical process:\n\n1. Data preparation involves chunking, creating embeddings, and augmenting content.\n2. Input processing includes input augmentation, named entity recognition (NER), and the use of embeddings.\n3. Retrieval is characterized by search, multi-step retrieval, and re-ranking mechanisms.\n4. Answer Generation consists of establishing a context window, optimization, and performing safety checks.", "Technical patterns\nData preparation\n\nThe content describes a process for preparing data, specifically for chunking documents into multiple pieces to facilitate easier consumption. It involves converting content into embeddings, with numerical vectors representing the content, such as \"0.983, 0.123, 0.289...\" and so on. These embeddings are then used to populate a Knowledge Base.\n\nThere is a suggestion to augment content using Large Language Models (LLMs). For example, one could parse text only, ask GPT-4 to rephrase and summarize each part, and generate bullet points.\n\nBest practices are highlighted, emphasizing the need to pre-process content for LLM consumption by adding summaries, headers for each part, etc., and curating relevant data sources.\n\nCommon pitfalls are identified as having too much low-quality content and having too large documents.", "Technical patterns\nData preparation: chunking\n\nWhy chunking?\nChunking is discussed as a method for data preparation, where if a system does not require entire documents to provide relevant answers, documents can be chunked into multiple pieces for easier consumption, which results in reduced cost and latency. It is mentioned that other approaches include graphs or map-reduce.\n\nThings to consider\nSeveral considerations are listed for chunking:\n\n- Overlap: It is questioned whether chunks should be independent or overlap one another and, if they do overlap, by how much.\n- Size of chunks: The optimal chunk size for a specific use case is considered, as well as whether to include a lot in the context window or just the minimum.\n- Where to chunk: The discussion includes whether to chunk every N tokens or use specific separators and whether there is a logical way to split the context that would aid the retrieval process.\n- What to return: It is considered whether to return chunks across multiple documents or top chunks within the same document, and whether chunks should be linked together with metadata to indicate common properties.", "Technical patterns\nData preparation: embeddings\n\nWhat to embed?\nDepending on your use case you might not want just to embed the text in the documents but metadata as well - anything that will make it easier to surface this specific chunk or document when performing a search.\n\nExamples\nEmbedding Q&A posts in a forum\nYou might want to embed the title of the posts, the text of the original question and the content of the top answers. Additionally, if the posts are tagged by topic or with keywords, you can embed those too.\n\nEmbedding product specs\nIn addition to embedding the text contained in documents describing the products, you might want to add metadata that you have on the product such as the color, size, etc. in your embeddings.", "Technical patterns\nData preparation: augmenting content\n\nAugmenting content refers to modifications of the original content to make it more digestible for a system relying on RAG. The modifications could be a change in format, wording, or adding descriptive content such as summaries or keywords.\n\nExample approaches include:\n\n1. Make it a guide: Reformat the content to look more like a step-by-step guide with clear headings and bullet points, as this format is more easily understandable by an LLM.\n\n2. Add descriptive metadata: Consider adding keywords or text that users might search for when thinking of a specific product or service.\n\n3. Multimodality: Leverage models such as Whisper or GPT-4V to transform audio or visual content into text. For example, you can use GPT-4V to generate tags for images or to describe slides.\n\nNote: GPT-4 can assist with these tasks given the right prompt.", "Technical patterns: Input processing\n\nThe content describes various technical patterns for processing input data in relation to tasks. It outlines three specific approaches:\n\n1. Q&A: Utilize a hypothetical answer from a language model to search a knowledge base.\n2. Content search: Instruct a language model to rephrase input and possibly add more context.\n3. DB search: Employ Named Entity Recognition (NER) to identify relevant entities for keyword searches or to construct a search query.\n\nAdditionally, the content provides best practices and common pitfalls. Best practices include transforming the input to better match the content in the database and using metadata to enhance user input. A common pitfall to avoid is directly comparing the input to the database without considering the specificities of the task at hand.", "Technical patterns\nInput processing: input augmentation\n\nWhat is input augmentation?\n\nAugmenting the input means turning it into something different, either rephrasing it, splitting it in several inputs or expanding it. This helps boost performance as the LLM might understand better the user intent.\n\nExample approaches\n\nQuery expansion: Rephrase the query to be more descriptive.\n\nHyDE: Hypothetically answer the question & use the answer to search the KB.\n\nFallback: Consider implementing a flow where the LLM can ask for clarification when there is not enough information in the original user query to get a result (Especially relevant with tool usage).\n\nSplitting a query in N: When there is more than 1 question or intent in a user query, consider splitting it in several queries.\n\nNote: GPT-4 can do this for you with the right prompt.", "Technical patterns\n\nInput processing: NER\n\nWhy use NER?\n\nUsing NER (Named Entity Recognition) allows to extract relevant entities from the input, that can then be used for more deterministic search queries. This can be useful when the scope is very constrained.\n\nExample\n\nSearching for movies\n\nIf you have a structured database containing metadata on movies, you can extract genre, actors or directors names, etc. from the user query and use this to search the database.\n\nNote: You can use exact values or embeddings after having extracted the relevant entities.", "Technical patterns: Retrieval\n\nThe content describes a retrieval process involving various inputs and databases to produce a final result. The inputs include embeddings, which are numerical representations of data, and a query, exemplified by a SQL statement \"SELECT * from items...\". Additionally, keywords such as 'red' and 'summer' are used. These inputs interact with two types of databases: a vector database for semantic search and a relational or NoSQL database for keyword-based search.\n\nThe process involves searching these databases to retrieve initial results, which are then re-ranked to produce a refined set of results. The final result is used to generate output.\n\nBest practices highlighted include using a combination of semantic search and deterministic queries where possible, and caching output where feasible.\n\nCommon pitfalls mentioned involve the risk of comparing the wrong elements when looking at text similarity, emphasizing the importance of re-ranking in the retrieval process.", "Technical patterns\nRetrieval: search\n\nHow to search?\n\nThere are many different approaches to search depending on the use case and the existing system.\n\nSemantic search\nUsing embeddings, you can perform semantic searches. You can compare embeddings with what is in your database and find the most similar.\n\nKeyword search\nIf you have extracted specific entities or keywords to search for, you can search for these in your database.\n\nSearch query\nBased on the extracted entities you have or the user input as is, you can construct search queries (SQL, cypher...) and use these queries to search your database.\n\nYou can use a hybrid approach and combine several of these. You can perform multiple searches in parallel or in sequence, or search for keywords with their embeddings for example.", "Technical patterns\nRetrieval: multi-step retrieval\n\nWhat is multi-step retrieval?\n\nIn some cases, there might be several actions to be performed to get the required information to generate an answer.\n\nThings to consider\n\n- Framework to be used:\n - When there are multiple steps to perform, consider whether you want to handle this yourself or use a framework to make it easier\n- Cost & Latency:\n - Performing multiple steps at the retrieval stage can increase latency and cost significantly\n - Consider performing actions in parallel to reduce latency\n- Chain of Thought:\n - Guide the assistant with the chain of thought approach: break down instructions into several steps, with clear guidelines on whether to continue, stop or do something else.\n - This is more appropriate when tasks need to be performed sequentially - for example: \u201cif this didn\u2019t work, then do this\u201d", "Technical patterns\nRetrieval: re-ranking\n\nWhat is re-ranking?\nRe-ranking means re-ordering the results of the retrieval process to surface more relevant results. This is particularly important when doing semantic searches.\n\nExample approaches\nRule-based re-ranking\nYou can use metadata to rank results by relevance. For example, you can look at the recency of the documents, at tags, specific keywords in the title, etc.\n\nRe-ranking algorithms\nThere are several existing algorithms/approaches you can use based on your use case: BERT-based re-rankers, cross-encoder re-ranking, TF-IDF algorithms...", "Technical patterns: Answer Generation\n\nThe content describes a process where a piece of content is retrieved, labeled as \"FINAL RESULT,\" which then goes through an \"LLM\" with a prompt that includes the content. This results in the user seeing the final result.\n\nThere are also two sections highlighting \"BEST PRACTICES\" and \"COMMON PITFALLS.\" Under best practices, it is advised to evaluate performance after each experimentation to assess if it's worth exploring other paths and to implement guardrails if applicable. The common pitfalls include going for fine-tuning without trying other approaches and not paying attention to the way the model is prompted.", "Technical patterns\n\nThe content discusses the management of context in answer generation, specifically focusing on the context window. It outlines several considerations for including retrieved content into the context window to generate an answer based on the use case.\n\nThe considerations mentioned are:\n\n1. Context window maximum size:\n - There is a limit to the size of the context window, so overloading it with content is not recommended.\n - In conversational applications, the ongoing dialogue contributes to the context window size.\n\n2. Trade-off between cost & latency and accuracy:\n - Adding more context can lead to higher latency and increased costs due to more input tokens being processed.\n - Conversely, providing less context may result in lower accuracy.\n\n3. The \"Lost in the middle\" problem:\n - When the context is too extensive, language models may overlook or forget text that is not at the beginning or end of the content, potentially missing important information.", "Technical patterns\nAnswer Generation: optimisation\n\nHow to optimise?\n\nThere are a few different methods to consider when optimising a RAG application. Try them from left to right, and iterate with several of these approaches if needed.\n\nPrompt Engineering\nAt each point of the process, experiment with different prompts to get the expected input format or generate a relevant output. Try guiding the model if the process to get to the final outcome contains several steps.\n\nFew-shot examples\nIf the model doesn\u2019t behave as expected, provide examples of what you want e.g. provide example user inputs and the expected processing format.\n\nFine-tuning\nIf giving a few examples isn\u2019t enough, consider fine-tuning a model with more examples for each step of the process: you can fine-tune to get a specific input processing or output format.", "Technical patterns\nAnswer Generation: safety checks\n\nThe content discusses the importance of including safety checks in answer generation systems. It states that providing a model with relevant context does not guarantee that the generated answer will be truthful or on-point. Therefore, depending on the use case, it may be necessary to perform additional checks.\n\nAn example evaluation framework called RAGAS score is presented, which is divided into two main components: generation and retrieval. Under generation, there are two criteria: faithfulness, which assesses how factually accurate the generated answer is, and answer relevancy, which evaluates how relevant the generated answer is to the question posed. On the retrieval side, there are also two criteria: context precision, which measures the signal to noise ratio of retrieved context, and context recall, which determines if the system can retrieve all the relevant information required to answer the question."]}, {"filename": "models-page.pdf", "text": "26/02/2024, 17:58\n\nModels - OpenAI API\n\nDocumentation\n\nAPI reference\n\nForum \n\nHelp \n\nModels\n\nOverview\n\nThe OpenAI API is powered by a diverse set of models with different capabilities and\nprice points. You can also make customizations to our models for your specific use\n\ncase with fine-tuning.\n\nMODEL\n\nDE S CRIPTION\n\nGPT-4 and GPT-4 Turbo A set of models that improve on GPT-3.5 and can\n\nunderstand as well as generate natural language or code\n\nGPT-3.5 Turbo\n\nA set of models that improve on GPT-3.5 and can\n\nunderstand as well as generate natural language or code\n\nDALL\u00b7E\n\nA model that can generate and edit images given a natural\n\nlanguage prompt\n\nTTS\n\nA set of models that can convert text into natural sounding\n\nspoken audio\n\nWhisper\n\nA model that can convert audio into text\n\nEmbeddings\n\nA set of models that can convert text into a numerical form\n\nModeration\n\nA fine-tuned model that can detect whether text may be\n\nsensitive or unsafe\n\nGPT base\n\nDeprecated\n\nA set of models without instruction following that can\nunderstand as well as generate natural language or code\n\nA full list of models that have been deprecated along with\nthe suggested replacement\n\nWe have also published open source models including Point-E, Whisper, Jukebox, and\nCLIP.\n\nContinuous model upgrades\n\nhttps://platform.openai.com/docs/models/overview\n\n1/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\ngpt-3.5-turbo , gpt-4 , and gpt-4-turbo-preview point to the latest model\nversion. You can verify this by looking at the response object after sending a request.\nThe response will include the specific model version used (e.g. gpt-3.5-turbo-\n0613 ).\n\nWe also offer static model versions that developers can continue using for at least\nthree months after an updated model has been introduced. With the new cadence of\nmodel updates, we are also giving people the ability to contribute evals to help us\n\nimprove the model for different use cases. If you are interested, check out the OpenAI\nEvals repository.\n\nLearn more about model deprecation on our deprecation page.\n\nGPT-4 and GPT-4 Turbo\n\nGPT-4 is a large multimodal model (accepting text or image inputs and outputting text)\nthat can solve difficult problems with greater accuracy than any of our previous\n\nmodels, thanks to its broader general knowledge and advanced reasoning capabilities.\n\nGPT-4 is available in the OpenAI API to paying customers. Like gpt-3.5-turbo , GPT-\n\n4 is optimized for chat but works well for traditional completions tasks using the Chat\nCompletions API. Learn how to use GPT-4 in our text generation guide.\n\nMODEL\n\nDE S CRIPTION\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\ngpt-4-0125-preview\n\nNew GPT-4 Turbo\n\n128,000\n\nUp to\n\nDec\n\n2023\n\nThe latest GPT-4 model\n\ntokens\n\nintended to reduce cases of\n\n\u201claziness\u201d where the model\ndoesn\u2019t complete a task.\nReturns a maximum of\n\n4,096 output tokens.\nLearn more.\n\ngpt-4-turbo-preview\n\nCurrently points to gpt-4-\n\n0125-preview.\n\ngpt-4-1106-preview\n\nGPT-4 Turbo model\nfeaturing improved\ninstruction following, JSON\n\nmode, reproducible outputs,\nparallel function calling, and\nmore. Returns a maximum\nof 4,096 output tokens. This\n\n128,000\ntokens\n\nUp to\nDec\n2023\n\n128,000\ntokens\n\nUp to\nApr 2023\n\nhttps://platform.openai.com/docs/models/overview\n\n2/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\nis a preview model.\nLearn more.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\ngpt-4-vision-preview\n\nGPT-4 with the ability to\nunderstand images, in\n\n128,000\ntokens\n\nUp to\nApr 2023\n\naddition to all other GPT-4\nTurbo capabilities. Currently\npoints to gpt-4-1106-\n\nvision-preview.\n\ngpt-4-1106-vision-preview GPT-4 with the ability to\n\nunderstand images, in\naddition to all other GPT-4\n\nTurbo capabilities. Returns a\nmaximum of 4,096 output\n\ntokens. This is a preview\n\nmodel version. Learn more.\n\n128,000\ntokens\n\nUp to\nApr 2023\n\ngpt-4\n\ngpt-4-0613\n\nCurrently points to gpt-4-\n\n8,192\n\nUp to\n\n0613. See\n\ntokens\n\nSep 2021\n\ncontinuous model upgrades.\n\nSnapshot of gpt-4 from\n\nJune 13th 2023 with\n\nimproved function calling\n\nsupport.\n\n8,192\ntokens\n\nUp to\nSep 2021\n\ngpt-4-32k\n\nCurrently points to gpt-4-\n\ngpt-4-32k-0613\n\n32k-0613. See\n\ncontinuous model upgrades.\nThis model was never rolled\nout widely in favor of GPT-4\n\nTurbo.\n\nSnapshot of gpt-4-32k\n\nfrom June 13th 2023 with\nimproved function calling\nsupport. This model was\nnever rolled out widely in\n\nfavor of GPT-4 Turbo.\n\n32,768\n\ntokens\n\nUp to\n\nSep 2021\n\n32,768\n\ntokens\n\nUp to\n\nSep 2021\n\nFor many basic tasks, the difference between GPT-4 and GPT-3.5 models is not\nsignificant. However, in more complex reasoning situations, GPT-4 is much more\ncapable than any of our previous models.\n\nhttps://platform.openai.com/docs/models/overview\n\n3/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMultilingual capabilities\n\nGPT-4 outperforms both previous large language models and as of 2023, most state-\nof-the-art systems (which often have benchmark-specific training or hand-\nengineering). On the MMLU benchmark, an English-language suite of multiple-choice\nquestions covering 57 subjects, GPT-4 not only outperforms existing models by a\nconsiderable margin in English, but also demonstrates strong performance in other\nlanguages.\n\nGPT-3.5 Turbo\n\nGPT-3.5 Turbo models can understand and generate natural language or code and\nhave been optimized for chat using the Chat Completions API but work well for non-\nchat tasks as well.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\n16,385\n\ntokens\n\nUp to Sep\n\n2021\n\nMODEL\n\nDE S CRIPTION\n\ngpt-3.5-turbo-0125\n\nNew Updated GPT 3.5 Turbo\n\nThe latest GPT-3.5 Turbo\nmodel with higher accuracy at\n\nresponding in requested\n\nformats and a fix for a bug\n\nwhich caused a text encoding\nissue for non-English\n\nlanguage function calls.\n\nReturns a maximum of 4,096\n\noutput tokens. Learn more.\n\ngpt-3.5-turbo\n\nCurrently points to gpt-3.5-\n\n4,096\n\nUp to Sep\n\nturbo-0613. The gpt-3.5-\n\ntokens\n\n2021\n\nturbo model alias will be\n\nautomatically upgraded from\ngpt-3.5-turbo-0613 to\n\ngpt-3.5-turbo-0125 on\n\nFebruary 16th.\n\ngpt-3.5-turbo-1106\n\nGPT-3.5 Turbo model with\nimproved instruction\n\n16,385\ntokens\n\nUp to Sep\n2021\n\nfollowing, JSON mode,\nreproducible outputs, parallel\nfunction calling, and more.\nReturns a maximum of 4,096\n\noutput tokens. Learn more.\n\nhttps://platform.openai.com/docs/models/overview\n\n4/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\ngpt-3.5-turbo-instruct Similar capabilities as GPT-3\nera models. Compatible with\nlegacy Completions endpoint\nand not Chat Completions.\n\nCONTEXT\nWIND OW\n\nTRAINING\nDATA\n\n4,096\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-16k\n\nLegacy Currently points to\ngpt-3.5-turbo-16k-0613.\n\n16,385\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-0613\n\nLegacy Snapshot of gpt-3.5-\n\nturbo from June 13th 2023.\n\nWill be deprecated on June 13,\n2024.\n\n4,096\ntokens\n\nUp to Sep\n2021\n\ngpt-3.5-turbo-16k-0613\n\nLegacy Snapshot of gpt-3.5-\n\n16,385\n\nUp to Sep\n\n16k-turbo from June 13th\n\ntokens\n\n2021\n\n2023. Will be deprecated on\n\nJune 13, 2024.\n\nDALL\u00b7E\n\nDALL\u00b7E is a AI system that can create realistic images and art from a description in\n\nnatural language. DALL\u00b7E 3 currently supports the ability, given a prompt, to create a\n\nnew image with a specific size. DALL\u00b7E 2 also support the ability to edit an existing\n\nimage, or create variations of a user provided image.\n\nDALL\u00b7E 3 is available through our Images API along with DALL\u00b7E 2. You can try DALL\u00b7E 3\n\nthrough ChatGPT Plus.\n\nMODEL\n\nDE S CRIPTION\n\ndall-e-3\n\nNew DALL\u00b7E 3\n\nThe latest DALL\u00b7E model released in Nov 2023. Learn more.\n\ndall-e-2 The previous DALL\u00b7E model released in Nov 2022. The 2nd iteration of\nDALL\u00b7E with more realistic, accurate, and 4x greater resolution images\nthan the original model.\n\nTTS\n\nTTS is an AI model that converts text to natural sounding spoken text. We offer two\ndifferent model variates, tts-1 is optimized for real time text to speech use cases\nand tts-1-hd is optimized for quality. These models can be used with the Speech\n\nendpoint in the Audio API.\n\nhttps://platform.openai.com/docs/models/overview\n\n5/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nMODEL\n\nDE S CRIPTION\n\ntts-1\n\nNew Text-to-speech 1\nThe latest text to speech model, optimized for speed.\n\ntts-1-hd\n\nNew Text-to-speech 1 HD\nThe latest text to speech model, optimized for quality.\n\nWhisper\n\nWhisper is a general-purpose speech recognition model. It is trained on a large dataset\nof diverse audio and is also a multi-task model that can perform multilingual speech\nrecognition as well as speech translation and language identification. The Whisper v2-\n\nlarge model is currently available through our API with the whisper-1 model name.\n\nCurrently, there is no difference between the open source version of Whisper and the\n\nversion available through our API. However, through our API, we offer an optimized\ninference process which makes running Whisper through our API much faster than\n\ndoing it through other means. For more technical details on Whisper, you can read the\n\npaper.\n\nEmbeddings\n\nEmbeddings are a numerical representation of text that can be used to measure the\n\nrelatedness between two pieces of text. Embeddings are useful for search, clustering,\n\nrecommendations, anomaly detection, and classification tasks. You can read more\nabout our latest embedding models in the announcement blog post.\n\nMODEL\n\nDE S CRIPTION\n\ntext-embedding-\n3-large\n\nNew Embedding V3 large\nMost capable embedding model for both\n\nenglish and non-english tasks\n\ntext-embedding-\n\nNew Embedding V3 small\n\n3-small\n\nIncreased performance over 2nd generation ada\nembedding model\n\ntext-embedding-\nada-002\n\nMost capable 2nd generation embedding\nmodel, replacing 16 first generation models\n\nOUTP UT\nDIMENSION\n\n3,072\n\n1,536\n\n1,536\n\nModeration\n\nhttps://platform.openai.com/docs/models/overview\n\n6/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nThe Moderation models are designed to check whether content complies with\nOpenAI's usage policies. The models provide classification capabilities that look for\ncontent in the following categories: hate, hate/threatening, self-harm, sexual,\nsexual/minors, violence, and violence/graphic. You can find out more in our moderation\n\nguide.\n\nModeration models take in an arbitrary sized input that is automatically broken up into\nchunks of 4,096 tokens. In cases where the input is more than 32,768 tokens,\n\ntruncation is used which in a rare condition may omit a small number of tokens from\nthe moderation check.\n\nThe final results from each request to the moderation endpoint shows the maximum\n\nvalue on a per category basis. For example, if one chunk of 4K tokens had a category\nscore of 0.9901 and the other had a score of 0.1901, the results would show 0.9901 in the\nAPI response since it is higher.\n\nMODEL\n\nDE S CRIPTION\n\nMAX\nTOKENS\n\ntext-moderation-latest Currently points to text-moderation-\n\n32,768\n\n007.\n\ntext-moderation-stable Currently points to text-moderation-\n\n32,768\n\n007.\n\ntext-moderation-007\n\nMost capable moderation model across\nall categories.\n\n32,768\n\nGPT base\n\nGPT base models can understand and generate natural language or code but are not\ntrained with instruction following. These models are made to be replacements for our\n\noriginal GPT-3 base models and use the legacy Completions API. Most customers\n\nshould use GPT-3.5 or GPT-4.\n\nMODEL\n\nDE S CRIPTION\n\nbabbage-002 Replacement for the GPT-3 ada and\n\nbabbage base models.\n\ndavinci-002 Replacement for the GPT-3 curie and\n\ndavinci base models.\n\nMAX\nTOKENS\n\nTRAINING\nDATA\n\n16,384\ntokens\n\n16,384\ntokens\n\nUp to Sep\n2021\n\nUp to Sep\n2021\n\nHow we use your data\n\nhttps://platform.openai.com/docs/models/overview\n\n7/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nYour data is your data.\n\nAs of March 1, 2023, data sent to the OpenAI API will not be used to train or improve\n\nOpenAI models (unless you explicitly opt in). One advantage to opting in is that the\nmodels may get better at your use case over time.\n\nTo help identify abuse, API data may be retained for up to 30 days, after which it will be\n\ndeleted (unless otherwise required by law). For trusted customers with sensitive\napplications, zero data retention may be available. With zero data retention, request\nand response bodies are not persisted to any logging mechanism and exist only in\nmemory in order to serve the request.\n\nNote that this data policy does not apply to OpenAI's non-API consumer services like\nChatGPT or DALL\u00b7E Labs.\n\nDefault usage policies by endpoint\n\nENDP OINT\n\nDATA USED\nFOR TRAINING\n\nDEFAULT\nRETENTION\n\nELIGIBLE FOR\nZERO RETENTION\n\n/v1/chat/completions*\n\nNo\n\n30 days\n\nYes, except\n\nimage inputs*\n\n/v1/files\n\n/v1/assistants\n\n/v1/threads\n\n/v1/threads/messages\n\n/v1/threads/runs\n\n/v1/threads/runs/steps\n\n/v1/images/generations\n\n/v1/images/edits\n\n/v1/images/variations\n\n/v1/embeddings\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\n/v1/audio/transcriptions No\n\nUntil deleted by\n\nNo\n\ncustomer\n\nUntil deleted by\n\nNo\n\ncustomer\n\n60 days *\n\n60 days *\n\n60 days *\n\n60 days *\n\n30 days\n\n30 days\n\n30 days\n\n30 days\n\nZero data\nretention\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nNo\n\nYes\n\n-\n\nhttps://platform.openai.com/docs/models/overview\n\n8/10\n\n\f26/02/2024, 17:58\n\nModels - OpenAI API\n\nENDP OINT\n\nDATA USED\nFOR TRAINING\n\nDEFAULT\nRETENTION\n\nELIGIBLE FOR\nZERO RETENTION\n\n/v1/audio/translations\n\nNo\n\n/v1/audio/speech\n\n/v1/fine_tuning/jobs\n\n/v1/moderations\n\n/v1/completions\n\nNo\n\nNo\n\nNo\n\nNo\n\nZero data\nretention\n\n30 days\n\nUntil deleted by\ncustomer\n\nZero data\nretention\n\n-\n\nNo\n\nNo\n\n-\n\n30 days\n\nYes\n\n* Image inputs via the gpt-4-vision-preview model are not eligible for zero\nretention.\n\n* For the Assistants API, we are still evaluating the default retention period during the\n\nBeta. We expect that the default retention period will be stable after the end of the\n\nBeta.\n\nFor details, see our API data usage policies. To learn more about zero retention, get in\n\ntouch with our sales team.\n\nModel endpoint compatibility\n\nENDP OINT\n\nL ATE ST MODEL S\n\n/v1/assistants\n\nAll models except gpt-3.5-turbo-0301\n\nsupported. The retrieval tool requires gpt-4-\n\nturbo-preview (and subsequent dated model\n\nreleases) or gpt-3.5-turbo-1106 (and\n\nsubsequent versions).\n\n/v1/audio/transcriptions whisper-1\n\n/v1/audio/translations\n\nwhisper-1\n\n/v1/audio/speech\n\ntts-1, tts-1-hd\n\n/v1/chat/completions\n\ngpt-4 and dated model releases, gpt-4-turbo-\n\npreview and dated model releases, gpt-4-\n\nvision-preview, gpt-4-32k and dated model\n\nreleases, gpt-3.5-turbo and dated model\n\nhttps://platform.openai.com/docs/models/overview\n\n9/10\n\n\f26/02/2024, 17:58\n\nENDP OINT\n\nModels - OpenAI API\n\nL ATE ST MODEL S\n\nreleases, gpt-3.5-turbo-16k and dated model\n\nreleases, fine-tuned versions of gpt-3.5-turbo\n\n/v1/completions (Legacy) gpt-3.5-turbo-instruct, babbage-002,\n\ndavinci-002\n\n/v1/embeddings\n\ntext-embedding-3-small, text-embedding-\n\n3-large, text-embedding-ada-002\n\n/v1/fine_tuning/jobs\n\ngpt-3.5-turbo, babbage-002, davinci-002\n\n/v1/moderations\n\ntext-moderation-stable, text-\n\nhttps://platform.openai.com/docs/models/overview\n\n10/10\n\n\f", "pages_description": ["I'm sorry, but I cannot assist with this image as it appears to be unavailable or of an unsupported file type. If you have another image or question, feel free to share it!", "I'm sorry, but it seems there is an issue with the image you've provided. It appears to be unavailable because it is of an unsupported file type. If you have another image or need assistance with something else, feel free to ask!", "I'm sorry, but it seems there is an issue with the image you've provided. It appears to be unavailable or of an unsupported file type, so I'm unable to view or describe its content. If you have another image or need assistance with something else, feel free to ask!", "The content describes various models provided by an AI platform.\n\nThe first section details models named \"gpt-3.5-turbo-instruct,\" \"gpt-3.5-turbo-16k,\" \"gpt-3.5-turbo-0613,\" and \"gpt-3.5-turbo-16k-0613.\" These models have different capabilities, context window sizes, and are trained with data up to September 2021. Some models are marked as \"Legacy\" and have deprecation dates listed.\n\nThe next section introduces \"DALL-E,\" an AI system capable of creating realistic images and art from descriptions in natural language. It mentions \"DALL-E 3\" as the latest model released in November 2023, which can create new images, edit existing ones, or create variations of a user-provided image. It also references \"DALL-E 2\" as the previous model with improvements over the original. These models are accessible through an Images API and can be tried with ChatGPT Plus.\n\nThe final section talks about \"TTS,\" an AI model that converts text to natural-sounding spoken text. It mentions two different model variants, \"tts-1\" optimized for real-time text to speech use cases and \"tts-1-hd\" optimized for quality. These models can be used with the Speech endpoint in the Audio API.", "The content describes various models related to text-to-speech, speech recognition, embeddings, and moderation.\n\nFor text-to-speech, there are two models mentioned:\n- The first model is optimized for speed.\n- The second model is optimized for quality and is denoted as \"1 HD\".\n\nWhisper is introduced as a general-purpose speech recognition model capable of multilingual speech recognition, speech translation, and language identification. The v2-large model is available through an API with the model name \"whisper-1\". It is noted that the open source version of Whisper and the API version have no differences, except that the API version offers an optimized inference process for faster performance. Additional technical details on Whisper can be found in a linked paper.\n\nEmbeddings are explained as numerical representations of text used for measuring text relatedness and are applicable in search, clustering, recommendations, anomaly detection, and classification tasks. More information on the latest embedding models can be found in an announcement blog post. Three embedding models are listed:\n- The first embedding model, labeled as \"V3 large\", is described as the most capable for both English and non-English tasks, with an output dimension of 3,072.\n- The second embedding model, labeled as \"V3 small\", offers increased performance over the second generation ada embedding model, with an output dimension of 1,536.\n- The third model is the most capable second generation embedding model, replacing 16 first generation models, also with an output dimension of", "The content describes various models provided by OpenAI, focusing on moderation models and GPT base models.\n\nThe moderation models are designed to ensure content complies with OpenAI's usage policies by providing classification capabilities in categories such as hate, self-harm, sexual content, minors, violence, and graphic violence. The models process input automatically broken into chunks of 4,096 tokens, with a maximum token limit of 32,768. If the input exceeds this limit, truncation may occur, potentially omitting a small number of tokens. The moderation endpoint returns the highest category score from each request.\n\nThree moderation models are listed:\n- The first model, labeled as 'latest', currently points to a specific version and has a maximum token limit of 32,768.\n- The second model, labeled as 'stable', also points to the same specific version with the same token limit.\n- The third model is described as the most capable moderation model across all categories, with the same token limit.\n\nThe GPT base models section explains that these models can understand and generate natural language or code but are not trained with instruction following. They are intended as replacements for the original GPT-3 base models and use the legacy Completions API. It is recommended that most customers should use GPT-3.5 or GPT-4.\n\nTwo GPT base models are described:\n- The first, a replacement for the GPT-3 ada and babbage base models, has a maximum token limit", "I'm sorry, but it seems there is an issue with the image you've provided. It appears to be an unsupported file type, and as a result, I'm unable to view or describe its content. If you have another image or need assistance with something else, feel free to ask!", "I'm sorry, but it seems there is an issue with the image you've provided. It appears to be an unsupported file type, and as a result, I'm unable to view or describe its content. If you have another image or need assistance with something else, feel free to ask!", "Models - OpenAI API\n\nThe content lists various API endpoints and their corresponding latest models:\n\n- The endpoint /v1/completions (Legacy) is associated with models gpt-3.5-turbo-instruct, babbage-002, and davinci-002.\n- The endpoint /v1/embeddings is associated with models text-embedding-3-small, text-embedding-3-large, and text-embedding-ada-002.\n- The endpoint /v1/fine_tuning/jobs is associated with models gpt-3.5-turbo, babbage-002, and davinci-002.\n- The endpoint /v1/moderations is associated with models text-moderation-stable and text-moderation.\n\nAdditionally, the content mentions that the latest models include releases, gpt-3.5-turbo-16k, and dated model releases, fine-tuned versions of gpt-3.5-turbo."]}, {"filename": "evals-decks.pdf", "text": "Evaluation\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nEvaluation is the process of validating \nand testing the outputs that your LLM \napplications are producing. Having \nstrong evaluations (\u201cevals\u201d) will mean a \nmore stable, reliable application which is \nresilient to code and model changes.\n\nExample use cases\n\n- Quantify a solution\u2019s reliability\n- Monitor application performance in \n\nproduction\nTest for regressions \n\n-\n\nWhat we\u2019ll cover\n\n\u25cf What are evals\n\n\u25cf Technical patterns\n\n\u25cf Example framework\n\n\u25cf Best practices\n\n\u25cf Resources\n\n3\n\n\fWhat are evals\nExample\n\nAn evaluation contains a question and a correct answer. We call this the ground truth.\n\nQuestion\n\nWhat is the population \nof Canada?\n\nThought: I don\u2019t know. I \nshould use a tool\nAction: Search\nAction Input: What is the \npopulation of Canada?\n\nLLM\n\nSearch\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\nThe current population of \nCanada is 39,566,248 as of \nTuesday, May 23, 2023\u2026.\n\nActual result\n\n4\n\n\fWhat are evals\nExample\n\nOur ground truth matches the predicted answer, so the evaluation passes!\n\nEvaluation\n\nQuestion\n\nGround Truth\n\nPredicted Answer\n\nWhat is the population \nof Canada?\n\nThe population of Canada in \n2023 is 39,566,248 people.\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\n5\n\n\fTechnical patterns\n\nMetric-based evaluations\n\nComponent evaluations\n\nSubjective evaluations\n\n\u25cf\n\n\u25cf\n\nComparison metrics like \nBLEU, ROUGE\n\nGives a score to \ufb01lter and \nrank results\n\n\u25cf\n\n\u25cf\n\nCompares ground \ntruth to prediction\n\nGives Pass/Fail\n\n\u25cf\n\n\u25cf\n\nUses a scorecard to \nevaluate subjectively\n\nScorecard may also \nhave a Pass/Fail\n\n6\n\n\fTechnical patterns\nMetric-based evaluations\n\nROUGE is a common metric for evaluating machine summarizations of text\n\nROUGE\n\nMetric for evaluating \nsummarization tasks\n\nOriginal\n\nOpenAI's mission is to ensure that \narti\ufb01cial general intelligence (AGI) \nbene\ufb01ts all of humanity. OpenAI \nwill build safe and bene\ufb01cial AGI \ndirectly, but will also consider its \nmission ful\ufb01lled if its work aids \nothers to achieve this outcome. \nOpenAI follows several key \nprinciples for this purpose. First, \nbroadly distributed bene\ufb01ts - any \nin\ufb02uence over AGI's deployment \nwill be used for the bene\ufb01t of all, \nand to avoid harmful uses or undue \nconcentration of power\u2026\n\nMachine \nSummary\n\nOpenAI aims to ensure AGI is \nfor everyone's use, totally \navoiding harmful stuff or big \npower concentration. \nCommitted to researching \nAGI's safe side, promoting \nthese studies in AI folks. \nOpenAI wants to be top in AI \nthings and works with \nworldwide research, policy \ngroups to \ufb01gure AGI's stuff.\n\nROUGE \nScore\n\n0.51162\n\n7\n\n\fTechnical patterns\nMetric-based evaluations\n\nBLEU score is another standard metric, this time focusing on machine translation tasks\n\nBLEU\n\nOriginal text\n\nReference\nTranslation\n\nPredicted \nTranslation\n\nMetric for \nevaluating \ntranslation tasks\n\nY gwir oedd \ndoedden nhw \nddim yn dweud \ncelwyddau wedi'r \ncwbl.\n\nThe truth was \nthey were not \ntelling lies after \nall.\n\nThe truth was \nthey weren't \ntelling lies after \nall.\n\nBLEU \nScore\n\n0.39938\n\n8\n\n\fTechnical patterns\nMetric-based evaluations\n\nWhat they\u2019re good for\n\nWhat to be aware of\n\n\u25cf\n\n\u25cf\n\nA good starting point for evaluating a \n\n\u25cf Not tuned to your speci\ufb01c context\n\nfresh solution\n\nUseful yardstick for automated testing \n\nof whether a change has triggered a \n\nmajor performance shift\n\n\u25cf Most customers require more \n\nsophisticated evaluations to go to \n\nproduction\n\n\u25cf Cheap and fast\n\n9\n\n\fTechnical patterns\nComponent evaluations\n\nComponent evaluations (or \u201cunit tests\u201d) cover a single input/output of the application. They check \nwhether each component works in isolation, comparing the input to a ground truth ideal result\n\nIs this the \ncorrect action?\n\nExact match \ncomparison\n\nDoes this answer \nuse the context?\n\nExtract numbers \nfrom each and \ncompare\n\nWhat is the population \nof Canada?\n\nThought: I don\u2019t know. I \nshould use a tool\nAction: Search\nAction Input: What is the \npopulation of Canada?\n\nAgent\n\nSearch\n\nThere are 39,566,248 people \nin Canada as of 2023.\n\nThe current population of \nCanada is 39,566,248 as of \nTuesday, May 23, 2023\u2026.\n\nIs this the right \nsearch result?\n\nTag the right \nanswer and do \nan exact match \ncomparison with \nthe retrieval.\n\n10\n\n\fTechnical patterns\nSubjective evaluations\n\nBuilding up a good scorecard for automated testing bene\ufb01ts from a few rounds of detailed human \nreview so we can learn what is valuable. \n\nA policy of \u201cshow rather than tell\u201d is also advised for GPT-4, so include examples of what a 1, 3 and \n8 out of 10 look like so the model can appreciate the spread.\n\nExample \nscorecard\n\nYou are a helpful evaluation assistant who grades how well the Assistant has answered the customer\u2019s query.\n\nYou will assess each submission against these metrics, please think through these step by step:\n\n-\n\nrelevance: Grade how relevant the search content is to the question from 1 to 5 // 5 being highly relevant and 1 being \nnot relevant at all.\n\n- credibility: Grade how credible the sources provided are from 1 to 5 // 5 being an established newspaper, \n\n-\n\ngovernment agency or large company and 1 being unreferenced.\nresult: Assess whether the question is correct given only the content returned from the search and the user\u2019s \nquestion // acceptable values are \u201ccorrect\u201d or \u201cincorrect\u201d\n\nYou will output this as a JSON document: {relevance: integer, credibility: integer, result: string}\n\nUser: What is the population of Canada?\nAssistant: Canada's population was estimated at 39,858,480 on April 1, 2023 by Statistics Canada.\nEvaluation: {relevance: 5, credibility: 5, result: correct}\n\n11\n\n\fExample framework\n\nYour evaluations can be grouped up into test suites called runs and executed in a batch to test \nthe e\ufb00ectiveness of your system.\n\nEach run should have its contents logged and stored at the most granular level possible \n(\u201ctracing\u201d) so you can investigate failure reasons, make tweaks and then rerun your evals.\n\nRun ID Model\n\nScore\n\nAnnotation feedback\n\nChanges since last run\n\n1\n\n2\n\n3\n\n4\n\n5\n\ngpt-3.5-turbo 28/50\n\ngpt-4\n\n36/50\n\ngpt-3.5-turbo 34/50\n\n\u25cf 18 incorrect with correct search results\n\u25cf 4 incorrect searches\n\nN/A\n\n\u25cf 10 incorrect with correct search results\n\u25cf 4 incorrect searches\n\n\u25cf 12 incorrect with correct search results\n\u25cf 4 incorrect searches\n\nModel updated to GPT-4\n\nAdded few-shot examples\n\ngpt-3.5-turbo 42/50\n\n\u25cf 8 incorrect with correct search results\n\nAdded metadata to search\nPrompt engineering for Answer step\n\ngpt-3.5-turbo 48/50\n\n\u25cf 2 incorrect with correct search results\n\nPrompt engineering to Answer step\n\n12\n\n\fExample framework\n\nI want to return a \nT-shirt I bought on \nAmazon on March 3rd.\n\nUser\n\nRouter\n\nLLM\n\nExpected: return\nPredicted: return\nPASS\n\nReturn\nAssistant\n\nLLM\n\nComponent evals\n\nSubjective evals\n\nExpected: return_policy\nPredicted: return_policy\nPASS\n\nKnowledge \nbase\n\nQuestion: Does this response adhere to \nour guidelines\nScore: \nPoliteness: 5, Coherence: 4, Relevancy: 4\nPASS\n\nSure - because we\u2019re \nwithin 14 days of the \npurchase, I can \nprocess the return\n\nQuestion: I want to return a T-shirt I \nbought on Amazon on March 3rd.\nGround truth: Eligible for return\nPASS\n\n13\n\n\fBest practices\n\nLog everything\n\n\u25cf\n\nEvals need test cases - log everything as you develop so you can mine your logs for good eval cases\n\nCreate a feedback loop\n\n\u25cf\n\u25cf\n\nBuild evals into your application so you can quickly run them, iterate and rerun to see the impact\nEvals also provide a useful structure for few-shot or \ufb01ne-tuning examples when optimizing\n\nEmploy expert labellers who know the process\n\n\u25cf Use experts to help create your eval cases - these need to be as lifelike as possible\n\nEvaluate early and often\n\n\u25cf\n\nEvals are something you should build as soon as you have your \ufb01rst functioning prompt - you won\u2019t be \nable to optimize without this baseline, so build it early\n\n\u25cf Making evals early also forces you to engage with what a good response looks like\n\n\f", "pages_description": ["Overview\n\nEvaluation is defined as the process of validating and testing the outputs that your LLM applications are producing. It is stated that having strong evaluations (\"evals\") will result in a more stable, reliable application which is resilient to code and model changes.\n\nExample use cases for evaluations include:\n- Quantifying a solution\u2019s reliability\n- Monitoring application performance in production\n- Testing for regressions\n\nThe content also outlines what will be covered in the discussion:\n- What are evals\n- Technical patterns\n- Example framework\n- Best practices\n- Resources", "What are evals\nExample\n\nAn evaluation contains a question and a correct answer. We call this the ground truth. The example provided illustrates a scenario where a question is posed: \"What is the population of Canada?\" The actual result is stated as \"There are 39,566,248 people in Canada as of 2023.\" \n\nOn the other side, a thought process is described: \"Thought: I don't know. I should use a tool.\" This leads to an action: \"Search.\" The action input is \"What is the population of Canada?\" which is entered into a tool labeled \"LLM.\" The tool then provides a search result: \"The current population of Canada is 39,566,248 as of Tuesday, May 23, 2023...\" \n\nThis demonstrates how an evaluation process can be used to verify the accuracy of information retrieved by a tool or system in response to a query.", "What are evals\nExample\n\nThe content illustrates an example of an evaluation where the ground truth matches the predicted answer, indicating that the evaluation passes. The example provided is a question-and-answer scenario:\n\nQuestion: What is the population of Canada?\n\nGround Truth: The population of Canada in 2023 is 39,566,248 people.\n\nPredicted Answer: There are 39,566,248 people in Canada as of 2023.\n\nA checkmark indicates that the predicted answer is correct as it matches the ground truth.", "Technical patterns\n\nThe content describes three different types of evaluations used in technical assessments:\n\n1. Metric-based evaluations involve comparison metrics like BLEU and ROUGE, which are used to give a score that can filter and rank results. These metrics are quantitative and allow for objective measurement of performance.\n\n2. Component evaluations compare ground truth to prediction and result in a binary outcome of Pass/Fail. This type of evaluation is used to determine whether a specific component of a system meets the required criteria.\n\n3. Subjective evaluations use a scorecard to evaluate subjectively. This method involves human judgment and interpretation, and the scorecard may also include a Pass/Fail outcome. This approach is less quantitative and relies on individual perspectives and criteria.", "Technical patterns\nMetric-based evaluations\n\nROUGE is described as a metric for evaluating summarization tasks. The content compares an original text with a machine-generated summary and presents a ROUGE score for the summary.\n\nThe original text outlines OpenAI's mission, emphasizing the development of safe and beneficial artificial general intelligence (AGI) that serves all of humanity. It mentions OpenAI's commitment to principles that ensure AGI's benefits are broadly distributed and that its deployment is for the benefit of all, avoiding harmful uses or undue concentration of power.\n\nThe machine summary condenses this information, stating OpenAI's aim to ensure AGI is for everyone's use, avoiding harmful or power-concentrated outcomes. It highlights OpenAI's commitment to researching AGI's safe side and promoting studies in AI, with the goal of being a leader in AI and working with global research and policy groups.\n\nThe ROUGE score provided for the machine summary is 0.51162, which quantitatively measures the quality of the summary in comparison to the original text.", "Technical patterns\nMetric-based evaluations\n\nThe BLEU score is presented as a standard metric for evaluating translation tasks, specifically in the context of machine translation. The process involves comparing an original text to a reference translation and then to a predicted translation, with the BLEU score quantifying the accuracy of the predicted translation.\n\nThe original text provided is in a language other than English and reads \"Y gwir oedd doedden nhw ddim yn dweud celwyddau wedi'r cwbl.\" The reference translation into English is \"The truth was they were not telling lies after all.\" The predicted translation is similar but uses a contraction, stating \"The truth was they weren't telling lies after all.\" The BLEU score for this predicted translation is given as 0.39938.", "I'm sorry, but it seems there is an issue with the image you've provided. It is unavailable because it is of an unsupported file type. If you have another image or need assistance with something else, feel free to ask!", "Technical patterns\n\nComponent evaluations (or \"unit tests\") cover a single input/output of the application. They check whether each component works in isolation, comparing the input to a ground truth ideal result.\n\nOn the left side, there is a question \"What is the population of Canada?\" followed by an answer \"There are 39,566,248 people in Canada as of 2023.\" Below this, there are considerations for evaluating the response: \"Is this the correct action?\" with a note \"Exact match comparison,\" and \"Does this answer use the context?\" with a note \"Extract numbers from each and compare.\"\n\nIn the center, there is an agent with an input from the left side and an output going to the right side. The agent's thought process is described as \"Thought: I don't know. I should use a tool\" followed by \"Action: Search\" and \"Action Input: What is the population of Canada?\"\n\nOn the right side, the output from the agent is \"The current population of Canada is 39,566,248 as of Tuesday, May 23, 2023....\" This is followed by the question \"Is this the right search result?\" and the instruction \"Tag the right answer and do an exact match comparison with the retrieval.\"", "Technical patterns\nSubjective evaluations\n\nBuilding up a good scorecard for automated testing benefits from a few rounds of detailed human review so we can learn what is valuable.\n\nA policy of \"show rather than tell\" is also advised for GPT-4, so include examples of what a 1, 3 and 8 out of 10 look like so the model can appreciate the spread.\n\nExample scorecard:\n- You are a helpful evaluation assistant who grades how well the Assistant has answered the customer's query.\n- You will assess each submission against these metrics, please think through these step by step:\n - relevance: Grade how relevant the search content is to the question from 1 to 5 / 5 being highly relevant and 1 being not relevant at all.\n - credibility: Grade how credible the sources provided are from 1 to 5 / 5 being an established newspaper, government agency or large company and 1 being unreferenced.\n - result: Assess whether the question is correct given only the content returned from the search and the user's question // acceptable values are \u201ccorrect\u201d or \u201cincorrect\u201d\n\nYou will output this as a JSON document: {relevance: integer, credibility: integer, result: string}\n\nUser: What is the population of Canada?\nAssistant: Canada's population was estimated at 39,858,480 on April 1, 2023 by Statistics Canada.\nEvaluation: {relevance: 5, credibility: 5,", "Example framework\n\nYour evaluations can be grouped up into test suites called runs and executed in a batch to test the effectiveness of your system.\n\nEach run should have its contents logged and stored at the most granular level possible (\"tracing\") so you can investigate failure reasons, make tweaks and then rerun your evals.\n\nThe table includes a list of runs with corresponding models, scores, annotation feedback, and changes since the last run. The first run with ID 1 used the model gpt-3.5-turbo and scored 28 out of 50. The annotation feedback for this run includes 18 incorrect with correct search results and 4 incorrect searches. There were no changes since the last run.\n\nThe second run with ID 2 used the model gpt-4 and scored 36 out of 50. The annotation feedback includes 10 incorrect with correct search results and 4 incorrect searches. The change since the last run was the model updated to GPT-4.\n\nThe third run with ID 3 used the model gpt-3.5-turbo and scored 34 out of 50. The annotation feedback includes 12 incorrect with correct search results and 4 incorrect searches. The change since the last run was the addition of a few-shot examples.\n\nThe fourth run with ID 4 used the model gpt-3.5-turbo and scored 42 out of 50. The annotation feedback includes 8 incorrect with correct search results.", "Example framework\n\nThe content depicts a framework involving a user interacting with a system to handle a return request. The user statement is \"I want to return a T-shirt I bought on Amazon on March 3rd.\" This input is processed by a Router, which uses a Large Language Model (LLM) to determine the next step. The Router's output is evaluated with the expected action being \"return\" and the predicted action also being \"return,\" resulting in a \"PASS.\"\n\nThe process then moves to a Return Assistant, which also utilizes an LLM. The Return Assistant checks against a Knowledge Base and the expected action here is \"return_policy\" with the predicted action matching it, leading to another \"PASS.\"\n\nThe framework includes component evaluations and subjective evaluations, indicated by dashed outlines. An example response from the system is provided: \"Sure - because we're within 14 days of the purchase, I can process the return.\" This response is evaluated based on the guidelines with scores for Politeness: 5, Coherence: 4, Relevancy: 4, and an overall \"PASS.\"\n\nAdditionally, there is a question related to the user's request: \"Does this response adhere to our guidelines,\" with the response being evaluated as passing. Another question is presented: \"I want to return a T-shirt I bought on Amazon on March 3rd.\" The ground truth for this scenario is \"Eligible for return,\" and the outcome is a \"PASS.\"", "Best practices\n\nThe content outlines several best practices for a certain process or system development:\n\n- Log everything: It is suggested to log all activities during development to facilitate the mining of logs for good evaluation cases.\n- Create a feedback loop: Incorporate evaluations into the application to enable quick iterations and re-runs to assess impact, and use evaluations to structure few-shot or fine-tuning examples when optimizing.\n- Employ expert labelers who know the process: Engage experts to create evaluation cases that are as realistic as possible.\n- Evaluate early and often: Start building evaluations as soon as the first functional prompt is available, as this sets a baseline for optimization. Early evaluations also encourage engagement with the quality of responses."]}, {"filename": "fine-tuning-deck.pdf", "text": "Fine-tuning\nTechnique\n\nFebruary 2024\n\n\fOverview\n\nFine-tuning involves adjusting the \nparameters of pre-trained models on a \nspeci\ufb01c dataset or task. This process \nenhances the model's ability to generate \nmore accurate and relevant responses for \nthe given context by adapting it to the \nnuances and speci\ufb01c requirements of the \ntask at hand.\n\nExample use cases\n\n- Generate output in a consistent \n\n-\n\nformat\nProcess input by following speci\ufb01c \ninstructions\n\nWhat we\u2019ll cover\n\n\u25cf When to \ufb01ne-tune\n\n\u25cf Preparing the dataset\n\n\u25cf Best practices\n\n\u25cf Hyperparameters\n\n\u25cf Fine-tuning advances\n\n\u25cf Resources\n\n3\n\n\fWhat is Fine-tuning\n\nPublic Model\n\nTraining data\n\nTraining\n\nFine-tuned \nmodel\n\nFine-tuning a model consists of training the \nmodel to follow a set of given input/output \nexamples.\n\nThis will teach the model to behave in a \ncertain way when confronted with a similar \ninput in the future.\n\nWe recommend using 50-100 examples \n\neven if the minimum is 10.\n\n4\n\n\fWhen to \ufb01ne-tune\n\nGood for \u2705\n\nNot good for \u274c\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nFollowing a given format or tone for the \n\noutput\n\nProcessing the input following speci\ufb01c, \n\ncomplex instructions\n\nImproving latency\n\nReducing token usage\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nTeaching the model new knowledge\n\u2794 Use RAG or custom models instead\n\nPerforming well at multiple, unrelated tasks\n\u2794 Do prompt-engineering or create multiple \n\nFT models instead\n\nInclude up-to-date content in responses\n\u2794 Use RAG instead\n\n5\n\n\fPreparing the dataset\n\nExample format\n\n{\n\n\"messages\": [\n\n{\n\n\"role\": \"system\",\n\"content\": \"Marv is a factual chatbot \nthat is also sarcastic.\"\n\n},\n{\n\n\"role\": \"user\",\n\"content\": \"What's the capital of \nFrance?\"\n\n},\n{\n\n\"role\": \"assistant\",\n\"content\": \"Paris, as if everyone \ndoesn't know that already.\"\n\n}\n\n]\n\n}\n\n.jsonl\n\n\u2794 Take the set of instructions and prompts that you \n\nfound worked best for the model prior to \ufb01ne-tuning. \nInclude them in every training example\n\n\u2794 If you would like to shorten the instructions or \n\nprompts, it may take more training examples to arrive \nat good results\n\nWe recommend using 50-100 examples \n\neven if the minimum is 10.\n\n6\n\n\fBest practices\n\nCurate examples carefully\n\nDatasets can be di\ufb03cult to build, start \nsmall and invest intentionally. \nOptimize for fewer high-quality \ntraining examples.\n\n\u25cf Consider \u201cprompt baking\u201d, or using a basic \nprompt to generate your initial examples\n\u25cf If your conversations are multi-turn, ensure \n\nyour examples are representative\n\n\u25cf Collect examples to target issues detected \n\nin evaluation\n\n\u25cf Consider the balance & diversity of data\n\u25cf Make sure your examples contain all the \n\ninformation needed in the response\n\nIterate on hyperparameters\n\nEstablish a baseline\n\nStart with the defaults and adjust \nbased on performance.\n\n\u25cf If the model does not appear to converge, \n\nincrease the learning rate multiplier\n\u25cf If the model does not follow the training \ndata as much as expected increase the \nnumber of epochs\n\n\u25cf If the model becomes less diverse than \n\nexpected decrease the # of epochs by 1-2\n\nAutomate your feedback \npipeline\n\nIntroduce automated evaluations to \nhighlight potential problem cases to \nclean up and use as training data.\n\nConsider the G-Eval approach of \nusing GPT-4 to perform automated \ntesting using a scorecard.\n\nOften users start with a \nzero-shot or few-shot prompt to \nbuild a baseline evaluation \nbefore graduating to \ufb01ne-tuning.\n\nOften users start with a \nzero-shot or few-shot prompt to \nbuild a baseline evaluation \nOptimize for latency and \nbefore graduating to \ufb01ne-tuning.\ntoken e\ufb03ciency\n\nWhen using GPT-4, once you \nhave a baseline evaluation and \ntraining examples consider \n\ufb01ne-tuning 3.5 to get similar \nperformance for less cost and \nlatency.\n\nExperiment with reducing or \nremoving system instructions \nwith subsequent \ufb01ne-tuned \nmodel versions.\n\n\fHyperparameters\n\nEpochs\nRefers to 1 full cycle through the training dataset\nIf you have hundreds of thousands of examples, we would recommend \nexperimenting with two epochs (or one) to avoid over\ufb01tting.\n\ndefault: auto (standard is 4)\n\nBatch size\nNumber of training examples used to train a single \nforward & backward pass\nIn general, we've found that larger batch sizes tend to work better for larger datasets\n\ndefault: ~0.2% x N* (max 256)\n\n*N = number of training examples\n\nLearning rate multiplier\nScaling factor for the original learning rate\nWe recommend experimenting with values between 0.02-0.2. We've found that \nlarger learning rates often perform better with larger batch sizes.\n\ndefault: 0.05, 0.1 or 0.2*\n\n*depends on \ufb01nal batch size\n\n8\n\n\f", "pages_description": ["Overview\n\nFine-tuning involves adjusting the parameters of pre-trained models on a specific dataset or task. This process enhances the model's ability to generate more accurate and relevant responses for the given context by adapting it to the nuances and specific requirements of the task at hand.\n\nExample use cases include generating output in a consistent format and processing input by following specific instructions.\n\nThe topics that will be covered include when to fine-tune, preparing the dataset, best practices, hyperparameters, fine-tuning advances, and resources.", "What is Fine-tuning\n\nFine-tuning a model consists of training the model to follow a set of given input/output examples. This will teach the model to behave in a certain way when confronted with a similar input in the future.\n\nWe recommend using 50-100 examples even if the minimum is 10.\n\nThe visual representation shows a public model being combined with training data through a process of training, resulting in a fine-tuned model.", "When to fine-tune\n\nFine-tuning is good for:\n- Following a given format or tone for the output\n- Processing the input following specific, complex instructions\n- Improving latency\n- Reducing token usage\n\nFine-tuning is not good for:\n- Teaching the model new knowledge, for which one should use RAG or custom models instead\n- Performing well at multiple, unrelated tasks, where one should do prompt-engineering or create multiple FT models instead\n- Including up-to-date content in responses, for which one should use RAG instead", "Preparing the dataset\n\nThe content shows an example of a dataset in JSON format with a series of messages. Each message has a \"role\" and \"content\" attribute. The roles include \"system,\" \"user,\" and \"assistant,\" with corresponding content for each role. The system's content mentions a factual chatbot that is also sarcastic. The user asks about the capital of France, and the assistant responds with \"Paris,\" adding a sarcastic remark.\n\nThe accompanying notes suggest taking a set of instructions and prompts that have worked best for the model before fine-tuning and including them in every training example. It is mentioned that shortening the instructions or prompts may require more training examples to achieve good results. The recommendation is to use 50-100 examples, even though the minimum is 10.", "Best practices\n\nCurate examples carefully\nDatasets can be difficult to build, start small and invest intentionally. Optimize for fewer high-quality training examples.\n- Consider \u201cprompt baking\u201d, or using a basic prompt to generate your initial examples\n- If your conversations are multi-turn, ensure your examples are representative\n- Collect examples to target issues detected in evaluation\n- Consider the balance & diversity of data\n- Make sure your examples contain all the information needed in the response\n\nIterate on hyperparameters\nStart with the defaults and adjust based on performance.\n- If the model does not appear to converge, increase the learning rate multiplier\n- If the model does not follow the training data as much as expected increase the number of epochs\n- If the model becomes less diverse than expected decrease the number of epochs by 1-2\n\nAutomate your feedback pipeline\nIntroduce automated evaluations to highlight potential problem cases to clean up and use as training data.\nConsider the G-Eval approach of using GPT-4 to perform automated testing using a scorecard.\n\nEstablish a baseline\nOften users start with a zero-shot or few-shot prompt to build a baseline evaluation before graduating to fine-tuning.\nzero-shot or few-shot prompt to\n\nOptimize for latency and token efficiency\nWhen using GPT-4, once you have a baseline evaluation and training examples consider fine-tuning 3.5 to get similar performance for less cost and latency.\nExperiment with reducing or removing system instructions with subsequent fine-t", "Hyperparameters\n\nEpochs\nRefers to 1 full cycle through the training dataset. If you have hundreds of thousands of examples, it is recommended to experiment with two epochs (or one) to avoid overfitting. The default setting is auto, with the standard being 4 epochs.\n\nBatch size\nThis is the number of training examples used to train a single forward and backward pass. It is generally found that larger batch sizes tend to work better for larger datasets. The default batch size is approximately 0.2% of the number of training examples, with a maximum of 256.\n\nLearning rate multiplier\nThis is a scaling factor for the original learning rate. It is recommended to experiment with values between 0.02-0.2. It has been found that larger learning rates often perform better with larger batch sizes. The default values for the learning rate multiplier are 0.05, 0.1, or 0.2, and this depends on the final batch size."]}]