diff --git a/articles/techniques_to_improve_reliability.md b/articles/techniques_to_improve_reliability.md index 28a285d9..433e1615 100644 --- a/articles/techniques_to_improve_reliability.md +++ b/articles/techniques_to_improve_reliability.md @@ -33,10 +33,10 @@ A: Let's think step by step. ``` ```text-davinci-002 -There are 16 balls in total. -Half of the balls are golf balls. -That means that there are 8 golf balls. -Half of the golf balls are blue. +There are 16 balls in total. +Half of the balls are golf balls. +That means that there are 8 golf balls. +Half of the golf balls are blue. That means that there are 4 blue golf balls. ``` @@ -65,73 +65,73 @@ The rest of this article shares techniques for improving reliability of large la One way to give a model more time and space to think is to break tasks into simpler pieces. As an example, consider a task where we ask the model a multiple-choice question about some text - in this case, a game of Clue. When asked directly, `text-davinci-002` isn't able to put clues 3 & 5 together, and answers incorrectly: - + ```text-davinci-002 Use the following clues to answer the following multiple-choice question. - + Clues: 1. Miss Scarlett was the only person in the lounge. 2. The person with the pipe was in the kitchen. 3. Colonel Mustard was the only person in the observatory. 4. Professor Plum was not in the library nor the billiard room. 5. The person with the candlestick was in the observatory. - + Question: Was Colonel Mustard in the observatory with the candlestick? (a) Yes; Colonel Mustard was in the observatory with the candlestick (b) No; Colonel Mustard was not in the observatory with the candlestick (c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick - + Solution: ``` - + ```text-davinci-002 (c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick ``` - + Although clues 3 and 5 establish that Colonel Mustard was the only person in the observatory and that the person in the observatory had the candlestick, the model fails to combine them into a correct answer of (a) Yes. - + However, instead of asking for the answer directly, we can split the task into three pieces: - + - First, go through the clues one by one and consider whether the clue is potentially relevant - Second, combine the relevant clues to reason out the answer to the question - Third, write the final answer: either (a), (b), or (c) - + ```text-davinci-002 Use the following clues to answer the following multiple-choice question, using the following procedure: (1) First, go through the clues one by one and consider whether the clue is potentially relevant (2) Second, combine the relevant clues to reason out the answer to the question (3) Third, map the answer to one of the multiple choice answers: either (a), (b), or (c) - + Clues: 1. Miss Scarlett was the only person in the lounge. 2. The person with the pipe was in the kitchen. 3. Colonel Mustard was the only person in the observatory. 4. Professor Plum was not in the library nor the billiard room. 5. The person with the candlestick was in the observatory. - + Question: Was Colonel Mustard in the observatory with the candlestick? (a) Yes; Colonel Mustard was in the observatory with the candlestick (b) No; Colonel Mustard was not in the observatory with the candlestick (c) Unknown; there is not enough information to determine whether Colonel Mustard was in the observatory with the candlestick - + Solution: (1) First, go through the clues one by one and consider whether the clue is potentially relevant: ``` - + ```text-davinci-002 Clue 1: Miss Scarlett was the only person in the lounge. This clue is not relevant. Clue 2: The person with the pipe was in the kitchen. This clue is not relevant. Clue 3: Colonel Mustard was the only person in the observatory. This clue is relevant. Clue 4: Professor Plum was not in the library nor the billiard room. This clue is not relevant. Clue 5: The person with the candlestick was in the observatory. This clue is relevant. - + (2) Second, combine the relevant clues to reason out the answer to the question: Clue 3 and Clue 5 both say that the person with the candlestick was in the observatory. Colonel Mustard was the only person in the observatory, so Colonel Mustard was the person with the candlestick. - + (3) Third, write the final answer: (a) Yes; Colonel Mustard was in the observatory with the candlestick ``` - + By giving the model more time and space to think, and guiding it along a reasoning plan, it's able to figure out the correct answer of (a) Yes. Another benefit of splitting complex instructions into smaller subtasks is that it can help keep the model focused on each subtask. @@ -151,9 +151,9 @@ Summary: ```text-davinci-002 The text explains that statistics is a science that studies the variability, collection, organization, analysis, interpretation, and presentation of data, as well as the random process that generates them following the laws of probability. - ``` +``` - However, if we first ask the model to identify the language of the text, and then summarize the text, it becomes more reliable: +However, if we first ask the model to identify the language of the text, and then summarize the text, it becomes more reliable: ```text-davinci-002 First, identify the language of the text. Second, summarize the text using the original language of the text. The summary should be one sentence long. @@ -182,22 +182,22 @@ Another powerful technique for improving the reliability of answers is to prompt Published by [Takeshi Kojima et al. in 2022](https://arxiv.org/abs/2205.11916), the easiest way to prompt a model to reason out the answer is to simply prepend answers with `Let's think step by step.` Figure 2 illustrates an example: -[![zero-shot reasoning example](images/zero-shot_reasoners_fig2.png) -
Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) +[![zero-shot reasoning example](/images/zero-shot_reasoners_fig2.png) +
Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) #### Results Applying this simple trick to the MultiArith math dataset, the authors found `Let's think step by step` quadrupled the accuracy, from 18% to 79%! -[![zero-shot reasoning example](images/zero-shot_reasoners_tab5.png) -
Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) +[![zero-shot reasoning example](/images/zero-shot_reasoners_tab5.png) +
Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) #### Implications Although the `Let's think step by step` trick works well on math problems, it's not effective on all tasks. The authors found that it was most helpful for multi-step arithmetic problems, symbolic reasoning problems, strategy problems, and other reasoning problems. It didn't help with simple math problems or common sense questions, and presumably wouldn't help with many other non-reasoning tasks either. -[![zero-shot reasoning example](images/zero-shot_reasoners_tab1.png) -
Source: *Large Language Models are Zero-Shot Reasoners* by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) +[![zero-shot reasoning example](/images/zero-shot_reasoners_tab1.png) +
Source: _Large Language Models are Zero-Shot Reasoners_ by Takeshi Kojima et al. (2022).](https://arxiv.org/abs/2205.11916) To learn more, read the [full paper](https://arxiv.org/abs/2205.11916). @@ -248,13 +248,13 @@ Because the Toyota Prius Prime meets all of the criteria for a federal tax credi Prompting the model to reason out its answers can be done in many ways. One way is to demonstrate with a few examples ('few-shot'), as studied by [Jason Wei and Denny Zhou et al. from Google](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html). Here's an example few-shot chain-of-thought prompt: -[![chain of thought example](images/chain_of_thought_fig1.png) -
Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) +[![chain of thought example](/images/chain_of_thought_fig1.png) +
Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) More demonstrations of reasoning chains written by human labelers: -[![chain of thought example](images/chain_of_thought_fig3.png) -
Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) +[![chain of thought example](/images/chain_of_thought_fig3.png) +
Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) [(Note that it has been called into question whether pears actually float)](https://twitter.com/Meaningness/status/1561062170074370048?s=20&t=mpHt8f3RRboztXxdhLFnWQ) @@ -262,13 +262,13 @@ More demonstrations of reasoning chains written by human labelers: Testing on grade school math problems, the authors found that chain of thought prompting tripled the solve rate, from 18% to 57%. -[![chain of thought example](images/chain_of_thought_fig5.png) -
Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) +[![chain of thought example](/images/chain_of_thought_fig5.png) +
Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) In addition to math problems, chain of thought prompting also lifted performance on questions related to sports understanding, coin flip tracking, and last letter concatenation. In most cases, not many examples were need to saturate the performance gains (less than 8 or so). -[![chain of thought example](images/chain_of_thought_fig11.png) -
Source: *Chain of Thought Prompting Elicits Reasoning in Large Language Models* Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) +[![chain of thought example](/images/chain_of_thought_fig11.png) +
Source: _Chain of Thought Prompting Elicits Reasoning in Large Language Models_ Jason Wei and Denny Zhou et al. (2022)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html) To learn more, read the [full paper](https://arxiv.org/abs/2201.11903). @@ -284,8 +284,8 @@ In general, to eke out maximum performance on a task, you'll need to fine-tune a In 2022, Eric Zelikman and Yuhuai Wu et al. published a clever procedure for using a few-shot prompt to generate a dataset of explanations that could be used to fine-tune a model. The idea is to use a few-shot prompt to generate candidate explanations, and only keep the explanations that produce the correct answer. Then, to get additional explanations for some of the incorrect answers, retry the few-shot prompt but with correct answers given as part of the question. The authors called their procedure STaR (Self-taught Reasoner): -[![STaR procedure](images/star_fig1.png) -
Source: *STaR: Bootstrapping Reasoning With Reasoning* by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465) +[![STaR procedure](/images/star_fig1.png) +
Source: _STaR: Bootstrapping Reasoning With Reasoning_ by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465) With this technique, you can combine the benefits of fine-tuning with the benefits of chain-of-thought prompting without needing to write thousands of example explanations. @@ -293,8 +293,8 @@ With this technique, you can combine the benefits of fine-tuning with the benefi When the authors applied this technique to a Common Sense Q&A dataset, they found that STaR outperformed both chain-of-thought prompting alone (73% > 37%) and fine-tuning alone (73% > 60%): -[![STaR results](images/star_tab1.png) -
Source: *STaR: Bootstrapping Reasoning With Reasoning* by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465) +[![STaR results](/images/star_tab1.png) +
Source: _STaR: Bootstrapping Reasoning With Reasoning_ by Eric Zelikman and Yujuai Wu et al. (2022)](https://arxiv.org/abs/2203.14465) To learn more, read the [full paper](https://arxiv.org/abs/2203.14465). @@ -312,15 +312,15 @@ A number of extensions of chain-of-thought prompting have been published as well Published by Antonia Creswell et al., one extension of the chain-of-thought technique is to split the single prompt for generating explanations and answers into smaller parts. First, a prompt selects a relevant subset of facts from the text ('selection prompt'). Then, a second prompt infers a conclusion from the selected facts ('inference prompt'). These prompts are then alternated in a loop to generate multiple steps of reasoning and eventually land on a final answer. The authors illustrate the idea in the following figure: -[![Selection-inference prompting](images/selection-inference_fig1.png) -
Source: *Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712) +[![Selection-inference prompting](/images/selection-inference_fig1.png) +
Source: _Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712) #### Results When applied to a 7B-parameter model, the authors found that selection-inference prompting substantially improved performance relative to chain-of-thought prompting on the bAbi and Proof Writer benchmark tasks (both of which require longer sequences of reasoning steps). The best performance they achieved combined both selection-inference prompting with fine-tuning. -[![Selection-inference prompting](images/selection-inference_fig4.png) -
Source: *Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712) +[![Selection-inference prompting](/images/selection-inference_fig4.png) +
Source: _Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2205.09712) #### Implications @@ -351,33 +351,33 @@ The halter models brings a couple of advantages: - it can tell the selection-inference process to stop or keep going, as necessary. - if the process never halts, you'll get no answer, which is often preferable to a hallucinated guess -[![Faithful reasoning](images/faithful-reasoning_fig3.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +[![Faithful reasoning](/images/faithful-reasoning_fig3.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) -[![Faithful reasoning](images/faithful-reasoning_fig5.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +[![Faithful reasoning](/images/faithful-reasoning_fig5.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) Second, the authors add a value function, which is used to assess the quality of reasoning steps and search over multiple reasoning trajectories. This echoes a common theme for increasing reliability; instead of generating a single answer from the model, generate a set of answers and then use some type of value function / discriminator / verifier model to pick the best one. -[![Faithful reasoning](images/faithful-reasoning_fig7.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +[![Faithful reasoning](/images/faithful-reasoning_fig7.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) In addition to these two extensions, the authors also use a trick to reduce hallucination of fake facts. Rather than asking the model to write out factual sentences, they fine-tune a model to work with sentence labels (e.g., sen1) instead. This helps prevent the model from hallucinating fake facts not mentioned in the prompt context. -[![Faithful reasoning](images/faithful-reasoning_fig4.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +[![Faithful reasoning](/images/faithful-reasoning_fig4.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) #### Results The authors evaluated their technique on two benchmarks: the ProofWriter task (not shown) and [EntailmentBankQA](https://allenai.org/data/entailmentbank) (shown). The technique increased accuracy substantially, especially on harder reasoning problems. -![Faithful reasoning](images/faithful-reasoning_tab2.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +![Faithful reasoning](/images/faithful-reasoning_tab2.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) In addition, their sentence label manipulation trick essentially eliminated hallucination! -![Faithful reasoning](images/faithful-reasoning_tab5.png) -
Source: *Faithful Reasoning Using Large Language Models* by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) +![Faithful reasoning](/images/faithful-reasoning_tab5.png) +
Source: _Faithful Reasoning Using Large Language Models_ by Antonia Creswell et al. (2022)](https://arxiv.org/abs/2208.14271) #### Implications @@ -399,18 +399,18 @@ In addition to doing poorly on long reasoning chains (where selection-inference Least-to-most prompting is another technique that splits up reasoning tasks into smaller, more reliable subtasks. The idea is to elicit a subtask from the model by prompting it with something like `To solve {question}, we need to first solve: "`. Then, with that subtask in hand, the model can generate a solution. The solution is appended to the original question and the process is repeated until a final answer is produced. -[![Least-to-most prompting](images/least-to-most_fig1.png) -
Source: *Least-to-most Prompting Enables Complex Reasoning in Large Language Models* by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625) +[![Least-to-most prompting](/images/least-to-most_fig1.png) +
Source: _Least-to-most Prompting Enables Complex Reasoning in Large Language Models_ by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625) #### Results When applied to benchmarks involving long reasoning chains using `code-davinci-002` (which is optimized for code but can still understand text), the authors measured gains as large as 16% -> 99.7%! [ -![Least-to-most prompting results on last-letter-concatenation task](images/least-to-most_tab4.png) -![Least-to-most prompting results on SCAN](images/least-to-most_tab9.png) -![Least-to-most prompting results on DROP numerical reasoning](images/least-to-most_tab11.png) -
Source: *Least-to-most Prompting Enables Complex Reasoning in Large Language Models* by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625) +![Least-to-most prompting results on last-letter-concatenation task](/images/least-to-most_tab4.png) +![Least-to-most prompting results on SCAN](/images/least-to-most_tab9.png) +![Least-to-most prompting results on DROP numerical reasoning](/images/least-to-most_tab11.png) +
Source: _Least-to-most Prompting Enables Complex Reasoning in Large Language Models_ by Denny Zhou et al. (2022)](https://arxiv.org/abs/2205.10625) #### Implications @@ -426,7 +426,7 @@ To learn more, read the [full paper](https://arxiv.org/abs/2205.10625). #### Method -In contrast to the previous techniques, which try to maximize the likelihood of correct answers, another approach is to use GPT-3 to generate a tree of possible explanations (both correct *and incorrect*), and then analyze their relationships to guess at which set is correct. This technique was coined maieutic prompting by [Jaehun Jung et al. in May 2022](https://arxiv.org/abs/2205.11822) (maieutic means relating to the Socratic method of asking questions to elicit ideas). +In contrast to the previous techniques, which try to maximize the likelihood of correct answers, another approach is to use GPT-3 to generate a tree of possible explanations (both correct _and incorrect_), and then analyze their relationships to guess at which set is correct. This technique was coined maieutic prompting by [Jaehun Jung et al. in May 2022](https://arxiv.org/abs/2205.11822) (maieutic means relating to the Socratic method of asking questions to elicit ideas). The method is complicated, and works as follows: @@ -444,15 +444,14 @@ The method is complicated, and works as follows: - Use a solver to the find the most self-consistent set of beliefs, and take those as true [ - ![Maieutic prompting](images/maieutic_fig2.png) - ![Maieutic prompting](images/maieutic_fig6.png) -
Source: *Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations* by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822) - +![Maieutic prompting](/images/maieutic_fig2.png) +![Maieutic prompting](/images/maieutic_fig6.png) +
Source: _Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations_ by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822) #### Results -[![Maieutic prompting results](images/maieutic_tab1.png) -
Source: *Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations* by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822) +[![Maieutic prompting results](/images/maieutic_tab1.png) +
Source: _Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations_ by Jaehun Jung et al. (2022)](https://arxiv.org/abs/2205.11822) #### Implications @@ -468,15 +467,15 @@ To learn more, read the [full paper](https://arxiv.org/abs/2205.11822). For tasks with a discrete set of answers, one simple way to improve reliability is to sample multiple explanations & answers from the model (using a positive temperature) and then pick the final answer that appears most often. -[![Self-consistency method](images/self-consistency_fig1.png) -
Source: *Self-Consistency Improves Chain of Thought Reasoning in Language Models* by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171) +[![Self-consistency method](/images/self-consistency_fig1.png) +
Source: _Self-Consistency Improves Chain of Thought Reasoning in Language Models_ by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171) #### Results This technique lifted accuracies by anywhere from 1 to 24 percentage points on a suite of math and reasoning benchmarks. (Plotted below are results from Google's LaMDA model; using Google's larger PaLM model, the baselines were higher but the gains were a bit smaller.) -[![Self-consistency results](images/self-consistency_fig3.png) -
Source: *Self-Consistency Improves Chain of Thought Reasoning in Language Models* by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171) +[![Self-consistency results](/images/self-consistency_fig3.png) +
Source: _Self-Consistency Improves Chain of Thought Reasoning in Language Models_ by Xuezhi Wang et al. (2022)](https://arxiv.org/abs/2203.11171) #### Implications @@ -500,15 +499,15 @@ In 2021, OpenAI researchers applied this technique to grade school math problems - Using those solutions, with some labeled correct and some labeled incorrect, they fine-tuned a verifier model to classify whether a question and candidate solution was correct or incorrect - Finally, at test time, the generative model creates 100 solutions to each problem, and the one with the highest score according to the verifier model is picked as the final answer -[![Verifier method](images/verifiers_fig3.png) -
Source: *Training Verifiers to Solve Math Word Problems* by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168) +[![Verifier method](/images/verifiers_fig3.png) +
Source: _Training Verifiers to Solve Math Word Problems_ by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168) #### Results With a 175B GPT-3 model and 8,000 training examples, this technique substantially lifted grade school math accuracy from ~33% to ~55%. -[![Verifier results](images/verifiers_fig5.png) -
Source: *Training Verifiers to Solve Math Word Problems* by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168) +[![Verifier results](/images/verifiers_fig5.png) +
Source: _Training Verifiers to Solve Math Word Problems_ by Karl Cobbe et al. (2021)](https://arxiv.org/abs/2110.14168) #### Implications @@ -525,27 +524,27 @@ Although the techniques above vary in their approach, they all share the goal of This paradigm of trying to build a reliable system out of less reliable components is reminiscent of probabilistic programming, and many of the analysis techniques of that field can be applied to this one. -In the paper *Language Model Cascades*, David Dohan et al. interpret the above techniques in the paradigm of probabilistic graphical models: +In the paper _Language Model Cascades_, David Dohan et al. interpret the above techniques in the paradigm of probabilistic graphical models: #### Chain of thought prompting -[![graphical model of chain of thought prompting](images/lm_cascades_fig1.png) -
Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) +[![graphical model of chain of thought prompting](/images/lm_cascades_fig1.png) +
Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) #### Fine-tuned chain of thought prompting / Self-taught reasoner -[![graphical model of fine-tuned chain of thought prompting](images/lm_cascades_fig3.png) -
Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) +[![graphical model of fine-tuned chain of thought prompting](/images/lm_cascades_fig3.png) +
Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) #### Selection-inference prompting -[![graphical model of selection-inference prompting](images/lm_cascades_fig4.png) -
Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) +[![graphical model of selection-inference prompting](/images/lm_cascades_fig4.png) +
Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) #### Verifiers -[![graphical model of verifiers](images/lm_cascades_fig5.png) -
Source: *Language Model Cascades* by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) +[![graphical model of verifiers](/images/lm_cascades_fig5.png) +
Source: _Language Model Cascades_ by David Dohan et al. (2022)](https://arxiv.org/abs/2207.10342) #### Implications @@ -560,7 +559,7 @@ In the future, expect better models and better techniques to be published. Even ## Bibliography | Lesson | Paper | Date | -|--------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|----------| +| ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | -------- | | Break complex tasks into simpler subtasks (and consider exposing the intermediate outputs to users) | [AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts](https://arxiv.org/abs/2110.01691) | 2021 Oct | | You can improve output by generating many candidates, and then picking the one that looks best | [Training Verifiers to Solve Math Word Problems](https://arxiv.org/abs/2110.14168) | 2021 Oct | | On reasoning tasks, models do better when they reason step-by-step before answering | [Chain of Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) | 2022 Jan | @@ -571,4 +570,4 @@ In the future, expect better models and better techniques to be published. Even | On long reasoning problems, you can improve step-by-step reasoning by splitting the problem into pieces to solve incrementally | [Least-to-most Prompting Enables Complex Reasoning in Large Language Models](https://arxiv.org/abs/2205.10625) | 2022 May | | You can have the model analyze both good and bogus explanations to figure out which set of explanations are most consistent | [Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations](https://arxiv.org/abs/2205.11822) | 2022 May | | You can think about these techniques in terms of probabilistic programming, where systems comprise unreliable components | [Language Model Cascades](https://arxiv.org/abs/2207.10342) | 2022 Jul | -| You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' prompt | [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271) | 2022 Aug | +| You can eliminate hallucination with sentence label manipulation, and you can reduce wrong answers with a 'halter' prompt | [Faithful Reasoning Using Large Language Models](https://arxiv.org/abs/2208.14271) | 2022 Aug | diff --git a/examples/Search_reranking_with_cross-encoders.ipynb b/examples/Search_reranking_with_cross-encoders.ipynb index a5cc7e6b..126c1220 100644 --- a/examples/Search_reranking_with_cross-encoders.ipynb +++ b/examples/Search_reranking_with_cross-encoders.ipynb @@ -1,812 +1,812 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "7f30b8b2", - "metadata": {}, - "source": [ - "# Search reranking with cross-encoders\n", - "\n", - "This notebook takes you through examples of using a cross-encoder to re-rank search results.\n", - "\n", - "This is a common use case with our customers, where you've implemented semantic search using embeddings (produced using a [bi-encoder](https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieval-bi-encoder)) but the results are not as accurate as your use case requires. A possible cause is that there is some business rule you can use to rerank the documents such as how recent or how popular a document is. \n", - "\n", - "However, often there are subtle domain-specific rules that help determine relevancy, and this is where a cross-encoder can be useful. Cross-encoders are more accurate than bi-encoders but they don't scale well, so using them to re-order a shortened list returned by semantic search is the ideal use case.\n", - "\n", - "### Example\n", - "\n", - "Consider a search task with D documents and Q queries.\n", - "\n", - "The brute force approach of computing every pairwise relevance is expensive; its cost scales as ```D * Q```. This is known as **cross-encoding**.\n", - "\n", - "A faster approach is **embeddings-based search**, in which an embedding is computed once for each document and query, and then re-used multiple times to cheaply compute pairwise relevance. Because embeddings are only computed once, its cost scales as ```D + Q```. This is known as **bi-encoding**.\n", - "\n", - "Although embeddings-based search is faster, the quality can be worse. To get the best of both, one common approach is to use embeddings (or another bi-encoder) to cheaply identify top candidates, and then use GPT (or another cross-encoder) to expensively re-rank those top candidates. The cost of this hybrid approach scales as ```(D + Q) * cost of embedding + (N * Q) * cost of re-ranking```, where ```N``` is the number of candidates re-ranked.\n", - "\n", - "### Walkthrough\n", - "\n", - "To illustrate this approach we'll use ```text-davinci-003``` with ```logprobs``` enabled to build a GPT-powered cross-encoder. Our GPT models have strong general language understanding, which when tuned with some few-shot examples can provide a simple and effective cross-encoding option.\n", - "\n", - "This notebook drew on this great [article](https://weaviate.io/blog/cross-encoders-as-reranker) by Weaviate, and this [excellent explanation](https://www.sbert.net/examples/applications/cross-encoder/README.html) of bi-encoders vs. cross-encoders from Sentence Transformers." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "71cb361f", - "metadata": {}, - "outputs": [], - "source": [ - "!pip install openai\n", - "!pip install arxiv\n", - "!pip install tenacity\n", - "!pip install pandas\n", - "!pip install tiktoken" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "90f3b829", - "metadata": {}, - "outputs": [], - "source": [ - "import arxiv\n", - "from math import exp\n", - "import openai\n", - "import pandas as pd\n", - "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", - "import tiktoken" - ] - }, - { - "cell_type": "markdown", - "id": "fdada886", - "metadata": {}, - "source": [ - "## Search\n", - "\n", - "We'll use the arXiv search service for this example, but this step could be performed by any search service you have. The key item to consider is over-fetching slightly to capture all the potentially relevant documents, before re-sorting them.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "bf16c893", - "metadata": {}, - "outputs": [], - "source": [ - "query = \"how do bi-encoders work for sentence embeddings\"\n", - "search = arxiv.Search(\n", - " query=query, max_results=20, sort_by=arxiv.SortCriterion.Relevance\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "4b020a1b", - "metadata": {}, - "outputs": [], - "source": [ - "result_list = []\n", - "\n", - "for result in search.results():\n", - " result_dict = {}\n", - "\n", - " result_dict.update({\"title\": result.title})\n", - " result_dict.update({\"summary\": result.summary})\n", - "\n", - " # Taking the first url provided\n", - " result_dict.update({\"article_url\": [x.href for x in result.links][0]})\n", - " result_dict.update({\"pdf_url\": [x.href for x in result.links][1]})\n", - " result_list.append(result_dict)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "4fdce882", - "metadata": {}, - "outputs": [ + "cells": [ { - "data": { - "text/plain": [ - "{'title': 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features',\n", - " 'summary': 'Models based on large-pretrained language models, such as S(entence)BERT,\\nprovide effective and efficient sentence embeddings that show high correlation\\nto human similarity ratings, but lack interpretability. On the other hand,\\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\\nRepresentation, AMR) can make explicit the semantic aspects in which two\\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\\nand do not reach state-of-the-art performance when rating sentence similarity.\\n In this work, we aim at the best of both worlds, by learning to induce\\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\\nvarious semantic sentence features (e.g., semantic roles, negation, or\\nquantification). We show how to i) learn a decomposition of the sentence\\nembeddings into semantic features, through approximation of a suite of\\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\\nthe neural embeddings by controlling the decomposition learning process with a\\nsecond objective that enforces consistency with the similarity ratings of an\\nSBERT teacher model. In our experimental studies, we show that our approach\\noffers interpretability -- while fully preserving the effectiveness and\\nefficiency of the neural sentence embeddings.',\n", - " 'article_url': 'http://arxiv.org/abs/2206.07023v2',\n", - " 'pdf_url': 'http://arxiv.org/pdf/2206.07023v2'}" + "cell_type": "markdown", + "id": "7f30b8b2", + "metadata": {}, + "source": [ + "# Search reranking with cross-encoders\n", + "\n", + "This notebook takes you through examples of using a cross-encoder to re-rank search results.\n", + "\n", + "This is a common use case with our customers, where you've implemented semantic search using embeddings (produced using a [bi-encoder](https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieval-bi-encoder)) but the results are not as accurate as your use case requires. A possible cause is that there is some business rule you can use to rerank the documents such as how recent or how popular a document is. \n", + "\n", + "However, often there are subtle domain-specific rules that help determine relevancy, and this is where a cross-encoder can be useful. Cross-encoders are more accurate than bi-encoders but they don't scale well, so using them to re-order a shortened list returned by semantic search is the ideal use case.\n", + "\n", + "### Example\n", + "\n", + "Consider a search task with D documents and Q queries.\n", + "\n", + "The brute force approach of computing every pairwise relevance is expensive; its cost scales as ```D * Q```. This is known as **cross-encoding**.\n", + "\n", + "A faster approach is **embeddings-based search**, in which an embedding is computed once for each document and query, and then re-used multiple times to cheaply compute pairwise relevance. Because embeddings are only computed once, its cost scales as ```D + Q```. This is known as **bi-encoding**.\n", + "\n", + "Although embeddings-based search is faster, the quality can be worse. To get the best of both, one common approach is to use embeddings (or another bi-encoder) to cheaply identify top candidates, and then use GPT (or another cross-encoder) to expensively re-rank those top candidates. The cost of this hybrid approach scales as ```(D + Q) * cost of embedding + (N * Q) * cost of re-ranking```, where ```N``` is the number of candidates re-ranked.\n", + "\n", + "### Walkthrough\n", + "\n", + "To illustrate this approach we'll use ```text-davinci-003``` with ```logprobs``` enabled to build a GPT-powered cross-encoder. Our GPT models have strong general language understanding, which when tuned with some few-shot examples can provide a simple and effective cross-encoding option.\n", + "\n", + "This notebook drew on this great [article](https://weaviate.io/blog/cross-encoders-as-reranker) by Weaviate, and this [excellent explanation](https://www.sbert.net/examples/applications/cross-encoder/README.html) of bi-encoders vs. cross-encoders from Sentence Transformers." ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "result_list[0]" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "id": "7e6abb5b", - "metadata": {}, - "outputs": [ + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "1: SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features\n", - "2: Are Classes Clusters?\n", - "3: Semantic Composition in Visually Grounded Language Models\n", - "4: Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions\n", - "5: Learning Probabilistic Sentence Representations from Paraphrases\n", - "6: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings\n", - "7: How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation\n", - "8: Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences\n", - "9: Vec2Sent: Probing Sentence Embeddings with Natural Language Generation\n", - "10: Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings\n", - "11: SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding\n", - "12: Learning Joint Representations of Videos and Sentences with Web Image Search\n", - "13: Character-based Neural Networks for Sentence Pair Modeling\n", - "14: Train Once, Test Anywhere: Zero-Shot Learning for Text Classification\n", - "15: Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models\n", - "16: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models\n", - "17: In Search for Linear Relations in Sentence Embedding Spaces\n", - "18: Learning to Borrow -- Relation Representation for Without-Mention Entity-Pairs for Knowledge Graph Completion\n", - "19: Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences\n", - "20: Relational Sentence Embedding for Flexible Semantic Matching\n" - ] - } - ], - "source": [ - "for i, result in enumerate(result_list):\n", - " print(f\"{i + 1}: {result['title']}\")" - ] - }, - { - "cell_type": "markdown", - "id": "d5727678", - "metadata": {}, - "source": [ - "## Cross-encoder\n", - "\n", - "We'll create a cross-encoder using the ```Completions``` endpoint - the key factors to consider here are:\n", - "- Make your examples domain-specific - the strength of cross-encoders comes when you tailor them to your domain.\n", - "- There is a trade-off between how many potential examples to re-rank vs. processing speed. Consider batching and parallel processing cross-encoder requests to process them more quickly.\n", - "\n", - "The steps here are:\n", - "- Build a prompt to assess relevance and provide few-shot examples to tune it to your domain.\n", - "- Add a ```logit bias``` for the tokens for ``` Yes``` and ``` No``` to decrease the likelihood of any other tokens occurring.\n", - "- Return the classification of yes/no as well as the ```logprobs```.\n", - "- Rerank the results by the ```logprobs``` keyed on ``` Yes```." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "id": "ca634bf9", - "metadata": {}, - "outputs": [ + "cell_type": "code", + "execution_count": null, + "id": "71cb361f", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install openai\n", + "!pip install arxiv\n", + "!pip install tenacity\n", + "!pip install pandas\n", + "!pip install tiktoken" + ] + }, { - "data": { - "text/plain": [ - "([3363], [1400])" + "cell_type": "code", + "execution_count": 1, + "id": "90f3b829", + "metadata": {}, + "outputs": [], + "source": [ + "import arxiv\n", + "from math import exp\n", + "import openai\n", + "import pandas as pd\n", + "from tenacity import retry, wait_random_exponential, stop_after_attempt\n", + "import tiktoken" ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "tokens = [\" Yes\", \" No\"]\n", - "tokenizer = tiktoken.encoding_for_model(\"text-davinci-003\")\n", - "ids = [tokenizer.encode(token) for token in tokens]\n", - "ids[0], ids[1]" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "4fdf8c11", - "metadata": {}, - "outputs": [], - "source": [ - "prompt = '''\n", - "You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: \"Yes\" or \"No\" indicating the retrieved document is relevant to the query.\n", - "\n", - "Query: How to plant a tree?\n", - "Document: \"\"\"Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy.\"\"\"\n", - "Relevant: No\n", - "\n", - "Query: Has the coronavirus vaccine been approved?\n", - "Document: \"\"\"The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020.\"\"\"\n", - "Relevant: Yes\n", - "\n", - "Query: What is the capital of France?\n", - "Document: \"\"\"Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré.\"\"\"\n", - "Relevant: Yes\n", - "\n", - "Query: What are some papers to learn about PPO reinforcement learning?\n", - "Document: \"\"\"Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance.\"\"\"\n", - "Relevant: Yes\n", - "\n", - "Query: Explain sentence embeddings\n", - "Document: \"\"\"Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.85%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs.\"\"\"\n", - "Relevant: No\n", - "\n", - "Query: {query}\n", - "Document: \"\"\"{document}\"\"\"\n", - "Relevant:\n", - "'''\n", - "\n", - "\n", - "@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))\n", - "def document_relevance(query, document):\n", - " response = openai.Completion.create(\n", - " model=\"text-davinci-003\",\n", - " prompt=prompt.format(query=query, document=content),\n", - " temperature=0,\n", - " logprobs=1,\n", - " logit_bias={3363: 1, 1400: 1},\n", - " )\n", - "\n", - " return (\n", - " query,\n", - " document,\n", - " response[\"choices\"][0][\"text\"],\n", - " response[\"choices\"][0][\"logprobs\"][\"token_logprobs\"][0],\n", - " )" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "753cd363", - "metadata": {}, - "outputs": [], - "source": [ - "content = result_list[0][\"title\"] + \": \" + result_list[0][\"summary\"]\n", - "\n", - "# Set logprobs to 1 so our response will include the most probable token the model identified\n", - "response = openai.Completion.create(\n", - " model=\"text-davinci-003\",\n", - " prompt=prompt.format(query=query, document=content),\n", - " temperature=0,\n", - " logprobs=1,\n", - " logit_bias={3363: 1, 1400: 1},\n", - " max_tokens=1,\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "id": "7efef2fe", - "metadata": {}, - "outputs": [ + }, { - "name": "stdout", - "output_type": "stream", - "text": [ - "Result was Yes\n", - "Logprobs was -0.05869877\n", - "\n", - "Below is the full logprobs object\n", - "\n", - "\n", - "{\n", - " \"tokens\": [\n", - " \"Yes\"\n", - " ],\n", - " \"token_logprobs\": [\n", - " -0.05869877\n", - " ],\n", - " \"top_logprobs\": [\n", - " {\n", - " \"Yes\": -0.05869877\n", - " }\n", - " ],\n", - " \"text_offset\": [\n", - " 5764\n", - " ]\n", - "}\n" - ] - } - ], - "source": [ - "result = response[\"choices\"][0]\n", - "print(f\"Result was {result['text']}\")\n", - "print(f\"Logprobs was {result['logprobs']['token_logprobs'][0]}\")\n", - "print(\"\\nBelow is the full logprobs object\\n\\n\")\n", - "print(result[\"logprobs\"])" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "id": "7683b6f7", - "metadata": {}, - "outputs": [], - "source": [ - "output_list = []\n", - "for x in result_list:\n", - " content = x[\"title\"] + \": \" + x[\"summary\"]\n", - "\n", - " try:\n", - " output_list.append(document_relevance(query, document=content))\n", - "\n", - " except Exception as e:\n", - " print(e)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "id": "57576313", - "metadata": {}, - "outputs": [ + "cell_type": "markdown", + "id": "fdada886", + "metadata": {}, + "source": [ + "## Search\n", + "\n", + "We'll use the arXiv search service for this example, but this step could be performed by any search service you have. The key item to consider is over-fetching slightly to capture all the potentially relevant documents, before re-sorting them.\n" + ] + }, { - "data": { - "text/plain": [ - "[('how do bi-encoders work for sentence embeddings',\n", - " 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features: Models based on large-pretrained language models, such as S(entence)BERT,\\nprovide effective and efficient sentence embeddings that show high correlation\\nto human similarity ratings, but lack interpretability. On the other hand,\\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\\nRepresentation, AMR) can make explicit the semantic aspects in which two\\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\\nand do not reach state-of-the-art performance when rating sentence similarity.\\n In this work, we aim at the best of both worlds, by learning to induce\\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\\nvarious semantic sentence features (e.g., semantic roles, negation, or\\nquantification). We show how to i) learn a decomposition of the sentence\\nembeddings into semantic features, through approximation of a suite of\\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\\nthe neural embeddings by controlling the decomposition learning process with a\\nsecond objective that enforces consistency with the similarity ratings of an\\nSBERT teacher model. In our experimental studies, we show that our approach\\noffers interpretability -- while fully preserving the effectiveness and\\nefficiency of the neural sentence embeddings.',\n", - " 'Yes',\n", - " -0.05326408),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " 'Are Classes Clusters?: Sentence embedding models aim to provide general purpose embeddings for\\nsentences. Most of the models studied in this paper claim to perform well on\\nSTS tasks - but they do not report on their suitability for clustering. This\\npaper looks at four recent sentence embedding models (Universal Sentence\\nEncoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER\\n(Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a\\nbrief overview of the ideas behind their implementations. It then investigates\\nhow well topic classes in two text classification datasets (Amazon Reviews (Ni\\net al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their\\ncorresponding sentence embedding space. While the performance of the resulting\\nclassification model is far from perfect, it is better than random. This is\\ninteresting because the classification model has been constructed in an\\nunsupervised way. The topic classes in these real life topic classification\\ndatasets can be partly reconstructed by clustering the corresponding sentence\\nembeddings.',\n", - " 'No',\n", - " -0.009535169),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " \"Semantic Composition in Visually Grounded Language Models: What is sentence meaning and its ideal representation? Much of the expressive\\npower of human language derives from semantic composition, the mind's ability\\nto represent meaning hierarchically & relationally over constituents. At the\\nsame time, much sentential meaning is outside the text and requires grounding\\nin sensory, motor, and experiential modalities to be adequately learned.\\nAlthough large language models display considerable compositional ability,\\nrecent work shows that visually-grounded language models drastically fail to\\nrepresent compositional structure. In this thesis, we explore whether & how\\nmodels compose visually grounded semantics, and how we might improve their\\nability to do so.\\n Specifically, we introduce 1) WinogroundVQA, a new compositional visual\\nquestion answering benchmark, 2) Syntactic Neural Module Distillation, a\\nmeasure of compositional ability in sentence embedding models, 3) Causal\\nTracing for Image Captioning Models to locate neural representations vital for\\nvision-language composition, 4) Syntactic MeanPool to inject a compositional\\ninductive bias into sentence embeddings, and 5) Cross-modal Attention\\nCongruence Regularization, a self-supervised objective function for\\nvision-language relation alignment. We close by discussing connections of our\\nwork to neuroscience, psycholinguistics, formal semantics, and philosophy.\",\n", - " 'No',\n", - " -0.008887106),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " \"Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions: Text embedding models from Natural Language Processing can map text data\\n(e.g. words, sentences, documents) to supposedly meaningful numerical\\nrepresentations (a.k.a. text embeddings). While such models are increasingly\\napplied in social science research, one important issue is often not addressed:\\nthe extent to which these embeddings are valid representations of constructs\\nrelevant for social science research. We therefore propose the use of the\\nclassic construct validity framework to evaluate the validity of text\\nembeddings. We show how this framework can be adapted to the opaque and\\nhigh-dimensional nature of text embeddings, with application to survey\\nquestions. We include several popular text embedding methods (e.g. fastText,\\nGloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct\\nvalidity analyses. We find evidence of convergent and discriminant validity in\\nsome cases. We also show that embeddings can be used to predict respondent's\\nanswers to completely new survey questions. Furthermore, BERT-based embedding\\ntechniques and the Universal Sentence Encoder provide more valid\\nrepresentations of survey questions than do others. Our results thus highlight\\nthe necessity to examine the construct validity of text embeddings before\\ndeploying them in social science research.\",\n", - " 'No',\n", - " -0.008583762),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " 'Learning Probabilistic Sentence Representations from Paraphrases: Probabilistic word embeddings have shown effectiveness in capturing notions\\nof generality and entailment, but there is very little work on doing the\\nanalogous type of investigation for sentences. In this paper we define\\nprobabilistic models that produce distributions for sentences. Our\\nbest-performing model treats each word as a linear transformation operator\\napplied to a multivariate Gaussian distribution. We train our models on\\nparaphrases and demonstrate that they naturally capture sentence specificity.\\nWhile our proposed model achieves the best performance overall, we also show\\nthat specificity is represented by simpler architectures via the norm of the\\nsentence vectors. Qualitative analysis shows that our probabilistic model\\ncaptures sentential entailment and provides ways to analyze the specificity and\\npreciseness of individual words.',\n", - " 'No',\n", - " -0.011975748),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " \"Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings: Semantic sentence embeddings are usually supervisedly built minimizing\\ndistances between pairs of embeddings of sentences labelled as semantically\\nsimilar by annotators. Since big labelled datasets are rare, in particular for\\nnon-English languages, and expensive, recent studies focus on unsupervised\\napproaches that require not-paired input sentences. We instead propose a\\nlanguage-independent approach to build large datasets of pairs of informal\\ntexts weakly similar, without manual human effort, exploiting Twitter's\\nintrinsic powerful signals of relatedness: replies and quotes of tweets. We use\\nthe collected pairs to train a Transformer model with triplet-like structures,\\nand we test the generated embeddings on Twitter NLP similarity tasks (PIT and\\nTURL) and STSb. We also introduce four new sentence ranking evaluation\\nbenchmarks of informal texts, carefully extracted from the initial collections\\nof tweets, proving not only that our best model learns classical Semantic\\nTextual Similarity, but also excels on tasks where pairs of sentences are not\\nexact paraphrases. Ablation studies reveal how increasing the corpus size\\ninfluences positively the results, even at 2M samples, suggesting that bigger\\ncollections of Tweets still do not contain redundant information about semantic\\nsimilarities.\",\n", - " 'No',\n", - " -0.01219046),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " \"How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation: Sentence encoders map sentences to real valued vectors for use in downstream\\napplications. To peek into these representations - e.g., to increase\\ninterpretability of their results - probing tasks have been designed which\\nquery them for linguistic knowledge. However, designing probing tasks for\\nlesser-resourced languages is tricky, because these often lack large-scale\\nannotated data or (high-quality) dependency parsers as a prerequisite of\\nprobing task design in English. To investigate how to probe sentence embeddings\\nin such cases, we investigate sensitivity of probing task results to structural\\ndesign choices, conducting the first such large scale study. We show that\\ndesign choices like size of the annotated probing dataset and type of\\nclassifier used for evaluation do (sometimes substantially) influence probing\\noutcomes. We then probe embeddings in a multilingual setup with design choices\\nthat lie in a 'stable region', as we identify for English, and find that\\nresults on English do not transfer to other languages. Fairer and more\\ncomprehensive sentence-level probing evaluation should thus be carried out on\\nmultiple languages in the future.\",\n", - " 'No',\n", - " -0.015550519),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " 'Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences: Sentence embedding methods offer a powerful approach for working with short\\ntextual constructs or sequences of words. By representing sentences as dense\\nnumerical vectors, many natural language processing (NLP) applications have\\nimproved their performance. However, relatively little is understood about the\\nlatent structure of sentence embeddings. Specifically, research has not\\naddressed whether the length and structure of sentences impact the sentence\\nembedding space and topology. This paper reports research on a set of\\ncomprehensive clustering and network analyses targeting sentence and\\nsub-sentence embedding spaces. Results show that one method generates the most\\nclusterable embeddings. In general, the embeddings of span sub-sentences have\\nbetter clustering properties than the original sentences. The results have\\nimplications for future sentence embedding models and applications.',\n", - " 'No',\n", - " -0.012663184),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " 'Vec2Sent: Probing Sentence Embeddings with Natural Language Generation: We introspect black-box sentence embeddings by conditionally generating from\\nthem with the objective to retrieve the underlying discrete sentence. We\\nperceive of this as a new unsupervised probing task and show that it correlates\\nwell with downstream task performance. We also illustrate how the language\\ngenerated from different encoders differs. We apply our approach to generate\\nsentence analogies from sentence embeddings.',\n", - " 'Yes',\n", - " -0.004863006),\n", - " ('how do bi-encoders work for sentence embeddings',\n", - " 'Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings: Semantic representation learning for sentences is an important and\\nwell-studied problem in NLP. The current trend for this task involves training\\na Transformer-based sentence encoder through a contrastive objective with text,\\ni.e., clustering sentences with semantically similar meanings and scattering\\nothers. In this work, we find the performance of Transformer models as sentence\\nencoders can be improved by training with multi-modal multi-task losses, using\\nunpaired examples from another modality (e.g., sentences and unrelated\\nimage/audio data). In particular, besides learning by the contrastive loss on\\ntext, our model clusters examples from a non-linguistic domain (e.g.,\\nvisual/audio) with a similar contrastive loss at the same time. The reliance of\\nour framework on unpaired non-linguistic data makes it language-agnostic,\\nenabling it to be widely applicable beyond English NLP. Experiments on 7\\nsemantic textual similarity benchmarks reveal that models trained with the\\nadditional non-linguistic (images/audio) contrastive objective lead to higher\\nquality sentence embeddings. This indicates that Transformer models are able to\\ngeneralize better by doing a similar task (i.e., clustering) with unpaired\\nexamples from different modalities in a multi-task fashion.',\n", - " 'No',\n", - " -0.013869206)]" + "cell_type": "code", + "execution_count": 2, + "id": "bf16c893", + "metadata": {}, + "outputs": [], + "source": [ + "query = \"how do bi-encoders work for sentence embeddings\"\n", + "search = arxiv.Search(\n", + " query=query, max_results=20, sort_by=arxiv.SortCriterion.Relevance\n", + ")" ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_list[:10]" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "id": "29a4dc08", - "metadata": {}, - "outputs": [ + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4b020a1b", + "metadata": {}, + "outputs": [], + "source": [ + "result_list = []\n", + "\n", + "for result in search.results():\n", + " result_dict = {}\n", + "\n", + " result_dict.update({\"title\": result.title})\n", + " result_dict.update({\"summary\": result.summary})\n", + "\n", + " # Taking the first url provided\n", + " result_dict.update({\"article_url\": [x.href for x in result.links][0]})\n", + " result_dict.update({\"pdf_url\": [x.href for x in result.links][1]})\n", + " result_list.append(result_dict)" + ] + }, { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
indexquerydocumentpredictionlogprobsprobabilityyes_probability
00how do bi-encoders work for sentence embeddingsSBERT studies Meaning Representations: Decompo...Yes-0.0532640.9481300.948130
11how do bi-encoders work for sentence embeddingsAre Classes Clusters?: Sentence embedding mode...No-0.0095350.9905100.009490
22how do bi-encoders work for sentence embeddingsSemantic Composition in Visually Grounded Lang...No-0.0088870.9911520.008848
33how do bi-encoders work for sentence embeddingsEvaluating the Construct Validity of Text Embe...No-0.0085840.9914530.008547
44how do bi-encoders work for sentence embeddingsLearning Probabilistic Sentence Representation...No-0.0119760.9880960.011904
\n", - "
" + "cell_type": "code", + "execution_count": 4, + "id": "4fdce882", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'title': 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features',\n", + " 'summary': 'Models based on large-pretrained language models, such as S(entence)BERT,\\nprovide effective and efficient sentence embeddings that show high correlation\\nto human similarity ratings, but lack interpretability. On the other hand,\\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\\nRepresentation, AMR) can make explicit the semantic aspects in which two\\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\\nand do not reach state-of-the-art performance when rating sentence similarity.\\n In this work, we aim at the best of both worlds, by learning to induce\\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\\nvarious semantic sentence features (e.g., semantic roles, negation, or\\nquantification). We show how to i) learn a decomposition of the sentence\\nembeddings into semantic features, through approximation of a suite of\\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\\nthe neural embeddings by controlling the decomposition learning process with a\\nsecond objective that enforces consistency with the similarity ratings of an\\nSBERT teacher model. In our experimental studies, we show that our approach\\noffers interpretability -- while fully preserving the effectiveness and\\nefficiency of the neural sentence embeddings.',\n", + " 'article_url': 'http://arxiv.org/abs/2206.07023v2',\n", + " 'pdf_url': 'http://arxiv.org/pdf/2206.07023v2'}" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } ], - "text/plain": [ - " index query \\\n", - "0 0 how do bi-encoders work for sentence embeddings \n", - "1 1 how do bi-encoders work for sentence embeddings \n", - "2 2 how do bi-encoders work for sentence embeddings \n", - "3 3 how do bi-encoders work for sentence embeddings \n", - "4 4 how do bi-encoders work for sentence embeddings \n", - "\n", - " document prediction logprobs \\\n", - "0 SBERT studies Meaning Representations: Decompo... Yes -0.053264 \n", - "1 Are Classes Clusters?: Sentence embedding mode... No -0.009535 \n", - "2 Semantic Composition in Visually Grounded Lang... No -0.008887 \n", - "3 Evaluating the Construct Validity of Text Embe... No -0.008584 \n", - "4 Learning Probabilistic Sentence Representation... No -0.011976 \n", - "\n", - " probability yes_probability \n", - "0 0.948130 0.948130 \n", - "1 0.990510 0.009490 \n", - "2 0.991152 0.008848 \n", - "3 0.991453 0.008547 \n", - "4 0.988096 0.011904 " + "source": [ + "result_list[0]" ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "output_df = pd.DataFrame(\n", - " output_list, columns=[\"query\", \"document\", \"prediction\", \"logprobs\"]\n", - ").reset_index()\n", - "# Use exp() to convert logprobs into probability\n", - "output_df[\"probability\"] = output_df[\"logprobs\"].apply(exp)\n", - "# Reorder based on likelihood of being Yes\n", - "output_df[\"yes_probability\"] = output_df.apply(\n", - " lambda x: x[\"probability\"] * -1 + 1\n", - " if x[\"prediction\"] == \"No\"\n", - " else x[\"probability\"],\n", - " axis=1,\n", - ")\n", - "output_df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "id": "a647f120", - "metadata": {}, - "outputs": [ + }, { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
level_0indexquerydocumentpredictionlogprobsprobabilityyes_probability
01616how do bi-encoders work for sentence embeddingsIn Search for Linear Relations in Sentence Emb...Yes-0.0048240.9951870.995187
188how do bi-encoders work for sentence embeddingsVec2Sent: Probing Sentence Embeddings with Nat...Yes-0.0048630.9951490.995149
21919how do bi-encoders work for sentence embeddingsRelational Sentence Embedding for Flexible Sem...Yes-0.0388140.9619300.961930
300how do bi-encoders work for sentence embeddingsSBERT studies Meaning Representations: Decompo...Yes-0.0532640.9481300.948130
41515how do bi-encoders work for sentence embeddingsSentence-T5: Scalable Sentence Encoders from P...No-0.2918930.7468490.253151
566how do bi-encoders work for sentence embeddingsHow to Probe Sentence Embeddings in Low-Resour...No-0.0155510.9845700.015430
61818how do bi-encoders work for sentence embeddingsEfficient and Flexible Topic Modeling using Pr...No-0.0152960.9848200.015180
799how do bi-encoders work for sentence embeddingsNon-Linguistic Supervision for Contrastive Lea...No-0.0138690.9862270.013773
81212how do bi-encoders work for sentence embeddingsCharacter-based Neural Networks for Sentence P...No-0.0128660.9872160.012784
977how do bi-encoders work for sentence embeddingsClustering and Network Analysis for the Embedd...No-0.0126630.9874170.012583
\n", - "
" + "cell_type": "code", + "execution_count": 5, + "id": "7e6abb5b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1: SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features\n", + "2: Are Classes Clusters?\n", + "3: Semantic Composition in Visually Grounded Language Models\n", + "4: Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions\n", + "5: Learning Probabilistic Sentence Representations from Paraphrases\n", + "6: Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings\n", + "7: How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation\n", + "8: Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences\n", + "9: Vec2Sent: Probing Sentence Embeddings with Natural Language Generation\n", + "10: Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings\n", + "11: SentPWNet: A Unified Sentence Pair Weighting Network for Task-specific Sentence Embedding\n", + "12: Learning Joint Representations of Videos and Sentences with Web Image Search\n", + "13: Character-based Neural Networks for Sentence Pair Modeling\n", + "14: Train Once, Test Anywhere: Zero-Shot Learning for Text Classification\n", + "15: Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models\n", + "16: Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models\n", + "17: In Search for Linear Relations in Sentence Embedding Spaces\n", + "18: Learning to Borrow -- Relation Representation for Without-Mention Entity-Pairs for Knowledge Graph Completion\n", + "19: Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences\n", + "20: Relational Sentence Embedding for Flexible Semantic Matching\n" + ] + } ], - "text/plain": [ - " level_0 index query \\\n", - "0 16 16 how do bi-encoders work for sentence embeddings \n", - "1 8 8 how do bi-encoders work for sentence embeddings \n", - "2 19 19 how do bi-encoders work for sentence embeddings \n", - "3 0 0 how do bi-encoders work for sentence embeddings \n", - "4 15 15 how do bi-encoders work for sentence embeddings \n", - "5 6 6 how do bi-encoders work for sentence embeddings \n", - "6 18 18 how do bi-encoders work for sentence embeddings \n", - "7 9 9 how do bi-encoders work for sentence embeddings \n", - "8 12 12 how do bi-encoders work for sentence embeddings \n", - "9 7 7 how do bi-encoders work for sentence embeddings \n", - "\n", - " document prediction logprobs \\\n", - "0 In Search for Linear Relations in Sentence Emb... Yes -0.004824 \n", - "1 Vec2Sent: Probing Sentence Embeddings with Nat... Yes -0.004863 \n", - "2 Relational Sentence Embedding for Flexible Sem... Yes -0.038814 \n", - "3 SBERT studies Meaning Representations: Decompo... Yes -0.053264 \n", - "4 Sentence-T5: Scalable Sentence Encoders from P... No -0.291893 \n", - "5 How to Probe Sentence Embeddings in Low-Resour... No -0.015551 \n", - "6 Efficient and Flexible Topic Modeling using Pr... No -0.015296 \n", - "7 Non-Linguistic Supervision for Contrastive Lea... No -0.013869 \n", - "8 Character-based Neural Networks for Sentence P... No -0.012866 \n", - "9 Clustering and Network Analysis for the Embedd... No -0.012663 \n", - "\n", - " probability yes_probability \n", - "0 0.995187 0.995187 \n", - "1 0.995149 0.995149 \n", - "2 0.961930 0.961930 \n", - "3 0.948130 0.948130 \n", - "4 0.746849 0.253151 \n", - "5 0.984570 0.015430 \n", - "6 0.984820 0.015180 \n", - "7 0.986227 0.013773 \n", - "8 0.987216 0.012784 \n", - "9 0.987417 0.012583 " + "source": [ + "for i, result in enumerate(result_list):\n", + " print(f\"{i + 1}: {result['title']}\")" ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Return reranked results\n", - "reranked_df = output_df.sort_values(\n", - " by=[\"yes_probability\"], ascending=False\n", - ").reset_index()\n", - "reranked_df.head(10)" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "id": "610b2c7f", - "metadata": {}, - "outputs": [ + }, { - "data": { - "text/plain": [ - "'In Search for Linear Relations in Sentence Embedding Spaces: We present an introductory investigation into continuous-space vector\\nrepresentations of sentences. We acquire pairs of very similar sentences\\ndiffering only by a small alterations (such as change of a noun, adding an\\nadjective, noun or punctuation) from datasets for natural language inference\\nusing a simple pattern method. We look into how such a small change within the\\nsentence text affects its representation in the continuous space and how such\\nalterations are reflected by some of the popular sentence embedding models. We\\nfound that vector differences of some embeddings actually reflect small changes\\nwithin a sentence.'" + "cell_type": "markdown", + "id": "d5727678", + "metadata": {}, + "source": [ + "## Cross-encoder\n", + "\n", + "We'll create a cross-encoder using the ```Completions``` endpoint - the key factors to consider here are:\n", + "- Make your examples domain-specific - the strength of cross-encoders comes when you tailor them to your domain.\n", + "- There is a trade-off between how many potential examples to re-rank vs. processing speed. Consider batching and parallel processing cross-encoder requests to process them more quickly.\n", + "\n", + "The steps here are:\n", + "- Build a prompt to assess relevance and provide few-shot examples to tune it to your domain.\n", + "- Add a ```logit bias``` for the tokens for ``` Yes``` and ``` No``` to decrease the likelihood of any other tokens occurring.\n", + "- Return the classification of yes/no as well as the ```logprobs```.\n", + "- Rerank the results by the ```logprobs``` keyed on ``` Yes```." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "ca634bf9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "([3363], [1400])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "tokens = [\" Yes\", \" No\"]\n", + "tokenizer = tiktoken.encoding_for_model(\"text-davinci-003\")\n", + "ids = [tokenizer.encode(token) for token in tokens]\n", + "ids[0], ids[1]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "4fdf8c11", + "metadata": {}, + "outputs": [], + "source": [ + "prompt = '''\n", + "You are an Assistant responsible for helping detect whether the retrieved document is relevant to the query. For a given input, you need to output a single token: \"Yes\" or \"No\" indicating the retrieved document is relevant to the query.\n", + "\n", + "Query: How to plant a tree?\n", + "Document: \"\"\"Cars were invented in 1886, when German inventor Carl Benz patented his Benz Patent-Motorwagen.[3][4][5] Cars became widely available during the 20th century. One of the first cars affordable by the masses was the 1908 Model T, an American car manufactured by the Ford Motor Company. Cars were rapidly adopted in the US, where they replaced horse-drawn carriages.[6] In Europe and other parts of the world, demand for automobiles did not increase until after World War II.[7] The car is considered an essential part of the developed economy.\"\"\"\n", + "Relevant: No\n", + "\n", + "Query: Has the coronavirus vaccine been approved?\n", + "Document: \"\"\"The Pfizer-BioNTech COVID-19 vaccine was approved for emergency use in the United States on December 11, 2020.\"\"\"\n", + "Relevant: Yes\n", + "\n", + "Query: What is the capital of France?\n", + "Document: \"\"\"Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Beyond such landmarks as the Eiffel Tower and the 12th-century, Gothic Notre-Dame cathedral, the city is known for its cafe culture and designer boutiques along the Rue du Faubourg Saint-Honoré.\"\"\"\n", + "Relevant: Yes\n", + "\n", + "Query: What are some papers to learn about PPO reinforcement learning?\n", + "Document: \"\"\"Proximal Policy Optimization and its Dynamic Version for Sequence Generation: In sequence generation task, many works use policy gradient for model optimization to tackle the intractable backpropagation issue when maximizing the non-differentiable evaluation metrics or fooling the discriminator in adversarial learning. In this paper, we replace policy gradient with proximal policy optimization (PPO), which is a proved more efficient reinforcement learning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We demonstrate the efficacy of PPO and PPO-dynamic on conditional sequence generation tasks including synthetic experiment and chit-chat chatbot. The results show that PPO and PPO-dynamic can beat policy gradient by stability and performance.\"\"\"\n", + "Relevant: Yes\n", + "\n", + "Query: Explain sentence embeddings\n", + "Document: \"\"\"Inside the bubble: exploring the environments of reionisation-era Lyman-α emitting galaxies with JADES and FRESCO: We present a study of the environments of 16 Lyman-α emitting galaxies (LAEs) in the reionisation era (5.85%) observed in our sample of LAEs, indicating the presence of ionised ``bubbles'' with physical sizes of the order of 0.1pMpc≲Rion≲1pMpc in a patchy reionisation scenario where the bubbles are embedded in a fully neutral IGM. Around half of the LAEs in our sample are found to coincide with large-scale galaxy overdensities seen in FRESCO at z∼5.8-5.9 and z∼7.3, suggesting Lyman-α transmission is strongly enhanced in such overdense regions, and underlining the importance of LAEs as tracers of the first large-scale ionised bubbles. Considering only spectroscopically confirmed galaxies, we find our sample of UV-faint LAEs (MUV≳−20mag) and their direct neighbours are generally not able to produce the required ionised regions based on the Lyman-α transmission properties, suggesting lower-luminosity sources likely play an important role in carving out these bubbles. These observations demonstrate the combined power of JWST multi-object and slitless spectroscopy in acquiring a unique view of the early stages of Cosmic Reionisation via the most distant LAEs.\"\"\"\n", + "Relevant: No\n", + "\n", + "Query: {query}\n", + "Document: \"\"\"{document}\"\"\"\n", + "Relevant:\n", + "'''\n", + "\n", + "\n", + "@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))\n", + "def document_relevance(query, document):\n", + " response = openai.Completion.create(\n", + " model=\"text-davinci-003\",\n", + " prompt=prompt.format(query=query, document=content),\n", + " temperature=0,\n", + " logprobs=1,\n", + " logit_bias={3363: 1, 1400: 1},\n", + " )\n", + "\n", + " return (\n", + " query,\n", + " document,\n", + " response[\"choices\"][0][\"text\"],\n", + " response[\"choices\"][0][\"logprobs\"][\"token_logprobs\"][0],\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "753cd363", + "metadata": {}, + "outputs": [], + "source": [ + "content = result_list[0][\"title\"] + \": \" + result_list[0][\"summary\"]\n", + "\n", + "# Set logprobs to 1 so our response will include the most probable token the model identified\n", + "response = openai.Completion.create(\n", + " model=\"text-davinci-003\",\n", + " prompt=prompt.format(query=query, document=content),\n", + " temperature=0,\n", + " logprobs=1,\n", + " logit_bias={3363: 1, 1400: 1},\n", + " max_tokens=1,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "7efef2fe", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Result was Yes\n", + "Logprobs was -0.05869877\n", + "\n", + "Below is the full logprobs object\n", + "\n", + "\n", + "{\n", + " \"tokens\": [\n", + " \"Yes\"\n", + " ],\n", + " \"token_logprobs\": [\n", + " -0.05869877\n", + " ],\n", + " \"top_logprobs\": [\n", + " {\n", + " \"Yes\": -0.05869877\n", + " }\n", + " ],\n", + " \"text_offset\": [\n", + " 5764\n", + " ]\n", + "}\n" + ] + } + ], + "source": [ + "result = response[\"choices\"][0]\n", + "print(f\"Result was {result['text']}\")\n", + "print(f\"Logprobs was {result['logprobs']['token_logprobs'][0]}\")\n", + "print(\"\\nBelow is the full logprobs object\\n\\n\")\n", + "print(result[\"logprobs\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "7683b6f7", + "metadata": {}, + "outputs": [], + "source": [ + "output_list = []\n", + "for x in result_list:\n", + " content = x[\"title\"] + \": \" + x[\"summary\"]\n", + "\n", + " try:\n", + " output_list.append(document_relevance(query, document=content))\n", + "\n", + " except Exception as e:\n", + " print(e)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "57576313", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('how do bi-encoders work for sentence embeddings',\n", + " 'SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features: Models based on large-pretrained language models, such as S(entence)BERT,\\nprovide effective and efficient sentence embeddings that show high correlation\\nto human similarity ratings, but lack interpretability. On the other hand,\\ngraph metrics for graph-based meaning representations (e.g., Abstract Meaning\\nRepresentation, AMR) can make explicit the semantic aspects in which two\\nsentences are similar. However, such metrics tend to be slow, rely on parsers,\\nand do not reach state-of-the-art performance when rating sentence similarity.\\n In this work, we aim at the best of both worlds, by learning to induce\\n$S$emantically $S$tructured $S$entence BERT embeddings (S$^3$BERT). Our\\nS$^3$BERT embeddings are composed of explainable sub-embeddings that emphasize\\nvarious semantic sentence features (e.g., semantic roles, negation, or\\nquantification). We show how to i) learn a decomposition of the sentence\\nembeddings into semantic features, through approximation of a suite of\\ninterpretable AMR graph metrics, and how to ii) preserve the overall power of\\nthe neural embeddings by controlling the decomposition learning process with a\\nsecond objective that enforces consistency with the similarity ratings of an\\nSBERT teacher model. In our experimental studies, we show that our approach\\noffers interpretability -- while fully preserving the effectiveness and\\nefficiency of the neural sentence embeddings.',\n", + " 'Yes',\n", + " -0.05326408),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " 'Are Classes Clusters?: Sentence embedding models aim to provide general purpose embeddings for\\nsentences. Most of the models studied in this paper claim to perform well on\\nSTS tasks - but they do not report on their suitability for clustering. This\\npaper looks at four recent sentence embedding models (Universal Sentence\\nEncoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER\\n(Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a\\nbrief overview of the ideas behind their implementations. It then investigates\\nhow well topic classes in two text classification datasets (Amazon Reviews (Ni\\net al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their\\ncorresponding sentence embedding space. While the performance of the resulting\\nclassification model is far from perfect, it is better than random. This is\\ninteresting because the classification model has been constructed in an\\nunsupervised way. The topic classes in these real life topic classification\\ndatasets can be partly reconstructed by clustering the corresponding sentence\\nembeddings.',\n", + " 'No',\n", + " -0.009535169),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " \"Semantic Composition in Visually Grounded Language Models: What is sentence meaning and its ideal representation? Much of the expressive\\npower of human language derives from semantic composition, the mind's ability\\nto represent meaning hierarchically & relationally over constituents. At the\\nsame time, much sentential meaning is outside the text and requires grounding\\nin sensory, motor, and experiential modalities to be adequately learned.\\nAlthough large language models display considerable compositional ability,\\nrecent work shows that visually-grounded language models drastically fail to\\nrepresent compositional structure. In this thesis, we explore whether & how\\nmodels compose visually grounded semantics, and how we might improve their\\nability to do so.\\n Specifically, we introduce 1) WinogroundVQA, a new compositional visual\\nquestion answering benchmark, 2) Syntactic Neural Module Distillation, a\\nmeasure of compositional ability in sentence embedding models, 3) Causal\\nTracing for Image Captioning Models to locate neural representations vital for\\nvision-language composition, 4) Syntactic MeanPool to inject a compositional\\ninductive bias into sentence embeddings, and 5) Cross-modal Attention\\nCongruence Regularization, a self-supervised objective function for\\nvision-language relation alignment. We close by discussing connections of our\\nwork to neuroscience, psycholinguistics, formal semantics, and philosophy.\",\n", + " 'No',\n", + " -0.008887106),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " \"Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions: Text embedding models from Natural Language Processing can map text data\\n(e.g. words, sentences, documents) to supposedly meaningful numerical\\nrepresentations (a.k.a. text embeddings). While such models are increasingly\\napplied in social science research, one important issue is often not addressed:\\nthe extent to which these embeddings are valid representations of constructs\\nrelevant for social science research. We therefore propose the use of the\\nclassic construct validity framework to evaluate the validity of text\\nembeddings. We show how this framework can be adapted to the opaque and\\nhigh-dimensional nature of text embeddings, with application to survey\\nquestions. We include several popular text embedding methods (e.g. fastText,\\nGloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct\\nvalidity analyses. We find evidence of convergent and discriminant validity in\\nsome cases. We also show that embeddings can be used to predict respondent's\\nanswers to completely new survey questions. Furthermore, BERT-based embedding\\ntechniques and the Universal Sentence Encoder provide more valid\\nrepresentations of survey questions than do others. Our results thus highlight\\nthe necessity to examine the construct validity of text embeddings before\\ndeploying them in social science research.\",\n", + " 'No',\n", + " -0.008583762),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " 'Learning Probabilistic Sentence Representations from Paraphrases: Probabilistic word embeddings have shown effectiveness in capturing notions\\nof generality and entailment, but there is very little work on doing the\\nanalogous type of investigation for sentences. In this paper we define\\nprobabilistic models that produce distributions for sentences. Our\\nbest-performing model treats each word as a linear transformation operator\\napplied to a multivariate Gaussian distribution. We train our models on\\nparaphrases and demonstrate that they naturally capture sentence specificity.\\nWhile our proposed model achieves the best performance overall, we also show\\nthat specificity is represented by simpler architectures via the norm of the\\nsentence vectors. Qualitative analysis shows that our probabilistic model\\ncaptures sentential entailment and provides ways to analyze the specificity and\\npreciseness of individual words.',\n", + " 'No',\n", + " -0.011975748),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " \"Exploiting Twitter as Source of Large Corpora of Weakly Similar Pairs for Semantic Sentence Embeddings: Semantic sentence embeddings are usually supervisedly built minimizing\\ndistances between pairs of embeddings of sentences labelled as semantically\\nsimilar by annotators. Since big labelled datasets are rare, in particular for\\nnon-English languages, and expensive, recent studies focus on unsupervised\\napproaches that require not-paired input sentences. We instead propose a\\nlanguage-independent approach to build large datasets of pairs of informal\\ntexts weakly similar, without manual human effort, exploiting Twitter's\\nintrinsic powerful signals of relatedness: replies and quotes of tweets. We use\\nthe collected pairs to train a Transformer model with triplet-like structures,\\nand we test the generated embeddings on Twitter NLP similarity tasks (PIT and\\nTURL) and STSb. We also introduce four new sentence ranking evaluation\\nbenchmarks of informal texts, carefully extracted from the initial collections\\nof tweets, proving not only that our best model learns classical Semantic\\nTextual Similarity, but also excels on tasks where pairs of sentences are not\\nexact paraphrases. Ablation studies reveal how increasing the corpus size\\ninfluences positively the results, even at 2M samples, suggesting that bigger\\ncollections of Tweets still do not contain redundant information about semantic\\nsimilarities.\",\n", + " 'No',\n", + " -0.01219046),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " \"How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation: Sentence encoders map sentences to real valued vectors for use in downstream\\napplications. To peek into these representations - e.g., to increase\\ninterpretability of their results - probing tasks have been designed which\\nquery them for linguistic knowledge. However, designing probing tasks for\\nlesser-resourced languages is tricky, because these often lack large-scale\\nannotated data or (high-quality) dependency parsers as a prerequisite of\\nprobing task design in English. To investigate how to probe sentence embeddings\\nin such cases, we investigate sensitivity of probing task results to structural\\ndesign choices, conducting the first such large scale study. We show that\\ndesign choices like size of the annotated probing dataset and type of\\nclassifier used for evaluation do (sometimes substantially) influence probing\\noutcomes. We then probe embeddings in a multilingual setup with design choices\\nthat lie in a 'stable region', as we identify for English, and find that\\nresults on English do not transfer to other languages. Fairer and more\\ncomprehensive sentence-level probing evaluation should thus be carried out on\\nmultiple languages in the future.\",\n", + " 'No',\n", + " -0.015550519),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " 'Clustering and Network Analysis for the Embedding Spaces of Sentences and Sub-Sentences: Sentence embedding methods offer a powerful approach for working with short\\ntextual constructs or sequences of words. By representing sentences as dense\\nnumerical vectors, many natural language processing (NLP) applications have\\nimproved their performance. However, relatively little is understood about the\\nlatent structure of sentence embeddings. Specifically, research has not\\naddressed whether the length and structure of sentences impact the sentence\\nembedding space and topology. This paper reports research on a set of\\ncomprehensive clustering and network analyses targeting sentence and\\nsub-sentence embedding spaces. Results show that one method generates the most\\nclusterable embeddings. In general, the embeddings of span sub-sentences have\\nbetter clustering properties than the original sentences. The results have\\nimplications for future sentence embedding models and applications.',\n", + " 'No',\n", + " -0.012663184),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " 'Vec2Sent: Probing Sentence Embeddings with Natural Language Generation: We introspect black-box sentence embeddings by conditionally generating from\\nthem with the objective to retrieve the underlying discrete sentence. We\\nperceive of this as a new unsupervised probing task and show that it correlates\\nwell with downstream task performance. We also illustrate how the language\\ngenerated from different encoders differs. We apply our approach to generate\\nsentence analogies from sentence embeddings.',\n", + " 'Yes',\n", + " -0.004863006),\n", + " ('how do bi-encoders work for sentence embeddings',\n", + " 'Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings: Semantic representation learning for sentences is an important and\\nwell-studied problem in NLP. The current trend for this task involves training\\na Transformer-based sentence encoder through a contrastive objective with text,\\ni.e., clustering sentences with semantically similar meanings and scattering\\nothers. In this work, we find the performance of Transformer models as sentence\\nencoders can be improved by training with multi-modal multi-task losses, using\\nunpaired examples from another modality (e.g., sentences and unrelated\\nimage/audio data). In particular, besides learning by the contrastive loss on\\ntext, our model clusters examples from a non-linguistic domain (e.g.,\\nvisual/audio) with a similar contrastive loss at the same time. The reliance of\\nour framework on unpaired non-linguistic data makes it language-agnostic,\\nenabling it to be widely applicable beyond English NLP. Experiments on 7\\nsemantic textual similarity benchmarks reveal that models trained with the\\nadditional non-linguistic (/images/audio) contrastive objective lead to higher\\nquality sentence embeddings. This indicates that Transformer models are able to\\ngeneralize better by doing a similar task (i.e., clustering) with unpaired\\nexamples from different modalities in a multi-task fashion.',\n", + " 'No',\n", + " -0.013869206)]" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_list[:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "29a4dc08", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
indexquerydocumentpredictionlogprobsprobabilityyes_probability
00how do bi-encoders work for sentence embeddingsSBERT studies Meaning Representations: Decompo...Yes-0.0532640.9481300.948130
11how do bi-encoders work for sentence embeddingsAre Classes Clusters?: Sentence embedding mode...No-0.0095350.9905100.009490
22how do bi-encoders work for sentence embeddingsSemantic Composition in Visually Grounded Lang...No-0.0088870.9911520.008848
33how do bi-encoders work for sentence embeddingsEvaluating the Construct Validity of Text Embe...No-0.0085840.9914530.008547
44how do bi-encoders work for sentence embeddingsLearning Probabilistic Sentence Representation...No-0.0119760.9880960.011904
\n", + "
" + ], + "text/plain": [ + " index query \\\n", + "0 0 how do bi-encoders work for sentence embeddings \n", + "1 1 how do bi-encoders work for sentence embeddings \n", + "2 2 how do bi-encoders work for sentence embeddings \n", + "3 3 how do bi-encoders work for sentence embeddings \n", + "4 4 how do bi-encoders work for sentence embeddings \n", + "\n", + " document prediction logprobs \\\n", + "0 SBERT studies Meaning Representations: Decompo... Yes -0.053264 \n", + "1 Are Classes Clusters?: Sentence embedding mode... No -0.009535 \n", + "2 Semantic Composition in Visually Grounded Lang... No -0.008887 \n", + "3 Evaluating the Construct Validity of Text Embe... No -0.008584 \n", + "4 Learning Probabilistic Sentence Representation... No -0.011976 \n", + "\n", + " probability yes_probability \n", + "0 0.948130 0.948130 \n", + "1 0.990510 0.009490 \n", + "2 0.991152 0.008848 \n", + "3 0.991453 0.008547 \n", + "4 0.988096 0.011904 " + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "output_df = pd.DataFrame(\n", + " output_list, columns=[\"query\", \"document\", \"prediction\", \"logprobs\"]\n", + ").reset_index()\n", + "# Use exp() to convert logprobs into probability\n", + "output_df[\"probability\"] = output_df[\"logprobs\"].apply(exp)\n", + "# Reorder based on likelihood of being Yes\n", + "output_df[\"yes_probability\"] = output_df.apply(\n", + " lambda x: x[\"probability\"] * -1 + 1\n", + " if x[\"prediction\"] == \"No\"\n", + " else x[\"probability\"],\n", + " axis=1,\n", + ")\n", + "output_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "a647f120", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
level_0indexquerydocumentpredictionlogprobsprobabilityyes_probability
01616how do bi-encoders work for sentence embeddingsIn Search for Linear Relations in Sentence Emb...Yes-0.0048240.9951870.995187
188how do bi-encoders work for sentence embeddingsVec2Sent: Probing Sentence Embeddings with Nat...Yes-0.0048630.9951490.995149
21919how do bi-encoders work for sentence embeddingsRelational Sentence Embedding for Flexible Sem...Yes-0.0388140.9619300.961930
300how do bi-encoders work for sentence embeddingsSBERT studies Meaning Representations: Decompo...Yes-0.0532640.9481300.948130
41515how do bi-encoders work for sentence embeddingsSentence-T5: Scalable Sentence Encoders from P...No-0.2918930.7468490.253151
566how do bi-encoders work for sentence embeddingsHow to Probe Sentence Embeddings in Low-Resour...No-0.0155510.9845700.015430
61818how do bi-encoders work for sentence embeddingsEfficient and Flexible Topic Modeling using Pr...No-0.0152960.9848200.015180
799how do bi-encoders work for sentence embeddingsNon-Linguistic Supervision for Contrastive Lea...No-0.0138690.9862270.013773
81212how do bi-encoders work for sentence embeddingsCharacter-based Neural Networks for Sentence P...No-0.0128660.9872160.012784
977how do bi-encoders work for sentence embeddingsClustering and Network Analysis for the Embedd...No-0.0126630.9874170.012583
\n", + "
" + ], + "text/plain": [ + " level_0 index query \\\n", + "0 16 16 how do bi-encoders work for sentence embeddings \n", + "1 8 8 how do bi-encoders work for sentence embeddings \n", + "2 19 19 how do bi-encoders work for sentence embeddings \n", + "3 0 0 how do bi-encoders work for sentence embeddings \n", + "4 15 15 how do bi-encoders work for sentence embeddings \n", + "5 6 6 how do bi-encoders work for sentence embeddings \n", + "6 18 18 how do bi-encoders work for sentence embeddings \n", + "7 9 9 how do bi-encoders work for sentence embeddings \n", + "8 12 12 how do bi-encoders work for sentence embeddings \n", + "9 7 7 how do bi-encoders work for sentence embeddings \n", + "\n", + " document prediction logprobs \\\n", + "0 In Search for Linear Relations in Sentence Emb... Yes -0.004824 \n", + "1 Vec2Sent: Probing Sentence Embeddings with Nat... Yes -0.004863 \n", + "2 Relational Sentence Embedding for Flexible Sem... Yes -0.038814 \n", + "3 SBERT studies Meaning Representations: Decompo... Yes -0.053264 \n", + "4 Sentence-T5: Scalable Sentence Encoders from P... No -0.291893 \n", + "5 How to Probe Sentence Embeddings in Low-Resour... No -0.015551 \n", + "6 Efficient and Flexible Topic Modeling using Pr... No -0.015296 \n", + "7 Non-Linguistic Supervision for Contrastive Lea... No -0.013869 \n", + "8 Character-based Neural Networks for Sentence P... No -0.012866 \n", + "9 Clustering and Network Analysis for the Embedd... No -0.012663 \n", + "\n", + " probability yes_probability \n", + "0 0.995187 0.995187 \n", + "1 0.995149 0.995149 \n", + "2 0.961930 0.961930 \n", + "3 0.948130 0.948130 \n", + "4 0.746849 0.253151 \n", + "5 0.984570 0.015430 \n", + "6 0.984820 0.015180 \n", + "7 0.986227 0.013773 \n", + "8 0.987216 0.012784 \n", + "9 0.987417 0.012583 " + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Return reranked results\n", + "reranked_df = output_df.sort_values(\n", + " by=[\"yes_probability\"], ascending=False\n", + ").reset_index()\n", + "reranked_df.head(10)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "610b2c7f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'In Search for Linear Relations in Sentence Embedding Spaces: We present an introductory investigation into continuous-space vector\\nrepresentations of sentences. We acquire pairs of very similar sentences\\ndiffering only by a small alterations (such as change of a noun, adding an\\nadjective, noun or punctuation) from datasets for natural language inference\\nusing a simple pattern method. We look into how such a small change within the\\nsentence text affects its representation in the continuous space and how such\\nalterations are reflected by some of the popular sentence embedding models. We\\nfound that vector differences of some embeddings actually reflect small changes\\nwithin a sentence.'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Inspect our new top document following reranking\n", + "reranked_df[\"document\"][0]" + ] + }, + { + "cell_type": "markdown", + "id": "f372d311", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We've shown how to create a tailored cross-encoder to rerank academic papers. This approach will work best where there are domain-specific nuances that can be used to pick the most relevant corpus for your users, and where some pre-filtering has taken place to limit the amount of data the cross-encoder will need to process. \n", + "\n", + "A few typical use cases we've seen are:\n", + "- Returning a list of 100 most relevant stock reports, then re-ordering into a top 5 or 10 based on the detailed context of a particular set of customer portfolios\n", + "- Running after a classic rules-based search that gets the top 100 or 1000 most relevant results to prune it according to a specific user's context\n", + "\n", + "\n", + "### Taking this forward\n", + "\n", + "Taking the few-shot approach, as we have here, can work well when the domain is general enough that a small number of examples will cover most reranking cases. However, as the differences between documents become more specific you may want to consider the ```Fine-tuning``` endpoint to make a more elaborate cross-encoder with a wider variety of examples.\n", + "\n", + "There is also a latency impact of using ```text-davinci-003``` that you'll need to consider, with even our few examples above taking a couple seconds each - again, the ```Fine-tuning``` endpoint may help you here if you are able to get decent results from an ```ada``` or ```babbage``` fine-tuned model.\n", + "\n", + "We've used the ```Completions``` endpoint from OpenAI to build our cross-encoder, but this area is well-served by the open-source community. [Here](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) is an example from HuggingFace, for example.\n", + "\n", + "We hope you find this useful for tuning your search use cases, and look forward to seeing what you build." ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" } - ], - "source": [ - "# Inspect our new top document following reranking\n", - "reranked_df[\"document\"][0]" - ] - }, - { - "cell_type": "markdown", - "id": "f372d311", - "metadata": {}, - "source": [ - "## Conclusion\n", - "\n", - "We've shown how to create a tailored cross-encoder to rerank academic papers. This approach will work best where there are domain-specific nuances that can be used to pick the most relevant corpus for your users, and where some pre-filtering has taken place to limit the amount of data the cross-encoder will need to process. \n", - "\n", - "A few typical use cases we've seen are:\n", - "- Returning a list of 100 most relevant stock reports, then re-ordering into a top 5 or 10 based on the detailed context of a particular set of customer portfolios\n", - "- Running after a classic rules-based search that gets the top 100 or 1000 most relevant results to prune it according to a specific user's context\n", - "\n", - "\n", - "### Taking this forward\n", - "\n", - "Taking the few-shot approach, as we have here, can work well when the domain is general enough that a small number of examples will cover most reranking cases. However, as the differences between documents become more specific you may want to consider the ```Fine-tuning``` endpoint to make a more elaborate cross-encoder with a wider variety of examples.\n", - "\n", - "There is also a latency impact of using ```text-davinci-003``` that you'll need to consider, with even our few examples above taking a couple seconds each - again, the ```Fine-tuning``` endpoint may help you here if you are able to get decent results from an ```ada``` or ```babbage``` fine-tuned model.\n", - "\n", - "We've used the ```Completions``` endpoint from OpenAI to build our cross-encoder, but this area is well-served by the open-source community. [Here](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) is an example from HuggingFace, for example.\n", - "\n", - "We hope you find this useful for tuning your search use cases, and look forward to seeing what you build." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "openai_test", - "language": "python", - "name": "openai_test" + ], + "metadata": { + "kernelspec": { + "display_name": "openai_test", + "language": "python", + "name": "openai_test" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.11" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "nbformat": 4, + "nbformat_minor": 5 }