Updated several notebooks:
- fixed titles which are inconsistent or break the ToC sorting order.
- added missed soruce descriptions and links
- fixed formatting
"[ERNIE Embedding-V1](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu) is a text representation model based on Baidu Wenxin's large-scale model technology, \n",
"[ERNIE Embedding-V1](https://cloud.baidu.com/doc/WENXINWORKSHOP/s/alj562vvu) is a text representation model based on `Baidu Wenxin` large-scale model technology, \n",
"which converts text into a vector form represented by numerical values, and is used in text retrieval, information recommendation, knowledge mining and other scenarios."
"which converts text into a vector form represented by numerical values, and is used in text retrieval, information recommendation, knowledge mining and other scenarios."
"[FastEmbed](https://qdrant.github.io/fastembed/) is a lightweight, fast, Python library built for embedding generation. \n",
">[FastEmbed](https://qdrant.github.io/fastembed/) from [Qdrant](https://qdrant.tech) is a lightweight, fast, Python library built for embedding generation. \n",
"\n",
">\n",
"- Quantized model weights\n",
">- Quantized model weights\n",
"- ONNX Runtime, no PyTorch dependency\n",
">- ONNX Runtime, no PyTorch dependency\n",
"- CPU-first design\n",
">- CPU-first design\n",
"- Data-parallelism for encoding of large datasets."
">- Data-parallelism for encoding of large datasets."
"Let's load the HuggingFace instruct Embeddings class."
"\n",
">[Hugging Face sentence-transformers](https://huggingface.co/sentence-transformers) is a Python framework for state-of-the-art sentence, text and image embeddings.\n",
">One of the instruct embedding models is used in the `HuggingFaceInstructEmbeddings` class.\n"
"### Loading the Johnsnowlabs embedding class to generate and query embeddings\n",
">[John Snow Labs](https://nlp.johnsnowlabs.com/) NLP & LLM ecosystem includes software libraries for state-of-the-art AI at scale, Responsible AI, No-Code AI, and access to over 20,000 models for Healthcare, Legal, Finance, etc.\n",
"\n",
">\n",
"Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.\n",
">Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started >with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.\n",
"For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)\n"
">For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)\n"
],
]
"metadata": {
"collapsed": false
}
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"source": [
"source": [
"! pip install johnsnowlabs\n"
"## Setting up"
],
]
"metadata": {
"collapsed": false
}
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"outputs": [],
"source": [
"source": [
"# If you have a enterprise license, you can run this to install enterprise features\n",
"#### Define some example texts . These could be any documents that you want to analyze - for example, news articles, social media posts, or product reviews."
],
"metadata": {
"metadata": {
"collapsed": false
"collapsed": false,
}
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Define some example texts . These could be any documents that you want to analyze - for example, news articles, social media posts, or product reviews."
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"outputs": [],
"source": [
"source": [
"texts = [\"Cancer is caused by smoking\", \"Antibiotics aren't painkiller\"]"
"texts = [\"Cancer is caused by smoking\", \"Antibiotics aren't painkiller\"]"
],
]
"metadata": {
"collapsed": false
}
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"source": [
"#### Generate and print embeddings for the texts . The JohnSnowLabsEmbeddings class generates an embedding for each document, which is a numerical representation of the document's content. These embeddings can be used for various natural language processing tasks, such as document similarity comparison or text classification."
],
"metadata": {
"metadata": {
"collapsed": false
"collapsed": false,
}
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Generate and print embeddings for the texts . The JohnSnowLabsEmbeddings class generates an embedding for each document, which is a numerical representation of the document's content. These embeddings can be used for various natural language processing tasks, such as document similarity comparison or text classification."
]
},
},
{
{
"cell_type": "code",
"cell_type": "code",
"execution_count": null,
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"outputs": [],
"source": [
"source": [
"embeddings = embedder.embed_documents(texts)\n",
"embeddings = embedder.embed_documents(texts)\n",
"for i, embedding in enumerate(embeddings):\n",
"for i, embedding in enumerate(embeddings):\n",
" print(f\"Embedding for document {i+1}: {embedding}\")"
" print(f\"Embedding for document {i+1}: {embedding}\")"
],
]
"metadata": {
"collapsed": false
}
},
},
{
{
"cell_type": "markdown",
"cell_type": "markdown",
"source": [
"#### Generate and print an embedding for a single piece of text. You can also generate an embedding for a single piece of text, such as a search query. This can be useful for tasks like information retrieval, where you want to find documents that are similar to a given query."
],
"metadata": {
"metadata": {
"collapsed": false
"collapsed": false,
}
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"Generate and print an embedding for a single piece of text. You can also generate an embedding for a single piece of text, such as a search query. This can be useful for tasks like information retrieval, where you want to find documents that are similar to a given query."
">[SentenceTransformers](https://www.sbert.net/) embeddings are called using the `HuggingFaceEmbeddings` integration. We have also added an alias for `SentenceTransformerEmbeddings` for users who are more familiar with directly using that package.\n",
">[Hugging Face sentence-transformers](https://huggingface.co/sentence-transformers) is a Python framework for state-of-the-art sentence, text and image embeddings.\n",
">One of the embedding models is used in the `HuggingFaceEmbeddings` class.\n",
">We have also added an alias for `SentenceTransformerEmbeddings` for users who are more familiar with directly using that package.\n",
"\n",
"\n",
"`SentenceTransformers` is a python package that can generate text and image embeddings, originating from [Sentence-BERT](https://arxiv.org/abs/1908.10084)"
"`sentence_transformers` package models are originating from [Sentence-BERT](https://arxiv.org/abs/1908.10084)"
">[TensorFlow Hub](https://www.tensorflow.org/hub) is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. Reuse trained models like `BERT` and `Faster R-CNN` with just a few lines of code.\n",