v1.20.0-rc1
版本发布时间: 2023-08-30 20:29:07
deepset-ai/haystack最新发布版本:v2.4.0(2024-08-15 17:39:00)
⭐ Highlights
🪄LostInTheMiddleRanker and DiversityRanker
We are excited to introduce two new rankers to Haystack: LostInTheMiddleRanker and DiversityRanker!
LostInTheMiddleRanker is based on the research paper "Lost in the Middle: How Language Models Use Long Contexts" by Liu et al. It reorders documents according to the "Lost in the Middle" strategy, which places the most relevant paragraphs at the beginning and end of the context, while less relevant paragraphs are positioned in the middle. This ranker can be used in Retrieval-Augmented Generation (RAG) pipelines.
DiversityRanker aims to maximize the overall diversity of the given documents. It leverages sentence-transformer models to calculate semantic embeddings for each document. The ranker orders documents so that each next one, on average, is least similar to the already selected documents. This ranking results in a list where each subsequent document contributes the most to the overall diversity of the selected document set.
📰 New release note management
We have implemented a new release note management system, reno
. From now on, every contributor is responsible for adding release notes for the feature or bugfix they're introducing in Haystack in the same Pull Request containing the code changes. The goal is to encourage detailed and accurate notes for every release, especially when it comes to complex features or breaking changes.
See how to work with the new release notes in our Contribution Guide.
⬆️ Upgrade Notes
-
If you're a Haystack contributor, you need a new tool called
reno
to manage the release notes. Please runpip install -e .[dev]
to ensure you havereno
available in your environment. -
The Opensearch custom query syntax changes: the old filter placeholders for
custom_query
are no longer supported. Replace your custom filter expressions with the new${filters}
placeholder:Old:
retriever = BM25Retriever( custom_query=""" { "query": { "bool": { "should": [{"multi_match": { "query": ${query}, "type": "most_fields", "fields": ["content", "title"]}} ], "filter": [ {"terms": {"year": ${years}}}, {"terms": {"quarter": ${quarters}}}, {"range": {"date": {"gte": ${date}}}} ] } } } """ ) retriever.retrieve( query="What is the meaning of life?", filters={"years": [2019, 2020], "quarters": [1, 2, 3], "date": "2019-03-01"} )
New:
retriever = BM25Retriever( custom_query=""" { "query": { "bool": { "should": [{"multi_match": { "query": ${query}, "type": "most_fields", "fields": ["content", "title"]}} ], "filter": ${filters} } } } """ ) retriever.retrieve( query="What is the meaning of life?", filters={"year": [2019, 2020], "quarter": [1, 2, 3], "date": {"$gte": "2019-03-01"}} )
-
This update impacts only those who have created custom invocation layers by subclassing PromptModelInvocationLayer. Previously, the invoke() method in your custom layer received all prompt template parameters (like query, documents, etc.) as keyword arguments. With this change, these parameters will no longer be passed in as keyword arguments. If you've implemented such a custom layer, you'll need to potentially update your code to accommodate this change.
🥳 New Features
-
The LostInTheMiddleRanker can be used like other rankers in Haystack. After initializing LostInTheMiddleRanker with the desired parameters, it can be used to rank/reorder a list of documents based on the "Lost in the Middle" order - the most relevant documents are located at the top and bottom of the returned list, while the least relevant documents are found in the middle. We advise that you use this ranker in combination with other rankers, and to place it towards the end of the pipeline.
-
The DiversityRanker can be used like other rankers in Haystack and it can be particularly helpful in cases where you have highly relevant yet similar sets of documents. By ensuring a diversity of documents, this new ranker facilitates a more comprehensive utilization of the documents and, particularly in RAG pipelines, potentially contributes to more accurate and rich model responses.
-
When using
custom_query
inBM25Retriever
along withOpenSearch
orElasticsearch
, we added support for dynamicfilters
, like in regular queries. With this change, you can pass filters at query-time without having to modify thecustom_query
: Instead of defining filter expressions and field placeholders, all you have to do is setting the${filters}
placeholder analogous to the${query}
placeholder into yourcustom_query
. For example:{ "query": { "bool": { "should": [{"multi_match": { "query": ${query}, // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}} ], "filter": ${filters} // optional filters placeholder } } }
-
DeepsetCloudDocumentStore
supports searching multiple fields in sparse queries. This enables you to search meta fields as well when usingBM25Retriever
. For example setsearch_fields=["content", "title"]
to search thetitle
meta field along with the documentcontent
. -
Rework
DocumentWriter
to removeDocumentStoreAwareMixin
. Now we require a genericDocumentStore
when initialisating the writer. -
Rework
MemoryRetriever
to removeDocumentStoreAwareMixin
. Now we require aMemoryDocumentStore
when initialisating the retriever. -
Introduced
allowed_domains
parameter inWebRetriever
for domain-specific searches, thus enabling "talk to a website" and "talk to docs" scenarios.
✨ Enhancements
-
The WebRetriever now employs an enhanced caching mechanism that caches web page content based on search engine results rather than the query.
-
Upgrade transformers to the latest version 4.32.1 so that Haystack benefits from Llama and T5 bugfixes: https://github.com/huggingface/transformers/releases/tag/v4.32.1
-
Upgrade Transformers to the latest version 4.32.0. This version adds support for the GPTQ quantization and integrates MPT models.
-
Add top_k parameter to the DiversityRanker init method.
-
Enable setting the
max_length
value when running PromptNodes using local HF text2text-generation models. -
enable passing use_fast to the underlying transformers' pipeline
-
Enhance FileTypeClassifier to detect media file types like mp3, mp4, mpeg, m4a, and similar.
-
Minor PromptNode HFLocalInvocationLayer test improvements
-
Several minor enhancements for LinkContentFetcher:
- Dynamic content handler resolution
- Custom User-Agent header (optional, minimize blocking)
- PDF support
- Register new content handlers
-
If LinkContentFetcher encounters a block or receives any response code other than HTTPStatus.OK, return the search engine snippet as content, if it's available.
-
Allow loading Tokenizers for prompt models not natively supported by transformers by setting
trust_remote_code
toTrue
. -
Refactor and simplify WebRetriever to use LinkContentFetcher component
-
Remove template variables from invocation layer kwargs
-
Allow WebRetriever users to specify a custom LinkContentFetcher instance
🐛 Bug Fixes
-
Fix the bug that the responses of Agents using local HF models contain the prompt text.
-
fix issue 5485, TransformersImageToText.generate_captions accepts "str"
-
Fix StopWordsCriteria not checking stop word tokens in a continuous and sequential order
-
Ensure the leading whitespace in the generated text is preserved when using
stop_words
in the Hugging Face invocation layer of the PromptNode. -
Restricts the criteria for identifying an OpenAI model in the PromptNode and in the EmbeddingRetriever. Previously, the criteria were quite loose, leading to more false positives.
-
Make the Crawler work properly with Selenium>=4.11.0. Simplify the Crawler, as the new version of Selenium automatically finds or installs the necessary drivers.
👁️ Haystack 2.0 preview
-
Adds FileExtensionClassifier to preview components.
-
Add Sentence Transformers Document Embedder. It computes embeddings of Documents. The embedding of each Document is stored in the
embedding
field of the Document. -
Add Sentence Transformers Text Embedder. It is a simple component that embeds strings into vectors.
-
Add Answer base class for haystack v2
-
Add GeneratedAnswer and ExtractedAnswer
-
Improve error messaging in the FileExtensionClassifier constructor to avoid common mistakes.
-
Migrate existing v2 components to Canals 0.4.0
-
Fix TextFileToDocument using wrong Document class
-
Change import paths under the "preview" package to minimize module namespace pollution.
-
Migrate all components to Canals==0.7.0
-
Add serialization and deserialization methods for all Haystack components
-
Added new DocumentWriter component to Haystack v2 preview so that documents can be written to stores.
-
copy lazy_imports.py to preview
-
Remove
BaseTestComponent
class used to testComponent
s -
Remove
DocumentStoreAwareMixin
as it's not necessary anymore -
Remove Pipeline specialisation to support DocumentStores.
-
Add Sentence Transformers Embedding Backend. It will be used by Embedder components and is responsible for computing embeddings.
-
Add utility function
store_class
factory to createStore
s for testing purposes. -
Add
from_dict
andto_dict
methods toStore
Protocol
-
Add default
from_dict
andto_dict
implementations to classes decorated with@store
-
Add new TextFileToDocument component to Haystack v2 preview so that text files can be converted to Haystack Documents.