0.13.0
版本发布时间: 2024-03-30 05:37:23
Unstructured-IO/unstructured最新发布版本:0.13.7(2024-05-09 01:28:21)
0.13.0
Enhancements
-
Add
.metadata.is_continuation
to text-split chunks..metadata.is_continuation=True
is added to second-and-later chunks formed by text-splitting an oversizedTable
element but not to their counterpartText
element splits. Add this indicator forCompositeElement
to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks. -
Add
compound_structure_acc
metric to table eval. Add a new property tounstructured.metrics.table_eval.TableEvaluation
:composite_structure_acc
, which is computed from the element level row and column index and content accuracy scores -
Add
.metadata.orig_elements
to chunks..metadata.orig_elements: list[Element]
is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, likepage_number
,coordinates
, andimage_base64
. -
Add
--include_orig_elements
option to Ingest CLI. By default, when chunking, the original elements used to form each chunk are added tochunk.metadata.orig_elements
for each chunk. * Theinclude_orig_elements
parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata. - Add Google VertexAI embedder Adds VertexAI embeddings to support embedding via Google Vertex AI.
Features
-
Chunking populates
.metadata.orig_elements
for each chunk. This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as.coordinates
that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by theinclude_orig_elements
parameter topartition_*()
or to the chunking functions. This option defaults toTrue
so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to otherunstructured
repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR. - Add Clarifai destination connector Adds support for writing partitioned and chunked documents into Clarifai.
Fixes
-
Fix
clean_pdfminer_inner_elements()
to remove only pdfminer (embedded) elements merged with inferred elements. Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed. - Clarify IAM Role Requirement for GCS Platform Connectors. The GCS Source Connector requires Storage Object Viewer and GCS Destination Connector requires Storage Object Creator IAM roles.
-
Change table extraction defaults Change table extraction defaults in favor of using
skip_infer_table_types
parameter and reflect these changes in documentation. - Fix OneDrive dates with inconsistent formatting Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. See previous fix for SharePoint
- Adds tracking for AstraDB Adds tracking info so AstraDB can see what source called their api.
- Support AWS Bedrock Embeddings in ingest CLI The configs required to instantiate the bedrock embedding class are now exposed in the api and the version of boto being used meets the minimum requirement to introduce the bedrock runtime required to hit the service.
- Change MongoDB redacting Original redact secrets solution is causing issues in platform. This fix uses our standard logging redact solution.