0.10.26
版本发布时间: 2023-10-25 23:12:48
Unstructured-IO/unstructured最新发布版本:0.15.12(2024-09-13 22:39:58)
0.10.26
Enhancements
- Add CI evaluation workflow Adds evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
Features
-
Functionality to catch and classify overlapping/nested elements Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the
overlapping_elements
,overlapping_case
,overlapping_percentage
,largest_ngram_percentage
,overlap_percentage_total
,max_area
,min_area
, andtotal_area
. - Add Local connector source metadata python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
- Add Local connector source metadata. python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
Fixes
- Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured.
-
Adds
typing-extensions
as an explicit dependency This package is an implicit dependency, but the module is being imported directly inunstructured.documents.elements
so the dependency should be explicit in case changes in other dependencies lead totyping-extensions
being dropped as a dependency. -
Stop passing
extract_tables
tounstructured-inference
since it is now supported inunstructured
instead Table extraction previously occurred inunstructured-inference
, but that logic, except for the table model itself, is now a part of theunstructured
library. Thus the parameter triggering table extraction is no longer passed to theunstructured-inference
package. Also noted the table output regression for PDF files. -
Fix a bug in Table partitioning Previously the
skip_infer_table_types
variable used inpartition
was not being passed down to specific file partitioners. Now you can utilize theskip_infer_table_types
list variable when callingpartition
to specify the filetypes for which you want to skip table extraction, or theinfer_table_structure
boolean variable on the file specific partitioning function. - Fix partition docx without sections Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.
-
Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded
max_characters
. - Deserialization of ingest docs fixed When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.
- Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.