0.10.26

Unstructured-IO/unstructured

版本发布时间: 2023-10-25 23:12:48

Unstructured-IO/unstructured最新发布版本:0.15.12(2024-09-13 22:39:58)

0.10.26

Enhancements

Add CI evaluation workflow Adds evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.

Features

Functionality to catch and classify overlapping/nested elements Method to identify overlapping-bboxes cases within detected elements in a document. It returns two values: a boolean defining if there are overlapping elements present, and a list reporting them with relevant metadata. The output includes information about the overlapping_elements, overlapping_case, overlapping_percentage, largest_ngram_percentage, overlap_percentage_total, max_area, min_area, and total_area.
Add Local connector source metadata python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.
Add Local connector source metadata. python's os module used to pull stats from local file when processing via the local connector and populates fields such as last modified time, created time.

Fixes

Fixes elements partitioned from an image file missing certain metadata Metadata for image files, like file type, was being handled differently from other file types. This caused a bug where other metadata, like the file name, was being missed. This change brought metadata handling for image files to be more in line with the handling for other file types so that file name and other metadata fields are being captured.
Adds typing-extensions as an explicit dependency This package is an implicit dependency, but the module is being imported directly in unstructured.documents.elements so the dependency should be explicit in case changes in other dependencies lead to typing-extensions being dropped as a dependency.
Stop passing extract_tables to unstructured-inference since it is now supported in unstructured instead Table extraction previously occurred in unstructured-inference, but that logic, except for the table model itself, is now a part of the unstructured library. Thus the parameter triggering table extraction is no longer passed to the unstructured-inference package. Also noted the table output regression for PDF files.
Fix a bug in Table partitioning Previously the skip_infer_table_types variable used in partition was not being passed down to specific file partitioners. Now you can utilize the skip_infer_table_types list variable when calling partition to specify the filetypes for which you want to skip table extraction, or the infer_table_structure boolean variable on the file specific partitioning function.
Fix partition docx without sections Some docx files, like those from teams output, do not contain sections and it would produce no results because the code assumes all components are in sections. Now if no sections is detected from a document we iterate through the paragraphs and return contents found in the paragraphs.
Fix out-of-order sequencing of split chunks. Fixes behavior where "split" chunks were inserted at the beginning of the chunk sequence. This would produce a chunk sequence like [5a, 5b, 3a, 3b, 1, 2, 4] when sections 3 and 5 exceeded max_characters.
Deserialization of ingest docs fixed When ingest docs are being deserialized as part of the ingest pipeline process (cli), there were certain fields that weren't getting persisted (metadata and date processed). The from_dict method was updated to take these into account and a unit test added to check.
Map source cli command configs when destination set Due to how the source connector is dynamically called when the destination connector is set via the CLI, the configs were being set incorrectoy, causing the source connector to break. The configs were fixed and updated to take into account Fsspec-specific connectors.

相关地址：原始地址下载(tar) 下载(zip)

查看：2023-10-25发行的版本