0.15.2
版本发布时间: 2024-08-13 21:40:55
Unstructured-IO/unstructured最新发布版本:0.15.12(2024-09-13 22:39:58)
0.15.2
Enhancements
-
Improve directory handling when extracting image blocks. The
figures
directory is no longer created when theextract_image_block_to_payload
parameter is set toTrue
.
Features
- Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.
Fixes
-
Updates NLTK data file for compatibility with
nltk>=3.8.2
. The NLTK data file now containerpunkt_tab
, making it possible to upgrade tonltk>=3.8.2
. Thenltk==3.8.2
patches CVE-2024-39705. - Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
-
Accommodate single-column CSV files. Resolves a limitation of
partition_csv()
where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters). -
Accommodate
image/jpg
in PPTX as alias forimage/jpeg
. Resolves problem partitioning PPTX files having an invalidimage/jpg
(should beimage/jpeg
) MIME-type in the[Content_Types].xml
member of the PPTX Zip archive. - Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
- Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.