0.15.8
版本发布时间: 2024-08-27 23:55:33
Unstructured-IO/unstructured最新发布版本:0.15.12(2024-09-13 22:39:58)
0.15.8
Enhancements
- Bump unstructured.paddleocr to 2.8.1.0.
Features
- Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.
Fixes
-
Replace
pillow-heif
withpi-heif
. Replacespillow-heif
withpi-heif
due to more permissive licensing on the wheel forpi-heif
. -
Minify text_as_html from DOCX. Previously
.metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced bytabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text. -
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by
filetype
was incorrectly identified as a MSG file.