v3.2.0rc2
版本发布时间: 2024-05-07 07:42:40
delta-io/delta最新发布版本:v3.2.1rc1(2024-09-05 00:48:36)
We are excited to announce the release of Delta Lake 3.2.0 (RC2)! Instructions for how to use this release candidate are at the end of these notes. To give feedback on this release candidate, please post in the Delta Users Slack here or create issues in our Delta repository.
Highlights
- Support for Liquid clustering to reduce write amplification using incremental clustering.
- Preview support for Type Widening to allow users to change the type of columns without having to rewrite data.
- Preview support for Apache Hudi in Delta UniForm tables.
Delta Spark
Delta Spark 3.2.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.
- Documentation: https://docs.delta.io/3.2.0/index.html
- API documentation: https://docs.delta.io/3.2.0/delta-apidoc.html#delta-spark
- RC2 artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- RC2 Python artifacts: See
delta-3.2-rc2-python-artifacts.zip
attached
The key features of this release are:
-
Support for Liquid clustering: This allows for incremental clustering based on ZCubes and reduces the write amplification by not touching files already well clustered (i.e., files in stable ZCubes). Users can now use the ALTER TABLE CLUSTER BY syntax to change clustering columns and use the DESCRIBE DETAIL command to check the clustering columns. In addition, Delta Spark now supports DeltaTable
clusterBy
API in both Python and Scala to allow creating clustered tables using DeltaTable API. See the documentation and examples for more information. - Preview support for Type Widening: Delta Spark can now change the type of a column from
byte
toshort
tointeger
using the ALTER TABLE t CHANGE COLUMN col TYPE type command or with schema evolution during MERGE and INSERT operations. The table remains readable by Delta 3.2 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using theALTER TABLE t DROP FEATURE 'typeWidening-preview’
command.- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Support for Vacuum Inventory: Delta Spark now extends the VACUUM SQL command to allow users to specify an inventory table in a VACUUM command. When an inventory table is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. See the docs here.
-
Support for Vacuum Writer Protocol Check: Delta Spark can now support
vacuumProtocolCheck
ReaderWriter feature which ensures consistent application of reader and writer protocol checks duringVACUUM
operations, addressing potential protocol discrepancies and mitigating the risk of data corruption due to skipped writer checks. - Preview support for In-Commit Timestamps: When enabled, this preview feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. When enabled, time travel queries will yield consistent results, even if the table directory is relocated.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
- Deletion Vectors Read Performance Improvements: Two improvements were introduced to DVs in Delta 3.2.
- Removing broadcasting of DV information to executors: This work improves stability by reducing drivers’ memory consumption, preventing potential Driver OOM for very large Delta tables like 1TB+. This work also improves performance by saving us fixed broadcasting overhead in reading small Delta Tables.
- Supporting predicate pushdown and splitting in scans with DVs: Improving performance of DV reads with filters queries thanks to predicate pushdown and splitting. This feature gains 2x performance improvement on average.
-
Support for Row Tracking: Delta Spark can now write to tables that maintain information that allows identifying rows across multiple versions of a Delta table. Delta Spark can now also access this tracking information using the two metadata fields
_metadata.row_id
and_metadata.row_commit_version
.
Other notable changes include:
- Delta Sharing: reduce the minimum RPC interval in delta sharing streaming from 30 seconds to 10 seconds
- Improve the performance of write operations by skipping collecting commit stats
-
New SQL configurations to specify Delta Log cache size (
spark.databricks.delta.delta.log.cacheSize
) and retention duration (spark.databricks.delta.delta.log.cacheRetentionMinutes
) - Fix bug in plan validation due to inconsistent field metadata in MERGE
- Improved metrics during VACUUM for better visibility
- Hive Metastore schema sync: The truncation threshold for schemas with long fields is now user configurable
Delta Universal Format (UniForm)
- Documentation: https://docs.delta.io/3.2.0/delta-uniform.html
- RC2 artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13
Hudi is now supported by Delta Universal format in addition to Iceberg. Writing to a Delta UniForm table can generate Hudi metadata, alongside Delta. This feature is contributed by XTable.
Create a UniForm-enabled that automatically generates Hudi metadata using the following command:
CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES ('delta.universalFormat.enabledFormats' = hudi);
See the documentation here for more details.
Other notable changes include:
- Throw a better error if Iceberg conversion fails during initial sync
- Fix a bug in Delta Universal Format to support correct table overwrites
Delta Kernel
- API documentation: https://docs.delta.io/3.2.0/api/java/kernel/index.html
- RC2 artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details). In this release,e we improved the read support to make it production-ready by adding numerous performance improvements, additional functionality, and improved protocol support.
-
Support for time travel. Now you can read a table snapshot at a version id or snapshot at a timestamp.
-
Improved Delta protocol support.
-
Support for reading tables with
checkpoint v2
. - Support for reading tables with
timestamp
partition type data column. -
Support for reading tables with column data type
timestamp_ntz
.
-
Support for reading tables with
-
Improved table metadata read performance and reliability on very large tables with millions of files
- Improved checkpoint reading latency by pushing the partition predicate to the checkpoint Parquet reader to minimize reading number of checkpoint files read.
- Improved state reconstruction latency by using
LogStore
s fromdelta-storage
module for fasterlistFrom
calls. -
Retry loading the
_last_checkpoint
checkpoint in case of transient failures. Loading the last checkpoint info from this file helps construct the Delta table state faster. - Optimization to minimize the number of listing calls to object store when trying to find a last checkpoint at or before a version.
-
Other notable changes include:
-
Support for
IS_NULL
expression. Now thePredicate
passed to KernelScanBuilder
can includeIS_NULL
predicates. -
Support for custom
ParquetHandler
implementations to multiple Parquet files in parallel. The current default implementation reads one file at a time, but the connectors can implement their own customParquetHandler
to read the Parquet files in parallel.
-
Support for
For more information, refer to:
- User guide on step-by-step process of using Kernel in a standalone Java program or in a distributed processing connector.
- Slides explaining the rationale behind Kernel and the API design.
- Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
- Table and default Engine API Java documentation
- Migration guide to upgrade your connector to use the 3.2.0 APIs
How to use this Release Candidate
Download Spark 3.5 from https://spark.apache.org/downloads.html.
Important: Clear your package cache to ensure you’re effectively testing the latest Delta RC and not a previously released binary:
rm -rf ~/.ivy2/cache/
For this release candidate, we have published the artifacts to a staging repository. Here’s how you can use them:
Spark Submit
- Add
--repositories https://oss.sonatype.org/content/repositories/iodelta-1138
to the command line arguments. - Example:
spark-submit --packages io.delta:delta-spark_2.12:3.2.0 --repositories https://oss.sonatype.org/content/repositories/iodelta-1138 examples/examples.py
- Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.2.0 by just providing the
--packages io.delta:delta-spark_2.12:3.2.0
argument.
Spark Shell
bin/spark-shell --packages io.delta:delta-spark_2.12:3.2.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1138 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Spark SQL
bin/spark-sql --packages io.delta:delta-spark_2.12:3.2.0 \
--repositories https://oss.sonatype.org/content/repositories/iodelta-1138 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Maven
<repositories>
<repository>
<id>staging-repo</id>
<url>https://oss.sonatype.org/content/repositories/iodelta-1138</url>
</repository>
</repositories>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-spark_2.12</artifactId>
<version>3.2.0</version>
</dependency>
SBT Project
libraryDependencies += "io.delta" %% "delta-spark" % "3.2.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1138
(PySpark) Delta-Spark
- Download two artifacts from pre-release: https://github.com/delta-io/delta/releases/tag/v3.2.0rc2
- Artifacts to download are:
- delta-spark-3.2.0.tar.gz
- delta_spark-3.2.0-py3-none-any.whl
- Keep them in one directory. Let’s call that
~/Downloads
-
pip install ~/Downloads/delta_spark-3.2.0-py3-none-any.whl
-
pip show delta-spark
should show output similar to the below
Name: delta-spark
Version: 3.2.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: delta-users@googlegroups.com
License: Apache-2.0
Location: /home/<user.name>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark
Credits
Adam Binford, Ala Luszczak, Allison Portis, Ami Oka, Andreas Chatzistergiou, Arun Ravi M V, Babatunde Micheal Okutubo, Bo Gao, Carmen Kwan, Chirag Singh, Chloe Xia, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Davin Tjong, Dhruv Arya, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gabriel Russo, Hao Jiang, Hyukjin Kwon, Ian Streeter, Jason Teoh, Jiaheng Tang, Jing Zhan, Jintian Liang, Johan Lasperas, Jonas Irgens Kylling, Juliusz Sompolski, Kaiqi Jin, Lars Kroll, Lin Zhou, Miles Cole, Nick Lanham, Ole Sasse, Paddy Xu, Prakhar Jain, Rachel Bushrian, Rajesh Parangi, Renan Tomazoni Pinzon, Sabir Akhadov, Scott Sandre, Simon Dahlbacka, Sumeet Varma, Tai Le, Tathagata Das, Thang Long Vu, Tim Brown, Tom van Bussel, Venki Korukanti, Wei Luo, Wenchen Fan, Xupeng Li, Yousof Hosny, Gene Pang, Jintao Shen, Kam Cheung Ting, panbingkun, ram-seek, Sabir Akhadov, sokolat, tangjiafu