v4.0.0rc1
版本发布时间: 2024-06-14 00:28:01
delta-io/delta最新发布版本:v3.2.1rc1(2024-09-05 00:48:36)
We are excited to announce the preview release of Delta Lake 4.0.0 on the preview release of Apache Spark 4.0.0! This release gives a preview of the following exciting new features.
- Support for Spark Connect (aka Delta Connect) is an extension for Spark Connect which enables the usage of Delta over Spark Connect, allowing Delta to be used with the decoupled client-server architecture of Spark Connect.
- Support for Type Widening to allow users to change the type of columns without having to rewrite data.
- Support for the Variant data type to enable semi-structured storage and data processing, for flexibility and performance.
- Support for Coordinated Commits table feature which makes the commit protocol very flexible and allows reliable multi-cloud and multi-engine writes.
Read below for more details. In addition, few existing artifacts are unavailable in this release that are listed at the end.
Delta Spark
Delta Spark 4.0 preview is built on Apache Spark™ 4.0.0-preview1. Similar to Apache Spark, we have released Maven artifacts for Scala 2.13.
- Documentation: https://docs.delta.io/4.0.0-preview/index.html
- Maven artifacts: delta-spark_2.13, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
- Python artifacts: https://pypi.org/project/delta-spark/4.0.0rc1/
The key features of this release are:
- Support for Spark Connect (aka Delta Connect): Spark Connect is a new initiative in Apache Spark that adds a decoupled client-server infrastructure which allows Spark applications to connect remotely to a Spark server and run SQL / Dataframe operations. Delta Connect allows Delta operations to be made in applications running in such client-server mode. For more information on how to use Delta Connect see the Delta Connect documentation.
- Support for Coordinated Commits: Coordinated Commits is a new writer table feature which allows users to designate a “Commit Coordinator” for their Delta table. A commit coordinator is an entity with a unique identifier which maintains information about commits. Once a commit coordinator has been set for a table, all writes to the table must be coordinated through it. This single point of ownership of commits for the table makes cross-environment (e.g. cross cloud) writes safe. Examples of Commit Coordinators are catalogs (Hive Metastore, Unity Catalog, etc.), DynamoDB, or any system which can implement the commit coordinator API. This release also adds a DynamoDB Commit Coordinator which can use a DynamoDB table to coordinate commits for a table. Delta tables with commit coordinators are still readable through the object storage paths, making reads backward compatible. See the Delta Coordinated Commits documentation for more details.
-
Support for Type Widening: Delta Spark can now change the type of a column to a wider type using the
ALTER TABLE t CHANGE COLUMN col TYPE
type command or with schema evolution duringMERGE
andINSERT
operations. See the type widening documentation for a list of all supported type changes and additional information. The table will be readable by Delta 4.0 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using theALTER TABLE t DROP FEATURE 'typeWidening'
command. - Support for Variant data type: The Variant data type is a new Apache Spark data type. The Variant data type enables flexible, and efficient processing of semi-structured data, without a user-specified schema. Variant data does not require a fixed schema on write. Instead, Variant data is queried using the schema-on-read approach. The Variant data type allows flexible ingestion by not requiring a write schema, and enables faster processing with the Spark Variant binary encoding format. Please see the documentation and the example for more details.
Other notable changes include:
- Support protocol version downgrades when the existing table features exist in the lower protocol version.
- Support dropping table features for columnMapping and vacuumProtocolCheck.
-
Support
CREATE TABLE LIKE
with user provided properties. Previously any properties that were provided in the SQL command were ignored and only the properties from the source table were used. - Fix liquid clustering to automatically fall back to Z-order clustering when clustering on a single column. Previously, any attempts to optimize the table would fail.
- Pushdown query filters when reading CDF so the filters can be used for partition pruning and row group skipping.
- Improve the performance of finding the last complete checkpoint with more efficient file listing.
-
Fix a bug where providing a query filter that compares two
Literal
expressions would cause an infinite loop when constructing data skipping filters. -
Fix In-Commit Timestamps to use
clock.currentTimeMillis()
instead ofSystem.nanoTime()
for large commits since some systems return a very small number whenSystem.nanoTime()
is called. -
Fix streaming CDF queries to not read log entries beyond
endOffset
for reduced processing time.
More features to come in the final release of Delta 4.0!
Delta Kernel Java
- Maven artifacts: delta-kernel-api, delta-kernel-defaults
The Delta Kernel project is a set of Java and Rust libraries for building Delta connectors that can read and write to Delta tables without the need to understand the Delta protocol details.
This release of Delta Kernel Java contains the following changes:
-
Write timestamps using the
INT64
physical format in Parquet in theDefaultParquetHandler
. Previously they were written asINT96
which is an outdated and deprecated format for timestamps. -
Lazily evaluate comparator expressions in the
DefaultExpressionHandler
. Previously expressions would be eagerly evaluated for every row in the underlying vectors. -
Support SQL expression
LIKE
in theDefaultExpressionHandler
. - Support legacy Parquet schemas for map type and array type in the
DefaultParquetHandler
.
In addition to the above Delta Kernel Java changes, Delta Kernel Rust released its first version 0.1, which is available at https://crates.io/crates/delta_kernel.
Limitations
The following features from Delta 3.2 are not supported in this preview release. We are working with the community to address the following gaps by the final release of Delta 4.0:
- In Delta Spark, Uniform with Iceberg and Hudi is unavailable yet due to lack of their support for Spark 4.0.
- Delta Flink, Delta Standalone, and Delta Hive are not available yet.
Credits
Abhishek Radhakrishnan, Allison Portis, Ami Oka, Andreas Chatzistergiou, Anish, Carmen Kwan, Chirag Singh, Christos Stavrakakis, Dhruv Arya, Felipe Pessoto, Fred Storage Liu, Hyukjin Kwon, James DeLoye, Jiaheng Tang, Johan Lasperas, Jun, Kaiqi Jin, Krishnan Paranji Ravi, Lin Zhou, Lukas Rupprecht, Ole Sasse, Paddy Xu, Prakhar Jain, Qianru Lao, Richard Chen, Sabir Akhadov, Scott Sandre, Sergiu Pocol, Sumeet Varma, Tai Le Manh, Tathagata Das, Thang Long Vu, Tom van Bussel, Venki Korukanti, Wenchen Fan, Yan Zhao, zzl-7