v3.2.0rc2

版本发布时间: 2024-05-07 07:42:40

delta-io/delta最新发布版本:v3.2.1rc1(2024-09-05 00:48:36)

We are excited to announce the release of Delta Lake 3.2.0 (RC2)! Instructions for how to use this release candidate are at the end of these notes. To give feedback on this release candidate, please post in the Delta Users Slack here or create issues in our Delta repository.

Highlights

Support for Liquid clustering to reduce write amplification using incremental clustering.
Preview support for Type Widening to allow users to change the type of columns without having to rewrite data.
Preview support for Apache Hudi in Delta UniForm tables.

Delta Spark

Delta Spark 3.2.0 is built on Apache Spark™ 3.5. Similar to Apache Spark, we have released Maven artifacts for both Scala 2.12 and Scala 2.13.

Documentation: https://docs.delta.io/3.2.0/index.html
API documentation: https://docs.delta.io/3.2.0/delta-apidoc.html#delta-spark
RC2 artifacts: delta-spark_2.12, delta-spark_2.13, delta-contribs_2.12, delta_contribs_2.13, delta-storage, delta-storage-s3-dynamodb
RC2 Python artifacts: See delta-3.2-rc2-python-artifacts.zip attached

The key features of this release are:

Support for Liquid clustering: This allows for incremental clustering based on ZCubes and reduces the write amplification by not touching files already well clustered (i.e., files in stable ZCubes). Users can now use the ALTER TABLE CLUSTER BY syntax to change clustering columns and use the DESCRIBE DETAIL command to check the clustering columns. In addition, Delta Spark now supports DeltaTable clusterBy API in both Python and Scala to allow creating clustered tables using DeltaTable API. See the documentation and examples for more information.
Preview support for Type Widening: Delta Spark can now change the type of a column from byte to short to integer using the ALTER TABLE t CHANGE COLUMN col TYPE type command or with schema evolution during MERGE and INSERT operations. The table remains readable by Delta 3.2 readers without requiring the data to be rewritten. For compatibility with older versions, a rewrite of the data can be triggered using the ALTER TABLE t DROP FEATURE 'typeWidening-preview’ command.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
Support for Vacuum Inventory: Delta Spark now extends the VACUUM SQL command to allow users to specify an inventory table in a VACUUM command. When an inventory table is provided, VACUUM will consider the files listed there instead of doing the full listing of the table directory, which can be time consuming for very large tables. See the docs here.
Support for Vacuum Writer Protocol Check: Delta Spark can now support vacuumProtocolCheck ReaderWriter feature which ensures consistent application of reader and writer protocol checks during VACUUM operations, addressing potential protocol discrepancies and mitigating the risk of data corruption due to skipped writer checks.
Preview support for In-Commit Timestamps: When enabled, this preview feature persists monotonically increasing timestamps within Delta commits, ensuring they are not affected by file operations. When enabled, time travel queries will yield consistent results, even if the table directory is relocated.
- Note that this feature is in preview and that tables created with this preview feature enabled may not be compatible with future Delta Spark releases.
Deletion Vectors Read Performance Improvements: Two improvements were introduced to DVs in Delta 3.2.
- Removing broadcasting of DV information to executors: This work improves stability by reducing drivers’ memory consumption, preventing potential Driver OOM for very large Delta tables like 1TB+. This work also improves performance by saving us fixed broadcasting overhead in reading small Delta Tables.
- Supporting predicate pushdown and splitting in scans with DVs: Improving performance of DV reads with filters queries thanks to predicate pushdown and splitting. This feature gains 2x performance improvement on average.
Support for Row Tracking: Delta Spark can now write to tables that maintain information that allows identifying rows across multiple versions of a Delta table. Delta Spark can now also access this tracking information using the two metadata fields _metadata.row_id and _metadata.row_commit_version.

Other notable changes include:

Delta Sharing: reduce the minimum RPC interval in delta sharing streaming from 30 seconds to 10 seconds
Improve the performance of write operations by skipping collecting commit stats
New SQL configurations to specify Delta Log cache size (spark.databricks.delta.delta.log.cacheSize) and retention duration (spark.databricks.delta.delta.log.cacheRetentionMinutes)
Fix bug in plan validation due to inconsistent field metadata in MERGE
Improved metrics during VACUUM for better visibility
Hive Metastore schema sync: The truncation threshold for schemas with long fields is now user configurable

Delta Universal Format (UniForm)

Documentation: https://docs.delta.io/3.2.0/delta-uniform.html
RC2 artifacts: delta-iceberg_2.12, delta-iceberg_2.13, delta-hudi_2.12, delta-hudi-2.13

Hudi is now supported by Delta Universal format in addition to Iceberg. Writing to a Delta UniForm table can generate Hudi metadata, alongside Delta. This feature is contributed by XTable.

Create a UniForm-enabled that automatically generates Hudi metadata using the following command:

CREATE TABLE T (c1 INT) USING DELTA TBLPROPERTIES ('delta.universalFormat.enabledFormats' = hudi);

See the documentation here for more details.

Other notable changes include:

Throw a better error if Iceberg conversion fails during initial sync
Fix a bug in Delta Universal Format to support correct table overwrites

Delta Kernel

API documentation: https://docs.delta.io/3.2.0/api/java/kernel/index.html
RC2 artifacts: delta-kernel-api, delta-kernel-defaults

The Delta Kernel project is a set of Java libraries (Rust will be coming soon!) for building Delta connectors that can read (and, soon, write to) Delta tables without the need to understand the Delta protocol details). In this release,e we improved the read support to make it production-ready by adding numerous performance improvements, additional functionality, and improved protocol support.

Support for time travel. Now you can read a table snapshot at a version id or snapshot at a timestamp.
Improved Delta protocol support.
- Support for reading tables with checkpoint v2.
- Support for reading tables with timestamp partition type data column.
- Support for reading tables with column data type timestamp_ntz.
Improved table metadata read performance and reliability on very large tables with millions of files
- Improved checkpoint reading latency by pushing the partition predicate to the checkpoint Parquet reader to minimize reading number of checkpoint files read.
- Improved state reconstruction latency by using LogStores from delta-storage module for faster listFrom calls.
- Retry loading the _last_checkpoint checkpoint in case of transient failures. Loading the last checkpoint info from this file helps construct the Delta table state faster.
- Optimization to minimize the number of listing calls to object store when trying to find a last checkpoint at or before a version.
Other notable changes include:
- Support for IS_NULL expression. Now the Predicate passed to Kernel ScanBuilder can include IS_NULL predicates.
- Support for custom ParquetHandler implementations to multiple Parquet files in parallel. The current default implementation reads one file at a time, but the connectors can implement their own custom ParquetHandler to read the Parquet files in parallel.

For more information, refer to:

User guide on step-by-step process of using Kernel in a standalone Java program or in a distributed processing connector.
Slides explaining the rationale behind Kernel and the API design.
Example Java programs that illustrate how to read Delta tables using the Kernel APIs.
Table and default Engine API Java documentation
Migration guide to upgrade your connector to use the 3.2.0 APIs

How to use this Release Candidate

Download Spark 3.5 from https://spark.apache.org/downloads.html.

Important: Clear your package cache to ensure you’re effectively testing the latest Delta RC and not a previously released binary: rm -rf ~/.ivy2/cache/

For this release candidate, we have published the artifacts to a staging repository. Here’s how you can use them:

Spark Submit

Add --repositories https://oss.sonatype.org/content/repositories/iodelta-1138 to the command line arguments.
Example:

spark-submit --packages io.delta:delta-spark_2.12:3.2.0 --repositories https://oss.sonatype.org/content/repositories/iodelta-1138 examples/examples.py

Currently Spark shells (PySpark and Scala) do not accept the external repositories option. However, once the artifacts have been downloaded to the local cache, the shells can be run with Delta 3.2.0 by just providing the --packages io.delta:delta-spark_2.12:3.2.0 argument.

Spark Shell

bin/spark-shell --packages io.delta:delta-spark_2.12:3.2.0 \
  --repositories https://oss.sonatype.org/content/repositories/iodelta-1138 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Spark SQL

bin/spark-sql --packages io.delta:delta-spark_2.12:3.2.0 \
  --repositories https://oss.sonatype.org/content/repositories/iodelta-1138 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog

Maven

<repositories>
  <repository>
    <id>staging-repo</id>
    <url>https://oss.sonatype.org/content/repositories/iodelta-1138</url>
  </repository>
</repositories>
<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-spark_2.12</artifactId>
  <version>3.2.0</version>
</dependency>

SBT Project

libraryDependencies += "io.delta" %% "delta-spark" % "3.2.0"
resolvers += "Delta" at https://oss.sonatype.org/content/repositories/iodelta-1138

(PySpark) Delta-Spark

Download two artifacts from pre-release: https://github.com/delta-io/delta/releases/tag/v3.2.0rc2
Artifacts to download are:
- delta-spark-3.2.0.tar.gz
- delta_spark-3.2.0-py3-none-any.whl
Keep them in one directory. Let’s call that ~/Downloads
pip install ~/Downloads/delta_spark-3.2.0-py3-none-any.whl
pip show delta-spark should show output similar to the below

Name: delta-spark
Version: 3.2.0
Summary: Python APIs for using Delta Lake with Apache Spark
Home-page: https://github.com/delta-io/delta/
Author: The Delta Lake Project Authors
Author-email: delta-users@googlegroups.com
License: Apache-2.0
Location: /home/<user.name>/.conda/envs/delta-release/lib/python3.8/site-packages
Requires: importlib-metadata, pyspark

Credits

Adam Binford, Ala Luszczak, Allison Portis, Ami Oka, Andreas Chatzistergiou, Arun Ravi M V, Babatunde Micheal Okutubo, Bo Gao, Carmen Kwan, Chirag Singh, Chloe Xia, Christos Stavrakakis, Costas Zarifis, Daniel Tenedorio, Davin Tjong, Dhruv Arya, Felipe Pessoto, Fred Storage Liu, Fredrik Klauss, Gabriel Russo, Hao Jiang, Hyukjin Kwon, Ian Streeter, Jason Teoh, Jiaheng Tang, Jing Zhan, Jintian Liang, Johan Lasperas, Jonas Irgens Kylling, Juliusz Sompolski, Kaiqi Jin, Lars Kroll, Lin Zhou, Miles Cole, Nick Lanham, Ole Sasse, Paddy Xu, Prakhar Jain, Rachel Bushrian, Rajesh Parangi, Renan Tomazoni Pinzon, Sabir Akhadov, Scott Sandre, Simon Dahlbacka, Sumeet Varma, Tai Le, Tathagata Das, Thang Long Vu, Tim Brown, Tom van Bussel, Venki Korukanti, Wei Luo, Wenchen Fan, Xupeng Li, Yousof Hosny, Gene Pang, Jintao Shen, Kam Cheung Ting, panbingkun, ram-seek, Sabir Akhadov, sokolat, tangjiafu

delta-3.2-rc2-python-artifacts.zip

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-05-07发行的版本