MyGit

v2.5.0

pytorch/pytorch

版本发布时间: 2024-10-18 00:26:53

pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)

PyTorch 2.5 Release Notes

Highlights

We are excited to announce the release of PyTorch® 2.5! This release features a new CuDNN backend for SDPA, enabling speedups by default for users of SDPA on H100s or newer GPUs. As well, regional compilation of torch.compile offers a way to reduce the cold start up time for torch.compile by allowing users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Finally, TorchInductor CPP backend offers solid performance speedup with numerous enhancements like FP16 support, CPP wrapper, AOT-Inductor mode, and max-autotune mode. This release is composed of 4095 commits from 504 contributors since PyTorch 2.4. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.5. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page. As well, please check out our new ecosystem projects releases with TorchRec and TorchFix.

Beta Prototype
CuDNN backend for SDPA FlexAttention
torch.compile regional compilation without recompilations Compiled Autograd
TorchDynamo added support for exception handling & MutableMapping types Flight Recorder
TorchInductor CPU backend optimization Max-autotune Support on CPU with GEMM Template
TorchInductor on Windows
FP16 support on CPU path for both eager mode and TorchInductor CPP backend
Autoload Device Extension
Enhanced Intel GPU support

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] CuDNN backend for SDPA

The cuDNN "Fused Flash Attention" backend was landed for torch.nn.functional.scaled_dot_product_attention. On NVIDIA H100 GPUs this can provide up to 75% speed-up over FlashAttentionV2. This speedup is enabled by default for all users of SDPA on H100 or newer GPUs.

[Beta] torch.compile regional compilation without recompilations

Regional compilation without recompilations, via torch._dynamo.config.inline_inbuilt_nn_modules which default to True in 2.5+. This option allows users to compile a repeated nn.Module (e.g. a transformer layer in LLM) without recompilations. Compared to compiling the full model, this option can result in smaller compilation latencies with 1%-5% performance degradation compared to full model compilation.

See the tutorial for more information.

[Beta] TorchInductor CPU backend optimization

This feature advances Inductor’s CPU backend optimization, including CPP backend code generation and FX fusions with customized CPU kernels. The Inductor CPU backend supports vectorization of common data types and all Inductor IR operations, along with the static and symbolic shapes. It is compatible with both Linux and Windows OS and supports the default Python wrapper, the CPP wrapper, and AOT-Inductor mode.

Additionally, it extends the max-autotune mode of the GEMM template (prototyped in 2.5), offering further performance gains. The backend supports various FX fusions, lowering to customized kernels such as oneDNN for Linear/Conv operations and SDPA. The Inductor CPU backend consistently achieves performance speedups across three benchmark suites—TorchBench, Hugging Face, and timms—outperforming eager mode in 97.5% of the 193 models tested.

PROTOTYPE FEATURES

[Prototype] FlexAttention

We've introduced a flexible API that enables implementing various attention mechanisms such as Sliding Window, Causal Mask, and PrefixLM with just a few lines of idiomatic PyTorch code. This API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. Additionally, we automatically generate the backwards pass using PyTorch's autograd machinery. Furthermore, our API can take advantage of sparsity in the attention mask, resulting in significant improvements over standard attention implementations.

For more information and examples, please refer to the official blog post and Attention Gym.

[Prototype] Compiled Autograd

Compiled Autograd is an extension to the PT2 stack allowing the capture of the entire backward pass. Unlike the backward graph traced by AOT dispatcher, Compiled Autograd tracing is deferred until backward execution time, which makes it impervious to forward pass graph breaks, and allows it to record backward hooks into the graph.

Please refer to the tutorial for more information.

[Prototype] Flight Recorder

Flight recorder is a new debugging tool that helps debug stuck jobs. The tool works by continuously capturing information about collectives as they run. Upon detecting a stuck job, the information can be used to quickly identify misbehaving ranks/machines along with code stack traces.

For more information please refer to the following tutorial.

[Prototype] Max-autotune Support on CPU with GEMM Template

Max-autotune mode for the Inductor CPU backend in torch.compile profiles multiple implementations of operations at compile time and selects the best-performing one. This is particularly beneficial for GEMM-related operations, using a C++ template-based GEMM implementation as an alternative to the ATen-based approach with oneDNN and MKL libraries. We support FP32, BF16, FP16, and INT8 with epilogue fusions for x86 CPUs. We’ve seen up to 7% geomean speedup on the dynamo benchmark suites and up to 20% boost in next-token latency for LLM inference.

For more information please refer to the tutorial.

[Prototype] TorchInductor CPU on Windows

Inductor CPU backend in torch.compile now works on Windows. We support MSVC (cl), clang (clang-cl) and Intel compiler (icx-cl) for Windows inductor currently.

See the tutorial for more details.

[Prototype] FP16 support on CPU path for both eager mode and TorchInductor CPP backend

Float16 is a commonly used reduced floating point type for performance improvement in neural network inference/training. Since this release, float16 for both eager and TorchInductor is supported on the CPU path.

[Prototype] Autoload Device Extension

PyTorch now supports autoloading for out-of-tree device extensions, streamlining integration by eliminating the need for manual imports. This feature, enabled through the torch.backends entrypoint, simplifies usage by ensuring seamless extension loading, while allowing users to disable it via an environment variable if needed.

See the tutorial for more information.

[Prototype] Enhanced Intel GPU support

Intel GPUs support enhancement is now available for both Intel® Data Center GPU Max Series and Intel® Client GPUs (Intel® Core™ Ultra processors with built-in Intel® Arc™ graphics and Intel® Arc™ Graphics for dGPU parts), which is to make it easier to accelerate your Machine Learning workflows on Intel GPUs in PyTorch 2.5 release. We also enabled the initial support of PyTorch on Windows for Intel® Client GPUs in this release.

These features are available through PyTorch preview and nightly binary PIP wheels. For more information regarding Intel GPU support, please refer to documentation.

Backwards Incompatible changes

Distributed

Export

Inductor

mps

nn

Optimizer Frontend

Python Frontend

ONNX

Options to torch.onnx.export (except for the first three arguments) are now keyword-only (#131501)

Options can be supplied by keywords only to allow for future addition and evolution of the torch.onnx.export API.

Example:
Version 2.4

torch.onnx.export(model, input, f, True, False)  

Version 2.5:

torch.onnx.export(model, input, f, export_params=True, verbose=False)  

Deprecated internal API torch.onnx._export has been removed (133824)

torch.onnx._export is an internal API which is not meant for public consumption. Use the public torch.onnx.export instead.

Example:
Version 2.4

torch.onnx._export(...)  

Version 2.5:

torch.onnx.export(...)  

The op_level_debug option from torch.onnx.ExportOptions has been removed (#134961)

This option, designed to identify operator discrepancies, proved unreliable and has been removed. Instead, use torch.onnx.export(..., report=True, verify=True) option to validate exported models.

The ONNXProgramSerializer class has been removed (#135261)

The ONNX model in torch.onnx.ONNXProgram is now maintained and serialized by ONNX IR.
textproto, onnxtext, and json formats are supported by default when calling ONNXProgram.save() with a corresponding file extension.

The SymbolicContext class has been removed (#132184)

The deprecated torch.onnx.SymbolicContext class has been removed. (Non-dynamo) custom symbolic functions can no longer take ctx: torch.onnx.SymbolicContext as the first argument.

Support for caffe2 has been removed (#129021)

Some errors classes are removed

CheckerError and InvalidExportOptionsError are removed. Users can always catch RuntimeError to handle torch.onnx export errors.

Deprecations

Dynamo

Export

Inductor

Releng

ONNX

Supplying model keyword arguments to torch.onnx.export is deprecated (#131501)

The ability to supply model keyword arguments as a final dictionary is deprecated. Users should use the kwargs parameter instead.

Deprecated:

torch.onnx.export(model, (arg1, arg2, {“kwarg1”: …}))

Future:

torch.onnx.export(model, (arg1, arg2), kwargs={“kwarg1”: …})  

torch.onnx.OperatorExportTypes is deprecated (#131501)

The ability to supply operator_export_type in torch.onnx.export() is deprecated. Exported ONNX graphs will always use the ONNX opset domain. Options ONNX_FALLTHROUGH, ONNX_ATEN and ONNX_ATEN_FALLBACK are no longer supported. The OperatorExportTypes class will be removed in a future release.

The training option in torch.onnx.export is deprecated

Set the model training mode first before exporting instead.

Deprecated:

torch.onnx.export(model, inputs, path, training=torch.onnx.TrainingMode.EVAL)  

Future:

model = model.eval()  
torch.onnx.export(model, inputs, path)  

New features

Autograd frontend

Distributed

Flight Recorder with an analyzer

c10d

Dynamo

Export

Inductor

nn

Optim

Optimizer Frontend

Profiler

Python Frontend

Quantization

PT2E Numeric Debugger

Releng

XPU

Sparse Frontend

ONNX

The dynamo=True option and new export logic (#132530, #133743, #134304, #134782, #135378, #135399, #135786, #136162, #135134, #134976, #135367, #135418, #135591, #135520)

We introduce the dynamo=True option in torch.onnx.export(). This is recommended as a replacement for torch.onnx.dynamo_export starting in PyTorch 2.5.

Version 2.5:

onnx_program = torch.onnx.export(model, inputs, kwargs=kwargs, dynamo=True)  
# Use the external_data option to save weights as external data  
onnx_program.save(“model.onnx”, external_data=True)  
# To save without initializers  
onnx_program.save(“model.onnx”, include_initializers=False, keep_initializers_as_inputs=True)  

torch.onnx.export(model, args, dynamo=True, report=True, verify=True) leverages torch.export and ONNX IR to convert captured ExportedPrograms to ONNX efficiently and robustly. This new process reduces memory consumption by half compared to dynamo_export in 2.4, while preserving rich tensor shape and stack trace information in the ONNX graph. You can leverage the report=True option to obtain a conversion report in markdown format to diagnose any conversion issues. Set verify=True to verify the ONNX model numerically with ONNX Runtime.

When using external_data=True to save model weights as external data to the .onnx file, weights larger than 1 MB are now aligned at 64 KB addresses. This allows runtimes to memory-map weights for better memory efficiency during inference.

[NOTE]
The dynamo=True option currently supports only ONNX opset 18. Future releases will expand support to newer opsets.

[NOTE]
The dynamo=True option requires the latest versions of onnxscript and onnx packages.

Improvements

Autograd frontend

Compostability

Custom ops:

Dynamic shapes:

Decompositions, FakeTensor and meta tensors

Operator decompositions, FakeTensors and meta tensors are used to trace out a graph in torch.compile and torch.export. They received several improvements:

Decompositions:
Meta tensors:
Misc fixes:

Cpp frontend

Cuda

Distributed

Activation Checkpointing (AC)

c10d

DeviceMesh

Dtensor

DistributedStateDict (DSD)

FullyShardedDataParallel (FSDP)

fully_shard (FSDP2)

TorchElastic

TensorParallel(TP)

Pipelining

Dynamo

Export

ForEach Frontend

Fx

Inductor

mps

nn

Optim

Optimizer Frontend

Profiler

Python Frontend

Quantization

PT2E quantization

Observers

Export IR Migration

Others

Releng

Infrastructure

XPU

Intel GPU Backend for Inductor

Intel GPU ATen Operation

Intel GPU Runtime and Generalization

Nested-Tensor Frontend

cuDNN

Sparse Frontend

ONNX

ROCm

Bug fixes

Autograd frontend

Compostability

Cuda

Distributed

Distributed checkpoint

c10d

CPU profiler for distributed

DSD

DeviceMesh

DTensor

FSDP2

TensorParallel(TP)

RPC

Dynamo

Export

ForEach Frontend

Fx

Jit

Linalg Frontend

mps

nn

Optim

Optimizer Frontend

Profiler

Python Frontend

Releng

XPU

Nested-Tensor Frontend

cuDNN

ONNX

ROCm

Performance

Cuda

Distributed

CPU profiler for distributed

Dynamo

Compile time improvements

Fx

Inductor

mps

Profiler

Quantization

cuDNN

Sparse Frontend

Documentation

Autograd frontend

Distributed

TorchElastic

c10d

DeviceMesh

DTensor

Pipelining

Dynamo

Fx

Inductor

jit

Linalg Frontend

mps

nn

Optim

Optimizer Frontend

Profiler

Python Frontend

Releng

XPU

Sparse Frontend

ONNX

Developers

Distributed

c10d

DTensor

FDSP

FDSP2

DSD

TorchElastic

Fx

Optim

Optimizer Frontend

Releng

XPU

ONNX

Security

Inductor

Linalg Frontend

Optimizer Frontend

Quantization

相关地址:原始地址 下载(tar) 下载(zip)

1、 pytorch-v2.5.0.tar.gz 285.09MB

查看:2024-10-18发行的版本