MyGit

v2.2.0

pytorch/pytorch

版本发布时间: 2024-01-31 01:58:51

pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)

PyTorch 2.2 Release Notes

Highlights

We are excited to announce the release of PyTorch® 2.2! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

Stable Beta Prototype Performance Improvements
FlashAttentionV2 backend for scaled dot product attention PT 2 Quantization Inductor optimizations
AOTInductor Scaled dot product attention support for jagged layout NestedTensors aarch64-linux optimizations (AWS Graviton)
TORCH_LOGS
torch.distributed.device_mesh
torch.compile + Optimizers

*To see a full list of public 2.2 - 1.12 feature submissions click here.

Tracked Regressions

Performance reduction when using NVLSTree algorithm in NCCL 2.19.3 (#117748)

We have noticed a performance regression introduced to all-reduce in NCCL 2.19.3. Please use version 2.19.1 instead.

Poor numeric stability of loss when training with FSDP + DTensor (#117471)

We observe the loss will flatline randomly while training with FSDP + DTensor in some instances.

Backwards Incompatible Changes

Building PyTorch from source now requires GCC 9.4 or newer (#112858)

GCC 9.4 is the oldest version fully compatible with C++17, which the PyTorch codebase has migrated to from C++14.

Updated flash attention kernel in scaled_dot_product_attention to use Flash Attention v2 (#105602)

Previously, the v1 Flash Attention kernel had a Windows implementation. So if a user on Windows had explicitly forced the flash attention kernel to be run by using sdp_kernel context manager with only flash attention enabled, it would work. In 2.2, if the sdp_kernel context manager must be used, use the memory efficient or math kernel if on Windows.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=False):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)
# Don't force flash attention to be used if using sdp_kernel on Windows
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=True):
  torch.nn.functional.scaled_dot_product_attention(q,k,v)

Rewrote DTensor (Tensor Parallel) APIs to improve UX (#114732)

In PyTorch 2.1 or before, users can use ParallelStyles like PairwiseParallel and specify input/output layout with functions like make_input_replicate_1d or make_output_replicate_1d. And we have default values for _prepare_input and _prepare_output. The UX of Tensor Parallel was like:

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    make_input_replicate_1d,
    make_input_reshard_replicate,
    make_input_shard_1d,
    make_input_shard_1d_last_dim,
    make_sharded_output_tensor,
    make_output_replicate_1d,
    make_output_reshard_tensor,
    make_output_shard_1d,
    make_output_tensor,
    PairwiseParallel,
    parallelize_module,
)
from torch.distributed.tensor import DeviceMesh

module = DummyModule()
device_mesh = DeviceMesh("cuda", list(range(self.world_size)))
parallelize_module(module, device_mesh, PairwiseParallel(_prepare_input=make_input_replicate_1d))
...

Starting from PyTorch 2.2, we simplified parallel styles to only contain ColwiseParallel and RowwiseParallel because other ParallelStyle can consist of these two. We also deleted the input/output functions, and started using input_layouts and output_layouts as kwargs instead to specify the sharding layout of both input/output tensors. Finally, added PrepareModuleInput/PrepareModuleOutput style, and no default arguments for layouts in these two styles and users need to specify them to think about the sharding layouts.

from torch.distributed.tensor.parallel.style import (
    ColwiseParallel,
    PrepareModuleInput,
    RowwiseParallel,
    parallelize_module,
)
from torch.distributed._tensor import init_device_mesh

module = SimpleMLPModule()
device_mesh = init_device_mesh("cuda", (self.world_size,)))
parallelize_module(
   module,
   device_mesh,
   {
      "fqn": PrepareModuleInput(
                input_layouts=Shard(0),
                desired_input_layouts=Replicate()
             ),
      "fqn.net1": ColwiseParallel(),
      "fqn.net2": RowwiseParallel(output_layouts=Shard(0)),
   }
)
...

UntypedStorage.resize_ now uses the original device instead of the current device context (#113386)

Before this PR, UntypedStorage.resize_ would move data to the current CUDA device index (given by torch.cuda.current_device()). Now, UntypedStorage.resize_() keeps the data on the same device index that it was on before, regardless of the current device index.

2.1 2.2
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:0
>>> import torch
>>> with torch.cuda.device('cuda:0'):
...:     a = torch.zeros(0, device='cuda:1')
...:     print(a.device)
...:     a = a.untyped_storage().resize_(0)
...:     print(a.device)
cuda:1
cuda:1

Wrapping a function with set_grad_enabled will consume its global mutation (#113359)

This bc-breaking change fixes some unexpected behavior when set_grad_enabled is used as a decorator.

2.1 2.2
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
True
>>> import torch
>>> @torch.set_grad_enabled(False)  # unexpectedly, this mutates the grad mode!
    def inner_func(x):
        return x.sin()

>>> torch.is_grad_enabled()
False

Deprecated verbose parameter in LRscheduler constructors (#111302)

As part of our decision to move towards a consolidated logging system, we are deprecating the verbose flag in LRScheduler.

If you would like to print the learning rate during execution, please use get_last_lr()

2.1 2.2
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min', verbose=True)
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")

Removed deprecated c10d multi-gpu-per-thread APIs (#114156)

In PyTorch 2.1 or before, users can use our multi-gpu c10d collective APIs such as all_reduce_multigpu:

2.1 2.2
import torch.distributed as dist


dist.broadcast_multigpu
dist.all_reduce_multigpu
dist.reduce_multigpu
dist.all_gather_multigpu
dist.reduce_scatter_multigpu
...
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, 'min')
for epoch in range(10):
    train(...)
    val_loss = validate(...)
    # Note that step should be called after validate()
    scheduler.step(val_loss)
	print(f"Epoch {epoch} has concluded with lr of {scheduler.get_last_lr()}")

In PyTorch 2.2, these APIs are removed because PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated since PyTorch 1.13.

Rename torch.onnx.ExportOutput* to ONNXProgram* (#112263)

The torch.onnx.dynamo_export’s output was renamed from torch.onnx.ExportOutput to torch.onnx.ONNXProgram to better align with torch.export.export API terminology which returns a torch.export.ExportedProgram. With this change, any ambiguity that could arise with either API is eliminated.

2.1 2.2
export_output: torch.onnx.ExportOutput = torch.onnx.dynamo(...)
onnx_program: torch.onnx.ONNXProgram = torch.onnx.dynamo(...)

Fix functional::smooth_l1_loss signatures to not override beta (#109798)

Previously, there were two possible options to pass in beta to smooth_l1_loss, either as a SmoothL1LossFuncOption parameter or a function parameter.

Before, the beta specified as a function parameter would override the other beta if it was set, which was unexpected behavior. Now, we throw an error when beta is passed in both cases.

Deprecations

Autograd API

Deprecate not passing use_reentrant kwarg to torch.utils.checkpoint.checkpoint_sequential explicitly (#114158)

The use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. Note that not passing use_reentrant kwarg to torch.utils.checkpoint.checkpoint has been previously deprecated in a previous release.

2.1 2.2
a = torch.randn(3, requires_grad=True)
modules_list = [
    torch.nn.Linear(3, 3),
    torch.nn.Linear(3, 3),
    torch.nn.Linear(3, 3)
]

# This would produce a warning in 2.2
checkpoint_sequential(modules_list, 3, a)
# Recommended
checkpoint_sequential(modules_list, 3, a, use_reentrant=False)

# To preserve existing behavior
checkpoint_sequential(modules_list, 3, a, use_reentrant=True)

Deprecate "fallthrough" as autograd fallback default (#113166)

Custom operators that do not have a kernel registered to the Autograd keys (e.g. AutogradCPU and AutogradCUDA) will now produce a warning when used with autograd. If your custom operator previously returned floating-point or complex Tensors that do not require grad, they will now require grad as long as grad mode is enabled and the inputs require grad. For users who would like the old behavior, register torch::CppFunction::makeFallthrough() to your Autograd key, as shown here.

The below example uses the torch library API, but if you are writing an operator in a cpp extension, please read this doc for more information.

import torch
import numpy as np

# Define the operator
torch.library.define("mylibrary::sin", "(Tensor x) -> Tensor")

# Add implementations for the cpu device
@torch.library.impl("mylibrary::sin", "cpu")
def f(x):
    return torch.from_numpy(np.sin(x.detach().numpy()))
x = torch.randn(3, requires_grad=True)
y = torch.ops.mylibrary.sin(x)
y.sum().backward()
2.1 2.2
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
UserWarning: mylibrary::sin: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd.

Linalg

Deprecate torch.cross default behavior (#108760)

Calling torch.cross without specifying the dim arg is now deprecated. This behavior will be changed to match that of torch.linalg.cross in a future release.

Jit

NVFuser functionality has been removed from TorchScript (#110124, #111447, #110881)

Neural Network Compiler (NNC) has replaced NVFuser as the default GPU fuser for TorchScript in PyTorch 2.1, which also added a deprecation warning for NVFuser. The TorchScript functionality for NVFuser has now been fully removed and is no longer supported.

Optimizer

SparseAdam constructor will no longer accept raw Tensor type for params (#114425)

SparseAdam is now consistent with the rest of our optimizers and only accepts containers instead of individual Tensors/Parameters/param groups.

2.1 2.2
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
optimizer = torch.optim.SparseAdam([param])

New Features

torch.compile

Dynamo

Inductor

torch.export

Build

Python API

Profiler

Quantization

Sparse API

NestedTensor API

Misc

Fx

ONNX

CPU

MPS

Vulkan

Improvements

torch.compile

Dynamo

Inductor

torch.export

Composability

Python API

torch.nn API

Linalg API

Optimizer API

torch.func

Misc

Quantization

NestedTensor API

Distributed

CPU

CUDA

Fx

Jit

MPS

ONNX

ROCm

Vulkan

Bug fixes

Autograd API

Cpp API

Foreach API

Linalg API

NestedTensor API

Optimizer API

Python API

Sparse API

torch.compile

Dynamo

Inductor

torch.export

torch.func API

torch.nn API

Build

Composability

CPU

CUDA

Distributed

Fx

Jit

Lazy

Mps

ONNX

Profiler

Quantization

Releng

Visualization

Vulkan

Performance

Autograd API

Cpp API

Linalg API

NestedTensor API

Optimizer API

Sparse API

torch.compile API

Inductor

torch.func API

CPU

CUDA

Distributed

Fx

Vulkan

Documentation

Autograd API

Dataloader API

Linalg API

Optimizer API

Python API

torch.compile API

Inductor

torch.export API

torch.func API

torch.nn API

Build

Composability

CUDA

Distributed

Mps

ONNX

Profiler

Quantization

Security

Releng

相关地址:原始地址 下载(tar) 下载(zip)

1、 pytorch-v2.2.0.tar.gz 274.31MB

查看:2024-01-31发行的版本