MyGit

v2.1.0

pytorch/pytorch

版本发布时间: 2023-10-05 01:32:12

pytorch/pytorch最新发布版本:v2.3.0(2024-04-25 00:12:17)

PyTorch 2.1 Release Notes

Highlights

We are excited to announce the release of PyTorch® 2.1! PyTorch 2.1 offers automatic dynamic shape support in torch.compile, torch.distributed.checkpoint for saving/loading distributed training jobs on multiple ranks in parallel, and torch.compile support for the NumPy API.

In addition, this release offers numerous performance improvements (e.g. CPU inductor improvements, AVX512 support, scaled-dot-product-attention support) as well as a prototype release of torch.export, a sound full-graph capture mechanism, and torch.export-based quantization.

Along with 2.1, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 6,682 commits and 784 contributors since 2.0. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.1. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

Stable Beta Prototype Performance Improvements
Automatic Dynamic Shapes torch.export() AVX512 kernel support
torch.distributed.checkpoint torch.export-based Quantization CPU optimizations for scaled-dot-product-attention (SDPA)
torch.compile + NumPy semi-structured (2:4) sparsity CPU optimizations for bfloat16
torch.compile + Python 3.11 cpp_wrapper for torchinductor
torch.compile + autograd.Function
third-party device integration: PrivateUse1

*To see a full list of public 2.1, 2.0, and 1.13 feature submissions click here.

For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.

Backwards Incompatible Changes

Building PyTorch from source now requires C++ 17 (#100557)

The PyTorch codebase has migrated from the C++14 to the C++17 standard, so a C++17 compatible compiler is now required to compile PyTorch, to integrate with libtorch, or to implement a C++ PyTorch extension.

Disable torch.autograd.{backward, grad} for complex scalar output (#92753)

Gradients are not defined for functions that don't return real outputs; we now raise an error if you try to call backward on complex outputs. Previously, the complex component of the output was implicitly ignored. If you wish to preserve this behavior, you must now explicitly call .real on your complex outputs before calling .grad() or .backward().

Example

def fn(x):
    return (x * 0.5j).sum()

x = torch.ones(1, dtype=torch.double, requires_grad=True)
o = fn(x)

2.0.1

o.backward()

2.1

o.real.backward()

Update non-reentrant checkpoint to allow nesting and support autograd.grad (#90105)

As a part of a larger refactor to torch.utils.checkpoint, we changed the interaction activation checkpoint and retain_graph=True. Previously in 2.0.1, recomputed activations are kept alive if retain_graph=True, in PyTorch 2.1, non-reentrant impl now clears recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications: (1) Accessing ctx.saved_tensor twice in the same backward will now raise an error. (2) Accessing _saved_tensors multiple times will silently recompute forward multiple times.

2.1

class Func(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        out = x.exp()
        ctx.save_for_backward(out)
        return out

    @staticmethod
    def backward(ctx, x);
        out, = ctx.saved_tensors
        # Calling ctx.saved_tensors again will raise in 2.1
        out, = ctx.saved_tensors
        return out

a = torch.tensor(1., requires_grad=True)

def fn(x):
    return Func.apply(x)


out = torch.utils.checkpoint(fn, (a,), use_reentrant=False)

def fn2(x):
    return x.exp()

out = torch.utils.checkpoint(fn2, (a,), use_reentrant=False)

out.grad_fn._saved_result
# Calling _saved_result will trigger another unpack, and lead to forward being
# recomputed again
out.grad_fn._saved_result

Only sync buffers when broadcast_buffers is True (#100729)

from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is synchronized across all devices.
...
from torch.nn.parallel import DistributedDataParallel as DDP
module = torch.nn.Linear(4, 8)
module = DDP(module) # Buffer is synchronized across all devices.
module = DDP(module, broadcast_buffers=False) # Buffer is NOT synchronized across all devices
...

Remove store barrier after PG init (#99937)

from torch.distributed.distributed_c10d import init_process_group
init_process_group(...) # Will call _store_based_barrier in the end.
...
from torch.distributed.distributed_c10d import init_process_group
import os
os.environ["TORCH_DIST_INIT_BARRIER"] = "1" # This is the default behavior
init_process_group(...) # Will call _store_based_barrier in the end.
os.environ["TORCH_DIST_INIT_BARRIER"] = "0"
init_process_group(...) # Will not call _store_based_barrier in the end.
...

Disallow non-bool masks in torch.masked_{select, scatter, fill} (#96112, #97999, #96594)

Finish the deprecation cycle for non-bool masks. Functions now require the dtype of the mask to be torch.bool.

>>> # 2.0.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
UserWarning: masked_select received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/TensorAdvancedIndexing.cpp:1855.)
  torch.masked_select(inp, mask)

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> # 2.1
>>> inp = torch.rand(3)
>>> mask = torch.tensor([0, 1, 0], dtype=torch.uint8)
>>> torch.masked_select(inp, mask)
RuntimeError: masked_select: expected BoolTensor for mask

>>> correct_mask = torch.tensor([0, 1, 0], dtype=torch.bool)
>>> torch.masked_select(inp, correct_mask)
# Works fine

>>> torch.masked_select(inp, mask.to(dtype=torch.bool))
# Works fine

Fix the result of torch.unique to make it consistent with NumPy when dim is specified (#101693)

The dim argument was clarified and its behavior aligned to match the one from NumPy to signify which sub-tensor to consider when considering uniqueness. See the documentation for more details, https://pytorch.org/docs/stable/generated/torch.unique.html

Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)

Prior to this change, for torch.nn.functional.grid_sample(mode='nearest') the forward 2D kernel used std::nearbyint whereas the forward 3D kernel used std::round in order to determine the nearest pixel locations after un-normalization of the grid. Additionally, the backward kernels for both used std::round. This PR fixes the inconsistencies to use std::nearbyint which rounds values that are exactly <>.5 to the nearest even which is consistent with the behavior of torch.round. Unnormalized indices that are exactly <>.5 will now be rounded to the nearest even instead of being rounded away from 0.

Turned input shapes (aka record_shapes) off by default for on-demand tracing (#97917)

Profiler traces collected by on-demand tracing via IPC Fabric will have record_shapes off my default.

When called with a 0-dim tensor input, torch.aminmax would previously inconsistently return a 1D tensor output on CPU, but a 0D tensor output on CUDA. This has been fixed, so we consistently return a 0D tensor in both cases. (#96171).

In v2.0.1:

>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
__main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.)
torch.return_types.aminmax(
min=tensor([1]),
max=tensor([1]))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))

In v2.1.0:

>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))

Change to the default behavior for custom operators registered to the dispatcher, that do not have anything registered to an Autograd dispatch key

If you have a custom operator that has a CPU/CUDA kernel registered to the CPU/CUDA dispatch key, but has no implementation at the Autograd key, then:

Old behavior: When calling this operator with tensor inputs that require gradients, the tensor outputs would silently not require gradients.

New behavior: When calling this operator with tensor inputs that do require gradients, the tensor outputs would require gradients (as long as the outputs are floating-point or complex), and will error if you try to backpropagate through them.

There is more information on how to recover the old behavior in the PR: (#104481, #105078)

torch.autograd.Function Raise an error if input is returned as-is and saved for forward or backward in setup_context (#98051)

If you are writing a custom autograd Function and you have implemented your autograd Function using setup_context, and if your forward function returns an input as-is as output, then saving that tensor for forward or backward now raises an error. You should return an alias of the input instead.

2.0.1

class Cube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        return x ** 3, x

    @staticmethod
    def setup_context(ctx, inputs, outputs):
        cube, x = outputs
        ctx.save_for_backward(x)

    @staticmethod
    def backward(ctx, grad_output, grad_x):
        # NB: grad_x intentionally not used in computation
        x, = ctx.saved_tensors
        result = grad_output * 3 * x ** 2
        return result

2.1

class Cube(torch.autograd.Function):
    @staticmethod
    def forward(x):
        return x ** 3, x.view_as(x)

    ...

Deprecations

Deprecate not specifying the use_reentrant flag explicitly when using torch.utils.checkpoint (#100551)

In PyTorch 2.1, if the use_reentrant flag is not explicitly passed, a warning is raised. To retain current behavior, pass use_reentrant=True. The default value will be updated to use_reentrant=False in the future. We recommend using use_reentrant=False.

2.1

torch.utils.checkpoint(fn, (a,)) # Warns in 2.1

Deprecate torch.has_* attributes (#103279)

Use the version in the particular backend module at torch.backends.* to access these flags. Also note that we now properly differente is_built() (compile time availability) and is_available() (runtime availability) in these modules.

Deprecate check_sparse_nnz argument for torch.autograd.gradcheck (#97187)

2.0.1

torch.autograd.gradcheck(fn, inputs, check_sparse_nnz=True)

2.1

torch.autograd.gradcheck(fn, inputs, masked=True)

NVFuser integration with TorchScript is deprecated (#105185)

NVFuser replaced Neural Network Compiler (NNC) as the default GPU fuser for TorchScript in PyTorch 1.13. In PyTorch 2.1, TorchScript switched its default fuser back to NNC. Additionally, NVFuser for TorchScript is now deprecated. Currently, users can still manually choose to use NVFuser instead of NNC, see fuser options for details on how to do this.

New features

Release Engineering

Python Frontend

optim

torch.compile

Sparse Frontend

Autograd

torch.nn

torch.export

functorch

Distributed

c10d

Distributed Tensor

FullyShardedDataParallel:

DTensor based Distributed Checkpoint

Profiler

ONNX

New TorchDynamo ONNX Exporter

New torch.compile ONNX Runtime backend (#107973, #106929, #106589)

Usage: `torch.compile(..., backend="onnxrt")`
    Available when `torch.onnx.is_onnxrt_backend_supported()` returns `True`
    Additional Python package dependencies: `onnx`, `onnxscript`, `onnxruntime`

Additional TorchScript ONNX exporter operators:

Others

MPS

torch.fx

Quantization

Export Quantization:

JIT

Vulkan

Improvements

Python Frontend

Dataloader and DataPipe

torch.nn

functorch

optim

Linear Algebra

Autograd

Sparse

Nested Tensor

Foreach Frontend

Build Frontend

CPU

CUDA

MPS

torch.export

torch.fx

Quantization

Profiler

General Profiling

Memory Profiling

ONNX

TorchScript ONNX exporter

Distributed

Activation checkpointing

DistributedDataParallel (DDP)

FullyShardedDataParallel (FSDP)

Distributed Tensor (Prototype Release)

Distributed (c10d)

Distributed Checkpoint

Torch Elastic

RPC

Dynamo

Inductor

JIT

Misc

Bug fixes

Python Frontend

Autograd

optim

torch.nn

functorch

Distributed

Distributed (c10d)

FullyShardedDataParallel

Distributed Tensor (Prototype Release)

torch.compile

Dynamic Shapes

A lot of dynamic-shapes bugfixes, too many to enumerate one-by-one. Some important points:

Other bug fixes

In addition, we have the following fixes broken down into roughly 4 parts:

The first three cover a large number of general improvements to torch.compile, since torch.compile captures a graph internally by using these major components (fake tensor, prims and decomps, and AOTAutograd, see docs(https://pytorch.org/get-started/pytorch-2.0/)).

Primtorch and decompositions bugfixes

There were a large number of fixes to the primtorch and ref decompositions, which are used in torch.compile during graph capture. These all fixed quite a few bugs in torch.compile:

FakeTensor and Meta function fixes

Fake Tensors and meta functions are used internally to perform “shape inference” during graph capture when running torch.compile. In particular: when we capture a graph of pytorch operators, we’d like detailed information on the shapes of intermediate and output tensors in the graph. There were a large number of bugfixes and improvements to these two subsystems over the last release.

Operator bugfixes:

Increased operator coverage:

Other:

AOTAutograd bugfixes

AOTAutograd is a major component of the torch.compile stack, and received many bugfixes and improvements over the last release.

Sparse

Linear Algebra

Profiler

Quantization

CUDA

Intel

MPS

Vulkan

Build

ONNX

TorchScript ONNX exporter

TorchDynamo ONNX exporter

torch.fx

Dynamo

Misc TorchDynamo fixes

Misc dynamic shapes fixes

Benchmark related bug fixes

Export related bug fixes

Logger bug fixes

Minifier related bug fixes

Inductor

JIT

Misc

Performance

General

torch.optim

torch.nn

Sparse

Improved performance in the following:

torch.compile

Distributed

Distributed (c10d)

Distributed Tensor (Prototype Release)

FullyShardedDataParallel:

CUDA

Intel

MPS

Vulkan

ONNX

Inductor

Release Engineering

torch.export

JIT

Documentation

CUDA

DataPipe

torch.fx

torch.export

Intel

Linear Algebra

optim

Python Frontend

Quantization

Inductor

Release Engineering

Dynamo

nn_frontend

ONNX

Distributed

FullyShardedDataParallel

Distributed (c10d)

Distributed Checkpoint

RPC

Sparse Frontend

Composability

Dynamic Shapes

Dynamo

Developers

torch.fx

Inductor

Composability

Release Engineering

Autograd Frontend

JIT

optim

ONNX

Distributed

FullyShardedDataParallel

Distributed (c10d)

DistributedDataParallel

Distributed Tensor (Prototype Release)

Sparse Frontend

Security

Release Engineering

相关地址:原始地址 下载(tar) 下载(zip)

1、 pytorch-v2.1.0.tar.gz 269.93MB

查看:2023-10-05发行的版本