MyGit

v2.4.0

pytorch/pytorch

版本发布时间: 2024-07-25 02:39:28

pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)

PyTorch 2.4 Release Notes

Highlights

We are excited to announce the release of PyTorch® 2.4! PyTorch 2.4 adds support for the latest version of Python (3.12) for torch.compile. AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing libuv has been introduced which should significantly reduce initialization times for users running large-scale jobs. Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels into PyTorch, especially for torch.compile.

This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta Prototype Performance Improvements
Python 3.12 support for torch.compile FSDP2: DTensor-based per-parameter-sharding FSDP torch.compile optimizations for AWS Graviton (aarch64-linux) processors
AOTInductor Freezing for CPU torch.distributed.pipelining, simplified pipeline parallelism BF16 symbolic shape optimization in TorchInductor
New Higher-level Python Custom Operator API Intel GPU is available through source build Performance optimizations for GenAI projects utilizing CPU devices
Switching TCPStore’s default server backend to libuv

*To see a full list of public feature submissions click here.

Tracked Regressions

Subproc exception with torch.compile and onnxruntime-training

There is a reported issue (#131070) when using torch.compile if onnxruntime-training lib is installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variable TORCHINDUCTOR_WORKER_START=fork before executing the script.

cu118 wheels will not work with pre-cuda12 drivers

It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers. In this case, the workaround is to set TRITON_PTXAS_PATH manually as follows (adapt the code according to the local installation path):

TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas  python script.py

Backwards Incompatible Change

Python frontend

Default TreadPool size to number of physical cores (#125963)

Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of physical cores. This should reduce core oversubscribing when running CPU workload and improve performance. Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.

Fix torch.quasirandom.SobolEngine.draw default dtype handling (#126781)

The default dtype value has been changed from torch.float32 to the current default dtype as given by torch.get_default_dtype() to be consistent with other APIs.

Forbid subclassing torch._C._TensorBase directly (#125558)

This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should subclass torch.Tensor directly.

Composability

Non-compositional usages of as_strided + mutation under torch.compile will raise an error (#122502)

The torch.compile flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view. This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally, so we ban them rather than risking silent correctness issues under torch.compile.

An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).

@torch.compile
def foo(a):
    e = a.diagonal()
    # as_strided is being called on an existing view (e),
    # making it non-compositional. mutations to f under torch.compile
    # are not allowed, as we cannot easily functionalize them safely
    f = e.as_strided((2,), (1,), 0)
    f.add_(1.0)
    return a

We now verify schemas of custom ops at registration time (#124520)

Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.

TORCH_LIBRARY(my_ns, m) {
  // this now raises an error at registration time: bar/baz are unknown types
  m.def("my_ns::foo(bar t) -> baz");
  // you can get back the old behavior with the below flag
  m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}

Autograd frontend

Delete torch.autograd.function.traceable APIs (#122817)

The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted. This API does not do anything and was only meant for internal purposes. The following raised an warning in 2.3, and now errors because the API has been deleted:

@torch.autograd.function.traceable
class Func(torch.autograd.Function):
    ...

Release engineering

Optim

Distributed

DeviceMesh

Update get_group and add get_all_groups (#128097) In 2.3 and before, users can do:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will return all sub-pgs within the mesh
assert mesh_2d.get_group()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)

But from 2.4 forward, if users call get_group without passing in the dim, users will get a RuntimeError. Instead, they should use get_all_groups:

mesh_2d = init_device_mesh(
    "cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group()  # This will throw a RuntimeError
assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)

Pipelining

Retire torch.distributed.pipeline (#127354) In 2.3 and before, users can do:

import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining

But from 2.4 forward, if users write the code above, users will get a ModuleNotFound error. Instead, they should use torch.distributed.pipelining:

import torch.distributed.pipeline # -> ModuleNotFoundError
import torch.distributed.pipelining

jit

Fx

Complete revamp of float/promotion sympy handling (#126905)

ONNX

Deprecations

Python frontend

Composability

CPP

Release Engineering

Optim

nn

Distributed

Profiler

Quantization

Export

XPU

ONNX

New Features

Python frontend

Composability

Optim

nn frontend

linalg

Distributed

c10d

FullyShardedDataParallel v2 (FSDP2)

Pipelining

Profiler

Dynamo

Export

Inductor

jit

MPS

XPU

ONNX

Vulkan

Improvements

Python frontend

Composability

Autograd frontend

Release Engineering

nn frontend

Optim

Foreach

cuda

Quantization

Distributed

c10d

DeviceMesh

Distributed quantization

DistributedDataParallel (DDP)

Distributed Checkpointing (DCP)

DTensor

FullyShardedDataParallel (FSDP)

ShardedTensor

TorchElastic

Tensor Parallel

Profiler

Profiler torch.profiler:

Memory Snapshot torch.cuda.memory._dump_snapshot:

Profiler record_function:

Export

Fx

Dynamo

Inductor

jit

ONNX

MPS

XPU

Bug fixes

Python frontend fixes

Composability fixes

cuda fixes

Autograd frontend fixes

Release Engineering fixes

nn frontend fixes

Optim fixes fixes

linalg fixes

CPP fixes

Distributed fixes

c10d

DeviceMesh

DistributedDataParallel (DDP)

Distributed Checkpointing (DCP)

FullyShardedDataParallel (FSDP)

TorchElastic

Profiler fixes

Dynamo fixes

Export fixes

Fx fixes

Inductor fixes

ONNX fixes

MPS fixes

XPU fixes

Performance

Python frontend

cuda

nn frontend

Optim

linalg

Foreach

Distributed

C10d

DTensor

Distributed Checkpointing (DCP)

TorchElastic

jit

Fx

Inductor

MPS

XPU

Documentation

Python frontend

Composability

cuda

Autograd frontend

Release Engineering

nn frontend

Optim

linalg

Distributed

c10d

Distributed Checkpointing (DCP)

DTensor

FullyShardedDataParallel (FSDP)

Profiler

Export

Fx

Dynamo

Inductor

ONNX

XPU

Developers

Composability

Release Engineering

Optim

Distributed

c10d

DTensor

Distributed Checkpointing (DCP)

FullyShardedDataParallel (FSDP)

Miscellaneous

TorchElastic

Fx

Inductor

MPS

XPU

Security

Python frontend

Release Engineering

相关地址:原始地址 下载(tar) 下载(zip)

1、 pytorch-v2.4.0.tar.gz 283.15MB

查看:2024-07-25发行的版本