v2.4.0
版本发布时间: 2024-07-25 02:39:28
pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)
PyTorch 2.4 Release Notes
- Highlights
- Tracked Regressions
- Backward incompatible changes
- Deprecations
- New features
- Improvements
- Bug Fixes
- Performance
- Documentation
- Developers
- Security
Highlights
We are excited to announce the release of PyTorch® 2.4!
PyTorch 2.4 adds support for the latest version of Python (3.12) for torch.compile
.
AOTInductor freezing gives developers running AOTInductor more performance based optimizations by allowing the
serialization of MKLDNN weights. As well, a new default TCPStore server backend utilizing libuv
has been introduced
which should significantly reduce initialization times for users running large-scale jobs.
Finally, a new Python Custom Operator API makes it easier than before to integrate custom kernels
into PyTorch, especially for torch.compile
.
This release is composed of 3661 commits and 475 contributors since PyTorch 2.3. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.4. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
Beta | Prototype | Performance Improvements |
Python 3.12 support for torch.compile | FSDP2: DTensor-based per-parameter-sharding FSDP | torch.compile optimizations for AWS Graviton (aarch64-linux) processors |
AOTInductor Freezing for CPU | torch.distributed.pipelining, simplified pipeline parallelism | BF16 symbolic shape optimization in TorchInductor |
New Higher-level Python Custom Operator API | Intel GPU is available through source build | Performance optimizations for GenAI projects utilizing CPU devices |
Switching TCPStore’s default server backend to libuv | ||
*To see a full list of public feature submissions click here.
Tracked Regressions
Subproc exception with torch.compile and onnxruntime-training
There is a reported issue (#131070) when using torch.compile
if onnxruntime-training
lib is
installed. The issue will be fixed (#131194) in v2.4.1. It can be solved locally by setting the environment variable
TORCHINDUCTOR_WORKER_START=fork
before executing the script.
cu118 wheels will not work with pre-cuda12 drivers
It was also reported (#130684) that the new version of triton uses cuda features that are not compatible with pre-cuda12 drivers.
In this case, the workaround is to set
TRITON_PTXAS_PATH
manually as follows (adapt the code according to the local installation path):
TRITON_PTXAS_PATH=/usr/local/lib/python3.10/site-packages/torch/bin/ptxas python script.py
Backwards Incompatible Change
Python frontend
Default TreadPool
size to number of physical cores (#125963)
Changed the default number of threads used for intra-op parallelism from the number of logical cores to the number of physical cores. This should reduce core oversubscribing when running CPU workload and improve performance. Previous behavior can be recovered by using torch.set_num_threads to set the number of threads to the desired value.
Fix torch.quasirandom.SobolEngine.draw
default dtype handling (#126781)
The default dtype value has been changed from torch.float32
to the current default dtype as given by
torch.get_default_dtype()
to be consistent with other APIs.
Forbid subclassing torch._C._TensorBase
directly (#125558)
This is an internal subclass that a user used to be able to create an object that is almost a Tensor in Python and was advertised as such in some tutorials. This is not allowed anymore to improve consistency and all users should subclass torch.Tensor directly.
Composability
Non-compositional usages of as_strided + mutation under torch.compile
will raise an error (#122502)
The torch.compile
flow involves functionalizing any mutations inside the region being compiled. Torch.as_strided is
an existing view op that can be used non-compositionally: meaning when you call x.as_strided(...), as_strided will only
consider the underlying storage size of x, and ignore its current size/stride/storage_offset when creating a new view.
This makes it difficult to safely functionalize mutations on views of as_strided that are created non-compositionally,
so we ban them rather than risking silent correctness issues under torch.compile.
An example of a non-compositional usage of as_strided followed by mutation that we will error on is below. You can avoid this issue by re-writing your usage of as_strided so that it is compositional (for example: either use a different set of view ops instead of as_strided, or call as_strided directly on the base tensor instead of an existing view of it).
@torch.compile
def foo(a):
e = a.diagonal()
# as_strided is being called on an existing view (e),
# making it non-compositional. mutations to f under torch.compile
# are not allowed, as we cannot easily functionalize them safely
f = e.as_strided((2,), (1,), 0)
f.add_(1.0)
return a
We now verify schemas of custom ops at registration time (#124520)
Previously, you could register a custom op through the operator registration APIs, but give it a schema that contained types unknown to the PyTorch Dispatcher. This behavior came from TorchScript, where “unknown” types were implicitly treated by the TorchScript interpreter as type variables. However, calling such a custom op through regular pytorch would result in an error later. As of 2.4, we will raise an error at registration time, when you first register the custom operator. You can get the old behavior by constructing the schema with allow_typevars=true.
TORCH_LIBRARY(my_ns, m) {
// this now raises an error at registration time: bar/baz are unknown types
m.def("my_ns::foo(bar t) -> baz");
// you can get back the old behavior with the below flag
m.def(torch::schema("my_ns::foo(bar t) -> baz", /*allow_typevars*/ true));
}
Autograd frontend
Delete torch.autograd.function.traceable APIs (#122817)
The torch.autograd.function.traceable(...) API, which sets the is_traceable class attribute on a torch.autograd.Function class was deprecated in 2.3 and is now being deleted. This API does not do anything and was only meant for internal purposes. The following raised an warning in 2.3, and now errors because the API has been deleted:
@torch.autograd.function.traceable
class Func(torch.autograd.Function):
...
Release engineering
- Remove caffe2 db and distributed from build system (#125092)
Optim
- Remove
SparseAdam
weird allowance of raw Tensor input (#127081).
Distributed
DeviceMesh
Update get_group and add get_all_groups (#128097) In 2.3 and before, users can do:
mesh_2d = init_device_mesh(
"cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group() # This will return all sub-pgs within the mesh
assert mesh_2d.get_group()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_group()[1] == mesh_2d.get_group(1)
But from 2.4 forward, if users call get_group
without passing in the dim, users will get a RuntimeError
.
Instead, they should use get_all_groups
:
mesh_2d = init_device_mesh(
"cuda", (2, 2), mesh_dim_names=("dp", "tp")
)
mesh_2d.get_group() # This will throw a RuntimeError
assert mesh_2d.get_all_groups()[0] == mesh_2d.get_group(0)
assert mesh_2d.get_all_groups()[1] == mesh_2d.get_group(1)
Pipelining
Retire torch.distributed.pipeline (#127354) In 2.3 and before, users can do:
import torch.distributed.pipeline # warning saying that this will be removed and users need to migrate to torch.distributed.pipelining
But from 2.4 forward, if users write the code above, users will get a ModuleNotFound
error.
Instead, they should use torch.distributed.pipelining
:
import torch.distributed.pipeline # -> ModuleNotFoundError
import torch.distributed.pipelining
jit
- Fix serialization/deepcopy behavior for tensors that are aliasing but not equal (#126126)
Fx
Complete revamp of float/promotion sympy handling (#126905)
ONNX
- Remove caffe2 contrib and experiments (#125038)
Deprecations
Python frontend
- User warning when using
torch.load
with defaultweights_only=False
value (#129239, #129396, #129509). A warning is now raised if the weights_only value is not specified during a call to torch.load, encouraging users to adopt the safest practice when loading weights. - Deprecate device-specific autocast API (#126062) All the autocast APIs are unified under torch.amp and it can be used as a drop-in replacement for torch.{device}.amp APIs (passing a device argument where applicable)..
- Export torch.newaxis=None for Python Array API/Numpy consistency (#125026)
Composability
- Deprecate calling FakeTensor.data_ptr in eager-mode. FakeTensors are tensors without a valid data pointer, so in
general their data pointer is not safe to access. This makes it easier for
torch.compile
to provide a nice error message when tracing custom ops into a graph that are not written in a PT2-friendly way (because, for example, they try to directly access a tensor’s data pointer from a region of code being traced). More details on integrating custom ops withtorch.compile
can be found here (#123292) - Dynamic shapes:
- SymInt-ify mem-efficient attention forward op signature (#125418)
- Don't call item() into torch.scalar_tensor uselessly (#125373)
- Fix scalar type for constraint_range to Long (#121752)
- Guard oblivious on meta registrations (#122216), vector_norm (#126772), and unbind (#124959)
- Make expected stride test in torch._prims_common size oblivious (#122370)
- Use torch._check for safety assert in _reshape_view_helper (#125187)
- Add a code comment about torch._check_is_size in tensor_split (#125292)
- Make min(stride, strides[idx]) in collapse_view_helper size oblivious (#125301)
- Don't short circuit if shape is same (#125188)
CPP
- Refactor autocast C++ APIs to be device-agnostic (#124359)
Release Engineering
- Remove of QNNPACK third-party module (#126941)
Optim
- Deprecate LRScheduler.print_lr (#126105)
nn
-
torch.nn.hardtahn
allowedmin_val
to be greater than max_val (#121627)
Distributed
- Distributed Checkpointing (DCP)
Deprecated submodules feature for distributed_state_dict (#127793)
In 2.3 and before, users can do:
But from 2.4 forward, if users callmodel = AnyModel(device=torch.device("cuda")) model_state_dict = get_model_state_dict(model) set_model_state_dict( model, model_state_dict=new_model_state_dict, options=StateDictOptions(strict=False), ) # Below way of calling API is also legit model_state_dict2 = get_model_state_dict(model, submodules={model.submodule}) set_model_state_dict( model, model_state_dict={model.submodule: new_submodel_state_dict}, options=StateDictOptions(strict=False), )
get_model_state_dict
orset_model_state_dict
with a submodule path or state_dict, users will see a warning about the feature. To achieve the same functionality, users can manually filter out thestate_dict
returned fromget_state_dict
API and preprocess the model_state_dict before callingset_state_dict
API:model = AnyModel(device=torch.device("cuda")) model_state_dict = get_model_state_dict(model) set_model_state_dict( model, model_state_dict=new_model_state_dict, options=StateDictOptions(strict=False), ) # Deprecating warnings thrown for the below way of calling API model_state_dict2 = get_model_state_dict(model, submodules={model.submodule}) set_model_state_dict( model, model_state_dict={model.submodule: new_submodel_state_dict}, options=StateDictOptions(strict=False), )
- FullyShardedDataParallel (FSDP)
Deprecate FSDP.state_dict_type and redirect users to distributed_state_dict (#127794)
In 2.3 and before, users can do:
But from 2.4 forward, if users callmodel = AnyModel(device=torch.device("cuda")) fsdp_model = FSDP(model) # Users can do both ways below get_model_state_dict(model) with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT): fsdp_model.state_dict()
state_dict
or setstate_dict
with the FSDP.state_dict_type, users will see warnings. And the recommended solution now is to useget_model_state_dict
andset_model_state_dict
directly:model = AnyModel(device=torch.device("cuda")) fsdp_model = FSDP(model) get_model_state_dict(model) # Deprecating warnings thrown for the below way of calling API with FSDP.state_dict_type(fsdp_model, StateDictType.FULL_STATE_DICT): fsdp_model.state_dict()
Profiler
- Remove FlameGraph usage steps from export_stacks docstring (#123102) The export_stacks API will continue to work as before, however we’ve removed the docstring to use FrameGraph. PyTorch doesn’t own FrameGraph, and cannot guarantee that it functions properly.
Quantization
- Remove deprecated
torch._aminmax
operator (#125995).torch._aminmax
->torch.aminmax
instead
Export
- Start deprecation of capture_pre_autograd_graph (#125848, #126403)
XPU
- Refactor autocast C++ APIs to be device-agnostic(#124359)
at::autocast::get_autocast_gpu_dtype()
->at::autocast::get_autocast_dtype(at::kCUDA)
at::autocast::get_autocast_cpu_dtype()
->at::autocast::get_autocast_dtype(at::kCPU)
- Refactor autocast Python APIs(#124479)
torch.get_autocast_gpu_dtype()
->torch.get_autocast_dtype(“cuda”)
,torch.set_autocast_gpu_dtype(dtype)
->torch.set_autocast_dtype(“cuda”, dtype)
,torch.is_autocast_enabled()
->torch.is_autocast_enabled(“cuda”)
,torch.set_autocast_enabled(enabled)
->torch.set_autocast_enabled(”cuda”, enabled)
,torch.get_autocast_cpu_dtype()
->torch.get_autocast_dtype(“cpu”)
- Make torch.amp.autocast more generic (#125103)
torch.cuda.amp.autocast(args…)
->torch.amp.autocast(“cuda”,args…)
,torch.cpu.amp.autocast(args…)
->torch.amp.autocast(“cpu”, args…)
, - Deprecate device-specific GradScaler autocast API(#126527)
torch.cuda.amp.GradScaler(args…)
->torch.amp.GradScaler(“cuda”, args…)
,torch.cuda.amp.GradScaler(args…)
->torch.amp.GradScaler(“cpu”, args…)
, - Generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
torch.cuda.amp.custom_fwd(args…)
->torch.amp.custom_fwd(args…, device_type=’cuda’)
,
ONNX
- Remove more caffe2 files (#126628)
New Features
Python frontend
- Add
- support for unsigned int sizes for torch.unique (#123643)
- torch.OutOfMemoryError to signify out of memory error from any device (#121702)
- new device-agnostic API for autocast in torch.amp.* (#124938)
- new device-agnostic API for Stream/Event in torch.{Stream,Event} (#125757)
- channels last support to max, average and adaptive pooling functions (#116305)
- torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only load (#124331, #124330, #127808)
- pickling support for torch.Generator (#126271)
- torch.utils.module_tracker to track position within torch.nn.Module hierarchy (#125352)
Composability
- Add
- OpOverload.redispatch; use it in new custom ops API (#124089)
- mutated_args field to custom_op (#123129)
- new Python Custom Operators API
- register_autograd to register backward formulas for custom ops (#123110)
- torch.library.opcheck (#124496), torch.library.register_autograd (#124071), torch.library.register_kernel (#124299)
- Blanket ban kwarg-only Tensors (#124805)
- Change register_autograd to reflect ordering of setup_context and backward (#124403)
- Ensure torch.library doctests runs under xdoctest (#123282)
- Fix torch.library.register_fake's module reporting (#125037)
- New Custom Ops Documentation landing page (#127400)
- Refresh OpOverloadPacket if a new OpOverload gets added (#126863, #128000)
- Rename
- impl_abstract to register_fake, part 1/2 (#123937)
- register_impl to register_kernel (#124200)
- Schema inference now includes default values (#123453)
- Stop requiring a pystub for register_fake by default (#124064)
- Support TensorList inputs/outputs (#123615)
- Update the functionalization error message (#123261)
- add ability to provide manual schema (#124180)
- fix schema inference for kwarg-only args (#124637)
- mutated_args -> mutates_args (#123437)
- register_autograd supports non-tensor kwargonly-args (#124806)
- set some tags when constructing the op (#124414)
- setup_context fills in default values (#124852)
- torch.library.register_fake accepts more types (#124066)
- use new python custom ops API on prims ops (#124665)
Optim
- Enable
torch.compile
support for LRScheduler with Tensor LRs (#123751, #123752, #123753, #127190)
nn frontend
- Add RMSNorm module (#121364)
linalg
- Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
- Extend
preferred_backend
on ROCm backend. - Add cuBLASLt
gemm
implementation (#122106)
Distributed
c10d
- Implemented IntraNodeComm primitives for
allgather_matmul
(#118038) - Add first differentiable collective
all_to_all_single_grad
(#123599) - Add P2P versions of
send/recv_object_list
operations (#124379) - Add a new Collectives API for doing distributed collectives operations in the Elastic store with more performant and debuggable primitives (#126695)
FullyShardedDataParallel v2 (FSDP2)
- FSDP2 is a new fully sharded data parallel implementation that uses DTensor-based dim-0 per-parameter sharding for improved flexibility (e.g. mixed-dtype all-gather, no constraints on requires_grad) without significant cost to performance. See the document for more details and a comparison with FSDP1 (#122888, #122907, #123142, #123362, #123491, #123857, #119302, #122908, #123953, #120952, #123988, #124293, #124318, #124319, #120256, #124513, #124955, #125191, #125269, #125394, #126070, #126267, #126305, #126166, #127585, #127776, #127832, #128138, #128117, #128242)
Pipelining
- PyTorch Distributed pipeline parallelism APIs were upstreamed from the PiPPy project and are available as a prototype release in PyTorch 2.4. The package is under torch.distributed.pipelining and consists of two parts: a splitting frontend and a distributed runtime. The splitting frontend takes your model code as-is, splits it up into “model partitions”, and captures the data-flow relationship. The distributed runtime executes the pipeline stages on different devices in parallel, handling things like micro-batch splitting, scheduling, communication, and gradient propagation. For more information please check out the documentation and tutorial (#126322, #124776, #125273, #125729, #125975, #126123, #126419, #126539, #126582, #126732, #126653, #127418, #127084, #127673, #127332, #127946, #128157, #128163, #127796, #128201, #128228, #128240, #128236, #128273, #128279, #128276, #128278, #127066)
Profiler
- Add profiler support for
PrivateUse1
(#124818)
Dynamo
-
torch.compile
is compatible with Python 3.12. - Guarding on nn modules attributes (#125202) - TorchDynamo guards on nn module attributes. This was a frequently raised issue in the past (examples (#111785, #120248, #120958, #117758, #124357, #124717, #124817)). This increases TorchDynamo soundness with minimal perf impact.
- Hardened the recently introduced tracing rules infrastructure. This allows
torch.compile
users to easily control TorchDynamo tracing of PyTorch internal code. - Extended
torch.compile
support for RAdam and Adamax optimizer. Compiler optimizers now demonstrate SOTA performance. - Experimental feature - We introduced a new experimental flag torch._dynamo.config.inline_inbuilt_nn_modules to enable
torch.compile
to reuse compiled artifacts on repeated blocks in the models. This gives another point in the tradeoff space of compilation time and performance speedup. By movingtorch.compile
from full model to a repeated block (e.g. movingtorch.compile
from full LLM to a repeated Transformer block), we can now achieve faster compilation time with some performance dip compared to full model. We plan to make this flag default to True in the 2.5 release.
Export
- Introduce ShapesCollection, a dynamic shapes builder API (#124898)
Inductor
- Add higher order associative scan operator (#119430)
jit
- Add aten::sort.any op for sorting lists of arbitrary elements (#123982)
MPS
- Conform torch.mps to device module interface (#124676)
XPU
- Inductor Intel GPU backend (#121895)
- a new autocast API torch.amp.is_autocast_available(#124938)
- attributes to xpu device prop (#121898)
- XPU implementation for PyTorch ATen operators (#120891)
- generic stream/event on XPU backend (#125751)
- gpu trace on XPU (#121795)
- Switch to torch.float16 on XPU AMP mode (#127741)
ONNX
- quantized layer norm op to opset 17 (#127640)
- symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828)
- Support for Some Bitwise Ops in Onnx Exporter (#126229)
- Allow ONNX models without parameters (#121904)
- Integrate onnxscript optimizer (#123379)
Vulkan
- quantized transposed 2D convolutions (#120151, #122547)
- the quantized ReLU operator (#123004)
Improvements
Python frontend
- bfloat16 support for torch.binary_cross_entropy on CPU (#123823)
- MAP_SHARED option for torch.load when mmap=True (#124889)
- default value when printing function signature (#127059)
- all variants of upsampling functions to be done in high precision in autocast (#121324)
Composability
- FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
torch.compile. There were much coverage improvements this release:
- New metas / fake tensor rules:
- aten._embedding_bag_dense_backward, aten._embedding_bag_per_sample_weights_backward (#125785), aten.randint.out, aten.rand.out (#122375), aten.unique2 (#124306), aten.histc (#124548), aten.channel_shuffle (#123033), aten._masked_scale (#127389), aten.addcdiv.ScalarList, aten.addcmul.ScalarList (#123486)
- New metas / fake tensor rules:
- New decomps:
- Aten.resize_as (#122317), several out= variants of ops with existing decomps (#122979, #115437)
Autograd frontend
-
nn.functional.batch_norm
: add forward AD rule for miopen backend (#125069) -
nn.functional.scaled_dot_product_attention
: add backward rule for cuDNN backend (#122510)
Release Engineering
- Add CI support for aarch64 linux. The CI is triggered when the ciflow/linux-aarch64 label is added. (#120931, #121284, #125255, #121136, #124781, #125599)
- Add experimental CUDA pip wheels for ARM architectures supporting the NVIDIA Hopper architecture as nightly binaries and a prototype for the PyTorch 2.4.0 release. (#126174, #127514)
- Add support for CUDA 12.4 in CI/CD (#121684, #121956, #127825, #125944, #128250)
- Add support for numpy 2.0.0rc1 in CI and CD (#123286, #122157)
- Enable support for
torch.compile
and triton with Python 3.12 CI/CD (#127547, #123307, #126218) - Intel GPU enablement in CI (#122254, #123920, #125655)
- Migrated CI/CD jobs to macOS 14 (#127582, #127853, #125801)
- ROCM: upgrade CI/CD to 6.1 (#124811, #118216, #124300, #125646)
- CUDNN version 9.1.0.70 for CUDA 11.8, 12.1, 12.4 builds (#123475)
- NCCL submodule v2.20.5 (#121635)
- submodule oneDNN v3.4.2 (#126137)
- Wrapped deprecated function/class with typing_extensions.deprecated (#127689)
nn frontend
- Add
swap_tensors
path to nn parametrizations (#124130) - Relax
use_count
constraints forswap_tensors
whenAccumulateGrad
node holds a reference (#127313) - Increase numel limit to 2^63 for replicatepad1d (#122199)
- Use
int64_t
indexing forUpsample2d
backwards (#123682) - Remove warning from
LazyModuleMixin
constructor (#123968)
Optim
- Radam and Nadam support the flag for "maximize" (#126765, #127214)
- Include scheduler_on_plateau in optim.h (#121722)
Foreach
- Allow foreach ops to run for any backend, not just CPU (#127412)
cuda
- Update CUDA out of memory message with private pool info (#124673)
- Add autocast rule for torch.vdot (#125697)
- Fix type hint for cuda.get_device_name() and cuda. get_device_capability() (#126743)
Quantization
- X86 Inductor backend
- Enable linear and linear-unary post-op gelu quant recipe for
X86InductorQuantizer
(#114853) - Add Quantization recipe filter per operator type for
X86InductorQuantizer
(#122775) - Add Matmul recipe into
X86InductorQuantizer
(#122776) - Improve performance of
qconv
by reducing integration overhead (#123240)
- Enable linear and linear-unary post-op gelu quant recipe for
- PT2E quantization flow
- Add support for conv transpose + bn + {relu} weights fusion in PTQ and QAT (#122046, #123652)
- Simplify
fake_quant_per_channel
(#123186) - Support fp8 quantization (#123161)
- Propagate get_attr meta through known ops only (#124415)
- Fix issue of lowering nn.linear ops with kwargs (#126331)
Distributed
c10d
-
TORCH_NCCL_HIGH_PRIORITY
option for ProcessGroupNCCL (#122830) -
__repr__
to P2POp class (#126538) -
commCreateFromRanks
to c10d (#127421, #127982) -
dist.get_node_local_rank
helper (#123992) - an option to enable TCPStore libuv backed for c10d rendezvous (#124684)
- Captured dtype in Flight Recorder (#126581)
- Enable ncclCommDevIdxMap unconditionally (#122049)
- Extended the flight recorder dump from timeout to any exception (#123023)
- Make TCPStore server use libuv by default (#127957)
- Make
get_node_local_rank()
accept fallback_rank (#126737) - Make abort communicators in destroy_process_group call on default and code cleanup (#124334)
- Mapped float8 types to uint8 for allgather (#126556)
- Optionally avoided rethrowing CUDA Errors in NCCL Watchdog (#126587)
- Wrapped TCPStore check in a try/catch (#127030)
-
ProcessGroupWrapper
support custom backend (#124447) - ncclComm is not aborted before checking exception (#124466)
DeviceMesh
- Add a private init backend option (#124780)
- Initialized mesh tensor with CPU context (#124767)
- Add
DeviceMesh.from_group()
(#124787) - Make
_validate_tp_mesh_dim
support 3D (#125763) - Supported N groups in from_group (#126258)
- Make sure device mesh can be imported from torch.distributed (#126119)
Distributed quantization
- Used BFloat16 in distributed quantization when supported by NCCL (#125113)
DistributedDataParallel (DDP)
- Add a mode to avoid clone() in DDPSink (#122927)
Distributed Checkpointing (DCP)
- Add
type_check
param to copy state dict utils (#127417) - Add strict option to
DefaultPlanner
(#123869) - Always created requests for non-tensor objects (#125334)
- Always flattened mapping even if no tensors present (#125335)
- Correctly handle
_extra_state
(#125336) - Implement
broadcast_from_rank0
option for model/optimstate_dict
(#125338, #125339) - Introduced async staging extension points (#122965)
- Make distributed
state_dict
supporttorch.distributed
is not initialized case (#127385) - Make param name consistent with overridden function (#124770)
- Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070)
- Supported flattening the optimizer
state_dict
when saving and unflattening when loading (#127071) - Unified the API signatures of
set_model_state_dict
andset_optimizer_state_dict
(#127384)
DTensor
- backward support for
scaled_dot_product_attention
(flash-attention) (#122541) - more foreach ops (#123214)
- op support for
view_as_complex
andview_as_real
(#122569) - op support for memory efficient attention (#122996)
- support for
fused_adam
andfused_adamw
when lr is a tensor (#126750) - ASGD foreach optimizer with associated unit tests (#121942)
- the handle of DTensor.device_mesh.device_type in dynamo (#118803)
- the support of placement kwargs for DTensor.to_local() in dynamo (#119947)
- scatter op with simple replication (#126713)
- distributed topk operator (#126711)
- Make Partial placement public (#127338, #127420)
- ensure expected input spec have correct tensor meta (#122949)
- ensure meta tensor random op does not alternate rng state (#125693)
- Move early return check into redistribute autograd function (#121653)
- Move some modules to private namespace (#127339)
- Standardized multi mesh-dim strategy with utils (#126712)
- 2D clip_grad_norm_ (#121945)
- simple replicate strategy for SVD (#127004)
- Turned on foreach implementation for (1)
clip_grad_norm_
for DTensor by default (#126423), (2) optimizer for DTensor by default (#123394)
FullyShardedDataParallel (FSDP)
- device in
pin_memory
argument (#119878) - private _unshard API (#124304)
- privateuse1 in FSDP's sharded grad scaler (#126971)
- Avoided CPU sync in
clip_grad_norm_
(#122001) - Marked
pre_backward_hook
unserializable (#125464) - Skipped FSDP hooks base on dynamo config (#123021)
- Used generic device handle instead of cuda (#121620)
ShardedTensor
- Supported non-contiguous rank validation in sharded tensor (#123230)
TorchElastic
- debug info logging interface for expired timers (#123883)
- health check server hook in torch elastic (#122750, #123504)
- option for sharing TCPStore created by rendezvous handlers (#125743)
- support for binding to TCP in WorkerServer (#127986)
- Applied "distributed debug handlers" (#127805)
- Cleared timer for already terminated process (#122324)
- Skipped expired timer logging for empty expired timers (#125039)
Tensor Parallel
- wildcard support for Tensor Parallel
parallelize_plan
(#122968) - kwargs support to
prepare_module_input
(#124114)
Profiler
Profiler torch.profiler
:
- metrics for performance timing and other statistics collection (#123412)
- Kineto traces will export ns granularity for finer timestamps (#122425, #123650)
- Unified the device (CUDA, XPU, PrivateUse1) in profiler’s post processing (#123247)
- Improve profiler post processing by iterating frontend function events rather than all function events (#124596)
- Report strides in json traces (#125851)
- Register COLLECTIVE_COMM profiler activity type when available (#121461)
- Support third-party devices emit a range for each autograd operator (#125822)
Memory Snapshot torch.cuda.memory._dump_snapshot
:
- Improve the description of blocks with missing frames in the Memory Visualizer (#124784)
- Add recordAnnotations to capture record_function annotations (#124179)
Profiler record_function
:
- For with_effects, skip over profiler.record_function_exit (#121829)
- support for RecordFunctionFast to take inputs (#123208)
- support for kwargs in RecordFunctionFast (#123600)
- Collecting autograd sequence numbers on PythonTLSSnapshot dispatch keys for Nested Tensor (#123304)
Export
- a printer to the unflattened module (#124315)
- disable_forced_specializations flag (#124949, #126925)
- export support for auto_functionalize (#121990, #122177, #122246)
- readable placeholder names to ExportedProgram nodes (#123587, #123590, #124765)
- set_grad_enabled higher order operator (#123391, #125066, #121736)
- stack_trace for non-strict export (#121034)
- torch_fn, a more consistent metadata across strict and non-strict export (#122693)
- torchbind tracing support (#122619, #123370, #122622, #125490)
- Allow static constraints in dynamic_shapes (#121860)
- Ignore logging.Logger.* calls during dynamo export (#123402)
- Make metadata serialization more strict (#124411)
- Populate ShapeEnv's var_to_val during deserialization (#121759)
- Prototype TorchScript 2 ExportedProgram Converter (#126920, #127466)
- Provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
- Save/load example inputs in the ExportedProgram (#122618)
- Suggest constant dim values in dynamic shapes fixes (#125458)
- Support map in pre-dispatch functionalization (#121444)
- We introduced the concept of “effect tokens”, which is how we allow side-effectful operators in torch.compile/export (#121552, #122357)
Fx
- shape inference tool (#120097)
- device_ordinal to Subgraph in splitter_base (#125616)
- exclusion function to minimizer base (#124504)
- missing forbidden mutation methods in immutable collections (#125468)
- option to turn on return_tuple in _SplitterBase (#123868)
- prefix option to CapabilityBasedPartitioner (#126382)
- Create block traverse mode in minimizer for graph aware debugging (#125613)
- Implement Graph Transform Observer (#127427)
- Option to include stride and device annotation in gm.print_readable() (#123690)
- Register create_node_hook (#126671)
Dynamo
- We performed a careful audit and fixed all known memory leaks in TorchDynamo.
- We hardened
torch.compile
+__torch_function__
support by onboarding Scaled Dot Product Attention (SDPA) and TensorDict.
Inductor
- 0 initialization to Triton masked loads (#127311)
- HalideCodeCache (#126416)
- clone if output is a view from constant (#123200)
- config to allow buffer mutation (#126584)
- decompose_mem_bound_mm to the customization pre and post grad passes (#123376)
- inductor support (#123709)
- kernel_code logging artifact (#126631)
- lowering for avg_pool{1, 3}d (#116085), cummax, cummin (#120429)
- missing files to torch_key (#128230)
- mode to MemoryDep to track atomic accumulates (#123223)
- pybind for tensor_converter util functions (#121744)
- qlinear_pointwise.binary op for X86Inductor backend (#123144)
- support for multiple flexattention calls in a single compile (#125516)
- tensor_constantX to pass constant buffer update's check (#122562, #122690)
- the quant lift up pass in convert phase (#122777)
- a decomposition for select_scatter (#124426)
- Allow multiple cudagraph recordings per compiled graph (#126822)
- Automatic detection for buffer mutation and binary linking (#126706)
- Change
- OverridesData to take callables instead of strings (#123397)
- aot_compile callsites (#122225)
- Clean up for removing 2 decompose patterns (#123422)
- Codegen runtime asserts in Inductor (#124874)
- Customize pre grad and post grad patterns (#121915)
- Disallow fusions of foreach and reductions (#127048)
- Enable
- lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
- mmaped weights when CUDA is used (#124346)
- meta internal AOTInductor compilation on ROCM (#124123)
- Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459)
- Filter non input symexprs from codecache guards (#128052)
- Get PT2 Cutlass backend working under fbcode (#125688)
- Hipifying aoti code_wrapper (#124241)
- Improve group batch fusion with same parent/users fusion enablement (#127648)
- Inductor respects strides for custom ops by default (#126986)
- Initial implementation of Inductor FX Graph Remote Cache (#124669)
- Make
- torch._inductor.dependencies.Dep a proper class (#124407)
- c10/util ostream function implementations to their headers (#123847)
- some cudagraphs checks into C++ (#122251)
- Pass triton kernel info to record function (#123871)
- Read the patterns from the config instead of hard-code passes (#125136)
- Remove
- API that allows for extra deferred runtime asserts during lowering (#124864)
- assertion for cat target_func (#125540)
- Serialize large weights (#123002)
- Specialize on unguarded alignment of example inputs (#123319)
- Split cat customization (#123045)
- Support
- CUDA_INC_PATH env variable when compiling extensions (#126808)
- custom op in JIT with cpp wrapper (#122554)
- pytrees as associative_scan input (#122137)
- use_runtime_constant_folding for CPU (#122563)
- Try to reuse old symbol name rather than new symbol name when renaming (#124782)
- Update the cpp_wrapper entry function signature (#121745)
- Use source code hash instead of torch version (#126092)
- Various improvements to error handling during autotuning (#126847)
- batch pointwise op + unbind stack pass in post grad (#126959)
- config target platform (#126306)
- disable comprehensive padding in fbcode (#124191)
- enable software pipelining on AMD devices (#125858)
- epilogue support for gemm template (#126019)
- make mask_rcnn inference work in max-autotune mode (#123008)
- pt2 dper passes: run shape prop before each pass (#122451)
- remove 2 decompose patterns (#123371)
- switch assume_aligned_inputs to False (#124336)
- unified the vectorized conversion with at::vec::convert for all data types (#119979)
jit
- Shape function fix for _batch_norm_with_update (#122430)
- Attach target function to OSError when source can't be found (#125248)
- Support getattr/hasattr on NamedTuple (#121863)
ONNX
- Allow fake models to run with ONNXProgram.call (#122230)
- Fix ONNX export with print (#123368)
- Improve torch.onnx.export runtime from O(n^2) to O(n) (#123025, #123027, #123063, #124909, #123028, #123028, #123029, #123026, #124912)
- Make ONNXProgram.model_proto and disk file the same (#122196)
- Skip optimizer when it fails (#127349)
- Update decomposition table to core ATen ops (#127353)
- beartype to emit warning instead of error by default (#123205)
MPS
- Add naive quantized int4_mm, int8_mm and .gputrace capture hooks (#125163)
- Better error-check for linear op (#124952)
- Enable
- index_select for complex types (#122590)
- torch.mm and other ops for complex dtypes (#127241)
- Implemented isin_Tensor_Tensor_out for MPS backend (#124896)
- Improve F.adaptive_avg_pool2d error messages on MPS backend (#124143)
- Native non-zero op implementation (#125355)
XPU
- Generalize host allocator to be device-agnostic(#123079)
- Make macro with AMP more generic(#124050)
- Refactor
- CUDA’s AMP autocast policy to be generic(#124051)
- gpu trace to be device-agnostic(#121794)
- Support generic Stream/Event on CUDA/HIP backend(#125757)
Bug fixes
Python frontend fixes
- DtoH sync in torch.index_put_ (#125952)
-
torch.load
map_location for wrapper subclass and device being serialized through numpy (#126728) - memory leak in torch.dtype.to_complex() (#125154)
- nn.Parameter constructor type hint (#125106)
- parameter name in torch.can_cast to from_ (#126030)
- support of paths with space in torch.utils.cpp_extensions (#122974)
- Support numpy array in Tensor.eq (#122249)
Composability fixes
- FakeTensors, meta tensors and python decompositions are used to perform shape propagation when tracing out a graph in
torch.compile. There were a number of bug fixes improvements this release:
- FakeTensor fixes:
- Handle symbolic size access in FakeTensor (#124760)
- Avoid cuda init in FakeTensorMode (#124413)
- Do not run CUDA lazy init if it is triggered with fake mode on (#122636)
- Refactor faketensor ops that produce unbacked symints to memoize (#125623)
- Meta device fixes:
- fix meta tensor set_() incorrectly modifying nbytes of the storage (#123880)
- Fix aten._weight_int4pack_mm meta registration for float16 inputs (#124136)
- Fixes to python decompositions:
- aten.upsample_bicubic2d: support for uint8 (#120411)
- aten.upsample_nearest* ops: properly registered decomp to dispatch keys (#122782), (#122783)
- _refs.masked_fill: support privateuse1 device when value.device.type is cpu (#124835)
- _refs._reshape_view_helper: specialization shortcut for converting n-d to 1-d and 1-d to 2-d views (#127641)
- Fix decomp for torch.tensor(...) constructor with nested python lists(#125639)
-
Aten.rrellu_
: fix decomp when default values are missing (#126978)
- FakeTensor fixes:
- AOTDispatcher is the component of the
torch.compile
stack that functionalizes and normalizes the graph, and adds support for compiling the backward during training. There were several bugfixes and improvements to AOTDispatcher:- Fix
torch.compile
used with triton kernels under inference_mode (#124489) - Fix incorrect graph when functionalizing aten.expand followed by mutation (#122114)
- Properly keep input mutations in the graph when they are under torch.no_grad, even if there are outstanding aliases (#122433)
- Replay original views from the user code instead of falling back to as_strided in a few cases, which can improve
performance of the backward pass in cases where
torch.compile
captures small graphs with outputs that alias graph inputs (#121007)
- Fix
- For
__torch_dispatch__
-based tensor subclasses, support custom layout overrides under torch dispatch mode (#125379)
cuda fixes
- cuda array for empty arrays (#121458)
- a perf regression in kernel launcher for the foreach_* family of ops (#123566)
- CUDA out of memory error message formatting (#123984)
- CUblasLt compilation on windows (#125792)
Autograd frontend fixes
-
torch.utils.checkpoint
: Use pytrees to improve determination of what RNG state to stash (#121462) - Fix error message of autograd (#123154)
Release Engineering fixes
- Fix mypy issues in fake_tensor.py (#124428)
- Fix running of: lintrunner --all-files --take FLAKE8 (#124771)
- Fix libc and libstdcxx installation on conda environments (#121556)
- Release engineering tooling and CI fixes. Workflows, Trymerge, Bot Labeler, Mergebot (#125042, #121762, #121920, #124965, #122155, #123301, #121733, #127567, #128080)
nn frontend fixes
- access to unitialized memory in VSX vector functions for quantized values (#122399)
-
swap_tensors
path innn.Module._apply
for modules that inherit fromRNNBase
(RNN
,GRU
,LSTM
) (#122800) -
ctc_loss
zero/negative length corner cases (#123193) -
_LazyConvXdMixin.initialize_parameters
and add related tests (#123756) -
load_state_dict
with unexpected key whose prefix matches a valid key (#124385) -
requires_grad
propagation innn.utils.parametrize
(#124888) -
nan
with largebfloat16
values forFlashAttention
backend ofnn.functional.scaled_dot_product_attention
- issue in
affine_grid_backward
whengrad_grid
is non-contiguous (#124370) - Add error checks for invalid inputs on
thnn_conv2d
(#121906)(#122135)
Optim fixes fixes
- Wrong ASGD implementation (#125440, #126375)
- loading optimizer options from archive (#125215)
linalg fixes
-
svd_lowrank(..., M)
in the presence of broadcasting (#122681) -
linalg.vector_norm
when used withautocast(cuda)
(#125175)
CPP fixes
- Handle all types c10::isSigned (#125637)
- crash for AVX512 int4 matrix multiplication if weights are unaligned (#124128)
- loading custom C++ extension within DataParallel-ized model (#125404)
Distributed fixes
c10d
-
coalescedCollective
op Flight Recording (#120430) -
group_name/group_desc
set up in eager initialization (#127053) - bug in
_update_process_group
API (#128262) - bug in update_process_group DDP API (#128092)
- excepthook crash on exit after destroy_process_group (#126739)
- various errors in
TCPStoreLibUvBackend.cpp
(#127230) - work handle for coalescing manager (#122849)
- Add check gloo availability when doing
_ProcessGroupWrapper
check (#124233) - Add initialize lastEnqueuedSeq_ and lastCompletedSeq_ in ProcessGroupNCCL (#121980)
- Ensured gil is not released when calling to PyBytes (#128212)
- Guarded gpu context during abort (#127363)
- Make monitorThread sleep when we try to dump flight recorder (#123788)
- Only included NCCL related header file with macro
USE_C10D_NCCL
(#127501) - Prevented
wait_tensor()
calls on graph inputs from getting DCEd for AsyncCollectiveTensor (#125677)
DeviceMesh
- hash and eq not match (#123572)
- device type issue in
_get_device_handle
(#124390) - Enable cache and reuse of sliced result to prevent funky behaviors and NCCL deadlock at large scale (#122975)
- Make dtype of mesh tensor from
init_device_mesh()
consistent with directly callingDeviceMesh()
(#123677)
DistributedDataParallel (DDP)
- DDP
no_sync
whenfind_unused_parameters
is True (#124193)
Distributed Checkpointing (DCP)
- to remove non_persistent buffer in distributed state dict (#125337)
-
set_optimizer_state_dict()
changes the parameters with some optimizers (#125708) - various bugs for
broadcast_from_rank0
(#127635) - Remove the check of FSDP has root (#121544)
- Kept params in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)
FullyShardedDataParallel (FSDP)
- FSDP 2D state_dict to use run_check=False (#123802)
- HSDP: sharding placement (#123778), validation error msg (#123019)
- summon_full_params on submodule (#123290)
TorchElastic
- Make
torch.multiprocessing.ProcessContext.join()
wait for all child procs to exit before return (#125969)
Profiler fixes
- an asynchronous trace bug where end timestamp overflows and events are years in the future (#124080)
- torch.profiler Schedule Function (Function Event only) to accumulate events (#125510)
- Add a sanity test to the unit testing (#124773)
- Add missing field device_resource_id in profiler events (#121480)
- Cleaned up deprecated use_cuda by default (#126180)
- Do not emit a warning when using CPU profiler only (#125654)
- Handle more cases of symbolic sizes/strides detection (#123696)
- Reduced warning msg in torch.profiler when using AMD (#124469)
- Release gil in prepareProfiler (#121949)
- Remove a redundant *1000 to timestamp since we already have ns precision (#124374)
- Split up profiler test file (#124856)
Dynamo fixes
- 'Could not infer dtype of SymBool' on torch.tensor call (#125656)
- 'get_attr' call in dynamo 'run_node' (#127696)
- 'get_real_value' on placeholder nodes (#127698)
- assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
- guard_size_oblivious on non-symbolic expression (#123743)
- tvm backend interface (#126529)
- Add support for tensor's is_complex method (#124927)
- Allow asserts to fail (#126661)
- Forward OptimizedModule.setattr to the wrapped module (#122098)
- Initial exception handling support in dynamo (#126923)
- Keep track of ViewMeta with symbolic inputs (#125876)
- Support macOS and Linux/aarch64 platforms (#128124)
Export fixes
- GraphModuleDeserializer handling of signature (#122342)
- bug in get_update_constraint (#125194)
- conv decomp when decomposing to core-aten (#123283)
- mode not on stack error for while loop (#122323)
- runtime assertions to add call_function (#125878)
- to_copy to be inserted in the exported graph (#125628)
- unflattening with duplicate tensors (#125192)
- up nn_module_stack for nodes occurred around tracepoint ops (#124457)
- leaky fake tensor on attribute assignment, support buffer assignment (#122337)
- Allow Dim(1,2) for export dynamic shapes (v2 after revert) (#121910)
- Allow modules to be created in the forward (#125725)
- Correctly serialize empty list based on argument type (#123748)
- Forward fix failures for torch.export switch to predispatch (#126081)
- Handle param aliasing (#127471, #125509, #125758)
- Make error name private (#126715)
- More strictly respect scope when removing inputs in unflattener (#127607)
- Skip nn_module_stack verifier for non-fx.GraphModule modules (#122210)
Fx fixes
- fx graph triton import bug (#122041)
- graph partitioner and make runtime assertion work with submodules in export (#125793)
- infinite recursion in API BC test (#125706)
- mem size mismatch from split/chunk in const folding (#125199)
- triton import time cycles (#122059)
- Don't intersect when clamping for size oblivious (#123675)
- Don't use Proxy torch function in the sym size calls (#121981)
- FakeTensorProp assert consistency of sizes when metadata previously existed (#124059)
- Keep set_() input mutations in the AOTDispatcher graph, ban other cases (#122981)
- Make
- check_is_size clamp to sys.maxsize - 1, so sys.maxsize comparison returns False (#122372)
- torch._check understand Eq commutativity (#125629)
- Preserve
- node.meta when fusing subgraph (#125261)
- partitioner order (#122111)
- unbacked SymInt on SymNode (#120816)
- Remove
- duplicated nodes in dfs_iter_find_cycle (#125585)
- incorrect check (#123616)
- Skip index_put_ in dce (#122683)
Inductor fixes
- AFOC QPS Regression (#122944)
- C++ compilation error for tensor array in abi_compatible mode
- FakeTensorUpdater logic for updating fake tensors (#116168)
- a bool value codegen issue when calling custom ops (#127398)
- a bug when mutated buffer meets .to (#127671)
- a codegen issue when .item() is used for kernel arg (#126575)
- a dynamic shape problem when lowering diagonal (#121881)
- an internal test regression (#123481)
- another out-of-bounds access (#122580)
- cat backwards wrapping on symints (#121527)
- compilation_latency regression caused by #127060 (#127326)
- constant propagation pass (#114471)
- cuda compilation under fbcode remote execution (#126408)
- cummax and cummin lowering for empty case (#126461)
- cutlass path in inductor (#125463)
- edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622)
- includes to system Python (#125285)
- issue with randint + symbolic shapes (#122428)
- issues in pre_grad passes (#123181)
- mask propagation in the presence of where (#125574)
- memory planning compile error (#123867)
- missing unbacked def for unbacked in input expr (#127770)
- nextafter in inductor CPP codegen (#126876)
- ops.scan for non-commutative operators (#126633)
- out-of-bounds read/write in cvt_int64_to_[fp32|int32] (#122511)
- scheduler typehints (#127769)
- test with inlining flag (#128200)
- to #126656 (#127050)
- triton codegen main do_bench_gpu import error (#126213)
- unbacked symbol in stride when using item() (#122298)
- unsupported type of output=s1 (#126797)
- ScatterFallback codegen (#124580)
- a constant tensor device move issue (#128265)
- an assertion for node debug str (#127021)
- grid z bug for large grid (#127448)
- invalid call to aoti_torch_tensor_copy_ (#126668)
- linear_add_bias path (#127597)
- loop ordering test (#127807)
- miss isa bool check (#128274)
- post_grad pattern (#127457)
- redis-related env vars in remote_cache.py (#127583)
- Add missing acosh op to vec256_float_neon.h (#122513)
- Back out
- "Added a check in register_lowering to avoid decomposed ops (#117632)" (#122709)
- "Precompile triton templates (#121998)" (#123305)
- Backport https://github.com/openai/triton/pull/3433 (#122470)
- Correctly calculate the numel with symint in DDP fusion (#124422)
- Disable stack allocation when there is a fallback op (#122367)
- Do not forward parent's value range to CSE variable for variables created within codegen (#123099)
- Do not propogate (#124769)
- Don't clamp slices generated from cat kernel (#124139)
- Enable B019 - flags memory leaks through LRU cache on method (#127686)
- FX graph cache: Fix bug handling constants (#121925)
- Fall back to eager mode when viewing with differing bitwidths (#120998, #121786)
- Implement masked_load for integral types (#122608)
- Improve unbacked SymInt input support in Inductor (#124739)
- Inductor: fix Conv output stride for dynamic shapes (#121400)
- Remove symbol exports in C shim for Windows (#125472)
- Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
- Use pexpr, not texpr in Triton launch codegen (#128038)
- turn off triton memcache for amd devices (#122560)
- typing scheduler.py [1/2]: Bug fix (#126610)
- use two pass reduction for deterministic reduction order (#115620)
- Forward fixes
- for D56289438 (#124882)
- for templates + views (#127446)
ONNX fixes
- Fix list dtype finding bug in dispatcher (#122327)
- Rename ort to maia in dynamo's ort backend (#124967)
- Cast checkpoint weights to match model parameter's dtype (#122100)
- Reduce excessive warning to info (#122442)
- Prevent dup initializers when ONNXProgram.save is called many times (#122435)
MPS fixes
- FFT descriptor fields to resolve precision issue (#125328)
- FFT implementation bug dropping negative frequency components (#123274)
- GELU, LeakyRELU and MISH on non-contiguous tensors (#123049)
- abs for complex types (#125662)
- copies larger than 4GB (#124635)
- crash with binary_cross_entropy is invoked for half dtypes (#124258)
- for MPS regression in scalar creation (#123234)
- for addcdiv contiguous problem (#124442)
- naive matmul for BFloat16 (#121731)
- nextafter for negative values (#125029)
- overflow in cumsum when dtype is bool (#125318)
- strided ELU op correctness issue (#125692) and mse_loss correctness issue (#125696)
- Fwd-fix for clamp regression (#122148)
- Remove in place views fixing various crashes (#124895)
XPU fixes
- record issue on XPUGuardImpl (#123523)
Performance
Python frontend
- Use sleef on macOS Apple silicon by default (#126509)
cuda
- Speed up torch.softmax kernel (#122970)
nn frontend
- Parallelize upsampling ops across the batch/channel dimension (#127082)
Optim
- Add fast fused kernels for Adam, AdamW, SGD, and Adagrad on CPU (#123074, #123629, #124905)
linalg
- Improvements
- the CPU performance of
linalg.vector_norm
when reducing over a dimension of length 1 (#122143) - performance of FP16
gemv
on ARM (#126297, #126745, #126746, #126877, #127033) and BF16gemm
fallback on ARM (#126592) - autotuning through
TunableOp
on ROCm (#124362)
- the CPU performance of
Foreach
- Allow int vals to go down the fastpath for _foreach_max (#127303)
-
_foreach_copy
now supports different source/dest dtypes on the fast path (#127186)
Distributed
C10d
- Disable compute of collective duration by default (#122138)
DTensor
- Used str for reduce_op instead of c10d enum (#125172)
- Make early return for
_split_tensor
(#125810) - Make directly return
local_tensor
underno_grad
(#128145)
Distributed Checkpointing (DCP)
- Improve the performance of distributed state_dict (#125501)
TorchElastic
- Changed
monitor_interval
for torchelastic default value to 0.1 sec (#124692) - Add timing events to different stages of rendezvous (#125636)
jit
- Fix exponential memory usage when TorchScript types share the same name (#121874), (#121928)
Fx
- Add side table to FX Graph for O(1) op/target query (#121565)
- Apply guard knowledge to all simplifications (#123342)
- Do not calculate hint in advice_is_size (#124472)
- Enable FX graph and symbolic shape caching (#121697, #125258, #123724, #124610)
- Flatten/Unflatten micro optimization in proxy_tensor.py (#121993)
- Minor compile time optimization in has_free_symbols (#122144)
- Skip assert in check_is_size (#124209)
- Teach ShapeEnv that a <= b => a < b + 1 (#123436)
- Use sympy xreplace instead of subs (#124208)
-
_find
not update unchanged replacements (#124274) - eval_static: guards, unbacked compute once (#124217)
Inductor
- Speedup
convert<float>(Vectorized<half>::loadu(ptr, 8))
on ARM (#125889) - Add more mm kernel choices (#125000)
- Add NEON ISA support on
- arm64 Macs (#122217)
- aarch64 (#123584)
MPS
- Improvements to perf of int4pack_mm (#125983, #127135, #125704)
- Making copy_cast, softmax and cat_out unranked (#123191)
XPU
- Intel GPU
- Convolution&Deconvolution aten operators(#117529)
- Matmul aten operators(addmm, badbmm, etc.)(#117202)
- Support xpu host allocator (#123080)
- oneDNN
- Conv primitive integration (#117512)
- Matmul primitive integration (#117112)
- library compilation for Intel GPU support (#117098)
Documentation
Python frontend
- Add doc for
- torch.distributions.utils.clamp_probs (#128136)
- the legacy constructor for Tensor (#122625)
- torch.Size.numel (#124186)
- torch.utils.benchmark.utils.compare.Compare (#125009)
- torch.utils.collect_env.get_env_info (#128021)
- Clarify Security Policy (#120531)
- Fixes doc
- example of torch.masked_scatter (#123664)
- for torch.load map_location (#125473)
- Improve doc for
- torch.set_default_dtype (#121730)
- torch.load weights_only argument (#127575)
- Update doc for
- functions in torch.multinomial (#125495)
- functions in torch.random (#125265)
- torch.dot (#125908)
Composability
- Add extended debugging options for troubleshooting
torch.compile
issues (#122028)
cuda
- Add doc for torch.cuda.nccl.version (#128022)
- Add documentation for nvtx.range (#121699)
Autograd frontend
-
torch.autograd.Function
: update docs for separate context and forward functions (#121955) -
torch.utils.checkpoint
: Improve error message when use_reentrant=True is used with .grad() (#125155) - Improve the clarity of the
torch.Tensor.backward
doc (#127201) - Fix typing for
torch.autograd.Function
with ctx-less forward (#122167)
Release Engineering
- Fix torch and
torch.compile
links (#121823, #121824) - Add
- fuzzer instructions to pt2 bug template (#123156)
- better instructions for pytorchbot merge command on cancel (#124947)
- instructions on how to run doc coverage locally (#123688)
nn frontend
- Fixes
-
KLDiv
example (#126857) -
torch.nn.TripletMarginLoss
allowing margin less or equal to 0 (#121978) - example and typo in
nn.ChannelShuffle
andnn.PReLU
docs (#123959) - redundant tensor in
nn.MaxUnpool2d
example (#127850) - wording in
nn.Linear
docstring (#127240)
-
- Improvements
-
NLLLoss
documentation (#127346) - documentation of
torch.nn.utils.rnn
(#123559) - return value documentation for
nn.Module.load_state_dict
(#123637) - the example description for
torch.nn.utils.rnn.pad_sequence
(#123183)
-
- Update the
is_causal
explanation in thenn.functional.scaled_dot_product_attention
doc (#127209) - Warn SDPA users about dropout behavior (#126294)
Optim
- Document complex optimizer semantic behavior (#121667)
- Add missing parameter doc of Adagrad (#125886)
linalg
- Improve docs on the sorting of
eig
/eigvals
(#127492)
Distributed
c10d
- Add
- a doc page for NCCL ENVs (#128235)
- migration notes for --local-rank option style change for torchrun for PyTorch 2.0 onwards (#109480)
- Documents
- 'tag' limitation for nccl send/recv (#125278)
-
destroy_process_group
usage (#122358)
- Fixes
- example in
torch.distributed.new_subgroups
docstring (#123492) - the document of
distributed.new_group()
(#122703)
- example in
Distributed Checkpointing (DCP)
- Corrected typos in assert (#122633)
DTensor
- Add comment on replicate -> partial for _NormPartial (#121976)
- Updated public API docs for DTensor (#127340)
FullyShardedDataParallel (FSDP)
- Remove excessive warnings and rewrite FSDP docstrings (#123281)
- Fix docs for inter/intra node PG helpers (#126288)
- Updated docstring to include
device_mesh
arg (#126589)
Profiler
- Updated PT2+Profiler docs (#122272)
Export
- Fix documentation for register_fake_class (#126422)
Fx
- Document for add_var_to_val (#121850)
Dynamo
- Add a Dynamo deepdive to documentation (#122305)
- Update compile doc to suggest Module.compile (#123951)
- Fixes
- links rendering when surrounding code in Dynamo deepdive (#123427)
- the link to torch.compiler_custom_backends (#125865)
- typos in torch._dynamo.config.py (#126150)
- NumPy + backward example (#126872)
Inductor
- Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672)
- documentation for pattern_matcher.py (#127459)
ONNX
- Fix pytorch version for onnx in doc (#124182)
- Add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
- Documenting torch.onnx.operator.shape_as_tensor (#128051)
- Init sigmoid comments (#127983) (edited)
XPU
- PyTorch 2.4 XPU Getting Started (#127872)
- Update Intel GPU Support on README (#126001)
- Tensor (#126383 #127280)
- Stream (#121398)
- AMP (#127276 #127278)
-
torch.compile
with XPU support (#127879)
Developers
Composability
- cpu_fallback for aten::triu_indices on custom device crash (#121306)
- API to check whether running in torch_dispatch mode (#122339)
- clarify c10::Dispatcher kernel static asserts (#124519)
Release Engineering
- TD (target determination) reorders tests in CI based on heuristics and removes tests it believes to be irrelevant to the changes in the PR. (#121835, #121836, #122279, #122615, #122901, #124082, #122976, #125931)
- torchbench on-demand test workflow (#122624).
- BE: Ruff lint improvements (#124743, #124570)
- ability to save TORCH_COMPILE_DEBUG logs for CI failures (#124408)
- freezing option for cpu inductor accuracy test in inductor CI (#124715)
Optim
- Modify device check in capturable optimizer to support more devices (#124919)
- Improve typing and error messages in LRScheduler (#125556, #127943, #121633, #125161)
- Only initialize state if needed in SGD (#123757)
- Exempt
torch.compile
from more checks in Adamax (#123498) - Merged the pyi files into py files of optimizer (#125153, #125452)
- Tighten fallback conditions for compiled optimizer (#125825)
Distributed
c10d
- Updated error message for sparse all-reduce (#121644)
- Add
- generic scuba logging capability into c10d (#121859)
- log the target of Flight Recorder dump (#122345)
- the source rank in the logs when detecting the timeout (#122850)
- more fields for periodic logging (#123860)
-
pg_name
andpg_desc
to logger (#126409) - Work's numel to logger for debugging purposes (#127468)
- Allow user to pass process group description for ProcessGroupNCCL (#123472)
- Print the duration of the broadcast of
ncclunique_id
(#123963) - Pass and recorded
process_group_name
when creating ProcessGroupNCCL (#123117) - Pass pg name and desc to NCCL communicator (#124149)
- Make only PG0 should dump when monitoring thread timed out (#125356)
- split seq_id to
collective_seq_id
andp2p_seq_id
(#125727) - Print certain logs only on the head rank of each node (#125432)
- Make warn env vars only once during program (#127046)
DTensor
- Add some initial c10d ops to
CommDebugMode
(#125475) - Remove unused failed_reason (#126710)
- Add
all_reduce_coalesced
tracing to CommDebugMode (#127025)
Distributed Checkpointing (DCP)
- additional logging for improved observability in DCP (#121352)
FullyShardedDataParallel (FSDP)
- Remove unnecessary warnings (#126365)
- warnings on wrapping
ModuleList
/ModuleDict
(#124764)
Miscellaneous
- Remove dist_ prefix from TORCH_LOGS shortcuts (#126499)
- Make torch.distributed.breakpoint() to work under Python/Meta contexts (#118645)
TorchElastic
- Make log directory creation idempotent (#126496)
Fx
- Suggest TORCHDYNAMO_EXTENDED_DEBUG_ envvars when appropriate (#122473)
Inductor
-
aoti_torch_item
as a util function (#126352) - model_type and global_rank for the scuba log for the dashboard Optimus pattern frequency monitor (#123398)
- Change the log for the group batch fusion (#122245)
- Do not use
importlib.load_module
(#122542) - Enable FX graph caching on another round of inductor tests (#121994)
- Improves
- exception typing. Remove NOQAs (#125535)
- generate_extern_kernel_out's signature (#123351)
- logging (#122932)
- the optimus scuba log (#122361)
- Misc refactors (#126945)
- Only print bw result for the first time we benchmark a kernel (#123568)
- Refactor
- MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
- indexing() into triton.py
- part of
IterationRangesEntry
into triton.py (#126944) - some fallback op util functions (#126182)
- is_legacy_abi_kernel and abi_compatible_kernel (#121523)
- Renamed
mutationlayout
/aliasedlayout
(#122474) - Unify val_to_arg_str and val_to_cpp_arg_str (#126916)
- Update
-
DTYPE_TO_CPP
mapping (#126915) -
opinfo
tests (flattened diff) (#124657) - tensor_converter util functions (#121743)
- triton pin (#121268)
-
- Use C++17 helper templates (#122607)
- delete inductor
config.trace.compile_profile
(#127143) - log pt2 config dict to signpost from inductor post grad (#124593)
- refactor
- code to use define_kernel and call_kernel similar to CUDA (#123704)
- device dispatch inside
do_bench
(#125736)
MPS
- Reorganize logics and naming in copy.mm (#123310)
- Pointer to the non-zero limit ticket#124244
- Introduce MetalShaderLibrary class (#125550)
- Include MPSGraphVenturaOps.h for complex types on macOS12 (#127859)
- Define _compute_tolerances (#121754)
XPU
- Support general device runtime Interface for Intel GPU (#121883)
- Enable triton installation for Intel GPU (#122254)
- Reuse inductor test for Intel GPU (#122866, #124147)
- Update Intel triton for Pytorch 2.4 release (#128615)
- Support reduction split for Intel GPU (#129337)
- call empty_cache for dynamo tests (#126377)
- Support xpu autocast policy (#124052)
Security
Python frontend
- warning for weights_only (#129239, #129396, #129509) (see Deprecations section)
Release Engineering
- Vulnerability related updates of packages used in CI (#124614, #124675, #124976, #124983, #125698, #126805, #126989)
1、 pytorch-v2.4.0.tar.gz 283.15MB