v1.7.0
版本发布时间: 2020-10-28 00:35:58
pytorch/pytorch最新发布版本:v2.4.1(2024-09-05 03:59:29)
PyTorch 1.7.0 Release Notes
- Highlights
- Backwards Incompatible Change
- New Features
- Improvements
- Performance
- Documentation
Highlights
The PyTorch 1.7 release includes a number of new APIs including support for NumPy-Compatible FFT operations, profiling tools and major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training. In addition, several features moved to stable including custom C++ Classes, the memory profiler, the creation of custom tensor-like objects, user async functions in RPC and a number of other features in torch.distributed such as Per-RPC timeout, DDP dynamic bucketing and RRef helper.
A few of the highlights include:
- CUDA 11 is now officially supported with binaries available at PyTorch.org
- Updates and additions to profiling and performance for RPC, TorchScript and Stack traces in the autograd profiler
- (Beta) Support for NumPy compatible Fast Fourier transforms (FFT) via torch.fft
- (Prototype) Support for Nvidia A100 generation GPUs and native TF32 format
- (Prototype) Distributed training on Windows now supported
To reiterate, starting PyTorch 1.6, features are now classified as stable, beta and prototype. You can see the detailed announcement here. Note that the prototype features listed in this blog are available as part of this release.
Front End APIs
[Beta] NumPy Compatible torch.fft module
FFT-related functionality is commonly used in a variety of scientific fields like signal processing. While PyTorch has historically supported a few FFT-related functions, the 1.7 release adds a new torch.fft module that implements FFT-related functions with the same API as NumPy.
This new module must be imported to be used in the 1.7 release, since its name conflicts with the historic (and now deprecated) torch.fft function.
Example usage:
>>> import torch.fft
>>> t = torch.arange(4)
>>> t
tensor([0, 1, 2, 3])
>>> torch.fft.fft(t)
tensor([ 6.+0.j, -2.+2.j, -2.+0.j, -2.-2.j])
>>> t = tensor([0.+1.j, 2.+3.j, 4.+5.j, 6.+7.j])
>>> torch.fft.fft(t)
tensor([12.+16.j, -8.+0.j, -4.-4.j, 0.-8.j])
- Documentation | Link
[Beta] C++ Support for Transformer NN Modules
Since PyTorch 1.5, we’ve continued to maintain parity between the python and C++ frontend APIs. This update allows developers to use the nn.transformer module abstraction from the C++ Frontend. And moreover, developers no longer need to save a module from python/JIT and load into C++ as it can now be used it in C++ directly.
- Documentation | Link
[Beta] torch.set_deterministic
Reproducibility (bit-for-bit determinism) may help identify errors when debugging or testing a program. To facilitate reproducibility, PyTorch 1.7 adds the torch.set_deterministic(bool)
function that can direct PyTorch operators to select deterministic algorithms when available, and to throw a runtime error if an operation may result in nondeterministic behavior. By default, the flag this function controls is false and there is no change in behavior, meaning PyTorch may implement its operations nondeterministically by default.
More precisely, when this flag is true:
- Operations known to not have a deterministic implementation throw a runtime error;
- Operations with deterministic variants use those variants (usually with a performance penalty versus the non-deterministic version); and
-
torch.backends.cudnn.deterministic = True
is set.
Note that this is necessary, but not sufficient, for determinism within a single run of a PyTorch program. Other sources of randomness like random number generators, unknown operations, or asynchronous or distributed computation may still cause nondeterministic behavior.
See the documentation for torch.set_deterministic(bool)
for the list of affected operations.
Performance & Profiling
[Beta] Stack traces added to profiler
Users can now see not only operator name/inputs in the profiler output table but also where the operator is in the code. The workflow requires very little change to take advantage of this capability. The user uses the autograd profiler as before but with optional new parameters: with_stack
and group_by_stack_n
. Caution: regular profiling runs should not use this feature as it adds significant overhead.
Distributed Training & RPC
[Stable] TorchElastic now bundled into PyTorch docker image
Torchelastic offers a strict superset of the current torch.distributed.launch
CLI with the added features for fault-tolerance and elasticity. If the user is not be interested in fault-tolerance, they can get the exact functionality/behavior parity by setting max_restarts=0
with the added convenience of auto-assigned RANK
and MASTER_ADDR|PORT
(versus manually specified in torch.distributed.launch)
.
By bundling torchelastic
in the same docker image as PyTorch, users can start experimenting with torchelastic right-away without having to separately install torchelastic
. In addition to convenience, this work is a nice-to-have when adding support for elastic parameters in the existing Kubeflow’s distributed PyTorch operators.
- Usage examples and how to get started | Link
[Beta] Support for uneven dataset inputs in DDP
PyTorch 1.7 introduces a new context manager to be used in conjunction with models trained using torch.nn.parallel.DistributedDataParallel
to enable training with uneven dataset size across different processes. This feature enables greater flexibility when using DDP and prevents the user from having to manually ensure dataset sizes are the same across different process. With this context manager, DDP will handle uneven dataset sizes automatically, which can prevent errors or hangs at the end of training.
[Beta] NCCL Reliability - Async Error/Timeout Handling
In the past, NCCL training runs would hang indefinitely due to stuck collectives, leading to a very unpleasant experience for users. This feature will abort stuck collectives and throw an exception/crash the process if a potential hang is detected. When used with something like torchelastic (which can recover the training process from the last checkpoint), users can have much greater reliability for distributed training. This feature is completely opt-in and sits behind an environment variable that needs to be explicitly set in order to enable this functionality (otherwise users will see the same behavior as before).
[Beta] TorchScript remote
and rpc_sync
torch.distributed.rpc.rpc_async
has been available in TorchScript in prior releases. For PyTorch 1.7, this functionality will be extended the remaining two core RPC APIs, torch.distributed.rpc.rpc_sync
and torch.distributed.rpc.remote
. This will complete the major RPC APIs targeted for support in TorchScript, it allows users to use the existing python RPC APIs within TorchScript (in a script function or script method, which releases the python Global Interpreter Lock) and could possibly improve application performance in multithreaded environment.
[Beta] Distributed optimizer with TorchScript support
PyTorch provides a broad set of optimizers for training algorithms, and these have been used repeatedly as part of the python API. However, users often want to use multithreaded training instead of multiprocess training as it provides better resource utilization and efficiency in the context of large scale distributed training (e.g. Distributed Model Parallel) or any RPC-based training application). Users couldn’t do this with with distributed optimizer before because we need to get rid of the python Global Interpreter Lock (GIL) limitation to achieve this.
In PyTorch 1.7, we are enabling the TorchScript support in distributed optimizer to remove the GIL, and make it possible to run optimizer in multithreaded applications. The new distributed optimizer has the exact same interface as before but it automatically converts optimizers within each worker into TorchScript to make each GIL free. This is done by leveraging a functional optimizer concept and allowing the distributed optimizer to convert the computational portion of the optimizer into TorchScript. This will help use cases like distributed model parallel training and improve performance using multithreading.
Currently, the only optimizer that supports automatic conversion with TorchScript is Adagrad
and all other optimizers will still work as before without TorchScript support. We are working on expanding the coverage to all PyTorch optimizers and expect more to come in future releases. The usage to enable TorchScript support is automatic and exactly the same with existing python APIs, here is an example of how to use this:
import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer
with dist_autograd.context() as context_id:
# Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
# Backward pass.
dist_autograd.backward(context_id, [loss.sum()])
# Optimizer, pass in optim.Adagrad, DistributedOptimizer will
# automatically convert/compile it to TorchScript (GIL-free)
dist_optim = DistributedOptimizer(
optim.Adagrad,
[rref1, rref2],
lr=0.05,
)
dist_optim.step(context_id)
[Beta] Enhancements to RPC-based Profiling
Support for using the PyTorch profiler in conjunction with the RPC framework was first introduced in PyTorch 1.6. In PyTorch 1.7, the following enhancements have been made:
- Implemented better support for profiling TorchScript functions over RPC
- Achieved parity in terms of profiler features that work with RPC
- Added support for asynchronous RPC functions on the server-side (functions decorated with
rpc.functions.async_execution)
.
User are now able to use familiar profiling tools such as with torch.autograd.profiler.profile()
and with torch.autograd.profiler.record_function,
and this works transparently with the RPC framework with full feature support, profiles asynchronous functions, and TorchScript functions.
[Prototype] Windows support for Distributed Training
PyTorch 1.7 brings prototype support for DistributedDataParallel
and collective communications on the Windows platform. In this release, the support only covers Gloo-based ProcessGroup
and FileStore
.
To use this feature across multiple machines, please provide a file from a shared file system in init_process_group
.
# initialize the process group
dist.init_process_group(
"gloo",
# multi-machine example:
# Shared files need six "/"
# init_method = `"file://////{machine}/{share_folder}/file"`
# Local file need three "/"
init_method="file:///{your local file path}",
rank=rank,
world_size=world_size
)
model = DistributedDataParallel(local_model, device_ids=[rank])
- Design doc | Link
- Documentation | Link
- Acknowledgement | gunandrose4u
Mobile
PyTorch Mobile supports both iOS and Android with binary packages available in Cocoapods and JCenter respectively. You can learn more about PyTorch-Mobile here.
[Beta] PyTorch Mobile Caching allocator for performance improvements
On some mobile platforms, such as Pixel, we observed that memory is returned to the system more aggressively. This results in frequent page faults as PyTorch being a functional framework does not maintain state for the operators. Thus outputs are allocated dynamically on each execution of the op, for the most ops. To ameliorate performance penalties due to this, PyTorch 1.7 provides a simple caching allocator for CPU. The allocator caches allocations by tensor sizes and, is currently, available only via the PyTorch C++ API. The caching allocator itself is owned by client and thus the lifetime of the allocator is also maintained by client code. Such a client owned caching allocator can then be used with scoped guard, c10::WithCPUCachingAllocatorGuard
, to enable the use of cached allocation within that scope.
Example usage:
#include <c10/mobile/CPUCachingAllocator.h>
.....
c10::CPUCachingAllocator caching_allocator;
// Owned by client code. Can be a member of some client class so as to tie the
// the lifetime of caching allocator to that of the class.
.....
{
c10::optional<c10::WithCPUCachingAllocatorGuard> caching_allocator_guard;
if (FLAGS_use_caching_allocator) {
caching_allocator_guard.emplace(&caching_allocator);
}
....
model.forward(..);
}
.....
NOTE: Caching allocator is only available on mobile builds, thus the use of caching allocator outside of mobile builds won’t be effective.
Backwards Incompatible changes
Python API
torch.conj
now returns the input as-is for real Tensors (#43270)
Previously, torch.conj
and Tensor.conj
were making a clone for Tensors of real dtype. It now returns the Tensor as-is to improve performance.
You can recover the original behavior by adding a .clone()
for real Tensors.
Note that this behavior is different from numpy
for which np.conj
returns a new ndarray and ndarray.conj
returns the ndarray as-is.
1.6.0 | 1.7.0 |
---|---|
>>> t.is_complex()
False
>>> t.conj() is t
False
|
>>> t.is_complex()
False
>>> t.conj() is t
True
>>>t.conj().clone() is t
False
|
torch.tensor
, torch.as_tensor
, and torch.sparse_coo_tensor
now use the input Tensor’s device when it is not specified (#41984)
This will change the device on which the Tensor is created and so the user can start seeing device mismatch errors.
It also means for sparse Tensors that both of the provided Tensors must be on the same device if the device is not specified.
You can recover the original behavior by passing the device
argument.
1.6.0 | 1.7.0 |
---|---|
>>> t.device
device(type=‘cuda:0’)
>>> # tensor constructor
>>> torch.tensor(t, dtype=torch.float32).device
device(type=‘cpu’)
>>> # sparse constructor
>>> torch.sparse_coo_tensor(
torch.tensor(([0], [2]), device="cpu"),
torch.tensor(([1.],), device="cuda"),
size=(3, 3, 1)).device
device(type='cuda', index=0)
|
>>> t.device
device(type=‘cuda:0’)
>>> # tensor constructor
>>> torch.tensor(t, dtype=torch.float32).device
device(type=‘cuda:0’)
>>> # Specify the device to get the same behavior as 1.6
>>> torch.tensor(t, dtype=torch.float32, device='cpu').device
device(type=‘cpu’)
>>> # sparse constructor
>>> torch.sparse_coo_tensor(
torch.tensor(([0], [2]), device="cpu"),
torch.tensor(([1.],), device="cuda"),
size=(3, 3, 1)).device
RuntimeError: backend of indices (CPU) must match backend
of values (CUDA)
>>> # Specify the device to get the same behavior as 1.6
>>> torch.sparse_coo_tensor(
torch.tensor(([0], [2]), device="cpu"),
torch.tensor(([1.],), device="cuda"),
size=(3, 3, 1),
device="cuda:0").device
device(type='cuda', index=0)
|
torch.nn.utils.pack_padded_sequence
: remove hidden cross-device copy for lengths
(#41984)
In previous versions, when the lengths argument was a CUDA tensor, it would incorrectly be moved to the CPU silently.
This can lead to surprising performances and CPU/GPU sync when using CUDA so this has been removed.
You need to make sure that the provided lenghts
is a CPU Tensor when it is provided as a Tensor.
1.6.0 | 1.7.0 |
---|---|
>>> inp = torch.rand(10, 2, 3, device="cuda")
>>> lengths = torch.tensor([10, 7], device="cuda")
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
>>> # Implicitly move lengths to the CPU and runs fine
|
>>> inp = torch.rand(10, 2, 3, device="cuda")
>>> lengths = torch.tensor([10, 7], device="cuda")
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor,
but got 1D cuda:0 Long tensor
>>> # Ensure the lenghts is already on the right device
>>> lengths = lengths.cpu()
>>> torch.nn.utils.rnn.pack_padded_sequence(inp, lengths)
>>> # Runs fine with no implicit move across device
|
Improve torch.norm
handling of keepdim=True
(#41956)
Before this change, when calling torch.norm
with keepdim=True
and p='fro'
or p=number
, leaving all other optional arguments as their default values, the keepdim argument would be ignored. It is now properly respected.
Also, any time torch.norm
was called with p='nuc'
and keepdim=True
, the result would have one fewer dimension than the input, and the dimensions could be out of order depending on which dimensions were being reduced. It is now properly keeping all the dimensions.
You can recover the original behavior by setting keepdim=False
.
NOTE: this function is now deprecated (see below) and we recommend you use torch.linalg.norm
, which follows NumPy’s conventions.
1.6.0 | 1.7.0 |
---|---|
>>> t.size()
torch.Size([4, 4])
>>> t.norm(p=‘fro’, keepdim=True).size()
torch.size([])
>>> t.norm(p=3, keepdim=True).size()
torch.size([])
>>> t.norm(p=‘nuc’, keepdim=True).size()
torch.size([1])
|
>>> t.size()
torch.Size([4, 4])
>>> t.norm(p=‘fro’, keepdim=True).size()
torch.size([1, 1])
>>> t.norm(p=3, keepdim=True).size()
torch.size([1, 1])
>>> t.norm(p=‘nuc’, keepdim=True).size()
torch.size([1, 1])
|
torch.split
and torch.chunk
: Fix view tracking for the autograd (#41567)
The autograd system is able to correctly handle modifications through views of Tensors by explicitly tracking known view operations. In prior releases, torch.split
and torch.chunk
were not marked as known view operations, which could lead to silently wrong gradients.
Note that since v1.5, inplace modification of views created by functions that return multiple views is deprecated. Such case is not properly handled by the autograd and can lead to internal errors or wrong gradients. So, as a side effect of this view fix, inplace modifications of the outputs of torch.split
and torch.chunk
will now raise a warning and can lead to internal errors or wrong gradients while they were previously silently computing wrong gradients.
If you see such a warning, you should replace the inplace operation with an out of place one.
You can recover the original behavior by using the new torch.unsafe_split
and torch.unsafe_chunk
. Note that these functions are only here to ease the transition and will also be removed in a future version.
torch.{argmin,argmax}
now always return the first min/max index (#42004)
torch.argmin
(torch.argmax
) now always returns the index of the first minimum (maximum) element. This choice is consistent with NumPy. Previously if there were multiple minima (maxima) the index returned could be the index of any of them.
You cannot recover the original behavior as it was platform dependent and not guaranteed. If your code was relying on a specific index for your specific platform, you should update it to work with the first index and this new code will work on all platforms.
torch.{min,max,median}
: Update backward formula when doing full reduction (dim
argument not provided) (#43519)
When no dimension is specified, full reduction is performed and the gradient will now flow back evenly towards all the input that realized the output value. The old behavior was to propagate the gradient only for one of such input selected arbitrarily.
This should improve stability of training by gradient descent.
To recover the previous behavior, you can perform the reduction with the dim=
argument. It will ensure that the gradient only flows back for the input whose index was returned.
1.6.0 | 1.7.0 |
---|---|
>>> a
tensor([3, 2, 3])
>>> a.max().backward()
>>> a.grad
tensor([0, 0, 1])
|
>>> a
tensor([3, 2, 3])
>>> a.max().backward()
>>> a.grad
tensor([0.5, 0, 0.5])
>>> a.max(dim=0).max(dim=0).max(dim=0).backward()
>>> a.grad
tensor([0, 0, 1])
|
nn.BCELoss
size mismatch warning is now an error (#41426)
This is the end of the deprecation cycle for this op to make sure it does not have different broadcasting semantic compared to numpy’s broadcasting semantic used everywhere else in PyTorch’s codebase. You need to make sure all inputs are the same size to avoid the error.
1.6.0 | 1.7.0 |
---|---|
>>> bceloss = nn.BCELoss()
>>> a = torch.rand(25)
>>> b = torch.rand(25, 1)
>>> bceloss(a, b)
UserWarning: Using a target size (torch.Size([25, 1]))
that is different to the input size (torch.Size([25]))
is deprecated. Please ensure they have the same size.
tensor(1.0604)
|
>>> bceloss = nn.BCELoss()
>>> a = torch.rand(25)
>>> b = torch.rand(25, 1)
>>> bceloss(a, b)
ValueError: Using a target size (torch.Size([25, 1]))
that is different to the input size (torch.Size([25]))
is deprecated. Please ensure they have the same size.
>>> b = b.reshape(25)
>>> bceloss(a, b)
tensor(1.0604)
|
Custom autograd.Function
stop materializing None
output Tensors (#41490)
To improve performance, the custom autograd.Function
will not create a Tensor full of zeros when an input is differentiable but the user’s backward
function returns None
for it. This means that code for which the .backward()
or autograd.grad()
final result will now be None
while it used to be a Tensor full of zeros.
You can recover the previous behavior by having your custom autograd.Function
materialize the zero Tensor with torch.zeros_like(input)
to replace the None
output for the backward
method.
import torch
# Custom Function that returns None for the gradient
class GetTwos(torch.autograd.Function):
@staticmethod
def forward(ctx, inp):
return inp.clone().fill_(2)
@staticmethod
def backward(ctx, grad_out):
# To recover the 1.6 behavior, replace the line below with `return torch.zeros_like(grad_out)`
return None
a = torch.rand(10, requires_grad=True)
b = GetTwos.apply(a)
b.sum().backward()
print(a.grad)
# In PyTorch 1.6 this will print
# tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
# In PyTorch 1.7 this will print
# None
Fix inplace detection for non-differentiable outputs (#41269)
We fixed a bug in the inplace detection code that was preventing the detection of some inplace operations for output that are not differentiable (like integer type Tensors). This can lead to code that used to run fine to throw the error “a Tensor that was needed for backward was modified in an inplace operation”. Such failure is true and the user code must be fixed to compute proper gradients. In general, this involves cloning the Tensor before modifying it inplace to make sure the backward pass can happen safely.
import torch
a = torch.rand(10, requires_grad=True)
with torch.no_grad():
a[2] = 10
b, ind = a.max(dim=0)
# ind is 2 here
with torch.no_grad():
t = torch.rand(10)
t[4] = 10
res = torch.max(t, dim=0, out=(torch.Tensor(), ind))
# ind becomes 4 here
# This backward runs in 1.6 but will fail in 1.7
b.sum().backward()
print(a.grad)
# tensor([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])
# The value is wrong is at index 4 while it should be at index 2
# The issue is avoided by not modifying ind inplace by replacing the line
# above with:
# res = torch.max(t, dim=0, out=(torch.Tensor(), ind.clone()))
Add __torch_functions__
for methods (#37091)
Functions, slicing and Tensor methods will now properly preserve the subclass type when possible.
>>> class SubTensor(torch.Tensor):
... pass
>>> type(torch.add(SubTensor([0]), SubTensor([1]))).__name__
'SubTensor'
>>> type(torch.add(SubTensor([0]), torch.Tensor([1]))).__name__
'SubTensor'
The old behavior of “any operations on your subclass produces a torch.Tensor instead of the subclass” can be recovered by doing:
from torch._C import _disabled_torch_function_impl
class SubTensor(torch.Tensor):
__torch_function__ = _disabled_torch_function_impl
For all details on how to use this feature, please refer to the doc page for it.
tensor.__iter__
: Use torch.unbind
instead of a for loop (#40884)
This improves performances significantly but it changes the behavior of in-place operations on the value returned by the iterator. This happens only if either the input Tensor or any argument of the in-place operation is a Tensor that requires gradients. And it will fail with "Output X of UnbindBackward is a view and is being modified inplace".
You can recover the previous behavior by manually slicing the Tensor: [t[i] for i in range(t.size(0))]
as shown in the example below.
1.6.0 | 1.7.0 |
---|---|
>>> x = torch.randn(5, 10, requires_grad=True)
>>> for i, v in enumerate(x):
>>> v.fill_(i)
|
>>> x = torch.randn(5, 10, requires_grad=True)
>>> for i, v in enumerate([x[j] for j in range(x.size(0))]):
>>> v.fill_(i)
|
Updated most function that take zero, one or two Tensor arguments and indexing op to check for memory overlap in the Tensor being worked on (#43418, #43419, #43420, #43421, #43423, #43422)
It fixes silent correctness errors: something that used to be silently incorrect now errors out. Code that raises this error must be updated to avoid doing such op that was returning wrong results as shown in the example below:
>>> x = torch.randn(1, 3)
>>> # Create a tensor that has internal memory overlap
>>> y = x.expand(2, 3)
# In 1.6, this would not error out, but in 1.7, this errors out
>>> torch.nn.functional.elu(y, inplace=True)
RuntimeError: unsupported operation: more than one element of the written-to tensor refers to a single m
emory location. Please clone() the tensor before performing the operation.
# Here is the fix in 1.7
>>> torch.nn.functional.elu(y, inplace=False)
c++ API: Any external users of TensorIterator
now always get the memory overlap check. The previous behavior can be recovered by setting set_check_mem_overlap(false)
when creating the iterator.
TorchScript
TorchScript now correctly supports various exception type and custom exception message (#41907)
- Exceptions raised in TorchScript was traditionally replaced with a generic runtime error that doesn’t carry exception type or message, leading to crashes that are difficult to pin-point and debug. We improved TorchScript to correctly parse exception types and messages and surface them to users.
- This change is backward incompatible because TorchScript now attempts to compile user code that creates custom exception messages instead of ignoring them. Any TorchScript-incompatible Python features used in those code snippets would lead to failures.
- There is no fixed formula to fix this backward incompatibility failure other than updating code that generates exceptions to be TorchScript-able.
TorchScript now supports properties of TorchScript classes and ScriptModules (#42389, #42390)
- TorchScript added support for
@property
of TorchScript classes and ScriptModules. Custom setters and getters are also supported. Custom deleters are not supported. - This improvement is backward incompatible because TorchScript now attempts to script properties of existing classes and
Modules
. If these properties use Python or Pytorch features that are not supported in Torchscript, scripting will fail. - There are two ways of fixing backward incompatibility failures introduced by this change. One is using
@torch.jit.unused
to annotate problematic properties, the other is to update the implementation of the property so that the getter and setter are scriptable.
Quantization
The convolution parameters now support versioning.
- This change means that any quantized convolution module saved using PyTorch 1.7+ cannot be loaded in v1.6 and lower.
- But this change is backward compatible: if the model (with conv layers) is saved in version 1.6, it can be safely loaded in version 1.7.
Some undocumented functions that were mistakenly made public have been removed
-
torch.absolute_
has been removed, the Tensor method (Tensor.absolute_
) should be used instead just like all other inplace ops. -
torch.ExtraFilesMap
is an internal jit construct and should not be used.
TorchScript Compiler Update
In 1.7, we are enabling a Profiling Executor and a new Tensor-Expressions-based (TE) Fuser. All compilations will now go through one (an adjustable setting) profiling run and one optimization run. For the profiling run, complete tensor shapes are recorded and used by the new Fuser. For the optimization run, the focus is on finding (in torch.jit.ScriptModule
s) and fusing element-wise operations over CUDA tensors into a single CUDA kernel.
The TE fuser is expected to deliver performance similar to the old fuser used in 1.6. It however unlocks more opportunities for performance improvements in future releases. In rare cases, performance of some models may degrade 5-10%. If you experience any regressions please report it on Github, so we can address them as soon as possible! For 1.7, we are providing an option for our users to revert back to the old fuser by calling torch._C._jit_set_profiling_executor(False)
in Python and torch::jit::getExecutorMode()`` = false;
in C++. For more information, please see “Graph Executor” section in our documentation.
Deprecations
Python API
torch.norm
and torch.functional.norm
are deprecated in favor of torch.linalg.norm
(#44321)
The new torch.linalg.norm
has the same behavior as numpy.linalg.norm
Both deprecated functions had odd behaviors for matrix and vector norms. You should refer to the doc here to find the exact behavior they had and how to replicate it with the new API.
Deprecate fft functions in torch.
namespace in favor of torch.fft.
namespace (#44876)
Please use torch.fft.foo
as a drop-in replacement for torch.foo
for the following functions: fft
, ifft
, rfft
and irfft
.
Warns when some out=
functions need to resize an output which is not 0-size (#42079)
This behavior is dangerous and leads to an API that is hard to use. It is being deprecated to be able to fix that API in future versions. You should resize the output before-hand to avoid any issue in the future:
a = torch.rand(5)
b = torch.rand(25)
# This is deprecated
torch.add(a, a, out=b)
# This has the same behavior but will work in future versions
torch.add(a, a, out=b.resize_(0))
torch.optim
: Warn for duplicate params in param group (#41597)
Providing multiple times the same Parameter in a single param group is most likely due to user error and is being deprecated. Please open an issue if you have a valid use case that require this feature.
torch.linspace
and torch.logspace
: Not giving the step argument is deprecated (#43860)
The default steps
argument that has been used historically in PyTorch is not consistent with other libraries and so is being removed to avoid confusion.
For both functions, passing steps=100
keyword argument can be used to recover the original behavior.
1.6.0 | 1.7.0 |
---|---|
>>> torch.linspace(0, 10).size()
torch.Size([100])
|
>>> torch.linspace(0, 10).size()
UserWarning: Not providing a value for linspace's
steps is deprecated and will throw a runtime error
in a future release.
torch.Size([100])
>>> torch.linspace(0, 10, steps=100).size()
torch.Size([100])
|
Distributed
- Make TensorPipe the default backend for RPC (#43246)
- Infer RPC backend type to preserve backward compatibility as we make TensorPipe the default (#45065)
- Add deprecation warning to ProcessGroup backend and make TensorPipe backend stable. (#45356)
- Add warnings on
ProcessGroup
andProcessGroup::Work
APIs which will be retired soon. (#46366)
New features
Python API
New namespaces:
New operators:
-
torch.count_nonzero
added (#39992) -
nn.SiLU
activation added (#41034) -
torch.logit
added (#41062) -
torch.gcd
,torch.lcm
added (#40651, #41552, #42254) -
torch.functional.atleast_{1d/2d/3d}
added (#41317) -
torch.isreal
added (#41298) -
nn.Unflatten
added (#41564) -
torch.movedim
added (#41480) -
torch.isposinf
,torch.isneginf
added (#41588) -
torch.signbit
added (#41589) -
torch.absolute
added (#42586) -
torch.clip
alias added (#42770) -
torch.quantile
added (#42755) -
torch.linalg.det
andtorch.outer
alias added (#42802) -
torch.nansum
added (#38628) -
torch.hypot
added (#42291) -
torch.nextafter
added (#42580) -
torch.hstack
,torch.vstack
,torch.dstack
added (#42799) -
torch.arccosh
alias added (#43107) -
Tensor.movedim
as a method added (#43122) -
torch.matrix_exp
added (#40161) -
torch.fix
alias added (#43326) -
torch.arccos
,torch.arcsin
,torch.arctan
aliases added (#43319) -
torch.negative
alias added (#43400) -
torch.maximum
,torch.minimum
added (#42579) -
torch.arctanh
,torch.arcsinh
aliases added (#43762) -
torch.linalg.norm
added (#42749, #43907) -
torch.amax
,torch.amin
added (#43819) -
torch.heaviside
added (#42523) -
torch.i0
added (#43132) -
torch.not_equal
,torch.greater
,torch.greater_equal
,torch.less
,torch.less_equal
aliases added (#43870) -
torch.exp2
added (#44184) -
torch.kaiser_window
added (#44271) -
torch.nanquantile
added (#44393) -
torch.multiply
,torch.divide
aliases added (#44463) -
nn.TripletMarginWithDistanceLoss
added (#43680) -
torch.fft.fft
,torch.fft.ifft
,torch.fft.rfft
,torch.fft.irfft
,torch.fft.hfft
,torch.fft.ihfft
added (#43011) -
torch.fft.fftn
,torch.fft.ifftn
,torch.fft.rfftn
,torch.fft.irfftn
added (#44550) -
optim.functional.adagrad
added (#44715) -
optim.functional.adam
added (#44791) -
torch.complex
,torch.polar
added (#39617) -
Tensor.__complex__
added (#43844) -
torch.vdot
added (#43004)
API extension:
-
torch.full
added support for bool and integer dtypes (#41912) -
torch.lt
andtorch.masked_select
added support for half dtype (#43704) -
torch.div
,torch.true_divide
,torch.atan2
added support for integer to float type promotion in (#42359) -
unflatten
added support for non-named dimensions (#42563) -
torch.polygamma
added support for n >= 2 (#42499) -
torch.qr
added backward support for wide input matrices (#42216) -
nn.Linear
for MKLDNN added support for no-bias (#43703) -
torch.lerp
added support for half dtype (#43541) - Updates
torch.div
to perform true division (end of deprecation cycle) (#42907) -
torch.scatter
added support for reductions on CUDA (#41977) - BFloat16 support type promotion (#41698, #43324)
- BFloat16 support on CUDA for
torch.pow
(#44760), unary ops and activations (#44813, #44824, #44834),torch.i0
(#44750),softmax
(#44837),div
,addcdiv
,addcmul
,mean
,var
(#44758),layernorm
(#45002),all pooling layers (#44836, #45151)),torch.logspace
(CPU and CUDA) (#44675), random kernels on Windows (#44918),torch.addmm
,torch.addmv
(#44986), loss functions (#45011), batched gemm (#45167), nccl path (#38515), binary logical operators (#42485),torch.neg
(#45240), Conv (non-cuDNN) (#45007),torch.abs
(#44804),torch.erfinv
(#43399), comparison ops (#44748) -
torch.asin
,torch.neg
added support for sparse Tensors (#44028) -
torch.softmax
added support for CUDA (#42307) -
Tensor.{real,imag}
added setter for these attributes (#39860) -
torch.{addmm,addmv}
added support for complex on CUDA (#40431, #43827) -
torch.bmm
added support for complex on CPU #42383, -
torch.{dot, vdot}
added support for complex (#42745) -
torch.stft
,torch.istft
added support for complex (#43886) -
torch.cholesky
added support for complex (#44895, #45267) -
torch.sgn
added (to support complex) (#39955) - Binary ops added support for complex (#43174)
- Add allowlist for complex backward (#45461)
Autograd
- Don't automatically materialize output grads with zeros for
autograd.Function
(#41821) - Benchmark tool for
autograd.functional
API (#43428) - Added
reset_grad
API to remove gradient instead of setting them to zero (#44423) - Allow Tensor-like objects in
torch.autograd.gradcheck
(#43877) - Added support for nested call for
@torch.no_grad()
decorator (#44633) - Added support for
torch.lobpcg
backward (#43002)
CUDA
- Added TF32 support (#41498)
- CUDA RTX30 series support (#45489, #45130)
- **Note: **At the time of the 1.7 release, the currently available and stable Nvidia CUDA libraries are not fully tuned for the RTX 3080 and 3090 so users might see performance regressions.
-
torch.cuda.amp.GradScaler
now supports sparse gradients (#36786) - Autocast support for cudnn RNNs (#42385)
- Support AMP in nn.parallel (#43102)
- Support for tf32 in cudnn and
backends.cudnn.allow_tf32
flag to control it (#40737) - Added
torch.cuda.memory.list_gpu_processes
to list running processes on a give GPU (#44616) - Add env variable to bypass CUDACachingAllocator for debugging (#45294)
- Add non-deterministic alert to CUDA operations that use
atomicAdd()
(#41538)
C++ API
-
nn::TransformerEncoderLayer
added (#42633) -
nn::TransformerDecoderLayer
added (#42717) -
nn::TransformerEncoder
added (#43187) -
nn::TransformerDecoder
added (#42886) -
nn::Transformer
added (#44333) -
nn::Unflatten
added (#42613) -
nn.ParameterList
added (#41259) -
torch::cuda::manual_seed
andtorch::cuda::manual_seed_all
added (#42638)
Mobile
- Support Tensor MemoryFormat in java wrappers (#40785)
- Add
mobile_optimized
boolean flag to optimized model. (#45479)
Vulkan
- Backend added (#36491, #43076)
- Add many operators
adaptive_avg_pool2d
(#41220),mm
(#41221),reshape
(#41223),max_pool2d
(#41379),add_
andrelu_
(#41380),cat
(#41434),add
andmul
(#42674) andavg_pool2d
(#42675). - Model preparation via
torch.utils.optimize_for_vulkan
(#44903) - Add to Java API option to load on Vulkan and test app (#44896, #44897)
Distributed
- Support alltoall collective in ProcessGroupGloo (#41424, #41690)
- Add a DDP Communication Hook providing the flexibility to completely override DDP gradient communication (#40848)
- Examples on how to use the DDP communication hook (#43310)
- Add NCCL Alltoall to NCCL process group (#42514)
- Support allgather and gather APIs for Python Objects (#42189)
- Join-based API to support uneven inputs in DDP (#42577)
- broadcast_object API for c10d (#43887)
- Async Error Handling support for ProcessGroupNCCL (#41050, #41051, #41052, #41053, #41054, #44163)
- Add a “gradient_as_bucket_view" parameter to DDP to reduce memory overhead (#44344)
- Add getNumKeys API to c10d TCPStore (#43962)
- Add DeleteKey API for c10d TCP Store (#45401)
Quantization
- New quantized ops
- Adaptive average pooling (#40271)
- Max pooling (#45152)
- Embedding and EmbeddingBag quantization (8-bit + partial support for 4-bit): (#40076, #41293, #41612, #42924, #42762, #42881, #43077, #43088, #43090, #43176, #43296, #43433, #43989, #44008, #44207, #44208, #44217, #45149, #44845, #44048, #42690, #42612)
- QNNPACK Transposed convolution2D and 3D (#39714, #40351, #40360, #40370, #40371, #44844, #45078, #45081)
- Operations on quantized tensors
- 1D batch normalization support (#42491)
- N-Dimensional constant padding (#43304)
- CELU operator (#39199)
- Support for FP16 quantization (#40708, #40709, #40710, #42147, #42221, #42222, #42348, #41049)
- Add Quantizer support to IValue (#42438)
- Custom module support (#44835)
- Preserving pre and post forward hooks (#37233)
Misc
-
torch.set_deterministic
andtorch.is_deterministic
: Raise error when the flag is set and a non-deterministic operation is used (#15359, #41377) - Add CUDA 11 to nightly binaries (#44086, #43366)
- Dev Tool: Nightly checkout tool and doc in
CONTRIBUTING.md
(#42635, #43294) - Website: Add docs for tagged version (include rc) on the general website (#45204)
- Build: Added BUILD_CAFFE2 flag to be able to disable caffe2 compilation (#43673)
- Dataloader: Add
prefetch_factor
argument to control the number of batch loaded ahead of time(#41130) - Dataloader: Allow handling of
np.memmap
objects (#39847) - ROCm: Add support torch
utils.cpp_extension
(#41257, #43528) - ROCm: Enable complex BLAS (#43744)
- docker: Add torchelastic to docker image (#45438)
- docker: Add CUDA 11 support (#45071)
- docker: Use python 3.8 in pytorch docker image (#45466)
Improvements
Python API
- Use tree-based sum for floats to avoid numerical instability (#39516)
-
nn.ReflectionPad
: Add support for 0-dim batch sizes. (#39231) -
torch.scatter
: Add reductions for CPU (#36447) - Allow any valid ASCII python identifiers as dimnames (#40871)
- Improve Python warning prints when there is also an error (#41116)
-
torch.iinfo
,torch.finfo
: Improve printing (#40488) -
torch.where
: Add support for scalar input (#40336) -
torch.nonzero
: Remove deprecation warning foras_tuple
argument (#45413) -
torch.distributions.Categorical
: Clamp logit to avoid-inf
when calculating entropy (#41002) -
torch.futures.Future
: Adddone
function to query the status of the future (#42013)
torch.nn
-
nn.EmbeddingBag
: Add support forincude_last_offset=True
when reduction is mean or max (#42215) -
nn.AvgPooling{1,2,3}d
: Ensure all cells are valid in ceil mode to avoid division by 0 (#41368) -
nn,[Adaptive]MaxPool{1,2,3}d
: Handle edge case when input is filled with -inf (#40665) -
nn.Hardsigmoid
,nn.Hardswish
: Add inplace option (#42346) -
nn.MSELoss
,nn.L1Loss
,nn.SmoothL1Loss
: Add support for target that requires gradients. (#44437, #44471, #44486) -
nn.Parameter{List,Dict}
: Add warning when improperly used (with DataParallel or weight_norm) (#44405) -
nn.functional.smooth_l1
: Add beta parameter (#44433)
Build
- Report error when ATEN_THEADING is OMP and USE_OPENMP is turned off. (#40146)
- Raise nice error when trying to build PyTorch on 32-bit Windows system (#40321)
- Make setup.py Python-2 syntactically correct and work for version >= 3.9 (#41960, #46388)
- Don't proceed into setup.py too far if Python version is unsupported (#42870)
Distributed
- Support profiling rpc_async in TorchScript (#40652)
- Allow RPC to be initialized again after shutdown. (#42723)
- Support rpc_sync, rpc.remote in TorchScript (#43043, #43046)
- Make async_execution compatible with RRef helpers (#44666)
- Extend RPC profiling to support async function execution over RPC. (#44664)
- Support record_shapes in RPC profiling (#44419)
- Add variants for cuda.comm.broadcast/gather/scatter which store the result in a provided “out” parameter (#39681)
- Explicitly abort NCCL Communicators on ProcessGroupNCCL Destruction (#40585)
- Helper function to print out all DDP-relevant env vars (#41297)
- Add timeout to ProcessGroup Work Wait (#40944)
- Support Wait Timeout in ProcessGroupNCCL (#40946)
- Support work-level timeouts in ProcessGroupGloo (#40948)
- Support for torch.bool in ProcessGroupNCCL (#41959)
- DDP.train() returns self to stay consistent with nn.Module (#42131)
- Add a drop_last option in DistributedSampler to drop tail of the data to ensure data is even across ranks (#41171)
- Additional error checking for
torch.cuda.nccl
APIs. (#43247) - Support work.result() to get result tensors for allreduce for Gloo, NCCL backends (#43970)
- Add a device parameter to RemoteModule (#44254)
- Add remote_parameters() API for RemoteModule. (#43906)
- Add a warning log when there is high skew of uneven inputs in DDP training (#45238)
TorchScript
- Support string concatenation (cc29c192a6)
- Support using Python Enum in TorchScript (#41390,#41965,#42085,#42623,#42661,#42661,#42874,#43460,#43188,#44243,#44891)
- Support sorting list of strings (#42398)
- Support boolean key in dictionary (#42833)
- Support
@torch.no_grad
(#41371) - Support
del
to TorchScript classes (#44352) - Speed up saving modules in case of having many classes (#44589)
- Support Python Slice class in TorchScript (#44335)
- Support sorting a list of tuples (#43448)
- Enable
@torch.jit.unused
syntax for ignoring properties (#45261) - Enable ProfilingExecutor + TensorExpression (#45546) (#45546)
- Support
@torch.jit.unused
on a@torch.no_grad
decorated function (#41496) - Improve ModuleList indexing error msg (#43361)
- Better match behavior of loaded `ScriptModule``s vs. freshly created ones (#43298)
- Support backend-lowered submodules (#41146)
- Allow freezing of modules containing interface attribute (#41860)
-
to_backend
API now accepts wrapped modules (#43612) - Allow submodule methods inference rules to be different (#43872)
- Support default values for arguments of class type methods (#45098)
- Improve sugared value's error message when closing over global variables (#42889)
- Support backend-lowered submodules (#40841)
- Turn on non-ASCII string literals serialization (#40719)
- Better printing of Tensor stride information (#45156)
Mobile
- Allow specifying PYTHON executable to build_android (#41927)
- Include all overloads for OSS custom build (a01e91e6b2)
Quantization
- Change the
whitelist
toallowlist
(#41771, #41802) -
dequantize
now supports list and tuple of tensors (#41079) - User now has a way to add a activation post process hook using
register_activation_post_process_hook
function (#42342) -
add
/mul
now support different variants (#42769) - Fake quantizer now has more info when printed (#43031)
-
OP_LIST_TO_FUSER_METHOD
is exposed to the user (#43286) -
quantize_jit
can handle new upsample overloads (#43407) - Setter/getter method for quantization and fusion mappings (#43990)
- fake_quant and observer can be disabled in scriptmodule (#44773)
-
convert_jit
can now takepreserved_attrs
argument (#44490) -
SyncBN
: preserve qconfig if it exists (#45317) - Add quant APIs to save/load observer
state_dict
(#44846) - Add version support for the
conv
parameters (#43524, #43086, #43651, #44671)
ONNX
In PyTorch 1.7, we have continued to add and improve PyTorch operator export to ONNX. We have enabled export of 10 new operators, and further enhanced and optimized export of 10+ torch operators to ONNX. We have also focused on improving export of TorchScript modules, in particular laying some groundwork required for better support in near future. We have also created an API (torch.onnx.utils._find_missing_ops_onnx_export) as a diagnostic tool (preview only) to get a list of operators in a model that are not supported or implemented by ONNX exporter. Support for export of torch.quantization.FakeQuantize has also been added to help enable some QAT workflows.
-
Add support to export more torch ops
torch.view_as
(#40496), fake quantize functions (#39738), embedding_bag (#41234, #44693),torch.eye
(#41357),Tensor.as_strided
(#41569),torch.tensor
(#41872), addition between list of tensors (#41888),Tensor.__floordiv__
(#43022),torch.nn.KLDivLoss
(#41858),Tensor.new_empty
andTensor.new_zeros
(#43506) -
Improves existing export logic and optimizing exported ONNX graph
- Add warning in ONNX export when constant folding is on in training-amenable mode (#40546)
- Fix export of
torch.full_like
(#40063) - Add pass that fuses Conv and BatchNormalization (#40547)
-
torch.where
export, add support for ByteTensor (#42264) - Fix scalar type cast for comparison ops (#37787)
-
torch.scatter
export, add support for src being scalar or different dtype (#42765, #43440) - Fix Squeeze operator when applied to a dimension with shape > 1 (#38476)
- Extend support for
torch.where
(#41544) - Update ops
torch.slice
(#42935),torch.split
(#43670),torch.repeat
(#43430),torch.arange
(#43777),len
(#43824),torch.narrow
(#44039), flatten (#40418), adaptive_pool (#46100)
-
Update export to follow pytorch changes
Misc
-
torch.utils.collect_env
: Collect more informations (python 32/64bit, clang version, CPU architecture, ROCm version) (#42887, #42961, #44106) -
torch.hub.load_local
: Allow to load models from any local directory (#44204) - Add warning if
import torch
is called from the source root (#39995) - Improve Dynamic Library loading for Windows (#40365)
- serialization: validate sparse tensors after loading (#34059)
- Add
--continue-through-error
option to run_test.sh script (#41136) - Tensorboard: Support custom
run_name
and ``hparam_domain_discretein
add_hparams` (#40660, #40720) - MKLDNN: Enable conv3d, batchnorm3d, max_pool3d and avg_pool3d (#40691, #40995, #40996)
- Profiler: Do not record zero duration kernel events (#41540)
- Profiler: Improve cuda time counting (#45209)
- Profiler: Adding
with_source
parameter to enable tracking source code (#43898) - Optim: Add verbose param for all schedulers (#41580)
- Pruning: check attributes before deleting (#41913)
- Autograd: In
zero_grad
, avoid using inpalcedetach
when it is not required (#41283) - Autograd: Update the
torch.div
backward formula to improve numerical stability (#43627) - Autograd: Print all traceback for higher order backwards in detect_anomaly (#43626)
- Autograd: Stop saving input of
torch.repeat
as onlyinput.dim()
is needed in backward (#40766) - CUDA: Improve cuDNN error messages to include call parameters (#45023)
- CUDA: Improve
device_count
and cuda init error detection and messages (#42249) - Improve Tensor layout propagation for pointwise ops to follow input layout more closely (#42922)
- Remove blacklist/whitelist references (#41447, #41644, #41636, #41777, #41822, #41691, #41789, #41979, #41627, #42011, #41796, #42067, #42091, #42097, #42071, #42089, #42279, #42047, #42088, #45260)
Python Type Annotations
- Update some types in top level
torch/*.py
(#40235, #40873) - Added typing for
Tensor
attributes and methods:T
andgrad_fn
(#40879),Tensor._version
(#41125),ndim
(#42909),nonzero
(#43053), #40499) - Added typing for
torch.serialization
(#40862) - Added typing for
torch.tensor
(#45077) - Added typing for
torch.Size
(#40879) - Added typing for
torch.futures
(#41675) - Added typing for
torch.random
(#42234) - Added typing for
torch.hub
(#42252) - Added typing for
collect_env.py
(#43062) - Added typing for
torch.utils
(#39392, #42647, #42711, #42960, #43806, #44136, #44216) - Added typing for
torch.nn
(#43044, #44093, #43080, #42231, #40669) - Added typing for
torch.sparse
(#43108) - Added typing for
torch.cuda.nvtx
(#43443) - Added typing for
torch.cuda.memory
(#43444) - Added typing for
torch.functional
(#43446) - Added typing for
torch.autograd
(#44451, #46206) - Added typing for
torch.quantization.fuse_modules
(#43786) - Added typing for
torch.nn.quantized
(#43186, #44154, #43110) - Added typing for
torch.testing._internal
submodules (#44575, #44805, #44832, #44911, #44927, #44985, #44971, #45107, #45368, #45375) - Added typing for
torch.backends.quantized
(#44794) - Added typing for
torch.backends.cuda
(#44916) - Added typing for
torch.cuda.{comm,nccl,amp}
(#45350, #45344, #45480) - Added typing for
torch.quasirandom
(#45434) - Fix typing for
jit.trace
andonnx.export
(#41093) - Fix typing for
torch/optim/lr_scheduler.pyi
(#41775, #41866)
Bug fixes
Python API
-
torch.linspace
: Fix step computation for large integral types (#40132) -
torch.pca_lowrank
: Fix un-expected memory consumption (#40853) -
torch.linspace
: Fix behavior for non-contiguous inputs on CPU (#41286) -
torch.div
: Fix division by low precision scalar (#41446) -
torch.expm1
: disable mkl as it produces wrong values in some cases (#41654) -
torch.utils.data.RandomSampler
: Stop generating samples one at a time when replacement=True (#41682) -
torch.nn.functional.grid_sample
: Fix 64-bit indexing (#41923) -
torch.nn.functional.grid_sample
: Fix crash whengrid
has NaNs (#42703) -
torch.det
: Fix on CPU (#35136) -
torch.interpolate
: Avoid zero division in cubic mode (#42093) -
torch.fmod
: Fix to work with zero divisors consistently (#41948) -
torch.masked_select
: Fix for discontiguous outputs (#41841) -
torch.cummin
,torch.cummax
: Fix for discontiguous inputs/outputs (#42507) -
torch.einsum
: Fix for discontiguous inputs (#42425) -
torch.orgqr
: Fix input size conditions (#42825) -
torch.manual_seed
: Fix argument unpacking (#42206) -
torch.searchsorted
: Properly mark output as non differentiable (#42933) -
torch.bucketize
: Properly mark output as non differentiable (#44102) -
torch.addmm
: Properly raise error on device mismatch (#43505) -
torch.chain_matmul
: Properly handle empty args (#43553) -
torch.multinomial
: Properly handle 0 size dim (#43775) -
torch.cholesky_solve
: Fix broadcast and error checking (#43137) -
torch.movedim
: Fix uniqueness check (#44307) -
torch.min
,torch.max
,torch.mean
: Properly throw error if dim is repeated (#44281) -
torch.lerp
: Fix for discontiguous outputs on CUDA (#44559) -
torch.addmv
,torch.mv
: Fix beta=0 case in slow path (#44681) -
torch.triangular_solve
: Fix error check on CPU (#44720) -
torch.empty_like
,torch.zeros_like
: Properly raise error if any memory format is provided with sparse input (#44058) -
torch.atan2
: Fix type promotion (#43466) -
torch.repeat
: Fix backward for 0 size repeats (#45212) -
torch.min
,torch.max
,torch.median
: Fix handling of nan in backward (#45280) -
torch.rdiv
: Properly make it consistent with div (#45407) -
torch.std
: Fix hanling of nan in backward (#45468) -
torch.distributions.Binomial
: Fix CUDA sampling at extreme points (#42702) -
torch.dot
,torch.vdot
: Add complex support (#45074) -
torch.pow
: Fix when scalar base is complex (#45259) -
torch.round
,torch.abs_
: Disable complex inputs (#45330) -
torch.svd
: Fix memory corruption for complex inputs (#45486) -
torch.view_as_complex
: Fix zero dimensional input (#44175) -
torch.kthvalue
: Fix for non-contiguous input (#46177) -
torch.save
: Fix python binding that could lead to out of bound read (#46207)
Torch.nn
-
nn.ModuleDict
: Fix input dict key ordering (#40905) -
nn.LayerNorm
: Fix handling ofgamma
in the backward whencreate_graph=True
(#41595) -
nn.functional.{max,avg}_pool{1,2,3}d
: Raise RuntimeError for zero stride (#41819) -
nn.Module
: Fix missing attribute when loading model from older version (#42290) -
nn.Embedding
: Raise proper error for 0-D weight (#42550) -
nn.SyncBatchNorm
: Fix forward pass for non-default process group (#43861) -
nn.functional.embedding_bag
: Fix for non-contiguous weight (#44032) -
nn.functional.upsample
: Add nondeterministic checks (df6ea62526) -
nn.GroupNorm
: Fix bug when input does not require_grad on CUDA (#44863) -
functional.{l1_loss,smoothl1_loss,mse_loss}
: Properly check that reduction strings are valid (#43527) -
functional.smoothl1_loss
: Properly raise error for negativebeta
values (#45759) -
functional.pad
: Fix extra memory allocation and invalid result for negative or zero pad when using circular padding (#39273)
C++ API
-
nn::MultiheadAttention
: Ensure all parameters are properly registered (#42037) -
Tensor::grad
: Fix the thread safety issues (#40887) -
Tensor::var
: Ensure thatvar(0)
does not call thevar(bool keepdim)
overload butvar(int dim)
(#40451)
Distributed
- Fix RPC and ProcessGroup GIL deadlock (#45088)
- Relax size check in flatten_for_scatter_gather (#40573)
- BAND, BOR and BXOR for NCCL all_reduce should throw runtime errors (#42669)
- Disallow creation of ProcessGroupNCCL without GPUs (#45642)
- Fix read/write of bulk data (#42504)
- Fix thread safety issue with distributed optimizers and TorchScript (#46071)
TorchScript
- Fix type annotations in select assignments (#40528)
- Fix compilation issues with GCC-5.4 (#41055, #41063, #43223)
- Fix JIT not round to even if constant is folded (#40897)
- Fix
torch.jit.freeze
import (#42319) - Fix
List[str].index
(#40348) - Fix
torch.jit.is_tracing()
so that it is correctly called rather than returning the method itself (#42486) - Fix Str -> Device implicit conversions (#43213)
- Fix
NaN
propagation in fuser's min/max implementation (#43590) - Cast return values of functions returning Any (#42259)
- Fix
NaN
propagation in TensorExpression fuser's min/max implementation (#43609) - Fix segfault in attribute lookup on loaded
ScriptModules
(#43284) - Fix casting of
unsigned char
, andabs(int)
(#44157) - Fix frac in CUDA fuser (#44152)
- Fix model_name not logged properly issue. (#45488)
- Fix
len
,contains
,getitem
inherited from interface class derived from nn container (#40789) - Fix support for FP16 in CudaCodgen (#44209)
- Fix
torch.tensor
for empty multidimensional-typed lists (#44652) - Fix freeze_module pass for sharedtype (#42457)
- Correctly clone schema in
insert_observers
(#40624) - Fix value association with dictionaries in the tracer (#40885)
- Fix preserve submodule attribute in freezing (#45143)
- Fix Half conversion of immediates in NNC Cuda backend (#45213)
- Fix a bug in
SplitWithMask
when splitting multiple times (#45141) - Fix inlining interface call in fork subgraph (#43790)
- Fix operator order in combineMultilane in TensorExpr fuser(#45157)
- Correctly mark Tensor types inferred from empty annotation as
inferred=True
(#45360) - Fix some bugs in Round+Mod simplification in NNC (#42934)
- Fix
set_grad_enabled
scripted version (#46060) - Fix for
dict.update()
scripted version (#46105) - Fix segfault when scripting nested classes (#46422)
- Fix memory leak in Profiling Mode (#46621)
Quantization
- Resolved namespace conflict in qnnpack for init_win symbol (a7e09b8727)
- Fix linking of qnnpack params on windows. (#40920)
- Adding zero point type check for per channel quantization (#40811)
- Remove activation_post_process in qat modules (#42343) (#43015)
-
qlinear_dynamic
: Fix ASAN error in QNNPACK's integration. (#41967) - Change quantizer to account for input tensor's memory format. (#42178)
- Fixing the output shape for the linear (#44513)
- Ensure observers and fq modules are scriptable (#44749)
- histogram observer: ensure buffer shape consistency (#44956)
- Attach qconfig to all modules (#42576)
- Fix qnnpack quantized activations for NHWC memory format (#46217)
ONNX
- Fix crash when exporting a model with
nn.Sequential
(#19227) - Fix default
ignore_index
for nll loss (#44816) - Rename Black to Block for various files (#42913)
- Fix bug in
onnx::SsaRewrite
(#42148)
Misc
- Fix
torch.hub
for new zipfile format. (#42333) - Preserve python backtrace in autograd engine errors. (#43684)
-
optim.SparseAdam
: Fix check that params are dense on init (#43668) - Fix clang build (#44934)
-
nn::MultiheadAttention:
Fix parameter registration (#42037) - MaxPool2D: Fix memory leak for XNNPACK (#41874)
- Numpy scalar detection for bool and complex types fixed (#43644)
- Add missing file to
BUILD.bazel
(#40536) -
autograd.gradcheck
: Add support for complex (#43208) - Fix bug in mobile-specific CPU caching allocator (#43719)
Performance
Python API
-
torch.{view_as_complex,view_as_real}
: Remove unnecessary temporary Tensor (#44908) -
tensorboard.SummaryWriter.add_audio
: Remove unnecessary for loops (#44201) -
Conv2d
andConv3d
: bypass the im2col for 1x1 conv (#40324) - Fix
max_pool2d
perf regression (#41174) - Disable the mkldnn for
conv2d
in some special cases (#40610) -
addmm
: Reduce constant time overhead (#41374) -
cumsum, cumprod
: Enable non-synchronizing cub scan for cum* operations (#42036) -
max_pool2d
: CUDA NCHW performance improvement (#42182) -
arenge
: Vectorize CPU implementation (#38697) -
istft
: optimize by using col2im (#42826) -
LayerNorm
: improved performance on CPU both forward and backward (#35750) -
silu
: improved performance (#42976) -
addmv
: improved performance for zero sized input cases (#41824) - Mobile: Simple caching allocator for CPU (#42006)
-
MaxPool1d
: improved performance for cases without indices (#43745) -
adaptive_avg_pool2d:
optimized code path for cases when output size is (1, 1) (#44211) - Vectorized complex copy (#44722)
-
cat
: optimized cuda kernel (#44833) - Vectorized int8_t on CPU (#44759)
- Vectorized
bitwise_not
(#45103) - Added stateful XNNPack deconvolution2d operator to torch (#43233)
- Enabled mkldnn dilation convolution (#40483)
Distributed
- Skip allreducing
local_used_maps_dev_
whenfind_unused_param=False
in DDP to improve performance (#40407) - Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
- Add option to run NCCL operations on high priority cuda stream (#43796)
- Enhance DistributedOptimizer to be functional and torchscriptable to avoid GIL and global lock (#45221)
TorchScript
- JIT pass for add relu fusion. (#39343)
- Optimize autodiff subgraph slicing (#41437)
- Don't re-run CSE on every block (#41479)
- Add loop unroll optimization in NNC (#42465)
- Speed up CUDA kernel launch when block/thread extents are statically known (#42899)
- Support merging adjacent fusion groups in TensorExpression Fuser. (#43671)
- Add passes to profiling executor pipeline (#43636)
- Improve performance of
KernelSumMultipleAxes
(#43905) - Latency improvements for pointwise + reduction fusion (#45218)
- Add simplification of Loop + Condition patterns in NNC (#44764)
- Fix fallback graph in specialize autogradzero (#44654)
- Fix masking for all block and thread dimensions in CudaCodeGen (#44733)
- Improve performance of simple reduction and softmax in nvFuser (#40864)
- Add a new optimization pass, the Registerizer, which looks for common Stores and Loads to a single item in a buffer and replaces them with a local temporary scalar which is cheaper to write. (#42606)
- Fuse identical conditions in NNC simplifier (#44886)
- Add _out variants and reuse memory in static runtime(#44128)
Mobile
- Add add_relu fusion pass to optimize_for_mobile. (#40252)
- optimize_for_mobile: bring packed params to root module (#42740)
- Apply selective build on RNN operators (#44132)
- Add neon backend for vectorization (#39341)
Quantization
- Use the _min_max function instead of two separate calls for min and max(#41570, #42957, #44537)
- Improve performance of the QNNPACK kernels (#41342, #42007, #42008)
- Speed up HistogramObserver by vectorizing critical path (#41041)
- Speed up AdaptivePool3d by checking if input is ChannelsLast or ChannelsLast3d (#42780)
- observers: use clamp instead of min/max in calculate_qparams (#43150)
- observers: use torch.all to check for valid min and max values (#43151)
- Avoid resizing in MinMaxObserver (#43789)
- observers: make eps a buffer (#43149)
Misc
- ROCm: Fix performance issues with
torch.cat
(#46323)
Documentation
Python API
- Numerous typo and grammatical improvements (#39854, #40217, #40285, #40544, #40692, #40617, #41025, #41031, #40984, #41066, #41203, #41263, #41384, #41526, #41563, #41632, #41643, #41599, #41799, #41679, #41835, #41851, #41963, #42016, #42076, #41946, #42046, #42065, #42236, #42184, #42734, #42923, #42891, #43063, #43131, #43395, #43588, #43583, #43697, #43779, #43569, #43893, #43695, #43973, #44667, #44753, #44740, #45045, #45192, #43308, #40334)
- Remove use of term “blacklist” (#41450)
- Add overflow notice for cuFFT on half precision (#40551)
- Add complex Note (#41012, #41252, #40450)
- Add documentation about data sharing for Tensors during serialization (#40412)
- Add
nn.Module.training
to docs (#40923) -
nn.CrossEntropyLoss
: Clarify that the mean argument is weighted (#40991) -
torch.scatter_
: Update doc with support for reduction methods. (#40962) - Fix HTTP links in documentation to HTTPS (#40878)
- Fix warnings when building docs (#41068, #41334, #41335, #44686)
- Add PyTorch Glossary (#40639)
- Fix documentation references following page split (#39086)
- Update serialization note to explain versioned symbols and dynamic versioning (#41395)
- Make elementwise comparison docs more consistent (#41626)
- Update CONTRIBUTING.md to explain how to use ccache (#41619)
- Add doc warning for LSTM non-deterministic behavior (#40893)
- Document default dim for cross being None (#41850)
- Clarify Python 3.6 is the minimum supported version in the installation section. (#41937)
- Split quantization subsection into smaller pages (#41321)
- Documentation for
torch.optim.swa_utils
(#41228) - Improve the documentation of DistributedDataParallel (#42471)
- Update docs about CUDA stream priority (#41364)
- Update the documentation for
torch.scatter
to include streams parameter. (#42814) - Update
Tensor.clone
doc (#42931, #43098) - Update external links in the README.md (#43100)
- Update
torch.Tensor.is_set_to
documentation (#43052) - Polish the nightly pull docs in CONTRIBUTING (#43494)
- Update the
torch.qr
documentation to include a warning about when the QR.backward is well-defined. (#43547) - Update the instructions to build from source on windows (#43479, #45553)
- Document the beta=0 behavior of BLAS functions (#43823)
- Fix docs for kwargs-only functions (#43586, #43589)
- Documents
torch.sub
properly, addstorch.subtract
alias (#43850) - Update determinism documentation (#41692)
- Update instructions to build (#42850)
- Clarify
nn.Batchnorm
track_running_stats
docs (#44445) - Fix latex error in
torch.heaviside
docs (#44481) - Update
torch.median
doc to explain returned value for even-sized input (#44562) - Fix the
nn.ELU
formula in the docs (#43764) -
torch.min
,torch.max
: remove incorrect warning from docs (#44615) - Reference
torch.cuda.amp
tutorial from core amp docs (#44725) - Mention TF32 on related docs (#44690)
- Clarify that 5-D 'bilinear' grid_sample is actually trilinear (#45090)
- Update linalg warning + docs (#45415)
- Update
torch.floor_divide
documentation to clarify it's actuallytorch.trunc_divide
(#45411) - Update
torch.fft
doc and make warning clearer (#45409) - Update for complex autograd (#45270, #46281)
- Update
nn.Flatten
docs (#42084)
Distributed
- Add a CONTRIBUTING.md for the distributed package. (#44224)
- Added docs for Store API (#45543)
- Add
all_gather_object
andgather_object
documentation (#43772)
TorchScript
- Fix
torch.jit.trace_module
documentation (#40248) - Fix the docs for the inputs arg of
torch.jit.trace_module
(#41586) - Add documentation for
PYTORCH_JIT_TYPE_VERBOSITY
(#42241) - Grammatical corrections in JIT overview (#43473)
- Update docs for recently added JIT features, including Enum Support,
torch.no_grad
etc. (#45232) - Add function signature for
pixel_shuffle
(#45661) - Fix signature for
torch.poisson
in documentation (#45656)
Mobile
- Aar native linking add fbjni (#40578)
- fix scripts (#44464)
- [PyTorch Mobile] Move some string ops to register_prim_ops.cpp and make them selective (#44500)
Quantization
- Fix several quantization documentation typos (#40567, #43693)
- API summary section (#45848)
- Documentation for dynamically quantized RNN cells (#40896)
Misc
- Update ONNX docs for release (#45086)