MyGit

v1.6.0

pytorch/pytorch

版本发布时间: 2020-07-29 01:13:18

pytorch/pytorch最新发布版本:v2.4.1(2024-09-05 03:59:29)

PyTorch 1.6.0 Release Notes

Highlights

The PyTorch 1.6 release includes a number of new APIs, tools for performance improvement and profiling, as well as major updates to both distributed data parallel (DDP) and remote procedure call (RPC) based distributed training.

A few of the highlights include:

  1. Automatic mixed precision (AMP) training is now natively supported and a stable feature - thanks to NVIDIA’s contributions;
  2. Native TensorPipe support now added for tensor-aware, point-to-point communication primitives built specifically for machine learning;
  3. New profiling tools providing tensor-level memory consumption information; and
  4. Numerous improvements and new features for both distributed data parallel (DDP) training and the remote procedural call (RPC) packages.

Additionally, from this release onward, features will be classified as Stable, Beta and Prototype. Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via compiler flag. You can learn more about what this change means in the post here.

[Stable] Automatic Mixed Precision (AMP) Training

AMP allows users to easily enable automatic mixed precision training enabling higher performance and memory savings of up to 50% on Tensor Core GPUs. Using the natively supported torch.cuda.amp API, AMP provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.

[Beta] TensorPipe backend for RPC

PyTorch 1.6 introduces a new backend for the RPC module which leverages the TensorPipe library, a tensor-aware point-to-point communication primitive targeted at machine learning, intended to complement the current primitives for distributed training in PyTorch (Gloo, MPI, ...) which are collective and blocking. The pairwise and asynchronous nature of TensorPipe lends itself to new networking paradigms that go beyond data parallel: client-server approaches (e.g., parameter server for embeddings, actor-learner separation in Impala-style RL, ...) and model and pipeline parallel training (think GPipe), gossip SGD, etc.

# One-line change needed to opt in
torch.distributed.rpc.init_rpc(
    ...
    backend=torch.distributed.rpc.BackendType.TENSORPIPE,
)

# No changes to the rest of the RPC API
torch.distributed.rpc.rpc_sync(...)

[Beta] Memory Profiler

The torch.autograd.profiler API now includes a memory profiler that lets you inspect the tensor memory cost of different operators inside your CPU and GPU models.

Here is an example usage of the API:

import torch
import torchvision.models as models
import torch.autograd.profiler as profiler

model = models.resnet18()
inputs = torch.randn(5, 3, 224, 224)
with profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inputs)

# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
# ---------------------------  ---------------  ---------------  ---------------
# Name                         CPU Mem          Self CPU Mem     Number of Calls
# ---------------------------  ---------------  ---------------  ---------------
# empty                        94.79 Mb         94.79 Mb         123
# resize_                      11.48 Mb         11.48 Mb         2
# addmm                        19.53 Kb         19.53 Kb         1
# empty_strided                4 b              4 b              1
# conv2d                       47.37 Mb         0 b              20
# ---------------------------  ---------------  ---------------  ---------------

Distributed and RPC Features and Improvements

[Beta] DDP+RPC

PyTorch Distributed supports two powerful paradigms: DDP for full sync data parallel training of models and the RPC framework which allows for distributed model parallelism. Currently, these two features work independently and users can’t mix and match these to try out hybrid parallelism paradigms.

Starting PyTorch 1.6, we’ve enabled DDP and RPC to work together seamlessly so that users can combine these two techniques to achieve both data parallelism and model parallelism. An example is where users would like to place large embedding tables on parameter servers and use the RPC framework for embedding lookups, but store smaller dense parameters on trainers and use DDP to synchronize the dense parameters. Below is a simple code snippet.

// On each trainer

remote_emb = create_emb(on="ps", ...)
ddp_model = DDP(dense_model)

for data in batch:
   with torch.distributed.autograd.context():
      res = remote_emb(data)
      loss = ddp_model(res)
      torch.distributed.autograd.backward([loss])

[Beta] RPC - Asynchronous User Functions

RPC Asynchronous User Functions supports the ability to yield and resume on the server side when executing a user-defined function. Prior to this feature, when an callee processes a request, one RPC thread waits until the user function returns. If the user function contains IO (e.g., nested RPC) or signaling (e.g., waiting for another request to unblock), the corresponding RPC thread would sit idle waiting for these events. As a result, some applications have to use a very large number of threads and send additional RPC requests, which can potentially lead to performance degradation. To make a user function yield on such events, applications need to: 1) Decorate the function with the @rpc.functions.async_execution decorator; and 2) Let the function return a torch.futures.Future and install the resume logic as callbacks on the Future object. See below for an example:

@rpc.functions.async_execution
def async_add_chained(to, x, y, z):
    return rpc.rpc_async(to, torch.add, args=(x, y)).then(
        lambda fut: fut.wait() + z
    )

ret = rpc.rpc_sync(
    "worker1", 
    async_add_chained, 
    args=("worker2", torch.ones(2), 1, 1)
)
        
print(ret)  # prints tensor([3., 3.])

[Beta] Fork/Join Parallelism

This release adds support for a language-level construct as well as runtime support for coarse-grained parallelism in TorchScript code. This support is useful for situations such as running models in an ensemble in parallel, or running bidirectional components of recurrent nets in parallel, and allows the ability to unlock the computational power of parallel architectures (e.g. many-core CPUs) for task level parallelism.

Parallel execution of TorchScript programs is enabled through two primitives: torch.jit.fork and torch.jit.wait. In the below example, we parallelize execution of foo:

import torch
from typing import List

def foo(x):
    return torch.neg(x)

@torch.jit.script
def example(x):
    futures = [torch.jit.fork(foo, x) for _ in range(100)]
    results = [torch.jit.wait(future) for future in futures]
    return torch.sum(torch.stack(results))

print(example(torch.ones([])))

Backwards Incompatible Changes

Dropped support for Python <= 3.5 (#39879)

The minimum version of Python we support now is 3.6. Please upgrade your Python to match. If you use conda, instructions for setting up a new environment with Python >= 3.6 can be found here.

Throw a RuntimeError for deprecated torch.div and torch.addcdiv integer floor division behavior (#38762, #38620)

In 1.5.1 and older PyTorch releases torch.div , torch.addcdiv, and the / operator perform integer floor division. In 1.6 attempting to perform integer division throw a RuntimeError, and in 1.7 the behavior will change so that these operations always perform true division (consistent with Python and NumPy division).

To floor divide integer tensors, please use torch.floor_divide instead.

1.5.11.6.0
>>> torch.tensor(3) / torch.tensor(2)
../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer
division of tensors using div or / is deprecated, and in a future
release div will perform true division as in Python 3. Use true_divide
or floor_divide (// in Python) instead.
tensor(1)
      
>>> # NB: the following is equivalent to 
>>> # torch.floor_divide(torch.tensor(3), torch.tensor(2))
>>> torch.tensor(3) // torch.tensor(2)
tensor(1)
      

The fix for torch.addcdiv is similar.

1.5.11.6.0
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> torch.addcdiv(input, tensor, other, value=value)
../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning:
Integer division with addcdiv is deprecated, and in a future 
release addcdiv will perform a true division of tensor1 and
tensor2. The current addcdiv behavior can be replicated using
floor_divide for integral inputs (self + value * tensor1 // tensor2)
and division for float inputs (self + value * tensor1 / tensor2).
The new addcdiv behavior can be implemented with
true_divide (self + value * torch.true_divide(tensor1, tensor2).
tensor(0)
      
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> (input + torch.floor_divide(value * tensor, other))
tensor(0)
      

Prevent cross-device data movement for zero-dimension CUDA tensors in binary pointwise PyTorch operators (#38998)

In previous versions of PyTorch, zero dimensional CUDA tensors could be moved across devices implicitly while performing binary pointwise operations (e.g. addition, subtraction, multiplication, division, and others). For example,

torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')

would work, even though the tensors are on different CUDA devices. This is a frequent source of user confusion, however, and PyTorch generally does not move data across devices without it being explicit. This functionality is removed in PyTorch 1.6.

To perform binary pointwise operations on data of different devices, please cast the tensors to the correct device by using Tensor.to:

Version 1.5.1Version 1.6.0
>>> torch.tensor(5, device='cuda:0') + torch.tensor((1, 1), device='cuda:1')
torch.tensor([6, 6], device='cuda:1')
      
>>> torch.tensor(5, device='cuda:0').to('cuda:1') + torch.tensor((1, 1), device='cuda:1')
torch.tensor([6, 6], device='cuda:1')
    

Dropped support for CUDA 9.2 on Windows

In previous versions of PyTorch, we provided an installation option for Windows environments running CUDA 9.2. Starting from PyTorch 1.6.0, we are no longer providing those binaries. Please upgrade your CUDA version to 10.1 or 10.2 and install a PyTorch binary for one of those CUDA versions instead.

PyTorch release binaries dropped dedicated bytecode for CUDA compute capability 6.1; removed PTX for CUDA compute capability 3.7

To check whether you are affected, please find your GPU in a table inthis link.

If you are using a Nvidia GPU with compute capability 6.1, you may notice a performance hit when using the release binaries (installed via pip or conda). We stopped building for CUDA compute capability 6.1 but PyTorch programs should still continue to work with those devices. If you do notice a performance hit, a workaround is to compile PyTorch from source.

If you are using a Nvidia GPU with compute capability 3.7 and relied on PTX, we have dropped support for that in our release binaries (installed via pip or conda). Potential workarounds are: install a previous version of PyTorch or to compile PyTorch from source.

Changed how bool tensors are constructed from non-bool values to match Python, C++, and NumPy (#38392)

In previous versions of PyTorch, when a bool tensor is constructed from a floating-point tensor, we would first convert the tensor to a long tensor, then to float tensor. This is not consistent with how bools are interpreted in Python, C++, and NumPy (just to name a few), which interpret 0 floating-point values as False and everything else as True.

If you were relying on the previous behavior, the following code will achieve the same effect.

Version 1.5.1Version 1.6.0
>>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2], dtype=torch.bool)
tensor([ True,  True, False, False, False,  True,  True])
      
>>> torch.tensor([-2, -1, -0.9, 0, 0.9, 1, 2]).long().bool()
tensor([ True,  True, False, False, False,  True,  True])
    

Throw RuntimeError when torch.full would infer a float dtype from a bool or integral fill value (#40364)

In PyTorch 1.6 bool and integral fill values given to torch.full must set the dtype our out keyword arguments. In prior versions of PyTorch these fill values would return float tensors by default, but in PyTorch 1.7 they will return a bool or long tensor, respectively. The documentation for torch.full has been updated to reflect this.

Enabled thread parallelism for autograd on CPU (#33157)

In previous versions of PyTorch, running .backward() in multiple threads causes them to be serialized in a specific order, resulting in no parallelism on CPU. In PyTorch 1.6.0, running .backward() in multiple threads no longer serializes the execution and instead autograd will run those in parallel.

This is BC-breaking for the following two use cases:

In more detail, in 1.6.0, when you run backward() or grad() via python, TorchScript or the C++ API in multiple threads on CPU, you should expect to see extra concurrency. For example, you can manually write multithreaded Hogwild training code like:

# Define a train function to be used in different threads
def train_fn(model, input):
    # forward
    y = model(input)
    # backward
    y.sum().backward()
    # potential optimizer update

# define your model in python or in TorchScript
model = Model()
# User write their own threading code to drive the train_fn
threads = []
for _ in range(10):
    # define or load the data
    input = torch.ones(5, 5, requires_grad=True)
    p = threading.Thread(target=train_fn, args=(model, input))
    p.start()
    threads.append(p)

for p in threads:
    p.join()

Note when you use the same model and call backward() concurrently in multiple threads, model parameters are automatically shared across threads. The gradient accumulation might become non-deterministic as two backward calls might access and try to accumulate the same .grad attribute. Although we do proper locking to avoid data corruption, we don't guarantee the order in which the ops are executed, so non-determinism might arise, but this is an expected pattern in multithread training. You could use the functional API torch.autograd.grad() to calculate the gradients instead of backward() to avoid the non-determinism.

For thread safety:

Change autograd gradient accumulation logic to yield .grads that match the weights' memory layout (#40358)

In previous versions of PyTorch, autograd would yield contiguous gradients. Now, gradients have the same memory layout as their respective weights. This should result in silent performance improvements. Since PyTorch operators generally support non-contiguous tensors, this should have no functional effect on most PyTorch programs. A known exception is when accessing param.grad and performing an operation that requires a contiguous tensor, such as param.grad.view(-1). In this case, you will receive an error as follows: RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

If a user wants to force accumulation into a grad with a particular layout, they can preset param.grad to a zeroed tensor with the desired strides or manually set grad to have the desired strides ( param.grad = param.grad.contiguous(desired format).)

See the below section on “Note: BC-breaking memory format changes” for more details.

Change memory format promotion rules of pointwise operators (#37968)

In previous versions of PyTorch, performing a binary pointwise operation between a Contiguous and a Channels Last tensor produced a Channels Last. In PyTorch 1.6, this now returns a tensor with the layout of the first operand.

See the below section on“Note: BC-breaking memory format changes” for more details.

Note: BC-breaking memory format changes

Operations that now return tensors in a different memory format generally should have no functional effect on most PyTorch programs because PyTorch operators generally support non-contiguous tensors.

The most common incompatibility with Python programs is with the view operator, which has specific stride requirements. If these requirements are no longer met as a result of this change, you will get an error message indicating that you should use reshape instead, i.e. "RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead."

Another possible exception incompatibility is if you have a (usually) C++ operator implementation that works directly on memory (i.e. calls data_ptr and relies on the strides being contiguous).

nn.functional.interpolate: recompute_scale_factor default behavior changed from True to False (#39453)

In PyTorch 1.5.1 and older versions, nn.functional.interpolate(input, size, scale_factor, ..., recompute_scale_factor) has a default of recompute_scale_factor = True. In PyTorch 1.6, we’ve changed the default to recompute_scale_factor = False.

Depending on the precision of the scale_factor, this may result in an output tensor with different values than before. To retain the old behavior, simply change your code to use recompute_scale_factor = True.

More concretely, what recompute_scale_factor = True means is, if the user passes in a scale_factor:

  1. We will first compute the new output size; and
  2. Then, we will compute a new scale_factor by dividing the output size by the input size and sending it to an internal helper function.
  3. The new scale_factor is used in the interpolate computation but in some cases is different from the scale_factor the user passed in.

This behavior resulted in loss of precision so we deprecated it in PyTorch 1.5.0. In PyTorch 1.6 and onward, recompute_scale_factor has a default of False, which means that we pass it directly to an internal helper function.

out= arguments of pointwise and reduction functions no longer participate in type promotion (#39655)

In PyTorch 1.5 passing the out= kwarg to some functions, like torch.add, could affect the computation. That is,

out = torch.add(a, b)

could produce a different result than

torch.add(a, b, out=out)

This is because previously the out argument participated in the type promotion rules. For greater consistency with NumPy, Python, and C++, in PyTorch 1.6 the out argument no longer participates in type promotion, and has no effect on the computation performed.

Changed torch.quasirandom.SobolEngine(..., scramble=True, seed=None) to respect torch.manual_seed when a seed has not been provided (#36427)

In previous versions of PyTorch, SobolEngine(..., scramble=True, seed=None) did not respect any calls to torch.manual_seed. The expected behavior for random number generation functions is to respect the seed set by torch.manual_seed, so we’ve changed SobolEngine to match.

If you were relying on the old behavior where SobolEngine ignores torch.manual_seed, please explicitly pass a different seed to SobolEngine:

Version 1.5.1Version 1.6.0
>>> torch.manual_seed(1337)
# SobolEngine ignores the manual_seed and instead uses its own.
>>> `x1 = SobolEngine(dimension=1, scramble=True, seed=None).draw(3)`
      
>>> import time
>>> torch.manual_seed(1337)
# To replicate the old behavior of, pass a seed to SobolEngine.
>>> ms_since_epoch = int(round(time.now() * 1000))
>>> x1 = SobolEngine(dimension=1, scramble=True, seed=ms_since_epoch).draw(3)
    

Tensor.random_(to, from): Enforce check that from and to are within the bounds of the Tensor’s dtype (#37507)

In previous versions of PyTorch, to and from did not have to be within the bounds of the tensor’s dtype (this raised a warning). The behavior of random_ in that case can be unexpected. We are making this a hard error starting from PyTorch 1.6.0; please modify your code if you run into the error.

Version 1.5.1Version 1.6.0
>>> tensor = torch.zeros(10, dtype=torch.uint8)
# 256 is the maximum value for `to` for `torch.uint8`
>>> tensor.random_(0, 257)
UserWarning: to - 1 is out of bounds for unsigned char.
      
>>> tensor = torch.zeros(10, dtype=torch.uint8)
# 256 is the maximum value for `to` for `torch.uint8`
>>> tensor.random_(0, 256)
    

Dropped support for CUDA < 9.2 from for source builds (#38977, #36846)

If you build PyTorch from source, we’ve dropped support for using CUDA < 9.2 (run nvcc --version to check your CUDA version). Users who install PyTorch packages via conda and/or pip are unaffected.

DataLoader’s __len__ changed to return number of batches when holding an IterableDataset (#38925)

In previous versions of PyTorch, len(<instance of dataloader holding an IterableDataset>) would return the number of examples in the dataset. We’ve changed it to be the number of batches (e.g., the number of examples divided by the DataLoader’s batch_size) to be consistent with the computation of length when the DataLoader has a BatchedSampler.

torch.backends.cudnn.flags: deleted unused verbose flag (#39228)

The verbose flag did nothing, so we deleted it. If you were passing a value to flags for verbose, please remove it.

RPC

RpcBackendOptions takes float instead of timedelta for timeout argument to stay consistent with timeout types in other TorchScriptable RPC APIs.

# v1.5
rpc.init_rpc(
    "worker1",
    rank=0,
    world_size=2,
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
        num_send_recv_threads=16,
        datetime.timedelta(seconds=20)
    )
)
# v1.6
rpc.init_rpc(
    "worker1",
    rank=0,
    world_size=2,
    rpc_backend_options=rpc.ProcessGroupRpcBackendOptions(
        num_send_recv_threads=16,
        20 # seconds
    )
)

TorchScript

The Default Executor Is Rolled Back To Legacy (#41017)

We rolled back to the old fuser and the legacy executor in this release in order to recover some reported performance regressions. In future releases we plan to reach the same or better performance with a new redesigned executor and fuser.

In order to switch back to the executor used in the 1.5 release one could use the following API:

Added dynamic versioning (#40279)

Note: this isn’t actually BC-breaking but we are listing it here because it is BC-Improving.

The PyTorch Team recommends saving and loading modules with the same version of PyTorch. Older versions of PyTorch may not support newer modules, and newer versions may have removed or modified older behavior. These changes are explicitly described in PyTorch’s release notes, and modules relying on functionality that has changed may need to be updated to continue working properly.

In this release, the historic behavior of torch.div and torch.full is preserved for models saved via torch.jit.save in previous versions of PyTorch. Modules saved with the current version of PyTorch will use the latest torch.div and torch.full behavior. See the notes above for the BC changes to those operators.

Internals

The following are a list of BC-breaking changes to some of PyTorch’s internal components.

Dispatcher C++ API has had some spring cleaning. This is still considered an “internal” API, but it is becoming more public facing as it stabilizes.

autograd.gradcheck and autograd.gradgradcheck: Added a new default-true argument check_undefined_grad (#39400)

Internally, in the autograd engine, we use a special undefined Tensor value to represent zero-filled gradients and expect backward functions and user-defined torch.autograd.Functions to gracefully handle those values. When check_undefined_grad is True (the default for PyTorch 1.6+), gradcheck/gradgradcheck test that the operation in question supports undefined output gradients. This may cause a previously succeeding gradcheck to fail.

You can turn the check off by setting check_undefined_grad to False. As long as autograd does not error out due to an undefined gradient in your model, then everything should be fine.

Version 1.5.1Version 1.6.0
>>> torch.autograd.gradcheck(my_custom_function, inputs)
True
      
>>> # To keep the previous behavior
>>> torch.autograd.gradcheck(my_custom_function, inputs, check_undefined_grad=False)
True
    

[C++ API] Changed the TensorIterator API (#39803)

TensorIterator is an implementation detail for writing kernels that is exposed in our C++ API. We’ve modified how developers interact with TensorIterator, please see the Pull Request for more details.

Removed torch._min and torch._max(#38440)

torch._min and torch._max are undocumented and were intended to be an implementation detail; we expect very few users, if any at all, to be using it. We’ve deleted it in PyTorch 1.6.0. Please use torch.min/torch.max instead if you are using torch._min/torch._max.

Deprecations

Deprecated old torch.save serialization format (#39460, #39893, #40288, #40793)

We have switched torch.save to use a zip file-based format by default rather than the old Pickle-based format. torch.load has retained the ability to load the old format, but use of the new format is recommended. The new format is:

Usage is as follows:

m = MyMod()
torch.save(m.state_dict(), 'mymod.pt') # Saves a zipfile to mymod.pt

To use the old format, pass the flag _use_new_zipfile_serialization=False

m = MyMod()
torch.save(m.state_dict(), 'mymod.pt', _use_new_zipfile_serialization=False) # Saves pickle

Fixed missing deprecation warning for Tensor.nonzero() (#40187)

Calling torch.nonzero(tensor, as_tuple=False) with one argument or Tensor.nonzero(as_tuple=False) with no arguments is deprecated and will be removed in a future version of PyTorch. Please specify the as_tuple argument.

New Features

Python API

New Utilities

New Operators

C++ API

[Beta] Complex Tensor support

The PyTorch 1.6 release brings beta-level support for complex tensors. The UX is similar to existing PyTorch tensors and the new complex-specific functionality is compatible with NumPy’s complex arrays. In particular, you’ll be able to create and manipulate complex tensors, interop with previously existing code that represented complex tensors as tensors of size (..., 2), and more.

While this is an early version of this feature, and we expect it to improve over time, the overall goal is provide a NumPy compatible user experience that leverages PyTorch’s ability to run on accelerators and work with autograd to better support the scientific computing and ML communities.

Please find the full documentation here.

Python API:

C++ API:

Distributed

Mobile

New operator registration API

PyTorch 1.6 has a new, pybind11-based operator registration API which replaces the torch::RegisterOperators() class.

Before:

static auto registry =
  torch::RegisterOperators("my_ops::warp_perspective", &warp_perspective);

After:

TORCH_LIBRARY(my_ops, m) {
  m.def("warp_perspective", warp_perspective);
}

You can read more about this API in the custom C++ operators tutorial or the reference documentation.

The new API was developed in PRs #35061, #35629, #35706, #36222, #36223, #36258, #36742, #37019. Internal code was ported to this API in #36799, #36800, #36389, #37834, #38014; you may find the code examples in these PRs helpful for your ports.

ONNX

In PyTorch 1.6, we have added support for ONNX Opset 12. We have also enhanced export of torchvision models, such as FasterRCNN, MaskRCNN, and KeypointRCNN to support dynamic input image size. Export support for several new ops have also been added. A new operator export mode, ONNX_FALLTHROUGH, has been added to the export API that allows exporting the model with non-standard ONNX operators. For large (> 2 GB) model export (using external_data_format=True argument), we now support models with large tensor data in attributes (not just model parameters).

New ONNX operator support:

Quantization

New quantization operators:

RPC

TorchScript

Improvements

Python API

Python Type Annotations

AMD/ROCm

C++ API

Distributed

Distributions

Internals

ONNX

Operator Benchmark

Profiler

Quantization

RPC

TorchScript

Improvements

Bug Fixes

Python API

AMD/ROCm

C++ API

Distributed

Internals

ONNX

Operator Benchmark

Profiler

Quantization

RPC

TensorBoard

TorchScript

Performance

Misc

Distributed

Mobile

Quantization

RPC

TorchScript

Documentation

C++ API

Distributed

Quantization

RPC

TorchScript

相关地址:原始地址 下载(tar) 下载(zip)

查看:2020-07-29发行的版本