MyGit

v0.4.0

pytorch/pytorch

版本发布时间: 2018-04-25 04:49:48

pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)

PyTorch 0.4.0 release notes

Table of Contents

Major Core changes

Here is a summary of the updates to the most important core features users will use daily.

Major Changes and Potentially Breaking Changes:

Improvements:

We wrote a migration guide that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.

Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate. Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate. Please read the migration guide if you have code in a previous version of PyTorch that you would like to migrate.

The contents of this section (Major Core changes) are included in the migration guide.

Merging Tensor and Variable classes

torch.autograd.Variable and torch.Tensor are now the same class. More precisely, torch.Tensor is capable of tracking history and behaves like the old Variable; Variable wrapping continues to work as before but returns an object of type torch.Tensor. This means that you don't need the Variable wrapper everywhere in your code anymore.

The type() of a Tensor has changed

Note also that the type() of a Tensor no longer reflects the data type. Use isinstance() or x.type() instead:

>>> x = torch.DoubleTensor([1, 1, 1])
>>> print(type(x)) # was torch.DoubleTensor
<class 'torch.autograd.variable.Variable'>
>>> print(x.type())  # OK: 'torch.DoubleTensor'
'torch.DoubleTensor'
>>> print(isinstance(x, torch.DoubleTensor))  # OK: True
True

When does autograd start tracking history now?

requires_grad, the central flag for autograd, is now an attribute on Tensors. Let's see how this change manifests in code.

autograd uses the same rules previously used for Variables. It starts tracking history when any input Tensor of an operation has requires_grad=True. For example,

>>> x = torch.ones(1)  # create a tensor with requires_grad=False (default)
>>> x.requires_grad
False
>>> y = torch.ones(1)  # another tensor with requires_grad=False
>>> z = x + y
>>> # both inputs have requires_grad=False. so does the output
>>> z.requires_grad
False
>>> # then autograd won't track this computation. let's verify!
>>> z.backward()
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
>>>
>>> # now create a tensor with requires_grad=True
>>> w = torch.ones(1, requires_grad=True)
>>> w.requires_grad
True
>>> # add to the previous result that has require_grad=False
>>> total = w + z
>>> # the total sum now requires grad!
>>> total.requires_grad
True
>>> # autograd can compute the gradients as well
>>> total.backward()
>>> w.grad
tensor([ 1.])
>>> # and no computation is wasted to compute gradients for x, y and z, which don't require grad
>>> z.grad == x.grad == y.grad == None
True
Manipulating requires_grad flag

Other than directly setting the attribute, you can change this flag in-place using my_tensor.requires_grad_(requires_grad=True), or, as in the above example, at creation time by passing it in as an argument (default is False), e.g.,

>>> existing_tensor.requires_grad_()
>>> existing_tensor.requires_grad
True
>>> my_tensor = torch.zeros(3, 4, requires_grad=True)
>>> my_tensor.requires_grad
True

What about .data?

.data was the primary way to get the underlying Tensor from a Variable. After this merge, calling y = x.data still has similar semantics. So y will be a Tensor that shares the same data with x, is unrelated with the computation history of x, and has requires_grad=False.

However, .data can be unsafe in some cases. Any changes on x.data wouldn't be tracked by autograd, and the computed gradients would be incorrect if x is needed in a backward pass. A safer alternative is to use x.detach(), which also returns a Tensor that shares data with requires_grad=False, but will have its in-place changes reported by autograd if x is needed in backward.

Some operations now return 0-dimensional (scalar) Tensors

Previously, indexing into a Tensor vector (1-dimensional tensor) gave a Python number but indexing into a Variable vector gave (incosistently!) a vector of size (1,)! Similar behavior existed with reduction functions, i.e. tensor.sum() would return a Python number, but variable.sum() would retun a vector of size (1,).

Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch! Scalars can be created using the new torch.tensor function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent of numpy.array). Now you can do things like:

>>> torch.tensor(3.1416)         # create a scalar directly
tensor(3.1416)
>>> torch.tensor(3.1416).size()  # scalar is 0-dimensional
torch.Size([])
>>> torch.tensor([3]).size()     # compare to a vector of size 1
torch.Size([1])
>>>
>>> vector = torch.arange(2, 6)  # this is a vector
>>> vector
tensor([ 2.,  3.,  4.,  5.])
>>> vector.size()
torch.Size([4])
>>> vector[3]                    # indexing into a vector gives a scalar
tensor(5.)
>>> vector[3].item()             # .item() gives the value as a Python number
5.0
>>> sum = torch.tensor([2, 3]).sum()
>>> sum
tensor(5)
>>> sum.size()
torch.Size([])

Accumulating losses

Consider the widely used pattern total_loss += loss.data[0] before 0.4.0. loss was a Variable wrapping a tensor of size (1,), but in 0.4.0 loss is now a scalar and has 0 dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): use loss.item() to get the Python number from a scalar.

Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor. The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.

Deprecation of volatile flag

The volatile flag is now deprecated and has no effect. Previously, any computation that involves a Variable with volatile=True won't be tracked by autograd. This has now been replaced by a set of more flexible context managers including torch.no_grad(), torch.set_grad_enabled(grad_mode), and others.

>>> x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
...     y = x * 2
>>> y.requires_grad
False
>>>
>>> is_train = False
>>> with torch.set_grad_enabled(is_train):
...     y = x * 2
>>> y.requires_grad
False
>>> torch.set_grad_enabled(True)  # this can also be used as a function
>>> y = x * 2
>>> y.requires_grad
True
>>> torch.set_grad_enabled(False)
>>> y = x * 2
>>> y.requires_grad
False

dtypes, devices and NumPy-style creation functions

In previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example, torch.cuda.sparse.DoubleTensor was the Tensor type respresentingdouble data type, living on CUDA devices, and with COO sparse tensor layout.

In this release, we introduce torch.dtype, torch.device and torch.layout classes to allow better management of these properties via NumPy-style creation functions.

torch.dtype

Below is a complete list of available torch.dtypes (data types) and their corresponding tensor types.

Data type torch.dtype Tensor types
32-bit floating point torch.float32 or torch.float torch.*.FloatTensor
64-bit floating point torch.float64 or torch.double torch.*.DoubleTensor
16-bit floating point torch.float16 or torch.half torch.*.HalfTensor
8-bit integer (unsigned) torch.uint8 torch.*.ByteTensor
8-bit integer (signed) torch.int8 torch.*.CharTensor
16-bit integer (signed) torch.int16 or torch.short torch.*.ShortTensor
32-bit integer (signed) torch.int32 or torch.int torch.*.IntTensor
64-bit integer (signed) torch.int64 or torch.long torch.*.LongTensor

Use torch.set_default_dtype and torch.get_default_dtype to manipulate default dtype for floating point tensors.

torch.device

A torch.device contains a device type ('cpu' or 'cuda') and optional device ordinal (id) for the device type. It can be initilized with torch.device('{device_type}') or torch.device('{device_type}:{device_ordinal}').

If the device ordinal is not present, this represents the current device for the device type; e.g., torch.device('cuda') is equivalent to torch.device('cuda:X') where X is the result of torch.cuda.current_device().

torch.layout

torch.layout represents the data layout of a Tensor. Currentlytorch.strided (dense tensors) and torch.sparse_coo (sparse tensors with COO format) are supported.

Creating Tensors

Methods that create a Tensor now also take in dtype, device, layout, and requires_grad options to specify the desired attributes on the returned Tensor. For example,

>>> device = torch.device("cuda:1")
>>> x = torch.randn(3, 3, dtype=torch.float64, device=device)
tensor([[-0.6344,  0.8562, -1.2758],
        [ 0.8414,  1.7962,  1.0589],
        [-0.1369, -1.0462, -0.4373]], dtype=torch.float64, device='cuda:1')
>>> x.requires_grad  # default is False
False
>>> x = torch.zeros(3, requires_grad=True)
>>> x.requires_grad
True

torch.tensor

torch.tensor is one of the newly added tensor creation methods. It takes in array like data of all kinds and copies the contained values into a new Tensor. As mentioned earlier, torch.tensor is the PyTorch equivalent of NumPy's numpy.array constructor. Unlike the torch.*Tensor methods, you can also create zero-dimensional Tensors (aka scalars) this way (a single python number is treated as a Size in thetorch.*Tensor methods). Moreover, if a dtype argument isn't given, it will infer the suitable dtype given the data. It is the recommended way to create a tensor from existing data like a Python list. For example,

>>> cuda = torch.device("cuda")
>>> torch.tensor([[1], [2], [3]], dtype=torch.half, device=cuda)
tensor([[ 1],
        [ 2],
        [ 3]], device='cuda:0')
>>> torch.tensor(1)               # scalar
tensor(1)
>>> torch.tensor([1, 2.3]).dtype  # type inferece
torch.float32
>>> torch.tensor([1, 2]).dtype    # type inferece
torch.int64

We've also added more tensor creation methods. Some of them have torch.*_like and/or tensor.new_* variants.

  1. torch.*_like takes in an input Tensor instead of a shape. It returns a Tensor with same attributes as the input Tensor by default unless otherwise specified:

    >>> x = torch.randn(3, dtype=torch.float64)
    >>> torch.zeros_like(x)
    tensor([ 0.,  0.,  0.], dtype=torch.float64)
    >>> torch.zeros_like(x, dtype=torch.int)
    tensor([ 0,  0,  0], dtype=torch.int32)
    
  2. tensor.new_* can also create Tensors with same attributes as tensor, but it always takes in a shape argument:

    >>> x = torch.randn(3, dtype=torch.float64)
    >>> x.new_ones(2)
    tensor([ 1.,  1.], dtype=torch.float64)
    >>> x.new_ones(4, dtype=torch.int)
    tensor([ 1,  1,  1,  1], dtype=torch.int32)
    

To specify the desired shape, you can either use a tuple (e.g., torch.zeros((2, 3))) or variable arguments (e.g., torch.zeros(2, 3)) in most cases.

Name Returned Tensor torch.*_like variant tensor.new_* variant
torch.empty unintialized memory
torch.zeros all zeros
torch.ones all ones
torch.full filled with a given value
torch.rand i.i.d. continuous Uniform[0, 1)
torch.randn i.i.d. Normal(0, 1)
torch.randint i.i.d. discrete Uniform in given range
torch.randperm random permutation of {0, 1, ..., n - 1}
torch.tensor copied from existing data (list, NumPy ndarray, etc.)
torch.from_numpy* from NumPy ndarray (sharing storage without copying)
torch.arange,
torch.range, and
torch.linspace
uniformly spaced values in a given range
torch.logspace logarithmically spaced values in a given range
torch.eye identity matrix

*: torch.from_numpy only takes in a NumPy ndarray as its input argument.

Writing device-agnostic code

Previous versions of PyTorch made it difficult to write code that was device agnostic (i.e. that could run on both CUDA-enabled and CPU-only machines without modification).

PyTorch 0.4.0 makes this easier in two ways:

We recommend the following pattern:

# at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

...

# then whenever you get a new Tensor or Module
# this won't copy if they are already on the desired device
input = data.to(device)
model = MyModule(...).to(device)

Tensors

Full support for Advanced indexing

PyTorch now has full support for advanced indexing, following numpy's advanced indexing rules. The following examples are now possible:

a = torch.rand(10, 10, 10, 10)

# the indexing elements can have other shapes than 1
b = a[[[3, 2]], :, [[1, 3]]]

# broadcasting also supported in the indices, as well as lists,
# negative indices, slices, elipses, numbers
c = a[[1, -2], 2:4, :, [1]]

# can also support tensors as indices
index = torch.tensor([2, 4])
d = a[index]

# and the indices can be on the GPU
# or CPU
e = a[index.cuda()]
f = a.cuda()[index]


mask = torch.rand(10) > 0.5
# we can now index with a mask that has fewer
# dimensions than the indexing tensor
c = a[mask, :5]

Fast Fourier Transform

New and updated Torch operators

a = torch.arange(0, 9).reshape(3, 3)
# the following transposes a
b = torch.einsum('ij->ji', (a,))

Rename async argument in .cuda() to non_blocking

The async keyword argument in conversion calls is now deprecated in PyTorch, and it has been replaced by non_blocking. This was necessary because async will be a keyword in Python 3.7

Neural Networks

A new autograd container that lets you trade compute for memory

The new checkpoint container allows you to only store a subset of the outputs necessary for backpropagation. If an output is missing (to save memory), the checkpoint container will recompute the intermediate outputs from the closest checkpoint, so that memory usage can be reduced (with an increase in computation time). Here is an example:

# input
input = torch.rand(1, 10)
# suppose we have a very deep model
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)
output = model(input)

The above model uses a lot of memory, because it needs to keep the intermediate values of every operation for backpropagation. checkpoint lets your reduce the memory requirements:


# create the input tensors and set the requires_grad=True
# NOTE: the requires_grad=True for the input is a current
# limitation of checkpointing. At least one of the 
# model inputs should have requires_grad=True. 
# If you don't do it, you might have empty gradients.
input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]

# define function that will define where
# we will checkpoint and store
# intermediate gradients. In this case,
# we will only store one intermediate
# gradient, in the middle of the
# model

def run_first_half(*args):
    x = args[0]
    for layer in layers[:500]:
        x = layer(x)
    return x

def run_second_half(*args):
    x = args[0]
    for layer in layers[500:-1]:
        x = layer(x)
    return x

# now uses the new checkpoint functionality
from torch.utils.checkpoint import checkpoint

x = checkpoint(run_first_half, input)
x = checkpoint(run_second_half, x)
# last output need to be run without checkpoint
x = layers[-1](x)
x.sum.backward()  # works!

For sequential modules (which can have arbitrary blocks inside), a helper function checkpoint_sequential is provided, which takes care of the most common use-cases:

input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)

from torch.utils.checkpoint import checkpoint_sequential

# split in two blocks
num_segments = 2
x = checkpoint_sequential(model, num_segments, input)
x.sum().backward()  # works!

bottleneck - a tool to identify hotspots in your code

torch.utils.bottleneck (#5216, #6425) is a tool that can be used as an initial step for debugging bottlenecks in your program. It summarizes runs of your script with the Python profiler and PyTorch’s autograd profiler. See the bottleneck docs for more details.

reduce=False Losses

As of this release, all of our loss functions support the reduce keyword. Specifying reduce=False gives a Tensor per unit of loss instead of a single reduced loss. #4924, #5346, #5646, #4231, #4705, #5680

New modules and module improvements

torch.distributions

torch.distributions has expanded to include 24 basic probability distributions: Bernoulli, Beta, Binomial, Categorical, Cauchy, Chi2, Dirichlet, Exponential, FisherSnedecor, Gamma, Geometric, Gumbel, Laplace, LogNormal, Multinomial, MultivariateNormal, Normal, OneHotCategorical, Pareto, Poisson, RelaxedBernoulli, RelaxedOneHotCategorical, StudentT, and Uniform.

The Distribution interface has expanded to include many methods including .cdf(), .icdf(), .mean(), .variance(), .entropy(), and .perplexity(). Distributions now split tensor dimensions into sample_shape+batch_shape+event_shape. Most continuous distributions now also implement a differentiable .rsample() method to compute pathwise derivatives aka the reparameterization trick (check .has_rsample for availability):

>>> loc = torch.tensor(0., requires_grad=True)
>>> scale = torch.tensor(1., requires_grad=True)
>>> samples = Normal(loc, scale).rsample(sample_shape=(1000,))
>>> loss = (samples - 0.5).pow(4).mean()  # average over 1000 monte carlo samples
>>> grad(loss, [loc, scale])
(tensor(-7.5092), tensor(15.2704))

Most discrete distributions implement an .enumerate_support() method to make it easy to sum over all possible sample values (check .has_enumerate_support for availability).

kl_divergence is defined for many pairs of distributions, e.g.

>>> x = torch.tensor(1.0, requires_grad=True)
>>> kl = kl_divergence(Uniform(-x, x), Normal(0., 1.))
>>> grad(kl, [x])[0]
tensor(-0.6667)

Distribution Transforms

New distributions can be created by combining TransformedDistribution with any number of Transform objects from the torch.distributions.transforms library, including: ExpTransform, PowerTransform, SigmoidTransform, AbsTransform, AffineTransform, SoftmaxTransform, StickBreakingTransform, LowerCholeskyTransform, and their inverses via the .inv property.

Distribution Constraints

Distributions provide metadata about the constraints of their .support and about their arguments (.arg_constraints). These Constraint objects are registered with transforms using transform_to() and biject_to(). Together constraints and transforms make it easy to specify new distributions in a generic way

>>> scale = torch.tensor(1., requires_grad=True)
>>> p = Normal(0., scale)
>>> assert p.arg_constraints['scale'] == constraints.positive
>>> prior = TransformedDistribution(Normal(0., 1.),
...                                 transform_to(constraints.positive))

Constraints in the torch.distributions.constraints library include: boolean, greater_than(lower_bound), integer_interval(lower_bound, upper_bound), interval(lower_bound, upper_bound), lower_cholesky, lower_triangular, nonnegative_integer, positive, positive_definite, positive_integer, real, real_vector, simplex, and unit_interval.

Distributed

Helper utility for launching Distributed Training jobs

We have added an utility function to help launch jobs on a distributed setup. In order to launch a script that leverages DistributedDataParallel on either single-node multiple-nodes, we can make use of torch.distributed launch as follows

python -m torch.distributed.launch my_script.py --arg1 --arg2 --arg3

The script simplifies day to day usability of the distributed package.

You can read about it's usage here: http://pytorch.org/docs/stable/distributed.html#launch-utility

A new distributed backend based on NCCL 2.0

PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed. It also provides new APIs for collective operations on multiple GPUs. You can enable the new backend via

torch.distributed.init_process_group("nccl")

Other distributed improvements

C++ extensions

Previously, the official way of writing extensions using C or CUDA for custom modules was through the cffi extension. The drawback of this method was that it required a separate step for compiling the CUDA kernels, which could be a bit messy.

PyTorch now provides a better system for writing your own C++ / CUDA extensions. Example implementations using this new extension support can be found in the pytorch/cpp_extensions repo.

We provide two compilation modes:

In C++

// my_implementation.cpp
#include <torch/torch.h>
#include <unordered_set>

// can use templates as well. But let's keep it
// simple
using scalar_t = float;

at::Tensor unique_float(at::Tensor input_) {
  // only works for floats
  AT_ASSERT(input_.type().scalarType() == at::ScalarType::Float, "input must be a float tensor");
  // and CPU tensors
  AT_ASSERT(!input_.type().is_cuda(), "input must be a CPU tensor");
  
  // make the input contiguous, to simplify the implementation
  at::Tensor input = input_.contiguous();
  
  // get the pointer that holds the data
  scalar_t* input_data = input.data<scalar_t>();
  // let's use a function from the std library to implement
  // the unique function
  std::unordered_set<scalar_t> set(input_data, input_data + input.numel());
  
  // create the output tensor, with size set.size()
  at::Tensor output = input.type().tensor({static_cast<int64_t>(set.size())});
  scalar_t* output_data = output.data<scalar_t>();
  // copy the content of the set to the output tensor
  std::copy(set.begin(), set.end(), output_data);
  
  return output;
}

// this defines the functions exposed to Python
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
  m.def("unique_float", &unique_float, "Unique for float tensors");
}

And then in Python

import torch
from torch.utils.cpp_extension import load as load_ext
# pass the source files, they will be compiled on the fly 
# and will return a python module
_C = load_ext('my_unique_lib', sources=['my_implementation.cpp'])

# now can use the functions implemented in C++
unique = _C.unique_float

a = torch.tensor([1.0, 2.0, 1.0])
print(unique(a))
# tensor([ 2.,  1.])

Windows support

PyTorch now officially supports Windows. We provide pre-compiled Conda binaries and pip wheels for Python 3.5 and 3.6. PyTorch on Windows doesn't support distributed training and might be a tad bit slower than Linux / OSX because Visual Studio supports an older version of OpenMP.

As always, you can use the commands at http://pytorch.org to install PyTorch on Windows We have an FAQ that answers most questions you might have around Windows here: http://pytorch.org/docs/stable/notes/windows.html

ONNX Improvements

New ONNX operators

Improvements

Better RNN support

PyTorch can now export a subset of RNNs to ONNX #4409

Bugfixes

Miscellaneous improvements

For example:

model = nn.Sequential(nn.Linear(2, 2), nn.ReLU(), nn.Linear(2, 2))
del model[1]  # deletes nn.ReLU

Performance improvements

Distributed

Bug fixes

torch operators

core

autograd

nn layers

CUDA

sparse

dataloader

optim

distributed and multi-gpu

相关地址:原始地址 下载(tar) 下载(zip)

查看:2018-04-25发行的版本