MyGit

v0.2.0

pytorch/pytorch

版本发布时间: 2017-08-28 22:43:31

pytorch/pytorch最新发布版本:v2.5.1(2024-10-30 01:58:24)

Here comes the next major release of PyTorch, just in time for ICML. Install it today from our website http://pytorch.org Package documentation for this release is available at http://pytorch.org/docs/0.2.0/

We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.

Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the Important Breakages and Workarounds section.

Table of contents:

Tensor Broadcasting (numpy-style)

In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).

PyTorch Broadcasting semantics closely follow numpy-style broadcasting; if you are familiar with numpy broadcasting, things should just work as expected.

General Semantics

Two tensors are “broadcastable” if the following rules hold:

For Example:

>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
# same shapes are always broadcastable (i.e. the above rules always hold)

# can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor(  3,1,1)

# x and y are broadcastable.
# 1st trailing dimension: both have size 1
# 2nd trailing dimension: y has size 1
# 3rd trailing dimension: x size == y size
# 4th trailing dimension: y dimension doesn't exist

# but:
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
# x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:

For Example:

# can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])

# error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

More details can be found on the PyTorch documentation site. Also, each torch function lists its broadcasting semantics in the documentation.

Advanced Indexing for Tensors and Variables

PyTorch now supports a subset of NumPy style advanced indexing. This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same []-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's Index[Select, Add, ...] functions.

Let's look at some examples:

x = torch.Tensor(5, 5, 5)

Pure Integer Array Indexing - specify arbitrary indices at each dimension

x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])

also supports broadcasting, duplicates

x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])

arbitrary indexer shapes allowed

x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
                         [x[0][0][1], x[1][0][1]]]

can use colon, ellipse

x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]

also use Tensors to index!

y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]

selection with less than ndim, note the use of comma

x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]

Higher order gradients

Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.

In the 0.2 release, we've enabled the ability to compute higher order gradients for all of torch.XXX functions and the most popular nnlayers. The rest will be covered in the next release.

Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.

import torch
from torchvision.models import resnet18
from torch.autograd import Variable

model = resnet18().cuda()

# dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())

# as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)

grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
# torch.autograd.grad does not accumuate the gradients into the .grad attributes
# It instead returns the gradients as Variable tuples.

# now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
    grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()

# take the gradients wrt grad_norm. backward() will accumulate
# the gradients into the .grad attributes
grad_norm.backward()

# do an optimization step
optimizer.step()

We see two new concepts here:

  1. torch.autograd.grad is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the .grad attributes. This is useful if you want to further operate on the gradients.
  2. You can operate on the gradients, and call backward() on them.

The list of nn layers that support higher order gradients are:

To enable higher order gradients, we've introduced a new style of writing autograd.Function (the current/old style of writing functions is fully backward compatible). You can read more about the new style of functions here.

Most of you dont write your own autograd.Functions, they are low-level primitives that introduce new operations to the autograd engine, where you specify the forward and backward calls.

Distributed PyTorch

We introduce the torch.distributed package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.

The distributed package follows an MPI-style programming model. This means that there are functions provided to you such as send, recv, all_reduce that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:

Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:

import torch.distributed as dist

dist.init_process_group(backend='tcp',
                        init_method='tcp://[ff15:1e18:5d4c:4cf0:d02d:b659:53ba:b0a7]:23456',
                        world_size=4)

print('Hello from process {} (out of {})!'.format(
        dist.get_rank(), dist.get_world_size()))

This would print Hello from process 2 (out of 4)on the 3rd machine.

World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.

Here's a snippet that shows how simple point-to-point communication can be performed:

# All processes (receiving ones too!) need to have tensors of appropriate
# size preallocated.
x = torch.Tensor(10)
if dist.get_rank() == 0:
    x.normal_()
    # Send x to process with rank 1
    dist.send(x, dst=1)
else:  # rank == 1
    # Receive data from process with rank 0 and save result in x
    dist.recv(x, src=0)

Asynchronous p2p functions (isend, irecv) are available too.

However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using send/recv. One example is all_reduce:

x = torch.Tensor([dist.get_rank()])
# Add tensors from all processes such that they all receive the result.
# x is an input and output to this operation.
dist.all_reduce(x)

The distributed package is fairly low-level, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but data-parallel training is such a common one that we have created high-level helpers for it.

Hence, we've introduced DistributedDataParallel, which is meant to be a nearly drop-in replacement for nn.DataParallel. Here's a code snippet demonstrating changes necessary to add it to existing training code:

# Wrap model in DistributedDataParallel (CUDA only for the moment)
model = torch.nn.parallel.DistributedDataParallel(model.cuda())

# Use a DistributedSampler to restrict each process to a distinct subset
# of the dataset.
train_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, num_workers=args.workers,
    pin_memory=True, sampler=train_sampler)

for epoch in range(args.num_epochs):
    # Use .set_epoch() method to reshuffle the dataset partition at every iteration
    train_sampler.set_epoch(epoch)
    # training loop
    ...

You can see a fuller Imagenet training example here

New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.

New features

input = Variable(torch.rand(1, 3), requires_grad=True)
h1 = input * 3
out = (h1 * h1).sum()

h1.retain_grad()
out.backward()

print(h1.grad)
# without calling retain_grad(), h1.grad is None

New Layers

training utilities

Learning Rate Schedulers: torch.optim.lr_scheduler provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.

There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the package docs:

ConcatDataset that is a convenient dataset meta-class that can merge and concatenate two individual datasets.

New in torch and autograd

Bug-fixes and small improvements

b = Variable(torch.zeros(1))
if b[0]: # errors now

Important Breakages and Workarounds

As you've read, we've introduced two important changes that are not backward compatible:

We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.

tl;dr

Here is a code snippet that you can add to the top of your scripts. Adding this code will generate warnings highlighting incompatible code.

Fix your code to no longer generate warnings.

# insert this to the top of your scripts (usually main.py)
import sys, warnings, traceback, torch
def warn_with_traceback(message, category, filename, lineno, file=None, line=None):
    sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line))
    traceback.print_stack(sys._getframe(2))
warnings.showwarning = warn_with_traceback; warnings.simplefilter('always', UserWarning);
torch.utils.backcompat.broadcast_warning.enabled = True
torch.utils.backcompat.keepdim_warning.enabled = True

Once all warnings disappear, you can remove the code snippet.

More elaborately

Now, let us see the three incompatible changes with examples.

Using the (now deprecated) 1-dimensional view pointwise function

Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal. The pointwise operation would then be carried out by viewing each tensor as 1-dimensional. PyTorch now supports broadcasting. The “1-dimensional” pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.

For example:

>>> torch.add(torch.ones(4), torch.ones(2,2))
__main__:1: UserWarning: self and other not broadcastable, but have the same
number of elements.  Falling back to deprecated pointwise behavior.
2
2
2
2
[torch.FloatTensor of size 4]
Broadcasting in code where it didn't happen before

The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape, but are broadcastable and have the same number of elements.

For example:

>>> torch.add(torch.ones(4,1), torch.randn(4))

would previously produce a Tensor with size: torch.Size([4,1]), but now produces a Tensor with size: torch.Size([4,4]).

In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set torch.utils.backcompat.broadcast_warning.enabled to True, which will generate a python warning in such cases.

For Example:

>>> torch.utils.backcompat.broadcast_warning.enabled=True
>>> torch.add(torch.ones(4,1), torch.ones(4))
__main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.

Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.

KeepDim=False for Reduction Functions

To get a warning when using a dimensional reduction function with the default keepdim argument, set torch.utils.backcompat.keepdim_warning.enabled to True. For example:

>>> torch.sum(torch.ones(2,3), 1)
__main__:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False.  Consider passing as kwarg.
3
3
[torch.FloatTensor of size 2]

As with torch.utils.backcompat.broadcast_warning.enabled, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.

Note also that using keepdim=False can cause your existing code to "just work" with broadcasting. For example:

# behavior with (old) keepdim=True, causes accidental broadcast
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))
5  5  5  5
5  5  5  5
5  5  5  5
5  5  5  5
[torch.FloatTensor of size 4x4]

# new behavior with keepdim=False is equivalent to non-broadcasted result
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))
5
5
5
5
[torch.FloatTensor of size 4]

相关地址:原始地址 下载(tar) 下载(zip)

查看:2017-08-28发行的版本