v11.0.0b2

版本发布时间: 2022-04-27 15:44:54

cupy/cupy最新发布版本:v13.3.0(2024-08-22 15:42:45)

This is the release note of v11.0.0b2. See here for the complete list of solved issues and merged PRs.

We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!

Highlights

JIT Improvements (#6620, #6640, #6649, #6668)

CuPy JIT has been further enhanced thanks to @leofang and @eternalphane! It is now possible to use CUDA cooperative groups and access .shape and .strides attributes of ndarrays.

import cupy
from cupyx import jit

@jit.rawkernel()
def kernel(x, y):
    size = x.shape[0]
    ntid = jit.gridDim.x * jit.blockDim.x
    tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
    for i in range(tid, size, ntid):
        y[i] = x[i]
    g = jit.cg.this_thread_block()
    g.sync()

x = cupy.arange(200, dtype=cupy.int64)
y = cupy.zeros((200,), dtype=cupy.int64)
kernel[2, 32](x, y)

print(kernel.cached_code)

The above program emits the CUDA code as follows:

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

extern "C" __global__ void kernel(CArray<long long, 1, true, true> x, CArray<long long, 1, true, true> y) {
  ptrdiff_t i;
  ptrdiff_t size = thrust::get<0>(x.get_shape());
  unsigned int ntid = (gridDim.x * blockDim.x);
  unsigned int tid = ((blockIdx.x * blockDim.x) + threadIdx.x);
  for (ptrdiff_t __it = tid, __stop = size, __step = ntid; __it < __stop; __it += __step) {
    i = __it;
    y[i] = x[i];
  }
  cg::thread_block g = cg::this_thread_block();
  g.sync();
}

Initial MPI and sparse matrix support in `cupyx.distributed` (#6628, #6658)

CuPy v10 added the cupyx.distributed API to perform interprocess communication using NCCL in a way similar to MPI. In CuPy v11 we are extending this API to support sparse matrices as defined in cupyx.scipy.sparse. Currently only send/recv primitives are supported but we will be adding support for collective calls in the following releases.

Additionally, now it is possible to use MPI (through the mpi4py python package) to initialize the NCCL communicator. This prevents from launching the TCP server used for communication exchange of CPU values. Moreover, we recommend to enable MPI for sparse matrices communication as this requires to exchange metadata per each communication call that lead to device synchronization if MPI is not enabled.

# run with mpiexec -n N python …

import mpi4py
comm = mpi4py.MPI.COMM_WORLD
workers = comm.Get_size()
rank = comm.Get_rank()

comm = cupyx.distributed.init_process_group(workers, rank, use_mpi=True)

Announcements

Introduction of generic `cupy-wheel` (EXPERIMENTAL) (#6012)

We have added a new package in the PyPI called cupy-wheel. This meta package allows other libraries to add a dependency to CuPy with the ability to transparently install the exact CuPy binary wheel matching the user environment. Users can also install CuPy using this package instead of manually specifying a CUDA/ROCm version.

pip install cupy-wheel

This package is only available for the stable release as the current pre-release wheels are not hosted in PyPI.

This feature is currently experimental and subject to change so we recommend users not to distribute packages relying on it for now. Your suggestions or comments are highly welcomed (please visit #6688.)