v11.0.0b2
版本发布时间: 2022-04-27 15:44:54
cupy/cupy最新发布版本:v13.3.0(2024-08-22 15:42:45)
This is the release note of v11.0.0b2. See here for the complete list of solved issues and merged PRs.
We are running a Gitter chat for general discussions and quick questions. Feel free to join the channel to talk with developers and users!
Highlights
JIT Improvements (#6620, #6640, #6649, #6668)
CuPy JIT has been further enhanced thanks to @leofang and @eternalphane!
It is now possible to use CUDA cooperative groups and access .shape
and .strides
attributes of ndarrays.
import cupy
from cupyx import jit
@jit.rawkernel()
def kernel(x, y):
size = x.shape[0]
ntid = jit.gridDim.x * jit.blockDim.x
tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x
for i in range(tid, size, ntid):
y[i] = x[i]
g = jit.cg.this_thread_block()
g.sync()
x = cupy.arange(200, dtype=cupy.int64)
y = cupy.zeros((200,), dtype=cupy.int64)
kernel[2, 32](x, y)
print(kernel.cached_code)
The above program emits the CUDA code as follows:
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
extern "C" __global__ void kernel(CArray<long long, 1, true, true> x, CArray<long long, 1, true, true> y) {
ptrdiff_t i;
ptrdiff_t size = thrust::get<0>(x.get_shape());
unsigned int ntid = (gridDim.x * blockDim.x);
unsigned int tid = ((blockIdx.x * blockDim.x) + threadIdx.x);
for (ptrdiff_t __it = tid, __stop = size, __step = ntid; __it < __stop; __it += __step) {
i = __it;
y[i] = x[i];
}
cg::thread_block g = cg::this_thread_block();
g.sync();
}
Initial MPI and sparse matrix support in cupyx.distributed
(#6628, #6658)
CuPy v10 added the cupyx.distributed
API to perform interprocess communication using NCCL in a way similar to MPI. In CuPy v11 we are extending this API to support sparse matrices as defined in cupyx.scipy.sparse
. Currently only send
/recv
primitives are supported but we will be adding support for collective calls in the following releases.
Additionally, now it is possible to use MPI (through the mpi4py
python package) to initialize the NCCL communicator. This prevents from launching the TCP server used for communication exchange of CPU values. Moreover, we recommend to enable MPI for sparse matrices communication as this requires to exchange metadata per each communication call that lead to device synchronization if MPI is not enabled.
# run with mpiexec -n N python …
import mpi4py
comm = mpi4py.MPI.COMM_WORLD
workers = comm.Get_size()
rank = comm.Get_rank()
comm = cupyx.distributed.init_process_group(workers, rank, use_mpi=True)
Announcements
Introduction of generic cupy-wheel
(EXPERIMENTAL) (#6012)
We have added a new package in the PyPI called cupy-wheel
. This meta package allows other libraries to add a dependency to CuPy with the ability to transparently install the exact CuPy binary wheel matching the user environment. Users can also install CuPy using this package instead of manually specifying a CUDA/ROCm version.
pip install cupy-wheel
This package is only available for the stable release as the current pre-release wheels are not hosted in PyPI.
This feature is currently experimental and subject to change so we recommend users not to distribute packages relying on it for now. Your suggestions or comments are highly welcomed (please visit #6688.)
Changes
New Features
- Support cooperative group in JIT compiler (#6620)
- Add support for sparse matrices in
cupyx.distributed
(#6628) - JIT: Support compile-time for-loop unrolling (#6649)
- JIT: Support
.shape
and.strides
(#6668)
Enhancements
- Add a few driver/runtime/nvrtc API wrappers (#6604)
- Implement
flatten(order)
(#6613) - Implemented a
__repr__
forcupyx.profiler._time._PerfCaseResult
(#6617) - JIT: Avoid calling default constructor if possible (#6619)
- Add missing
cudaDevAttrMemoryPoolsSupported
to hip (#6621) - Add CC 3.2 to Tegra arch list (#6631)
- JIT: Add more cooperative group APIs (#6640)
- JIT: Add
kernel.cached_code
test (#6643) - Use MPI for management in
cupyx.distributed
(#6658) - Improve warning message in sparse (#6669)
Performance Improvements
- Improve copy and assign operation (#6181)
- Performance improvement of
cupy.intersect1d
(#6586)
Bug Fixes
- Define
float16::operator-()
only for ROCm 5.0+ (#6624) - JIT: fix access to cached codes (#6639)
- Fix cuda python CI (#6652)
- Fix int64 overflow in
cupy.polyval
(#6664) - JIT: Disable
memcpy_async
on CUDA 11.0 (#6671)
Documentation
- Add
--pre
option to instructions installing pre-releases (#6612) - JIT: fix function signatures in the docs (#6648)
- Fix typo in performance guide (#6657)
Installation
- Add universal CuPy package (#6012)
Tests
- Run daily benchmark with head branch against latest release (#6598)
- CI: Trigger FlexCI for hotfix branches (#6625)
- Remove
jenkins
requirements (#6632) - Fix
TestIncludesCompileCUDA
for HEAD tests (#6646) - Trigger CUDA Python tests with
/test mini
(#6653) - Fix missing f prefix on f-strings fix (#6674)
Contributors
The CuPy Team would like to thank all those who contributed to this release!
@asi1024 @code-review-doctor @danielg1111 @davidegavio @emcastillo @eternalphane @kmaehashi @leofang @okuta @takagi @toslunar
1、 cupy_cuda102-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 60.59MB
2、 cupy_cuda102-11.0.0b2-cp310-cp310-manylinux2014_aarch64.whl 34.84MB
3、 cupy_cuda102-11.0.0b2-cp310-cp310-win_amd64.whl 42.51MB
4、 cupy_cuda102-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 59.06MB
5、 cupy_cuda102-11.0.0b2-cp37-cp37m-manylinux2014_aarch64.whl 33.14MB
6、 cupy_cuda102-11.0.0b2-cp37-cp37m-win_amd64.whl 42.42MB
7、 cupy_cuda102-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 62.25MB
8、 cupy_cuda102-11.0.0b2-cp38-cp38-manylinux2014_aarch64.whl 36.29MB
9、 cupy_cuda102-11.0.0b2-cp38-cp38-win_amd64.whl 42.51MB
10、 cupy_cuda102-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 60.51MB
11、 cupy_cuda102-11.0.0b2-cp39-cp39-manylinux2014_aarch64.whl 34.79MB
12、 cupy_cuda102-11.0.0b2-cp39-cp39-win_amd64.whl 42.51MB
13、 cupy_cuda110-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 75.21MB
14、 cupy_cuda110-11.0.0b2-cp310-cp310-win_amd64.whl 57.09MB
15、 cupy_cuda110-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 73.68MB
16、 cupy_cuda110-11.0.0b2-cp37-cp37m-win_amd64.whl 57MB
17、 cupy_cuda110-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 76.87MB
18、 cupy_cuda110-11.0.0b2-cp38-cp38-win_amd64.whl 57.09MB
19、 cupy_cuda110-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 75.14MB
20、 cupy_cuda110-11.0.0b2-cp39-cp39-win_amd64.whl 57.09MB
21、 cupy_cuda111-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 94MB
22、 cupy_cuda111-11.0.0b2-cp310-cp310-win_amd64.whl 76.83MB
23、 cupy_cuda111-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 92.46MB
24、 cupy_cuda111-11.0.0b2-cp37-cp37m-win_amd64.whl 76.74MB
25、 cupy_cuda111-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 95.66MB
26、 cupy_cuda111-11.0.0b2-cp38-cp38-win_amd64.whl 76.84MB
27、 cupy_cuda111-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 93.92MB
28、 cupy_cuda111-11.0.0b2-cp39-cp39-win_amd64.whl 76.83MB
29、 cupy_cuda112-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 75.63MB
30、 cupy_cuda112-11.0.0b2-cp310-cp310-win_amd64.whl 57.58MB
31、 cupy_cuda112-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 74.09MB
32、 cupy_cuda112-11.0.0b2-cp37-cp37m-win_amd64.whl 57.49MB
33、 cupy_cuda112-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 77.29MB
34、 cupy_cuda112-11.0.0b2-cp38-cp38-win_amd64.whl 57.58MB
35、 cupy_cuda112-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 75.55MB
36、 cupy_cuda112-11.0.0b2-cp39-cp39-win_amd64.whl 57.58MB
37、 cupy_cuda113-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 72.8MB
38、 cupy_cuda113-11.0.0b2-cp310-cp310-win_amd64.whl 54.31MB
39、 cupy_cuda113-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 71.27MB
40、 cupy_cuda113-11.0.0b2-cp37-cp37m-win_amd64.whl 54.22MB
41、 cupy_cuda113-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 74.47MB
42、 cupy_cuda113-11.0.0b2-cp38-cp38-win_amd64.whl 54.31MB
43、 cupy_cuda113-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 72.73MB
44、 cupy_cuda113-11.0.0b2-cp39-cp39-win_amd64.whl 54.31MB
45、 cupy_cuda114-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 81.28MB
46、 cupy_cuda114-11.0.0b2-cp310-cp310-win_amd64.whl 63MB
47、 cupy_cuda114-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 79.75MB
48、 cupy_cuda114-11.0.0b2-cp37-cp37m-win_amd64.whl 62.91MB
49、 cupy_cuda114-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 82.94MB
50、 cupy_cuda114-11.0.0b2-cp38-cp38-win_amd64.whl 63.01MB
51、 cupy_cuda114-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 81.2MB
52、 cupy_cuda114-11.0.0b2-cp39-cp39-win_amd64.whl 63MB
53、 cupy_cuda115-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 78MB
54、 cupy_cuda115-11.0.0b2-cp310-cp310-win_amd64.whl 59.68MB
55、 cupy_cuda115-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 76.46MB
56、 cupy_cuda115-11.0.0b2-cp37-cp37m-win_amd64.whl 59.59MB
57、 cupy_cuda115-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 79.66MB
58、 cupy_cuda115-11.0.0b2-cp38-cp38-win_amd64.whl 59.69MB
59、 cupy_cuda115-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 77.92MB
60、 cupy_cuda115-11.0.0b2-cp39-cp39-win_amd64.whl 59.68MB
61、 cupy_cuda116-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 78.04MB
62、 cupy_cuda116-11.0.0b2-cp310-cp310-win_amd64.whl 59.7MB
63、 cupy_cuda116-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 76.5MB
64、 cupy_cuda116-11.0.0b2-cp37-cp37m-win_amd64.whl 59.61MB
65、 cupy_cuda116-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 79.7MB
66、 cupy_cuda116-11.0.0b2-cp38-cp38-win_amd64.whl 59.71MB
67、 cupy_cuda116-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 77.96MB
68、 cupy_cuda116-11.0.0b2-cp39-cp39-win_amd64.whl 59.7MB
69、 cupy_rocm_4_2-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 34.64MB
70、 cupy_rocm_4_2-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 33.31MB
71、 cupy_rocm_4_2-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 36.11MB
72、 cupy_rocm_4_2-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 34.56MB
73、 cupy_rocm_4_3-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 36.22MB
74、 cupy_rocm_4_3-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 34.9MB
75、 cupy_rocm_4_3-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 37.7MB
76、 cupy_rocm_4_3-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 36.15MB
77、 cupy_rocm_5_0-11.0.0b2-cp310-cp310-manylinux1_x86_64.whl 54.29MB
78、 cupy_rocm_5_0-11.0.0b2-cp37-cp37m-manylinux1_x86_64.whl 52.96MB
79、 cupy_rocm_5_0-11.0.0b2-cp38-cp38-manylinux1_x86_64.whl 55.77MB
80、 cupy_rocm_5_0-11.0.0b2-cp39-cp39-manylinux1_x86_64.whl 54.22MB