v2.2.1
版本发布时间: 2021-12-08 13:04:30
PaddlePaddle/Paddle最新发布版本:v3.0.0-beta0(2024-06-27 18:00:34)
2.2.1 Release Note
1. 重要更新
我们很高兴的发布飞桨框架2.2.1版本,主要是对2.2.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:
- 新增
paddle.linalg.triangular_solve
,用于计算带有三角系数矩阵的线性方程组。 - 新增
paddle.device.cuda.graphs.CUDAGraph
API,支持NVIDIA的CUDA Graph功能,注意目前该API还处于实验阶段,尚未稳定。 - 修复了基础API、Tensor 索引中的已知问题。
2. 训练框架(含分布式)
(1)新功能
API
- 新增
paddle.linalg.triangular_solve
API,用于计算带有三角系数矩阵的线性方程组。(#36714) - 新增
paddle.device.cuda.graphs.CUDAGraph
API,支持NVIDIA的CUDA Graph功能,可以将GPU计算全部捕捉到一张CUDA Graph中,往后多次调用,可以去除框架的额外开销,提升运行性能。注意目前该API还处于实验阶段,尚未稳定。(#37109) - 新增
paddle.incubate.graph_send_recv
API,主要应用于图学习领域,目的是为了减少在消息传递过程中带来的中间变量显存或内存的损耗,包含 SUM、MEAN、MIN、MAX 共四种更新模式。(#37205) - 新增
paddle.incubate.operators.ResNetUnit
API,用于 ResNet 网络里的卷积、批归一化、shortcut/bottleneck操作融合。(#37109)
(2)功能优化
API
-
paddle.incubate.FusedTransformerEncoderLayer
,添加src_mask=None
的支持,添加pure fp16的支持。 (#37229)
IR(Intermediate Representation)
- 动态图转静态图
- 使用
@paddle.jit.to_static
装饰单独的 function 时,提供train()、eval()
函数支持切换到train、eval
模式。(#37383)
- 使用
分布式训练
- 异构参数服务器完善任意次切图能力,增加流水线训练功能,提升训练吞吐。(#37446)
其他
- 针对
paddle.scatter
的index
越界导致 core dump 的问题,加强了越界检查,并完善对应的报错信息。(#37431)
(3)性能优化
- 优化
paddle.top_k
,根据k
的大小和input_width
大小进行选择不同的实现方案,当 k>=75% input_width 时选择 cub 实现,否则选择手写 kernel 实现。(#37325) - 优化
paddle.fluid.optimizer.LarsMomentumOptimizer
,通过 optimizer 算子融合 + CUDA Cooperative Groups的方式提高OP性能。(#37109)
(4)问题修复
API
- 修复
paddle.nn.ELU
与paddle.nn.functional.elu
的计算公式,解决 alpha<0 时结果错误的问题;paddle.nn.functional.elu_
不支持 alpha<0 的场景,在 alpha<0 时会报错。(#37437) - 修复
paddle.slice
反向执行时出现out_of_range
的问题。(#37584) -
paddle.shape
没有反向,显式设置stop_gradient
为True
。(#37412) -
paddle.arange
没有反向,显式设置stop_gradient
为True
。(#37486) -
paddle.shard_index
在输入数据的最后一维不为1时进行报错提示。(#37421) - 修复
paddle.matmul
使用int8量化,反量化时维度错误的问题。(#36982) - 修复
paddle.nn.Dropout
在eval
模式下不计算梯度的问题。(#37305) - 修复
paddle.nn.functional.dropout
在静态图下输入Tenor
形状中有 -1 并指定 drop 该维时报错的问题。(#37223) - 修复RNN类API
paddle.nn.LSTM
,paddle.nn.GRU
,paddle.nn.SimpleRNN
在CPU训练时多层RNN(dropout设置为0)反向计算出错的问题。(#37086) - 修复
paddle.incubate.FusedTransformerEncoderLayer
反向计算梯度错误、pre_layer_norm 处理不正确、参数处理不正确,漏传参数、 add_bias 计算错误等问题。 (#37229) - 修复
paddle.incubate.fused_multi_head_attention
不支持bias
为None
的问题。(#37411, #37566) - 修复
paddle.vision.datasets.Cifar10
,paddle.vision.datasets.Cifar100
加载数据没有顺序的问题。 (#37528) - 修复一维
Tensor
在使用省略号(...)索引时维度检测异常报错的问题。(#37192) - 修复
Tensor
索引赋值(setitem
)梯度属性无法传播的问题,详见issue。(#37028)
IR(Intermediate Representation)
分布式训练
-
fleet.load_model
: 修复参数服务器模式下模型加载API不可用问题。(#37461) -
fleet.save_inference_model
: 修复参数服务器模式下模型保存 dense 参数前,未从 server 端拉取参数的问题。(#37461)
其他
- 修复动态图 inplace 操作的问题:对一个非叶子节点进行 inplace 操作后,立即执行 backward,该节点及更前的节点的梯度计算错误。(#37420)
3. 部署方向(Paddle Inference)
(1)问题修复
- 在明确关闭日志的情况下,进一步去除冗余的调试日志。(#37212)
- 修复内存/显存优化策略,避免因不当的内存/显存优化导致预测结果有误或崩溃。(#37324, #37123)
- 修复 Transformer 模型的 MultiHead 结构中融合后 QkvToContextPluginDynamicscale 的 scale 计算错误问题,这是由于 cuda 函数的 block 和 thread 设置错误引起的。(#37096)
- 将所有的推理OP在int8量化的功能中注册:解决因历史原因有些推理OP没有在int8量化中注册的问题。(#37266)
2.2.1 Release Note
1. Important Updates
This version fixed some function and performance issues of PaddlePaddle 2.2.0, and optimized some functions. The highlights are as follows:
- Add
paddle.linalg.triangular_solve
to calculate linear equations with triangular coefficient matrices. - Add
paddle.device.cuda.graphs.CUDAGraph
API that supports the CUDA Graph function of NVIDIA. Note that this API is still experimental and not yet stable. - Fix known issues of basic API and Tensor index.
2. Training Framework(Distributed Included)
(1)New Functions
API
- Add
paddle.linalg.triangular_solve
API to calculate linear equations with triangular coefficient matrices. (#36714) - Add
paddle.device.cuda.graphs.CUDAGraph
API that supports the CUDA Graph function of NVIDIA by capturing all GPU calculations into a single CUDA Graph and calling them for later use, which not only cuts the extra overhead but also improves the runtime performance. Note that the API is still experimental and not yet stable. (#37109) - Add
paddle.incubate.graph_send_recv
API for graph learning to reduce the loss of intermediate variables in memory or video memory during message passing. It contains four update modes, namely, SUM, MEAN, MIN, and MAX. (#37205) - Add
paddle.incubate.operators.ResNetUnit
API to integrate the convolution, batch normalization, and shortcut/bottleneck operation in the ResNet network. (#37109)
(2)Function Optimization
API
-
paddle.incubate.FusedTransformerEncoderLayer
addssrc_mask=None
and supports pure fp16.(#37229)
IR(Intermediate Representation)
- Dynamic Graph to Static Graph
- When adopting
@paddle.jit.to_static
to decorate single function,train()、eval()
functions are provided to support the switch totrain、eval
mode. (#37383)
- When adopting
Distributed Training
- Optimize the ability of arbitrary cutting and add pipeline training in the heterogeneous parameter server, which enhance training throughput.(#37446)
Others
- Enhance the out-of-bounds check for the
index
of ``paddle.scatter` that causes core dump, and improve the corresponding error reporting message. (#37431)
(3)Performance Optimization
- Optimize
paddle.top_k
by enabling it to choose different implementations according to the size ofk
andinput_width
: cub implementation when k>=75% input_width, otherwise the handwritten kernel implementation.(#37325) - Optimize
paddle.fluid.optimizer.LarsMomentumOptimizer
to improve OP performance by integrating optimizer operator and CUDA Cooperative Groups. (#37109)
(4)Bug Fixes
API
- Fix the calculation error of
paddle.nn.ELU
andpaddle.nn.functional.elu
when alpha<0;please note the inplace version:paddle.nn.functional.elu_
will raise error when alpha<0. ([#37437] - (https://github.com/PaddlePaddle/Paddle/pull/37437))
- Fix the problem of
out_of_range
when thepaddle.slice
is reversely executed. (#37584) -
paddle.shape
doesn't support backward, explicitly setstop_gradient
toTrue
. (#37412) -
paddle.arange
doesn't support backward, explicitly setstop_gradient
toTrue
.(#37486) -
paddle.shard_index
reports an error if the last dimension of the input data is not 1. (#37421) - Fix the wrong dimension of inverse quantization when
paddle.matmul
adopts int8 quantization. (#36982) - Fix the issue that
paddle.nn.Dropout
, undereval
, does not calculate the gradient. (#37305) - Fix the issue that
paddle.nn.functional.dropout
, in static graph mode, reports an error when -1 is included in the input shape ofTensor
and it is specified to drop this dimension. (#37223) - Fix the backward calculation errors of multi-layer RNN (dropout set 0) in CPU training by RNN API
paddle.nn.LSTM
,paddle.nn.GRU
,paddle.nn.SimpleRNN
. (#37086) - Fix issues such as the gradient error of
paddle.incubate.FusedTransformerEncoderLayer
backward calculation, incorrect processing of pre_layer_norm, incorrect parameter processing, missing parameters, calculation errors of add_bias, etc. (#37229) - Fix the issue that
paddle.incubate.fused_multi_head_attention
does not supportbias
asNone
.(#37411, #37566) - Fix the disordered data loaded by
paddle.vision.datasets.Cifar10
,paddle.vision.datasets.Cifar100
. (#37528) - Fix the issue that one-dimensional
Tensor
reports an exception error of dimension detection when using ellipsis(...) indexing. (#37192) - Fix the issue that the gradient attribute of
Tensor
cannot be spread during indexing and assignment (setitem
), see issue for details. (#37028)
IR(Intermediate Representation)
- Dynamic Graph to Static Graph
Distributed Training
-
fleet.load_model
: Fix the unavailable API loaded by the model in parameter server mode.(#37461) -
fleet.save_inference_model
: Fix the issue that the model does not pull parameters from the server side before saving dense parameters in parameter server mode. (#37461)
Others
- Fix the problem of inplace operation of dynamic graph: after performing inplace operation on a non-leaf node, followed by immediate execution of backward, the gradient of this node and the nodes before is calculated incorrectly. (#37420)
3. Paddle Inference
(1)Bug Fixes
- Further removal of redundant debug logs in the case of clear log disable.(#37212)
- Fix memory/video memory optimization policies to avoid incorrect prediction results or crashes due to improper memory/video memory optimization. (#37324, #37123)
- Fix the scale calculation error in the MultiHead structure of Transformer model after integrating QkvToContextPluginDynamicscale, which is caused by wrong block and thread settings of cuda function. (#37096)
- Register all inference OPs in the function of int8 quantization: Solve the issues that some inference OPs are not registered in int8 quantization due to historical reasons. (#37266)