v2.1.1
版本发布时间: 2021-07-01 16:43:33
PaddlePaddle/Paddle最新发布版本:v3.0.0-beta0(2024-06-27 18:00:34)
2.1.1 Release Note
重要更新
本版本主要是对2.1.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:
- 完成了
paddle.distributed、paddle.device、paddle.vision
目录API的可见性优化。 - 动态图转静态图新增对
paddle.nn.Sequential
容器内 sublayer 的用户代码的动静转换。 - 动态图增加
SyncBatchNorm
对AMP的支持,提升动态图SyncBatchNorm
层在AMP模式的性能。
训练框架
功能优化(含分布式)
基础API
-
paddle.distributed、paddle.device、paddle.vision
等层级新增推荐使用方式,推荐使用方式的具体说明请见下文2.1.0 Release Note。(#33420) - 新增
paddle.is_compiled_with_rocm
。(#33228) - 新增
paddle.strided_slice
bool type输入的支持。(#33373) - 新增
paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal
bool type输入的支持。 (#33551) - 修复
paddle.utils.download
在ConnectionError异常时不进行Retry逻辑。(#33454) - 修复
paddle.gather
在axis不等于0下,infershape错误的问题。(#33553) - 修复
paddle.io.DataLoader
在num_workers=0
且Dataset
生成GPUTensor
送入DataLoader
时导致的段错误。(#33487, #33249) - 修复
slice
操作结果作为左值使用inplace操作时,反向运行报错提示与错误无关的问题。(#32981) - 修复
paddle.concat
动态图支持 uint8 出错的问题。(#33667) - 修复
paddle.grid_sample
显存溢出和输出结果异常的问题。(#33100、#33232) - 修复
roi_align
中align=True模式下输入为0时的问题。(#33446) - 修复了在特定情况下
log_softmax
会把输入改为nan的问题。(#32937)
动态图转静态图
- 新增支持对
paddle.nn.Sequential
容器内 sublayer 的用户代码的动静转换。(#33065) - 修复了在控制流 for 语句转换中,在变量静态类型分析阶段未正确处理 Subscript 语法的问题。(#32969)
- 重构了动转静
param_guard
逻辑代码,全面解决动静态图Tensor
类型互转问题。(#32985)
分布式训练
- 修复
paddle.distributed.spawn
在使用默认nprocs
参数时出错的问题。(#33249) - 修复流水线并行通信组创建不一致导致训练启动hang住的问题。(#32890、#33473)
- 修复混合并行中保存参数失败的问题。(#33595、#33588)
- 修复Fleet API无法直接运行
Program
的问题。(#33511) - 修复异构参数服务器纯GPU训练模式中样本分桶不均导致hang住的问题。(#32957)
动态图混合并行
- 修复
TensorParallel
的精度问题。改变TensorParallel
的参数初始化方式,保证参数切分后的随机性。(#33087) - 修复
PipeLineParallel
的精度问题。解决PipeLineParallel
的microbatch
使用不正确的问题。(#33097) - 修复
new_group
API创建多个通信组,会hang的问题。(#33553)
混合精度训练
- 动态图增加
SyncBatchNorm
对AMP的支持,提升动态图SyncBatchNorm
层在AMP模式的性能,在PaddleSeg的DeepLabV3P
模型上8卡AMP模式加速比提升19%。(#33709)
自定义OP
- 移除了自定义OP编译时对 PADDLE_WITH_MKLDNN 宏的依赖。(#32903)
- 默认设置
GLIBCXX_USE_CXX11_ABI=1
以解决GCC版本过低导致编译时可能报错的问题。(#33185) - 新增支持c++14的语法特性,默认开启
-std=c++14
编译选项。 (#33227)
其他
- 修复了多线程下
LoDTensorArray
作为Op输入时,训练会随机出段错误的问题。(#32984) - 修复
paddle.ParamAttr
的 regularizer 和paddle.optimizer.Momentum
的weight_decay
同时被指定为L2Decay
时,参数正则化被执行2次的问题。(#32881) - 修复windows系统下warning信息可能显示乱码问题。(#33689)
推理部署
模型量化
Paddle Inference
功能升级
性能优化
- 增加TensorRT的
layer_norm
动态shape plugin,提升模型动态shape推理性能。(#33448)
易用性优化
- 新增 Paddle Inference ROCm 版的预测示例文档以及增加C++预测库的version.txt中与ROCM相关版本信息 (#33290)
- 更新了XPU的编译选项,具体编译选项请参考 #33581。
问题修复
- 修复
fused_fc_elementwise_layernorm
在海光DCU下的线程数过大导致的计算结果错误问题。 (#33299) - 修复yolov3模型在Jetson Nano和Jetson TX2上开启gpu后运行失败的问题。(#33442)
- Paddle-TensorRT plugin
multihead_matmul
修复当seq_len > 1024的计算错误。(#33365) - 修复了ERNIE 模型变长情况下,输入的顺序不一致导致输出结果不对的问题。(#33622)
- 修复OCR模型在GPU上预测报错问题。(#33431)
- 修复
paddle.static.io.normalize_program
没有导出paddle.static.normalize_program
的问题。(#33408) - 修复TensorRT6.0使用stride > 1的conv失败的问题。(#33198 )
- 修复批量推理图片时的显存访问越界错误。(#33370 )(#33531 )
- 修复X86 CPU上MKLDNN缓存大小设置失效的问题。 (#33571)
- 修复TensorRT
conv2d_transpose op converter
维度错误设置问题。(#33242) - 修复Jetson 设备上分CUDA Arch编译出的预测库结果错误的问题,本版本将发布分Arch编译的Jetson预测库,供对预测库体积有需求的用户使用。(#33269)
- 修复使用PaddleSlim量化模型从内存加载预测时,仍会因未设置校准表路径而报错的问题。(#33629)
- 修复BERT/ERNIE在非0号卡上使用TensorRT预测时报错cuda error 400的问题。(#33706)
- 修复在Linux下设置自定义编译参数时引发的cmake语法错误。(#33621)
- 优化
layer_norm
计算精度,修复大数据输入时输出Nan的问题。(#33420) - 修复windows下,TensorRT推理传入左斜杠做分隔符的模型路径时,opt路径错误问题。(#33885)
环境适配
新硬件适配
昆仑硬件训练支持
- 修复
gather
op,新增支持logsumexp
。 (#32931)
2.1.1 Release Note
Important Updates
This version fixed some function and performance issues of PaddlePaddle 2.1.0, and optimized some function. The important updates are as following:
- Optimize the API visibility of
paddle.distributed、paddle.device、paddle.vision
. - Add support for dynamic conversion of user code for sublayer in the
paddle.nn.Sequential
. - Add
SyncBatchNorm
support for AMP in dynamic graph, to improve the performance of dynamic graphSyncBatchNorm
layer in AMP mode,
Training Framework
Functional optimization (including distributed)
Basic API
- Optimize the API visibility of
paddle.distributed、paddle.device、paddle.vision
, for more information, please see 2.1.0 Release Note. (#33420) - Add
paddle.is_compiled_with_rocm
. (#33228) - Add the
paddle.strided_slice
to support bool type.(#33373) - Add
paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal
to support bool type. (#33551) - Fix
paddle.utils.download
does not perform Retry when ConnectionError is abnormal.(#33454) - Fix the issue of infershape error when
paddle.gather
axis is not equal to 0.(#33553) - Fix segment fault caused by
paddle.io.DataLoader
whennum_workers=0
andDataset
returns GPUTensor
and sends it toDataLoader
.(#33487, #33249) - Fix the issue that when use
slice
result as an lvalue of inplace operation, the error message of backward is not related to the error. (#32981) - Fix the issue of
paddle.concat
support uint8 in dynamic graph.(#33667) - Fix the issue of
paddle.grid_sample
GPU memory overflow and abnormal output. (#33100、#33232) - Fix bug of roi_align, when the input width or height of rois is 0, the output feature should be 0 .(#33446)
- Fixed in some corner cases, input was modified to 'nan' bug of log_softmax op. (#32937)
Dynamic Graphs to Static Graphs
- Add support for dynamic conversion of user code for sublayer in the
paddle.nn.Sequential
.(#33065) - Fix the issue of subscript syntax errors in the phase of static type analysis of variables in control flow for statement conversions. (#32969)
- Refactor the dynamic to static
param_guard
logic code to comprehensively solve the dynamic to static graphTensor
type conversion problem.(#32985)
Distributed Training
- Fix the error in
paddle.distributed.spawn
when using the defaultnprocs
argument.(#33249) - Fix the hang issue of training start caused by the inconsistent creation of pipeline parallel communication group.(#32890、#33473)
- Fix the issue of failed to save parameters in mixed parallelism.(#33595、#33588)
- Fix the issue that Fleet API cannot run
Program
directly.(#33511) - Fix the hang issue caused by the uneven sample bucketing in the pure GPU training mode of heterogeneous parameter server.(#32957)
Hybrid Parallelism with Dynamic Graph
- Fix the the accuracy error of
TensorParallel
. Change the parameter initialization method ofTensorParallel
to ensure the randomness of the parameter after slicing.(#33087) - Fix an accuracy error of
PipeLineParallel
. Fix the incorrect use ofmicrobatch
forPipeLineParallel
.(#33097) - Fix the issue that
new_group
API will hang when creating multiple communication groups.(#33553)
Mixed Precision Training
- Add
SyncBatchNorm
support for AMP in Dynamic graph, to improve the performance of dynamic graphSyncBatchNorm
layer in AMP mode, and improve the 8-card AMP mode speedup ratio by 19% onDeepLabV3P
model of [PaddleSeg].(#33709)
Custom OP
- Remove the dependency on
PADDLE_WITH_MKLDNN
macro for custom OP compilation.(#32903) - Default setting
GLIBCXX_USE_CXX11_ABI=1
to resolve the issue of low GCC version that may cause compile-time errors.(#33185) - Add support for c++14 syntax feature, and enable
-std=c++14
compile option by default. (#33227)
Others
- Fix the random segment error of training when
LoDTensorArray
is input of Op under multi-threading.(#32984) - Fix an issue where parameter regularization is executed twice when both the regularizer of
paddle.ParamAttr
and theweight_decay
ofpaddle.optimize
are specified asL2Decay
.(#32881) - Fix the issue of corrupted characters of warning information in windows system.(#33689)
Inference Deployment
Model Quantification
- Fix the issue of skipping OP quantization in dynamic graph quantization training function.(#32879)
- Fix the issue that
layer_norm
does not saveout_threahold
attribute when quantized model is saved.(#33610)
Paddle Inference
Function Upgrades
- Add converter/plugin of
gather_nd
和reduce_sum
in Paddle-TRT.(#33365) - Add
reshape
in Paddle-TRT.(#33372)
Performance Optimization
- Add the dynamic shape plugin of TensorRT
layer_norm
to improve model dynamic shape inference performance.(#33448)
易用性优化
- Add Paddle Inference ROCm version of Prediction Example Document, so as to add C++ prediction library version.txt with ROCm related version information. (#33290)
- Update XPU compilation options. Please refer to #33581 for specific compilation options.
Bug Fixes
- Fix the calculation error of
fused_fc_elementwise_layernorm
caused by too large number of threads under DCU. (#33299) - Fix the issue that yolov3 model fails to run after gpu is turned on on nano and TX2.(#33442)
- Fix the computation error when seq_len > 1024 in Paddle-TRT
multihead_matmul plugin
.(#33365) - Fix the incorrect output error caused by inconsistent order of input when ERNIE model becomes longer.(#33622)
- Fix the reports error of OCR model in prediction on GPU.(#33431)
- Fix the issue that
paddle.static.io.normalize_program
failed to exportpaddle.static.normalize_program
.(#33408) - Fix the issue that conv with stride > 1 fails in TRT6.0 and below.(#33198 )
- Fix the out-of-bounds error of GPU memory access when batch predicting images. (#33370 )(#33531 )
- Fix the issue of cache size setting failure on X86 CPU. (#33571)
- Fix TRT
conv2d_transpose op converter
dimension error setting. Now the model ofconv2d_transpose
op can work normally on TRT.(#33242) - Fix the error of prediction library compiled by sub-CUDA Arch on Jetson devices. This version will release the Jetson prediction library compiled by sub-Arch for users who have demand for shrinked prediction library binary size.(#33269)
- Fix the issue that when using PaddleSlim quantitative model to load prediction from memory, it still reports an error because the calibration table path is not set.(#33629)
- Fix the issue that BERT/ERNIE gives wrong cuda error 400 when using TRT prediction on non-0 card.(#33706)
- Fix a cmake syntax error caused by setting custom compilation parameters under Linux.(#33621)
- Optimize the calculation accuracy of
layer_norm
and fix the problem of outputting Nan when input is large data. (#33420)
Environment Adaptation
Compile and install
Support of new hardware training
support of Kunlun chips
- Fix the
gather
op, add support of logsumexp op. (#32931)
Thanks to our Contributors
This release contains contributions from: Aurelius84, cc, ceci3, Chen Weihang, danleifeng, feng_shuai, houj04, jiangcheng, JZ-LIANG, Kaipeng Deng, lidanqing, LielinJiang, Lijunhui, lilong12, liuyuhui, liym27, Pei Yang, Peihan, Qi Li, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, TeslaZhao, tianshuo78520a, TTerror, wangguanzhong, Wangzheee, wawltor, WeiXin, wenbin, Wenyu, whs, Wilber, wuhuanzhou, Zhang Ting, zhiboniu, Zhou Wei, zhoujun, 李季, 王明冬