v2.0.0-rc0

PaddlePaddle/Paddle

版本发布时间: 2020-10-30 11:37:50

PaddlePaddle/Paddle最新发布版本:v3.0.0-beta0(2024-06-27 18:00:34)

2.0-rc0 Release Note

重要更新

相对2.0-beta版，本版本在如下方面进一步完善：

默认模式：paddle2.0-rc后将默认开启动态图模式；如果需要使用静态图编程模式，可以通过paddle.enable_static()来切换到静态图模式。
框架API：修改50个常用API名称，新增8个基础API实现，移除220个API（包含别名移除），8个API增加二阶导数计算，更多API增加了对昆仑芯片的支持，分布式FleetAPI正式化，高层API进行了功能增强。
框架功能：优化动静转换用法，优化模型读取和载入，优化混合精度训练和量化策略，优化分布式训练策略。删除了nltk等6项编译依赖；安装包增加对Python 3.8、CUDA 10.1/10.2的支持。
推理引擎：增强int8量化能力，增加算子版本信息，oneDNN相关的功能强化和性能优化。

训练框架

基础API（含分布式）

新增API

新增 paddle.emtpy API，返回未初始化的内存
新增 paddle.emtpy_like API，返回未初始化的内存
新增 paddle.mv API，返回矩阵-向量乘的结果
新增paddle.multinomial多项分布API
新增paddle.nn.LocalResponseNorm和paddle.nn.functional.local_response_norm
新增paddle.nn.Pad1D/Pad2D/Pad3D api，支持constant，reflect，replicate和circular模式
新增paddle.add_n
新增动态图混合精度训练API，paddle.amp.auto_cast和paddle.amp.GradScaler

修复和完善API

paddle.reshape API支持bool类型输入
paddle.distribution.Categorical API添加sample和log_prob方法
BatchNorm1D, BatchNorm2D, BatchNorm3D 添加了 channel last 数据布局支持
paddle.optimzier.Adam和paddle.optimizer.AdamaW参数顺序修改
yolo_box支持输入特征图H，W不相等，用于完成长宽不相等的图像预测
paddle.nn.function.interpolate 支持 scale_factor 输入类型为 list
添加了adaptive pool2d运算符的oneDNN支持 @intel
添加了dilated conv和dilated conv_transpose的oneDNN支持 @intel
unique支持GPU设备计算
paddle.multiply 支持非variable 和 tensor 数据类型输入
paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d，重构python端PoolAPI的实现
paddle.set_printoptions支持设置动态图Tensor的显示选项
paddle.assign API，支持数组/张量到张量的赋值
paddle.nn.functional.swish/paddle.nn.Swish，删除beta参数
paddle.nn.functional.thresholded_relu/paddle.nn.ThresholdedReLU，threshold参数默认值为1.0
paddle.norm，升级后支持fro、inf、-inf、0、1、2，和任何正实数p对应的p范数
paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d，重构python端PoolAPI的实现
RNN类（SimpleRNN、LSTM、GRU）优化参数顺序和基类RNNBase实现，集成cudnn lstm
修复adaptive_pool op在特殊输出情况下GPU梯度异常的问题
新增支持二阶求导功能：batch_norm、abs、log、expand、tile、squeeze、unsqueeze、matmul
新增50余个算子对昆仑（XPU）训练的支持

API名称变化

对2.0-beta的50个API名称进行了修改，详见链接

移除API（包括别名）

移除220个API（包括别名），详见链接

多设备/分布式训练API

Fleet API正式化，统一到paddle.distributed.fleet作为Paddle通用分布式训练统一入口
paddle.distributed.fleet.DistributedStrategy作为Paddle统一并行策略定义入口暴露
增加paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API，支持分布式下的重计算机制
增加paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API，支持分布式下的梯度累加机制
增加paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API，支持分布式下的流水线并行机制
paddle.distributed.fleet.DistributedStrategy新增amp优化策略，支持分布式下自动混合精度机制的开启
paddle.distributed.fleet.DistributedStrategy新增dgc优化策略，支持分布式下深度梯度压缩机制的开启
paddle.distributed.fleet.DistributedStrategy新增fp16_allreduce优化策略，支持分布式下fp16 allreduce通信机制的开启
paddle.distributed.fleet.DistributedStrategy新增lars优化策略，支持分布式下大batch size 训练使用 lars 优化器
paddle.distributed.fleet.DistributedStrategy新增lamb优化策略，支持分布式下大batch size 训练使用 lamb 优化器
paddle.distributed.fleet支持多优化策略组合，支持包括amp+recompute, dgc+recompute, amp+recompute+lars等十余种策略的组合
paddle.distributed.fleet.DistributedStrategy新增a_sync优化策略，支持分布式下使用参数服务器进行同步、异步、GeoSGD以及异构参数服务器优化训练
paddle.distributed.fleet.DistributedStrategy新增auto实验性优化策略，支持分布式下多策略最优化自动并行
增加fleetrun启动分布式训练任务，支持Collective模式在单机单卡，单机多卡和多机多卡下启动，支持参数服务器模式在CPU集群、GPU集群、异构集群下启动，支持直接提交PaddleCloud集群
paddle.distributed.fleet支持动态图执行，支持GPU模式下动态图单机单机、单机多卡和多机多卡训练
paddle.distributed.fleet 新增通信集合功能，支持all_reduce，all_gather及 barrier功能
paddle.distributed.fleet 新增分布式指标计算功能，包括auc，rmse， mae，acc 等
paddle.distributed.fleet下废弃原fleet.main_program和fleet.startup_program，替换为paddle.static.default_main_program() 和 paddle.static.default_startup_program()
paddle.distributed.fleet支持异构参数服务器模式，可通过fleetAPI配合用户组网实现异构计算设备训练，跨设备协作进行分布式训练
分布式集合通信API支持CPU设备
paddle.distributed.fleet.DistributedStrategy新增localsgd优化策略
paddle.distributed.fleet.DistributedStrategy新增adaptivelocalsgd优化策略，支持分布式下自动计算step步长的localsgd策略
新增paddle.distributed添加InMemoryDataset和QueueDataset支持使用Dataset进行分布式训练

高层API

新增IterableDataset基类支持流式数据集，DataLoader支持对IterableDataset进行多进程加速，并支持通过paddle.io.get_worker_info()获取子进程状态并进行进程间数据划分
paddle.io.DataLoader的places参数更新为可选，不指定places使用默认的places
新增CIFAR10, CIFAR100, Conll05st等10个map-style数据集，支持数据集自动下载并以map-style方式获取数据
DIstributedBatchSampler接口新增num_replicas和rank参数用于指定卡数和当前卡逻辑序号
新增paddle.io.TensorDataset支持tensor数据集读取
新增paddle.io.Sampler基类，并新增SequenceSampler，RandomSampler用于在BatchSampler中顺序或乱序获取数据
paddle.io.BatchSampler支持Sampler作为输入，删除原输入参数indices
下线paddle.reader下原有API
paddle.vision.transforms中的图像变换算子添加处理PIL的后端
paddle.summary支持多个输入与多个输出的Layer
model.save升级，在动态图保存预测模型时，用户不需要调用paddle.jit_to_static或者为layer函数增加装饰器（动转静的功能）。并且如果用户在Model初始化时如果传入了inputs，则可以保存正确的输入shape，否则模型的输入shape会按照运行模型时传入的输入shape保存

功能优化（含分布式）

动态图基础功能

新增Tensor的clone接口，会拷贝一个完全相同的Tensor，同时clone后的Tensor继续保留在计算图中，并支持梯度回传
支持通过索引或切片原地(inplace) 修改 Tensor
动态图Tensor打印和显示优化，高维tensor数据显示方式对齐numpy，支持缩略形式
优化了initializer类的__call__方法，不再需要传入block，避免用户在动态图中感知到静态图block概念
隐藏动态图多卡API DataParallel的scale_loss和apply_collective_grads方法，编写多卡模型代码时不再需要调用这两个方法，简化写法，提升易用性
添加oneDNN 动态图支持，支持了 Resnet50模型训练和推理。@intel

动态图转静态图

动态图转静态图相关API接口迁移2.0，简化了import 路经
动转静装饰器 to_static 新增支持直接装饰 model 实例，如 to_static(model, input_spec)
新增InputSpec中name参数的默认值解析机制，若未指定name，则使用被装饰函数参数名作为name
StaticLayer重命名为StaticFunction
优化了动转静Debug log
修复了一些场景下动转静的bug

混合精度训练

重构静态图混合精度训练中的梯度有效性检查和动态loss scaling逻辑，去除一些condition block逻辑

模型量化

新增动态图分channel量化功能，支持对Conv2D和Linear等layer的权重进行分channel求取量化参数
新增动态图量化训练过程中对模型layer求取output scale参数功能，供Server端量化推理部署使用

分布式训练优化

支持流水线并行训练
支持参数服务器模式下异构分布式训练，支持PS+GPU，PS+昆仑， PS+CPU，PS+CPU+GPU(昆仑)等多种设备进行训练，单台GPU/昆仑机器+10台cpu机器上，完成千万数据千亿参数点击率模型分钟级训练
大规模稀疏功能进行了升级，支持int64范围内的稀疏ID，支持稀疏表自增长、配置准入条件及增量模型保存功能
分布式支持控制流多任务，性能较instag多任务提升50%以上

模型保存与载入

支持paddle.jit.save接口存储未经paddle.jit.to_static转写的Layer对象，扩大接口使用场景
规范Layer、Optimzier等API的set_dict方法名，统一改为set_state_dict，规范接口名
支持paddle.load从fluid.io.save_inference_model接口存储的结果中载入Layer的state_dict，打通接口体系，提升易用性
支持paddle.load从fluid.io.save_params/persistables接口默认存储结果中载入Layer的state_dict，打通接口体系，提升易用性
修改paddle.save/load接口行为，paddle.save不再为存储结果添加后缀，paddle.load每次载入仅返回一个结果，规范接口语义
为paddle.jit.TransLatedLayer新增program方法，用于获取paddle.jit.load载入模型的program，便于了解模型结构
移除paddle.SaveLoadConfig，对于paddle.jit.save, paddle.jit.load, paddle.load等接口兼容载入的场景，使用**kwargs传入额外的配置，简化接口的使用
更新paddle.jit.save, paddle.jit.load接口参数model_path的含义，用户输入的字符串作为存储文件前缀而非目录
原静态图API paddle.io.save, paddle.io.load, paddle.io.save_inference_model, paddle.io.load_inference_model移动到paddle.static模块下

性能优化（含分布式）

提升Argsort OP当输入Tensor的元素个数等于其axis维长度时的性能，前向速度提升34倍，反向速度提升10倍
优化lars策略， ResNet50 分布式多卡训练 16k batch size 的 time2train 指标小于 10 分钟
新增fused_bn_add_act OP，融合batch_norm、elementwise_add和activation OP
新增梯度聚合的inplace addto策略，支持原位梯度累加，在ResNet-50混合精度训练中性能提升6.3%

调试分析

继续完善paddle中约1500条报错检查的提示文案，提升框架调试易用性

编译安装

新增安装包对python3.8的支持
删除对matplotlib的安装依赖
删除对graphviz安装依赖
删除对objgraph安装依赖
删除对netifaces的安装依赖
删除对nltk的安装依赖
删除对opencv的安装依赖
新增安装包对cuda10.1、cuda10.2的支持
预测库支持cuda10.2-cudnn8-trt7.1的版本

Bug修复

修复梯度裁剪GradientClipByGlobalNorm在Paddle默认dtype是float64的网络下使用报错的bug
修复Windows的CUDA10.1/10.2版本的无法加载CUDA相关dll的bug
修复Tensor在CUDAPinnedPlace与其他Place之间相互拷贝的bug
修复paddle.jit.load载入无参数Layer出错的bug
修复paddle.diag对于大尺寸输入计算错误的bug，修复paddle.diag在Windows Python3.8环境下内存占用异常的bug
修复paddle.topk在静态图组网时输出的shape不合理的问题
修复paddle.io.DataLoader多进程模式经paddle.distributed.spawn启动时直接报错退出的bug
修复paddle.set_device接口设置运行时设备在部分场景中失效的问题
修复paddle.static.nn.while_loop反向计算中使用前向计算的变量而导致的梯度计算错误的bug
修复fleet不支持paddle.optimizer的bug
修复Adam优化器计算公式与论文有diff的bug
修复logsumexp导致部分机器上编译太慢的问题
修复ParamAttr缺失类型检查的问题
修复AvgPool API ceil_mode=true情况下在CPU上平均池化核计算问题
修复paddle.distributed.fleet.init_server()加载模型时维度不匹配的问题
修复paddle.distributed.fleet参数服务器模式下训练节点不支持GPU的问题
修paddle.allclose在float64数据类型下精度diff问题
修复了反向传播支持分组的conv算子（conv2d grad op with groups）的错误 @intel
修复了动转静to_static装饰模型，直接切换eval模式无法保存模型的bug
修复matmul不支持fp16bug
修复matmul反向计算性能差以及显存占比高的问题
修复paddle.nn.Transformer参数bias_attr和weight_attr指定为bool，list/tuple出错问题
修复dynamic_decode预测解码不能正确提前结束的问题
修复paddle.unsqueeze在axis为Tensor的情况下结果错误的问题
修复了paddle.to_tensor在某些场景下zero_copy带来的问题，暂时禁止了zero_copy行为

推理

Paddle Inference

预测库默认命名从fluid_inference改为paddle_inference

功能升级

Paddle-TRT 动态shape功能支持PaddleSlim量化Int8模型
Paddle Inference GPU Int8支持conv2d_transpose量化
增加预测模型的算子版本信息
在oneDNN INT8量化策略中增加了对有偏移的scales的量化和反量化的支持 @intel
- Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
添加了对oneDNN BF16的支持：支持conv2d bf16运算符和gru bf16 op，启用了resnet50 bf16模型推断 @intel
- Added CPU BF16 support: support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.

性能优化

ERNIE模型在T4上使用Paddle-TRT FP16推理性能提升15%。@NVIDIA
通过支持oneDNN FP32 GRU和oneDNN INT8 GRU，GRU INT8模型的速度与NativeConfig推理相比，提高了约1.49倍（线程= 1，batch_size = 50）@intel
- Added support for oneDNN FP32 GRU and oneDNN INT8 GRU. The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
通过oneDNN升级到1.6，Ernie Large oneDNN在Skylake上(Intel Core 6148）推理的速度提高了约2.7倍（即单元测试 test_analyzer_ernie_large）@intel
- Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.

Bug修复

修复用户使用Paddle Inference ZeroCopyRun接口，开启MKLDNN时，在变长输入下内存泄露的bug
修复ERNIE模型含有共享参数时预测出错的bug
修复带Paddle-TensorRT功能的预测库在未安装TensorRT的环境下初始化报错的bug
修复softmax op、layer_norm op使用Paddle-TRT预测时维度计算错误的bug
解决了增加cpu_math_library_num_threads_数目，预测性能却无法提高的问题（PaddleOCR repository）@intel
- Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
解决了oneDNN concat重载数据错误的问题 @intel
- Fix oneDNN concat overwritting data error
解决了开启oneDNN推理NHWC模型会报错的问题 @intel
- Fix the issue oneDNN inference with NHWC model report error
解决了rec_r34_vd_tps_bilstm_attn模型oneDNN预测失败的问题 @intel
- Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
解决了deeplabv3p_xception oneDNN预测失败的问题 @intel
- Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support

2.0-rc0 Release Note

Important Updates

Default mode: For the versions later than paddle 2.0-rc, the dynamic graph mode is enabled by default. To use the static graph programming mode, run paddle.enable_static() to switch to it.
Framework APIs: Modify 58 commonly used API names, add 95 APIs (including migration from the earlier V1.8), remove 220 APIs (including alias removal), add the support of the Kunlun chips in 50 APIs, add the second-order derivative calculation in 8 APIs, and functionally enhance the distributed APIs and high-level APIs.
Framework features: Optimize the dynamic-to-static conversion usage, optimize model reading and loading, optimize mixed-precision training and quantization strategies, optimize distributed training strategies, and streamline compilation and installation package dependencies.
Inference engine: Enhance the int8 quantitative capability, optimize the oneDNN performance, and fix a number of bugs.

Training Framework

Basic API (Including Distributed)

Name Change of Commonly Used APIs

Modified 58 API names. For details, see link

Added APIs

Added paddle.emtpy API to return uninitialized memory
Added paddle.emtpy_like API to return uninitialized memory
Added paddle.mv API to return the matrix-vector multiplication result
Added paddle.multinomial multinomial distribution API
Added paddle.nn.LocalResponseNorm and paddle.nn.functional.local_response_norm
Added paddle.nn.Pad1D/Pad2D/Pad3D api, and supported constant, reflect, replicate and circular modes
Added paddle.add_n
Added dynamic graph mixing precision training API, paddle.amp.auto_cast and paddle.amp.GradScaler

Fixed and Improved APIs

paddle.reshape API supports bool type input
paddle.distribution.Categorical API is added with sample and log_prob methods
BatchNorm1D, BatchNorm2D, and BatchNorm3D are added with the support of the channel last data layout
Modified paddle.optimzier.Adam and paddle.optimizer.AdmaW parameter order
yolo_box supports the input feature graph where the H and W are not equal, that is, complete the prediction of a graph with unequal width and length
paddle.nn.function.interpolate supports the settings that the input type of scale_factor is list
Added the support of oneDNN of the adaptive pool2d operator @intel
- Added adaptive pool2d operator oneDNN support
Added the support of oneDNN of dilated conv and dilated conv_transpose @intel
- Add oneDNN conv with dilations and conv_transpose with dilations support
unique supports the GPU device computing
paddle.multiply supports the input of non-variable and tensor data types
RNN classes (SimpleRNN, LSTM, and GRU) are optimized with the parameter order and the implementation of the base class RNNBase, and integrated with cudnn lstm
Fixed the GPU gradient anomaly of adaptive_pool op in special output cases

Removed APIs (Including Aliases)

Removed 220 APIs (including aliases), see link

Added the Second-order Derivation Function

batch_norm supports second-order derivation
abs supports second-order derivation
log supports second-order derivation
expand supports second-order derivation
tile supports second-order derivation
squeeze supports second-order derivation
unsqueeze supports second-order derivation
matmul supports second-order derivation

Support of Kunlun (XPU) Devices

uniform_random, gaussian_random and truncated_gaussian_random support XPU devices
paddle.concat, paddle.assign and paddle.cast APIs support XPU devices
paddle.reshape and paddle.shape APIs support XPU devices
stack, pool2d, and roi_align support XPU devices
conv2d, dropout, and log_loss support XPU devices
softmax supports XPU devices
mean and softmax_with_cross_entropy support XPU devices
sgd and momentum support XPU devices
sum, sign, scale, accuracy, elementwise_mul, elementwise_div, elementwise_sub, and elementwise_max support XPU devices
slice supports XPU devices
mul, pow, relu, sigmoid, sqrt, square, tanh, log, abs, elementwise_add, gelu, and matmul_v2 support xpu devices
transpose supports XPU devices
reduce_sum and reduce_mean support XPU devices
batch_norm and layer_norm support XPU devices
fill_constant supports XPU devices
load supports XPU devices
lookup_table_v2_xpu and adam support XPU devices
gather supports XPU devices

Multi-device/Distributed Training APIs

fleet api is formalized to paddle.distributed.fleet in a unified manner as the Paddle universal distributed training unified entry
paddle.distributed.fleet.DistributedStrategy is exposed as Paddle unified parallel strategy definition entry
Added paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API to support the distributed re-computing mechanism
Added paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API to support the distributed gradient summation mechanism
Added paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API to support the distributed pipeline parallel mechanism
paddle.distributed.fleet.DistributedStrategy is added with the AMP optimization strategy to support the enabling of automatic blending precision mechanism in the distributed environment
paddle.distributed.fleet.DistributedStrategy is added with the dgc optimization strategy to support the enabling of deep gradient compression mechanism in the distributed environment
paddle.distributed.fleet.DistributedStrategy is added with the fp16_allreduce optimization strategy to support the enabling of fp16 allreduce communication mechanism in the distributed environment
paddle.distributed.fleet.DistributedStrategy is added with the lars optimization strategy to support the use of lars optimizer for large batch size training in the distributed environment
paddle.distributed.fleet.DistributedStrategy is added with the lamb optimization strategy to support the use of lamb optimizer for large batch size training in the distributed environment
paddle.distributed.fleet supports multi-optimization strategy combinations, including combinations of more than ten kinds of strategies such as amp+recompute, dgc+recompute, amp+recompute+lars, and so on
paddle.distributed.fleet.DistributedStrategy is added with the a_sync optimization strategy to support synchronous, asynchronous, GeoSGD, and heterogeneous parameter server optimization training by using the parameter servers in the distributed environment
paddle.distributed.fleet.DistributedStrategy is added with the auto experimental optimization strategy to support auto parallel for multi-strategy optimization in the distributed environment
Added fleetrun to start the distributed training task, to support Collective mode to start in the single-machine single-card, single-machine multi-card and multi-machine multi-card, support the parameter server mode to start under CPU cluster, GPU cluster, and heterogeneous cluster, and support the direct submission of the PaddleCloud cluster
paddle.distributed.fleet supports dynamic graph execution and supports the single-machine single-card, single-machine multi-card and multi-machine multi-card training of a dynamic graph in GPU mode
paddle.distributed.fleet is added with the communication collection function, to support all_reduce, all_gather and barrier functions
paddle.distributed.fleet is added with the distributed indicator calculation function, including auc, rmse, mae, and acc
In paddle.distributed.fleet, fleet.main_program and fleet.startup_program are removed to be replaced with paddle.static.default_main_program() and paddle.static.default_startup_program()
paddle.distributed.fleet supports heterogeneous parameter server mode, to implement the heterogeneous computing device training and cross-device collaborative distributed training through fleetAPI and user networking
Distributed collective communication API supports CPU devices
paddle.distributed.fleet.DistributedStrategy is added with the localsgd optimization strategy
paddle.distributed.fleet.DistributedStrategy is added with the adaptivelocalsgd optimization strategy to support the localsgd strategy to automatically calculate step in the distributed environment
Added paddle.distributed. InMemoryDataset and QueueDataset are added to support the distributed training by using Dataset

High-level APIs

Added IterableDataset base class support streaming dataset. DataLoader supports multi-process acceleration of IterableDataset, and supports the getting of the child process state through paddle.io.get_worker_info() and the inter-process data division
The places parameter of paddle.io.DataLoader is updated to be optional. Places is not specified to use the default value
Added 10 map-style datasets such as CIFAR10, CIFAR100, Conll05st, and so on, to support automatic download of dataset and get data in map-style mode
Added the num_replicas and rank parameters of the DIstributedBatchSampler interface, for specifying the number of cards and the logical serial number of the current card
Added the support of reading tensor dataset of paddle.io.SensorDataset
Added paddle.io.Sampler base class, and SequenceSampler. RandomSampler is used for getting data in BatchSampler in order or random order
paddle.io.BatchSampler supports Sampler as input, and the original input parameter indices is deleted
Removed the original API in paddle.reader
The graph conversion operator in paddle.vision.transforms is added to process PIL backend
paddle.summary supports multi-input multi-output Layers
model.save is upgraded. When a dynamic graph saves a prediction model, a user does not need to call paddle.jit_to_static or add a decorator for the layer function (dynamic to static function).If inputs is passed in when initializing the model, the correct input shape is saved. Otherwise, the input shape of the model is saved according to the pass-in input shape when running the model

Function Optimization (Including Distributed)

Dynamic Graph

-Add the support for oneDNN dynamic graphs, and support Resnet50 model training and inference.@Intel

Added oneDNN dygraph training and inference support for Resnet50 model.It is faster than CPU NativeConfig training.

Dynamic Graph to Static Graph

For the dynamic graph to static graph, the related API interface is migrated in V2.0, simplifying the import route
Dynamic-to-static decorator to_static is added with the support of the direct decoration model instances, for example, to_static(model, input_spec)
Added the parsing mechanism for the default value of the name parameter in InputSpec. If no name is specified, the decorated function parameter name is used as name
StaticLayer is renamed to StaticFunction
Optimized the dynamic to static Debug log
Fixed the dynamic to static bug in some scenes

Mixed Precision Training

Added fused_bn_add_act OP, with the integration of batch_norm, elementwise_add and activation OP
Added inplace addto strategy for gradient aggregation, to support in-situ gradient summation. The performance is improved by 6.3% in ResNet-50 mixed precision training
Re-constructed the gradient validity check and dynamic loss scaling logic in static graph mixed precision training, and removed some condition block logics

Distributed Training Optimization

Optimized lars strategy, ResNet50 distributed multi-card training 16k batch size with the time2train index smaller than 10 minutes
Supported the pipeline training in parallel
Supported the heterogeneous distributed training in parameter server mode, supported PS+GPU, PS+Kunlun, PS+CPU, PS+CPU+GPU (Kunlun) and other devices for training, a single GPU/Kunlun machine + 10 cpu machines to complete click-through rate model training of hundreds of billions of parameters of millions of data in one minute
Upgraded the massive sparse function, to support for the sparse ID in int64 range, and support for sparse table self-growth, configuration access conditions and incremental model preservation function
Distributed support for control flow multitasking. The performance is improved by over 50% than that in instag multitasking

Model Quantization

Added the division channel quantization function for dynamic graphs, to support to quantize the division channel parameters of the weight of layer in Conv2D and Linear
Added the function of getting the output scale parameter on model layer during dynamic graph quantization training for Server-side quantization inference deployment
Optimized offline quantization of static graphs to avoid saving temporary data to disk

Model Saving and Loading

Supported paddle.jit.save interface to store the Layer object without paddle.jit.to_static transcription, to expand the interface usage scenarios
Standardized the set_dict method name of the APIs such as Layer and Optimzier, to rename to the set_state_dict in the unified method to standardize the interface name
Supported the loading of state_dict of Layer from the result stored in the fluid.io.save_inference_model interface by paddle.load
Supported the loading of state_dict of Layer from the default result stored in the fluid.io.save_params/persistables interface by paddle.load, to enable the interface system and improve the usability
Modified the paddle.save/load interface behavior. paddle.save does not add the suffix for the stored results. paddle.load returns only one result in each loading to standardize the interface semantics
paddle.jit.TransLatedLayer is added with the program method, to get the program of the paddle.jit.load loading model to facilitate the understanding of the model structure
Removed paddle.SaveLoadConfig. For paddle.jit.save, paddle.jit.load, paddle.load and other interface-compatible loading scenarios, use **kwargs to pass in additional configuration to simplify the use of the interface
Updated the meaning of model_path of the paddle.jit.save and paddle.jit.load interface parameter. The user input string is used as a prefix to the stored file, instead of a directory
Original static graph APIs such as paddle.io.save, paddle.io.load, paddle.io.save_inference_model, and paddle.io.load_inference_model are moved to the paddle.static module

Performance Optimization (Including Distributed)

Improved the performance of Argsort OP when the number of input Tensor elements is equal to its axis dimensional length. The forward speed is improved by 34 times and the reverse speed is improved by 10 times

Basic Functions for Dynamic Graph

Added the clone interface of Tensor. An identical Tensor is copied while the clone Tensor continues to retain in the computation graph and supports the gradient return
Hided the scale_loss and apply_collective_grads methods of multi-card API DataParallel of the dynamic graphs. The two methods need not to be called when multi-card model codes are prepared. This can simplify the writing method and improve the usability
Supported the modification of Tensor through index or slice (inplace)
Optimized the dynamic graph Tensor printing and display, high-dimensional tensor data display mode alignment numpy. The abbreviation is supported
Optimized the __call__ method of the initializer class. The pass-in of block is not required. This can prevent the user from perceiving the static graph block concept in the dynamic graph

Debugging Analysis

Continued to improve about 1500 pieces of error checking hint texts in paddle, to improve the framework debugging and usability

Compiling and Installation

Added the support for python3.8 in the installation package
Removed the installation dependency on matplotlib
Remove the installation dependency on graphviz
Removed the installation dependency on objgraph
Removed the installation dependency on netifaces
Remove the installation dependency on nltk
Removed the installation dependency on opencv
Added the installer support for cuda10.1 and cuda 10.2
The prediction library supports cuda10.2-cudnn8-trt7.1 version

Bug Fixing

Fixed the bug of error reported by gradient clipping GradientClipByGlobalNorm used in network where Paddle default dtype is float64
Fixed the bug of Windows-based CUDA version 10.1/10.2 failed to load CUDA related dll
Fixed the bug of Tensor copy each other between CUDAPinnedPlace and other Place
Fixed the bug of error in paddle.jit.load loading Layer without parameter
Fixed the bug of calculation error in the large size input of paddle.diag, and fixed the bug of memory usage exception of paddle.diag in Windows Python 3.8 environment
Fixed the unreasonable shape problem of paddle.topk in static graph networking
Fixed the bug of exit with the direct report of error of paddle.io.DataLoader multi-process mode when started through paddle.distributed.spaw
Fixed the problem of device failure in some scenarios when the paddle.set_device interface is set with the runtime
Fixed the bug of the gradient calculation error caused by using the variable of forward calculation in paddle.static.nn.while_loop backward calculation
Fixed the bug of fleet not supporting paddle.optimizer
Fixed the bug that the Adam optimizer formula and thesis have diff
Fixed the problem of logsumexp causing too slow compilation on some machines
Fixed the ParamAttr missing type check problem
Fixed the calculation problem of average pooling core on CPU when AvgPool API ceil_mode=true
Fixed the dimension mismatch problem when paddle.distributed.fleet.init_server() is loaded with a model
Fixed the problem that the training node does not support GPU in paddle.distributed.fleet parameter server mode
Fixed the precision diff problem of paddle.allclose in float64 data type
Fixed the error of back propagation supporting grouped conv operators (conv2d grad op with groups) @intel
- Fix the conv2d grad op with groups problems
Fixed the bug of failure to save the model when dynamic to static to_static decorative model is directly switched to the eval mode
Fixed the bug that matmul does not support fp16bug
Fixed the problem of poor performance of matmul reverse calculation and high memory consumption
Fixed the error when the bias_attr and weight_attr parameters of paddle.nn.Transformer are specified as bool, list/tuple
Fixed the problem that dynamic_decode prediction decoding doesn't end early correctly
Fixed the result error of paddle.unsqueeze when axis is Tensor
Fixed the problem of paddle.to_tensor caused by zero_copy in some scenarios, to temporarily disable the zero_copy behavior

Inference

Paddle Inference

Changed the default name of prediction library from fluid_inference to paddle_inference

API

Function Upgrading

Paddle-TRT dynamic shape supports PaddleSlim quantization of Int8 models
Paddle Inference GPU Int8 supports conv2d_transpose quantization
Added operator version information for the prediction model
Added the support for quantization and inverse quantization of offset scales to the oneDNN INT8 quantization strategy @intel
- Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
Added the support for oneDNN BF16: support conv2d bf16 operator and gru bf16 op, and enabled resnet50 bf16 model inference @intel
- Added CPU BF16 support:support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.

Performance Optimization

The inference performance of ERNIE model using Paddle-TRT FP16 on T4 is improved by 15%.@NVIDIA
Through the comparison of the speed of supporting oneDNN FP32 GRU and oneDNN INT8 GRU, the speed of the GRU INT8 model is about 1.49 times faster than that of NativeConfig inference (thread = 1, batch_size = 50)@intel
- Added support for oneDNN FP32 GRU and oneDNN INT8 GRU.The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
By upgrading oneDNN to 1.6, the speed of Ernie Large oneDNN inference on Skylake (Intel Core 6148) is improved about 2.7 times (that is, unit test test_analyzer_ernie_large) @intel
- Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.

Bug Fixing

Fixed the bug of memory leak under the variable length input when a user uses the Paddle Inference ZeroCopyRun interface to enable MKLDNN
Fixed the bug of prediction error when ERNIE model contains shared parameters
Fixed the bug of initialization error for the prediction library with the Paddle-TensorRT function in the environment when TensorRT is not installed
Fixed the bug of dimension calculation error when softmax op and layer_norm op use the Paddle-TRT prediction
Solved the problem of failing to improve the prediction performance (PaddleOCR repository) when increasing the number of cpu_math_library_num_threads_ @intel
- Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
Solved the problem of oneDNN concat reload data error @intel
- Fix oneDNN concat overwritting data error
Solved the problem of error reported when enabling the oneDNN to infer the NHWC model @intel
- Fix the issue oneDNN inference with NHWC model report error
Solved the oneDNN prediction failure problem of the rec_r34_vd_tps_bilstm_attn model @intel
- Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
Solved the prediction failure problem of deeplabv3p_xception oneDNN @intel
- Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support

相关地址：原始地址下载(tar) 下载(zip)

查看：2020-10-30发行的版本