v2.0.0-rc0
版本发布时间: 2020-10-30 11:37:50
PaddlePaddle/Paddle最新发布版本:v3.0.0-beta0(2024-06-27 18:00:34)
2.0-rc0 Release Note
重要更新
相对2.0-beta版,本版本在如下方面进一步完善:
- 默认模式:paddle2.0-rc后将默认开启动态图模式;如果需要使用静态图编程模式,可以通过paddle.enable_static()来切换到静态图模式。
- 框架API:修改50个常用API名称,新增8个基础API实现,移除220个API(包含别名移除),8个API增加二阶导数计算,更多API增加了对昆仑芯片的支持,分布式FleetAPI正式化,高层API进行了功能增强。
- 框架功能:优化动静转换用法,优化模型读取和载入,优化混合精度训练和量化策略,优化分布式训练策略。删除了nltk等6项编译依赖;安装包增加对Python 3.8、CUDA 10.1/10.2的支持。
- 推理引擎:增强int8量化能力,增加算子版本信息,oneDNN相关的功能强化和性能优化。
训练框架
基础API(含分布式)
新增API
- 新增 paddle.emtpy API,返回未初始化的内存
- 新增 paddle.emtpy_like API,返回未初始化的内存
- 新增 paddle.mv API,返回矩阵-向量乘的结果
- 新增paddle.multinomial多项分布API
- 新增paddle.nn.LocalResponseNorm和paddle.nn.functional.local_response_norm
- 新增paddle.nn.Pad1D/Pad2D/Pad3D api,支持constant,reflect,replicate和circular模式
- 新增paddle.add_n
- 新增动态图混合精度训练API,paddle.amp.auto_cast和paddle.amp.GradScaler
修复和完善API
- paddle.reshape API支持bool类型输入
- paddle.distribution.Categorical API添加sample和log_prob方法
- BatchNorm1D, BatchNorm2D, BatchNorm3D 添加了 channel last 数据布局支持
- paddle.optimzier.Adam和paddle.optimizer.AdamaW参数顺序修改
- yolo_box支持输入特征图H,W不相等,用于完成长宽不相等的图像预测
- paddle.nn.function.interpolate 支持 scale_factor 输入类型为 list
- 添加了adaptive pool2d运算符的oneDNN支持 @intel
- 添加了dilated conv和dilated conv_transpose的oneDNN支持 @intel
- unique支持GPU设备计算
- paddle.multiply 支持非variable 和 tensor 数据类型 输入
- paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d,重构python端PoolAPI的实现
- paddle.set_printoptions支持设置动态图Tensor的显示选项
- paddle.assign API,支持数组/张量到张量的赋值
- paddle.nn.functional.swish/paddle.nn.Swish,删除beta参数
- paddle.nn.functional.thresholded_relu/paddle.nn.ThresholdedReLU,threshold参数默认值为1.0
- paddle.norm,升级后支持fro、inf、-inf、0、1、2,和任何正实数p对应的p范数
- paddle.nn.AdaptiveMaxPool1D/2D/3D 和paddle.nn.functional.adaptivemaxpool1d/2d/3d,重构python端PoolAPI的实现
- RNN类(SimpleRNN、LSTM、GRU)优化参数顺序和基类RNNBase实现,集成cudnn lstm
- 修复adaptive_pool op在特殊输出情况下GPU梯度异常的问题
- 新增支持二阶求导功能:batch_norm、abs、log、expand、tile、squeeze、unsqueeze、matmul
- 新增50余个算子对昆仑(XPU)训练的支持
API名称变化
- 对2.0-beta的50个API名称进行了修改,详见 链接
移除API(包括别名)
- 移除220个API(包括别名),详见 链接
多设备/分布式训练API
- Fleet API正式化,统一到paddle.distributed.fleet作为Paddle通用分布式训练统一入口
- paddle.distributed.fleet.DistributedStrategy作为Paddle统一并行策略定义入口暴露
- 增加paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API,支持分布式下的重计算机制
- 增加paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API,支持分布式下的梯度累加机制
- 增加paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API,支持分布式下的流水线并行机制
- paddle.distributed.fleet.DistributedStrategy新增amp优化策略,支持分布式下自动混合精度机制的开启
- paddle.distributed.fleet.DistributedStrategy新增dgc优化策略,支持分布式下深度梯度压缩机制的开启
- paddle.distributed.fleet.DistributedStrategy新增fp16_allreduce优化策略,支持分布式下fp16 allreduce通信机制的开启
- paddle.distributed.fleet.DistributedStrategy新增lars优化策略,支持分布式下大batch size 训练使用 lars 优化器
- paddle.distributed.fleet.DistributedStrategy新增lamb优化策略,支持分布式下大batch size 训练使用 lamb 优化器
- paddle.distributed.fleet支持多优化策略组合,支持包括amp+recompute, dgc+recompute, amp+recompute+lars等十余种策略的组合
- paddle.distributed.fleet.DistributedStrategy新增a_sync优化策略,支持分布式下使用参数服务器进行同步、异步、GeoSGD以及异构参数服务器优化训练
- paddle.distributed.fleet.DistributedStrategy新增auto实验性优化策略,支持分布式下多策略最优化自动并行
- 增加fleetrun启动分布式训练任务,支持Collective模式在单机单卡,单机多卡和多机多卡下启动,支持参数服务器模式在CPU集群、GPU集群、异构集群下启动,支持直接提交PaddleCloud集群
- paddle.distributed.fleet支持动态图执行,支持GPU模式下动态图单机单机、单机多卡和多机多卡训练
- paddle.distributed.fleet 新增通信集合功能,支持all_reduce,all_gather及 barrier功能
- paddle.distributed.fleet 新增分布式指标计算功能,包括auc,rmse, mae,acc 等
- paddle.distributed.fleet下废弃原fleet.main_program和fleet.startup_program,替换为paddle.static.default_main_program() 和 paddle.static.default_startup_program()
- paddle.distributed.fleet支持异构参数服务器模式,可通过fleetAPI配合用户组网实现异构计算设备训练,跨设备协作进行分布式训练
- 分布式集合通信API支持CPU设备
- paddle.distributed.fleet.DistributedStrategy新增localsgd优化策略
- paddle.distributed.fleet.DistributedStrategy新增adaptivelocalsgd优化策略,支持分布式下自动计算step步长的localsgd策略
- 新增paddle.distributed添加InMemoryDataset和QueueDataset支持使用Dataset进行分布式训练
高层API
- 新增IterableDataset基类支持流式数据集,DataLoader支持对IterableDataset进行多进程加速,并支持通过paddle.io.get_worker_info()获取子进程状态并进行进程间数据划分
- paddle.io.DataLoader的places参数更新为可选,不指定places使用默认的places
- 新增CIFAR10, CIFAR100, Conll05st等10个map-style数据集,支持数据集自动下载并以map-style方式获取数据
- DIstributedBatchSampler接口新增num_replicas和rank参数用于指定卡数和当前卡逻辑序号
- 新增paddle.io.TensorDataset支持tensor数据集读取
- 新增paddle.io.Sampler基类,并新增SequenceSampler,RandomSampler用于在BatchSampler中顺序或乱序获取数据
- paddle.io.BatchSampler支持Sampler作为输入,删除原输入参数indices
- 下线paddle.reader下原有API
- paddle.vision.transforms中的图像变换算子添加处理PIL的后端
- paddle.summary支持多个输入与多个输出的Layer
- model.save升级,在动态图保存预测模型时,用户不需要调用paddle.jit_to_static或者为layer函数增加装饰器(动转静的功能)。并且如果用户在Model初始化时如果传入了inputs,则可以保存正确的输入shape,否则模型的输入shape会按照运行模型时传入的输入shape保存
功能优化(含分布式)
动态图基础功能
- 新增Tensor的clone接口,会拷贝一个完全相同的Tensor,同时clone后的Tensor继续保留在计算图中,并支持梯度回传
- 支持通过索引或切片原地(inplace) 修改 Tensor
- 动态图Tensor打印和显示优化,高维tensor数据显示方式对齐numpy,支持缩略形式
- 优化了initializer类的
__call__
方法,不再需要传入block,避免用户在动态图中感知到静态图block概念 - 隐藏动态图多卡API DataParallel的scale_loss和apply_collective_grads方法,编写多卡模型代码时不再需要调用这两个方法,简化写法,提升易用性
- 添加oneDNN 动态图支持,支持了 Resnet50模型训练和推理。@intel
动态图转静态图
- 动态图转静态图相关API接口迁移2.0,简化了import 路经
- 动转静装饰器 to_static 新增支持直接装饰 model 实例,如 to_static(model, input_spec)
- 新增InputSpec中name参数的默认值解析机制,若未指定name,则使用被装饰函数参数名作为name
- StaticLayer重命名为StaticFunction
- 优化了动转静Debug log
- 修复了一些场景下动转静的bug
混合精度训练
- 重构静态图混合精度训练中的梯度有效性检查和动态loss scaling逻辑,去除一些condition block逻辑
模型量化
- 新增动态图分channel量化功能,支持对Conv2D和Linear等layer的权重进行分channel求取量化参数
- 新增动态图量化训练过程中对模型layer求取output scale参数功能,供Server端量化推理部署使用
分布式训练优化
- 支持流水线并行训练
- 支持参数服务器模式下异构分布式训练,支持PS+GPU,PS+昆仑, PS+CPU,PS+CPU+GPU(昆仑)等多种设备进行训练,单台GPU/昆仑机器+10台cpu机器上,完成千万数据千亿参数点击率模型分钟级训练
- 大规模稀疏功能进行了升级,支持int64范围内的稀疏ID,支持稀疏表自增长、配置准入条件及增量模型保存功能
- 分布式支持控制流多任务,性能较instag多任务提升50%以上
模型保存与载入
- 支持paddle.jit.save接口存储未经paddle.jit.to_static转写的Layer对象,扩大接口使用场景
- 规范Layer、Optimzier等API的set_dict方法名,统一改为set_state_dict,规范接口名
- 支持paddle.load从fluid.io.save_inference_model接口存储的结果中载入Layer的state_dict,打通接口体系,提升易用性
- 支持paddle.load从fluid.io.save_params/persistables接口默认存储结果中载入Layer的state_dict,打通接口体系,提升易用性
- 修改paddle.save/load接口行为,paddle.save不再为存储结果添加后缀,paddle.load每次载入仅返回一个结果,规范接口语义
- 为paddle.jit.TransLatedLayer新增program方法,用于获取paddle.jit.load载入模型的program,便于了解模型结构
- 移除paddle.SaveLoadConfig,对于paddle.jit.save, paddle.jit.load, paddle.load等接口兼容载入的场景,使用**kwargs传入额外的配置,简化接口的使用
- 更新paddle.jit.save, paddle.jit.load接口参数model_path的含义,用户输入的字符串作为存储文件前缀而非目录
- 原静态图API paddle.io.save, paddle.io.load, paddle.io.save_inference_model, paddle.io.load_inference_model移动到paddle.static模块下
性能优化(含分布式)
- 提升Argsort OP当输入Tensor的元素个数等于其
axis
维长度时的性能,前向速度提升34倍,反向速度提升10倍 - 优化lars策略, ResNet50 分布式多卡训练 16k batch size 的 time2train 指标小于 10 分钟
- 新增fused_bn_add_act OP,融合batch_norm、elementwise_add和activation OP
- 新增梯度聚合的inplace addto策略,支持原位梯度累加,在ResNet-50混合精度训练中性能提升6.3%
调试分析
- 继续完善paddle中约1500条报错检查的提示文案,提升框架调试易用性
编译安装
- 新增安装包对python3.8的支持
- 删除对matplotlib的安装依赖
- 删除对graphviz安装依赖
- 删除对objgraph安装依赖
- 删除对netifaces的安装依赖
- 删除对nltk的安装依赖
- 删除对opencv的安装依赖
- 新增安装包对cuda10.1、cuda10.2的支持
- 预测库支持cuda10.2-cudnn8-trt7.1的版本
Bug修复
- 修复梯度裁剪GradientClipByGlobalNorm在Paddle默认dtype是float64的网络下使用报错的bug
- 修复Windows的CUDA10.1/10.2版本的无法加载CUDA相关dll的bug
- 修复Tensor在CUDAPinnedPlace与其他Place之间相互拷贝的bug
- 修复paddle.jit.load载入无参数Layer出错的bug
- 修复paddle.diag对于大尺寸输入计算错误的bug,修复paddle.diag在Windows Python3.8环境下内存占用异常的bug
- 修复paddle.topk在静态图组网时输出的shape不合理的问题
- 修复paddle.io.DataLoader多进程模式经paddle.distributed.spawn启动时直接报错退出的bug
- 修复paddle.set_device接口设置运行时设备在部分场景中失效的问题
- 修复paddle.static.nn.while_loop反向计算中使用前向计算的变量而导致的梯度计算错误的bug
- 修复fleet不支持paddle.optimizer的bug
- 修复Adam优化器计算公式与论文有diff的bug
- 修复logsumexp导致部分机器上编译太慢的问题
- 修复ParamAttr缺失类型检查的问题
- 修复AvgPool API ceil_mode=true情况下在CPU上平均池化核计算问题
- 修复paddle.distributed.fleet.init_server()加载模型时维度不匹配的问题
- 修复paddle.distributed.fleet参数服务器模式下训练节点不支持GPU的问题
- 修paddle.allclose在float64数据类型下精度diff问题
- 修复了反向传播支持分组的conv算子(conv2d grad op with groups)的错误 @intel
- 修复了动转静to_static装饰模型,直接切换eval模式无法保存模型的bug
- 修复matmul不支持fp16bug
- 修复matmul反向计算性能差以及显存占比高的问题
- 修复paddle.nn.Transformer参数bias_attr和weight_attr指定为bool,list/tuple出错问题
- 修复dynamic_decode预测解码不能正确提前结束的问题
- 修复paddle.unsqueeze在axis为Tensor的情况下结果错误的问题
- 修复了paddle.to_tensor在某些场景下zero_copy带来的问题,暂时禁止了zero_copy行为
推理
Paddle Inference
- 预测库默认命名从fluid_inference改为paddle_inference
功能升级
- Paddle-TRT 动态shape功能支持PaddleSlim量化Int8模型
- Paddle Inference GPU Int8支持conv2d_transpose量化
- 增加预测模型的算子版本信息
- 在oneDNN INT8量化策略中增加了对有偏移的scales的量化和反量化的支持 @intel
- Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
- 添加了对oneDNN BF16的支持:支持conv2d bf16运算符和gru bf16 op,启用了resnet50 bf16模型推断 @intel
- Added CPU BF16 support: support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.
性能优化
- ERNIE模型在T4上使用Paddle-TRT FP16推理性能提升15%。@NVIDIA
- 通过支持oneDNN FP32 GRU和oneDNN INT8 GRU,GRU INT8模型的速度与NativeConfig推理相比,提高了约1.49倍(线程= 1,batch_size = 50)@intel
- Added support for oneDNN FP32 GRU and oneDNN INT8 GRU. The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
- 通过oneDNN升级到1.6,Ernie Large oneDNN在Skylake上(Intel Core 6148)推理的速度提高了约2.7倍(即单元测试 test_analyzer_ernie_large)@intel
- Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.
Bug修复
- 修复用户使用Paddle Inference ZeroCopyRun接口,开启MKLDNN时,在变长输入下内存泄露的bug
- 修复ERNIE模型含有共享参数时预测出错的bug
- 修复带Paddle-TensorRT功能的预测库在未安装TensorRT的环境下初始化报错的bug
- 修复softmax op、layer_norm op使用Paddle-TRT预测时维度计算错误的bug
- 解决了增加cpu_math_library_num_threads_数目,预测性能却无法提高的问题(PaddleOCR repository)@intel
- Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
- 解决了oneDNN concat重载数据错误的问题 @intel
- Fix oneDNN concat overwritting data error
- 解决了开启oneDNN推理NHWC模型会报错的问题 @intel
- Fix the issue oneDNN inference with NHWC model report error
- 解决了rec_r34_vd_tps_bilstm_attn模型oneDNN预测失败的问题 @intel
- Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
- 解决了deeplabv3p_xception oneDNN预测失败的问题 @intel
- Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support
2.0-rc0 Release Note
Important Updates
- Default mode: For the versions later than paddle 2.0-rc, the dynamic graph mode is enabled by default. To use the static graph programming mode, run paddle.enable_static() to switch to it.
- Framework APIs: Modify 58 commonly used API names, add 95 APIs (including migration from the earlier V1.8), remove 220 APIs (including alias removal), add the support of the Kunlun chips in 50 APIs, add the second-order derivative calculation in 8 APIs, and functionally enhance the distributed APIs and high-level APIs.
- Framework features: Optimize the dynamic-to-static conversion usage, optimize model reading and loading, optimize mixed-precision training and quantization strategies, optimize distributed training strategies, and streamline compilation and installation package dependencies.
- Inference engine: Enhance the int8 quantitative capability, optimize the oneDNN performance, and fix a number of bugs.
Training Framework
Basic API (Including Distributed)
Name Change of Commonly Used APIs
- Modified 58 API names. For details, see link
Added APIs
- Added paddle.emtpy API to return uninitialized memory
- Added paddle.emtpy_like API to return uninitialized memory
- Added paddle.mv API to return the matrix-vector multiplication result
- Added paddle.multinomial multinomial distribution API
- Added paddle.nn.LocalResponseNorm and paddle.nn.functional.local_response_norm
- Added paddle.nn.Pad1D/Pad2D/Pad3D api, and supported constant, reflect, replicate and circular modes
- Added paddle.add_n
- Added dynamic graph mixing precision training API, paddle.amp.auto_cast and paddle.amp.GradScaler
Fixed and Improved APIs
- paddle.reshape API supports bool type input
- paddle.distribution.Categorical API is added with sample and log_prob methods
- BatchNorm1D, BatchNorm2D, and BatchNorm3D are added with the support of the channel last data layout
- Modified paddle.optimzier.Adam and paddle.optimizer.AdmaW parameter order
- yolo_box supports the input feature graph where the H and W are not equal, that is, complete the prediction of a graph with unequal width and length
- paddle.nn.function.interpolate supports the settings that the input type of scale_factor is list
- Added the support of oneDNN of the adaptive pool2d operator @intel
- Added adaptive pool2d operator oneDNN support
- Added the support of oneDNN of dilated conv and dilated conv_transpose @intel
- Add oneDNN conv with dilations and conv_transpose with dilations support
- unique supports the GPU device computing
- paddle.multiply supports the input of non-variable and tensor data types
- RNN classes (SimpleRNN, LSTM, and GRU) are optimized with the parameter order and the implementation of the base class RNNBase, and integrated with cudnn lstm
- Fixed the GPU gradient anomaly of adaptive_pool op in special output cases
Removed APIs (Including Aliases)
- Removed 220 APIs (including aliases), see link
Added the Second-order Derivation Function
- batch_norm supports second-order derivation
- abs supports second-order derivation
- log supports second-order derivation
- expand supports second-order derivation
- tile supports second-order derivation
- squeeze supports second-order derivation
- unsqueeze supports second-order derivation
- matmul supports second-order derivation
Support of Kunlun (XPU) Devices
- uniform_random, gaussian_random and truncated_gaussian_random support XPU devices
- paddle.concat, paddle.assign and paddle.cast APIs support XPU devices
- paddle.reshape and paddle.shape APIs support XPU devices
- stack, pool2d, and roi_align support XPU devices
- conv2d, dropout, and log_loss support XPU devices
- softmax supports XPU devices
- mean and softmax_with_cross_entropy support XPU devices
- sgd and momentum support XPU devices
- sum, sign, scale, accuracy, elementwise_mul, elementwise_div, elementwise_sub, and elementwise_max support XPU devices
- slice supports XPU devices
- mul, pow, relu, sigmoid, sqrt, square, tanh, log, abs, elementwise_add, gelu, and matmul_v2 support xpu devices
- transpose supports XPU devices
- reduce_sum and reduce_mean support XPU devices
- batch_norm and layer_norm support XPU devices
- fill_constant supports XPU devices
- load supports XPU devices
- lookup_table_v2_xpu and adam support XPU devices
- gather supports XPU devices
Multi-device/Distributed Training APIs
- fleet api is formalized to paddle.distributed.fleet in a unified manner as the Paddle universal distributed training unified entry
- paddle.distributed.fleet.DistributedStrategy is exposed as Paddle unified parallel strategy definition entry
- Added paddle.distributed.fleet.meta_optimizer.RecomputeOptimizer API to support the distributed re-computing mechanism
- Added paddle.distributed.fleet.meta_optimizer.GradientMergeOptimizer API to support the distributed gradient summation mechanism
- Added paddle.distributed.fleet.meta_optimizer.PipelineOptimizer API to support the distributed pipeline parallel mechanism
- paddle.distributed.fleet.DistributedStrategy is added with the AMP optimization strategy to support the enabling of automatic blending precision mechanism in the distributed environment
- paddle.distributed.fleet.DistributedStrategy is added with the dgc optimization strategy to support the enabling of deep gradient compression mechanism in the distributed environment
- paddle.distributed.fleet.DistributedStrategy is added with the fp16_allreduce optimization strategy to support the enabling of fp16 allreduce communication mechanism in the distributed environment
- paddle.distributed.fleet.DistributedStrategy is added with the lars optimization strategy to support the use of lars optimizer for large batch size training in the distributed environment
- paddle.distributed.fleet.DistributedStrategy is added with the lamb optimization strategy to support the use of lamb optimizer for large batch size training in the distributed environment
- paddle.distributed.fleet supports multi-optimization strategy combinations, including combinations of more than ten kinds of strategies such as amp+recompute, dgc+recompute, amp+recompute+lars, and so on
- paddle.distributed.fleet.DistributedStrategy is added with the a_sync optimization strategy to support synchronous, asynchronous, GeoSGD, and heterogeneous parameter server optimization training by using the parameter servers in the distributed environment
- paddle.distributed.fleet.DistributedStrategy is added with the auto experimental optimization strategy to support auto parallel for multi-strategy optimization in the distributed environment
- Added fleetrun to start the distributed training task, to support Collective mode to start in the single-machine single-card, single-machine multi-card and multi-machine multi-card, support the parameter server mode to start under CPU cluster, GPU cluster, and heterogeneous cluster, and support the direct submission of the PaddleCloud cluster
- paddle.distributed.fleet supports dynamic graph execution and supports the single-machine single-card, single-machine multi-card and multi-machine multi-card training of a dynamic graph in GPU mode
- paddle.distributed.fleet is added with the communication collection function, to support all_reduce, all_gather and barrier functions
- paddle.distributed.fleet is added with the distributed indicator calculation function, including auc, rmse, mae, and acc
- In paddle.distributed.fleet, fleet.main_program and fleet.startup_program are removed to be replaced with paddle.static.default_main_program() and paddle.static.default_startup_program()
- paddle.distributed.fleet supports heterogeneous parameter server mode, to implement the heterogeneous computing device training and cross-device collaborative distributed training through fleetAPI and user networking
- Distributed collective communication API supports CPU devices
- paddle.distributed.fleet.DistributedStrategy is added with the localsgd optimization strategy
- paddle.distributed.fleet.DistributedStrategy is added with the adaptivelocalsgd optimization strategy to support the localsgd strategy to automatically calculate step in the distributed environment
- Added paddle.distributed. InMemoryDataset and QueueDataset are added to support the distributed training by using Dataset
High-level APIs
- Added IterableDataset base class support streaming dataset. DataLoader supports multi-process acceleration of IterableDataset, and supports the getting of the child process state through paddle.io.get_worker_info() and the inter-process data division
- The places parameter of paddle.io.DataLoader is updated to be optional. Places is not specified to use the default value
- Added 10 map-style datasets such as CIFAR10, CIFAR100, Conll05st, and so on, to support automatic download of dataset and get data in map-style mode
- Added the num_replicas and rank parameters of the DIstributedBatchSampler interface, for specifying the number of cards and the logical serial number of the current card
- Added the support of reading tensor dataset of paddle.io.SensorDataset
- Added paddle.io.Sampler base class, and SequenceSampler. RandomSampler is used for getting data in BatchSampler in order or random order
- paddle.io.BatchSampler supports Sampler as input, and the original input parameter indices is deleted
- Removed the original API in paddle.reader
- The graph conversion operator in paddle.vision.transforms is added to process PIL backend
- paddle.summary supports multi-input multi-output Layers
- model.save is upgraded. When a dynamic graph saves a prediction model, a user does not need to call paddle.jit_to_static or add a decorator for the layer function (dynamic to static function).If inputs is passed in when initializing the model, the correct input shape is saved. Otherwise, the input shape of the model is saved according to the pass-in input shape when running the model
Function Optimization (Including Distributed)
Dynamic Graph
-Add the support for oneDNN dynamic graphs, and support Resnet50 model training and inference.@Intel
- Added oneDNN dygraph training and inference support for Resnet50 model.It is faster than CPU NativeConfig training.
Dynamic Graph to Static Graph
- For the dynamic graph to static graph, the related API interface is migrated in V2.0, simplifying the import route
- Dynamic-to-static decorator to_static is added with the support of the direct decoration model instances, for example, to_static(model, input_spec)
- Added the parsing mechanism for the default value of the name parameter in InputSpec. If no name is specified, the decorated function parameter name is used as name
- StaticLayer is renamed to StaticFunction
- Optimized the dynamic to static Debug log
- Fixed the dynamic to static bug in some scenes
Mixed Precision Training
- Added fused_bn_add_act OP, with the integration of batch_norm, elementwise_add and activation OP
- Added inplace addto strategy for gradient aggregation, to support in-situ gradient summation. The performance is improved by 6.3% in ResNet-50 mixed precision training
- Re-constructed the gradient validity check and dynamic loss scaling logic in static graph mixed precision training, and removed some condition block logics
Distributed Training Optimization
- Optimized lars strategy, ResNet50 distributed multi-card training 16k batch size with the time2train index smaller than 10 minutes
- Supported the pipeline training in parallel
- Supported the heterogeneous distributed training in parameter server mode, supported PS+GPU, PS+Kunlun, PS+CPU, PS+CPU+GPU (Kunlun) and other devices for training, a single GPU/Kunlun machine + 10 cpu machines to complete click-through rate model training of hundreds of billions of parameters of millions of data in one minute
- Upgraded the massive sparse function, to support for the sparse ID in int64 range, and support for sparse table self-growth, configuration access conditions and incremental model preservation function
- Distributed support for control flow multitasking. The performance is improved by over 50% than that in instag multitasking
Model Quantization
- Added the division channel quantization function for dynamic graphs, to support to quantize the division channel parameters of the weight of layer in Conv2D and Linear
- Added the function of getting the output scale parameter on model layer during dynamic graph quantization training for Server-side quantization inference deployment
- Optimized offline quantization of static graphs to avoid saving temporary data to disk
Model Saving and Loading
- Supported paddle.jit.save interface to store the Layer object without paddle.jit.to_static transcription, to expand the interface usage scenarios
- Standardized the set_dict method name of the APIs such as Layer and Optimzier, to rename to the set_state_dict in the unified method to standardize the interface name
- Supported the loading of state_dict of Layer from the result stored in the fluid.io.save_inference_model interface by paddle.load
- Supported the loading of state_dict of Layer from the default result stored in the fluid.io.save_params/persistables interface by paddle.load, to enable the interface system and improve the usability
- Modified the paddle.save/load interface behavior. paddle.save does not add the suffix for the stored results. paddle.load returns only one result in each loading to standardize the interface semantics
- paddle.jit.TransLatedLayer is added with the program method, to get the program of the paddle.jit.load loading model to facilitate the understanding of the model structure
- Removed paddle.SaveLoadConfig. For paddle.jit.save, paddle.jit.load, paddle.load and other interface-compatible loading scenarios, use **kwargs to pass in additional configuration to simplify the use of the interface
- Updated the meaning of model_path of the paddle.jit.save and paddle.jit.load interface parameter. The user input string is used as a prefix to the stored file, instead of a directory
- Original static graph APIs such as paddle.io.save, paddle.io.load, paddle.io.save_inference_model, and paddle.io.load_inference_model are moved to the paddle.static module
Performance Optimization (Including Distributed)
- Improved the performance of Argsort OP when the number of input Tensor elements is equal to its
axis
dimensional length. The forward speed is improved by 34 times and the reverse speed is improved by 10 times
Basic Functions for Dynamic Graph
- Added the clone interface of Tensor. An identical Tensor is copied while the clone Tensor continues to retain in the computation graph and supports the gradient return
- Hided the scale_loss and apply_collective_grads methods of multi-card API DataParallel of the dynamic graphs. The two methods need not to be called when multi-card model codes are prepared. This can simplify the writing method and improve the usability
- Supported the modification of Tensor through index or slice (inplace)
- Optimized the dynamic graph Tensor printing and display, high-dimensional tensor data display mode alignment numpy. The abbreviation is supported
- Optimized the
__call__
method of the initializer class. The pass-in of block is not required. This can prevent the user from perceiving the static graph block concept in the dynamic graph
Debugging Analysis
- Continued to improve about 1500 pieces of error checking hint texts in paddle, to improve the framework debugging and usability
Compiling and Installation
- Added the support for python3.8 in the installation package
- Removed the installation dependency on matplotlib
- Remove the installation dependency on graphviz
- Removed the installation dependency on objgraph
- Removed the installation dependency on netifaces
- Remove the installation dependency on nltk
- Removed the installation dependency on opencv
- Added the installer support for cuda10.1 and cuda 10.2
- The prediction library supports cuda10.2-cudnn8-trt7.1 version
Bug Fixing
- Fixed the bug of error reported by gradient clipping GradientClipByGlobalNorm used in network where Paddle default dtype is float64
- Fixed the bug of Windows-based CUDA version 10.1/10.2 failed to load CUDA related dll
- Fixed the bug of Tensor copy each other between CUDAPinnedPlace and other Place
- Fixed the bug of error in paddle.jit.load loading Layer without parameter
- Fixed the bug of calculation error in the large size input of paddle.diag, and fixed the bug of memory usage exception of paddle.diag in Windows Python 3.8 environment
- Fixed the unreasonable shape problem of paddle.topk in static graph networking
- Fixed the bug of exit with the direct report of error of paddle.io.DataLoader multi-process mode when started through paddle.distributed.spaw
- Fixed the problem of device failure in some scenarios when the paddle.set_device interface is set with the runtime
- Fixed the bug of the gradient calculation error caused by using the variable of forward calculation in paddle.static.nn.while_loop backward calculation
- Fixed the bug of fleet not supporting paddle.optimizer
- Fixed the bug that the Adam optimizer formula and thesis have diff
- Fixed the problem of logsumexp causing too slow compilation on some machines
- Fixed the ParamAttr missing type check problem
- Fixed the calculation problem of average pooling core on CPU when AvgPool API ceil_mode=true
- Fixed the dimension mismatch problem when paddle.distributed.fleet.init_server() is loaded with a model
- Fixed the problem that the training node does not support GPU in paddle.distributed.fleet parameter server mode
- Fixed the precision diff problem of paddle.allclose in float64 data type
- Fixed the error of back propagation supporting grouped conv operators (conv2d grad op with groups) @intel
- Fix the conv2d grad op with groups problems
- Fixed the bug of failure to save the model when dynamic to static to_static decorative model is directly switched to the eval mode
- Fixed the bug that matmul does not support fp16bug
- Fixed the problem of poor performance of matmul reverse calculation and high memory consumption
- Fixed the error when the bias_attr and weight_attr parameters of paddle.nn.Transformer are specified as bool, list/tuple
- Fixed the problem that dynamic_decode prediction decoding doesn't end early correctly
- Fixed the result error of paddle.unsqueeze when axis is Tensor
- Fixed the problem of paddle.to_tensor caused by zero_copy in some scenarios, to temporarily disable the zero_copy behavior
Inference
Paddle Inference
- Changed the default name of prediction library from fluid_inference to paddle_inference
API
Function Upgrading
- Paddle-TRT dynamic shape supports PaddleSlim quantization of Int8 models
- Paddle Inference GPU Int8 supports conv2d_transpose quantization
- Added operator version information for the prediction model
- Added the support for quantization and inverse quantization of offset scales to the oneDNN INT8 quantization strategy @intel
- Add support for (de/re) quantization with shiftted scales in INT8 quantization strategy
- Added the support for oneDNN BF16: support conv2d bf16 operator and gru bf16 op, and enabled resnet50 bf16 model inference @intel
- Added CPU BF16 support:support conv2d bf16 operator and gru bf16 op, enabled resnet50 bf16 model inference.
Performance Optimization
- The inference performance of ERNIE model using Paddle-TRT FP16 on T4 is improved by 15%.@NVIDIA
- Through the comparison of the speed of supporting oneDNN FP32 GRU and oneDNN INT8 GRU, the speed of the GRU INT8 model is about 1.49 times faster than that of NativeConfig inference (thread = 1, batch_size = 50)@intel
- Added support for oneDNN FP32 GRU and oneDNN INT8 GRU.The GRU INT8 model has 1.49X speed-up compared with NativeConfig inference (with thread=1, batch_size=50)
- By upgrading oneDNN to 1.6, the speed of Ernie Large oneDNN inference on Skylake (Intel Core 6148) is improved about 2.7 times (that is, unit test test_analyzer_ernie_large) @intel
- Since oneDNN is upgraded to 1.6, Ernie Large (test_analyzer_ernie_large) oneDNN inference has speed up ~2.7x.
Bug Fixing
- Fixed the bug of memory leak under the variable length input when a user uses the Paddle Inference ZeroCopyRun interface to enable MKLDNN
- Fixed the bug of prediction error when ERNIE model contains shared parameters
- Fixed the bug of initialization error for the prediction library with the Paddle-TensorRT function in the environment when TensorRT is not installed
- Fixed the bug of dimension calculation error when softmax op and layer_norm op use the Paddle-TRT prediction
- Solved the problem of failing to improve the prediction performance (PaddleOCR repository) when increasing the number of cpu_math_library_num_threads_ @intel
- Fix the issue that increasing cpu_math_library_num_threads_ does not improve performance in PaddleOCR repository
- Solved the problem of oneDNN concat reload data error @intel
- Fix oneDNN concat overwritting data error
- Solved the problem of error reported when enabling the oneDNN to infer the NHWC model @intel
- Fix the issue oneDNN inference with NHWC model report error
- Solved the oneDNN prediction failure problem of the rec_r34_vd_tps_bilstm_attn model @intel
- Fix rec_r34_vd_tps_bilstm_attn model oneDNN prediction failure
- Solved the prediction failure problem of deeplabv3p_xception oneDNN @intel
- Fix the deeplabv3p_xception MKLDNN inference failure by adding conv with dilations support