Release Note
- 编程范式:默认开启动态图模式进行模型开发和训练,通过动转静的方式进行模型部署和训练加速。如果需要使用静态图编程范式,可以通过paddle.enable_static()来切换到静态图模式。
- API体系:对API进行了补充,对目录结构进行了调整,使得更加易用,详情请见:API文档,同时,提供高层API简化使用流程;详情请见: 飞桨高层API使用指南。
- 框架功能:对数据加载、动态图执行,OP性能,混合精度训练,分布式训练,动静转换,等进行了功能增强和性能优化。
- 环境适配: 提供了对ARM架构CPU的支持,增加了对Python 3.8、CUDA 10.1/10.2的支持,发布支持CUDA11的安装包(experimental),发布支持百度昆仑芯片的安装包(experimental),详情请见:开始使用。
- 模型库及开发套件:飞桨的官方模型库和套件已经完成绝大部分模型升级至飞桨框架2.0.0版本。
- PaddleHub:支持2.0动态图,全面迁移动态图编程模式,模型开发调试更加方便,finetune接口更加灵活易用。
- PaddleDetection: 支持2.0动态图,覆盖检测方向主流算法(PP-YOLO、Faster-RCNN、SOLOv2),支持动静转换,打通预测部署,提供了更加模块化的组网方式。
- PaddleClas: 支持2.0动态图,提供了29个系列的分类算法和134个预训练模型,提供了基于SSLD知识蒸馏的优化方案,将分类模型的精度普遍提升3%以上。
- PaddleSeg: 支持2.0动态图,提供了50+的高质量预训练模型,支持15+主流分割网络,提供了业界的SOTA模型OCRNet,很好的提升了产品易用性。
- PaddleOCR: 支持2.0动态图,PPOCR系统、文字检测模型(DB,EAST,SAST)与文字识别模型(Rosetta,CRNN,StarNet)完成2.0动态图适配。
- PaddleGAN:支持2.0动态图,所有模型,包括风格迁移、视频增强、唇形迁移、人脸动漫化等九种模型均基于动态图开发。
- PaddleRec:支持2.0动态图,免安装,动静组网统一,方便用户的调研和上线,同时整理发布了推荐系统经典数据集。
- PaddleNLP:支持2.0动态图,提供25+预训练模型和易用的API方式提升文本建模效率。
- Parakeet:支持2.0动态图,已发布的声学模型及声码器均良好支持动态图版本。
- PaddleVideo:支持2.0动态图,包含了视频分类和视频动作定位方向模型,包括: TSN、TSM、SlowFast、AttentionLSTM、BMN模型以及特色应用预训练模型VideoTag和FootballAction。
- AmazonDJL:易用流畅的Java推理接口,支持各操作系统平台(Mac/Windows/Linux),支持Paddle预训练模型部署,更多的信息请参考DJL支持Paddle的官方文档。
- 飞桨框架计划在未来的某个版本起,放弃对python2和python3.5的支持,建议您升级python到3.8版本来使用飞桨。
- 飞桨框架计划在未来的某个版本起,放弃对CUDA9的支持,建议您升级CUDA版本来使用飞桨。
- 编程范式:飞桨2.0.0默认开启了命令式编程范式(动态图),但仍然保留对静态图的支持,静态图代码(包括1.8版本的静态图代码),可以通过添加
后来运行。 - API:飞桨框架2.0.0版本推荐用户使用位于paddle根目录下的API,同时在paddle.fluid目录下保留了所有的1.x版本的API,保留对之前版本API体系的支持。因此,1.x版本的静态图训练代码,添加
即可在2.0.0版本上正常运行;1.x版本训练保存的模型,可以使用2.0.0版本进行推理。 - 我们整理了1.8版本API到2.0版本API的对应关系表。
- 我们提供了迁移工具,来方便您将基于旧版本的代码迁移为2.0.0版本的代码,详情请见:版本迁移工具。
- 基础API
- API目录结构调整,1.x 版本的API主要位于paddle.fluid目录,本版本对API目录结构进行调整,使得分类更为合理,具体调整后的目录说明请参见API文档。
- 新增API共186个,修复和完善API共260个:详情请参考2.0.0 pre release版本的release notes,以及API文档。
- 新增分布式基础通信类API到paddle.distributed: broadcast, all_reduce, reduce, all_gather, scatter, barrier;动态图多卡训练启动API spawn, init_parallel_env,动静统一启动方式fleetrun
- 组网类API实现动静统一,支持在动态图模式和静态图模式两种模式下运行。
- 高层API
- 新增飞桨高层API,对模型开发过程中常见的组网、训练、评估、预测、存取等操作进行封装,实现低代码开发,请参见飞桨高层API使用指南。
- 新增分布式高层API paddle.distributed.fleet,支持通过配置DistributedStrategy来支持多种优化策略组合和自动并行、分布式指标计算、InMemoryDataset
- 易用性优化:
- Tensor功能增强:新增Tensor拷贝接口Tensor.clone(),及120余个Tensor计算操作接口(如Tensor.cos()等);新增使用索引或切片原地修改Tensor的功能;新增Tensor与Scalar运算时自动类型提升的功能;动态图Tensor打印信息优化,展示形式与Numpy保持相同。
- Layer功能增强:新增Layer深拷贝接口Layer.deepcopy();新增Layer属性和函数查看接口Layer.dir();自本版本起,Layer.eval()调用后,Trace功能仍会自动记录反向操作,如不需要记录反向,需要显式调用paddle.no_grad()。
- Optimizer新增set_lr()接口,可在动态图模式下灵活调整学习率。
- 新增set_global_initializer()接口,可定义全局的参数初始化方法。
- 多卡运行代码简化,不需要再显式调用scale_loss和apply_collective_grads。
- 性能优化:
- 多卡训练时Embedding等API支持使用稀疏参数梯度更新的功能。
- 动态图训练和推理新增对Intel加速库oneDNN(原MKL-DNN)的支持,CPU训练场景Resnet50模型可提速6倍。
- 新增动态图Inplace计算功能,可复用Tensor存储空间,减小显存占用,并新增View方法,可以在共享底层存储的情况下改变Tensor描述。
- 【不兼容升级】新增动态图梯度累加功能,起到变相“扩大BatchSize”的作用,backward()接口默认不清空梯度,需要显式调用optimizer.clear_grad()来清空梯度。
- Bug修复:
- 修复了多个模型在train/eval切换时会互相干扰的问题。
- 新增return语法支持,可以在if-elif-else或者循环条件中提前return,并能够return不同类型的tensor或None。
- 新增对函数signature中含有**kwargs参数的支持。
- 新增for、for enumerate遍历Tensor和TensorList的语法支持,遍历Tensor的操作更加灵活。
- 新增更多python语法支持,如print,assert,cast,isinstance,tuple,dict.pop()等。
- 动转静的返回类型从callable函数改为Class,可以调用Class的code,main_program等接口更轻松获取转化后的静态图信息。
- 动转静装饰器to_static新增支持直接装饰model实例,如to_static(model, input_spec) 。
- 新增jit.not_to_static装饰器,可以在动转静过程中,不转化该函数。
- 增加set_verbosity()和set_code_level()接口,可以设置不同级别来查看动转静过程的log或者中间状态的代码。
- 新增InputSpec,可以指定动转静时输入Tensor变量的形状和数据类型。
- 报错信息优化,可以定位到原动态图错误的代码行,并隐藏与用户无关的报错信息。
- 支持用 pdb.set_trace() 进行断点调试。
- 新增paddle.jit.save接口用于动转静模型的保存,该接口同时兼容存储未经paddle.jit.to_static转写的Layer对象以及paddle.DataParallel模型,删除旧接口ProgramTranslator.save_inference_model。
- 新增 paddle.jit.load 接口用于载入静态图格式存储的预测模型,包括paddle.jit.save和paddle.io.save_inference_model保存的模型,模型载入后可在动态图下用于模型推理或者模型训练调优。
- paddle.jit.TransLatedLayer新增program方法,用于获取paddle.jit.load载入模型的program,便于了解模型结构。
- 【不兼容升级】paddle.jit.save, paddle.jit.load接口参数model_path含义变更,改为存储文件前缀而非目录。
- 混合精度策略升级:黑白名单策略(下简称“O1策略”)之外,新增“Almost FP16(下简称O2策略)”支持,即尽可能多使用FP16进行计算。
- 新增FP16 Guard功能(
),支持用户自由控制模型中单个Op是否选用FP16计算类型。 - 用户可自定义
,以控制某一类Op保持FP32计算。 - 使用O2策略,Resnet50和Bert base在V100单卡训练速度分别可达1400images/s和590sequences/s。
- 新增FP16 Guard功能(
- 易用性优化:
- 使用
包统一管理与静态图混合精度训练相关的接口。 - 为
- 使用
- 集合通信All Reduce
- 支持千亿语言模型混合并行训练:支持基于executor接口的流水线并行训练,sharding-DP策略,GradientMerge+AMP策略,Recompute+Offload策略,megatron策略
- 支持动态图:支持多流通信策略,自动rebuild group策略,高性能稀疏参数通信,多卡梯度顺序一致性策略
- 参数服务器PS
- 大规模稀疏功能升级:升级大规模稀疏PS-API,抽象通信组件/参数表/优化器基类,方便用户通过子类派生方式进行二次开发;同时还支持千亿特征流式训练,含特征准入,退场,增量训练,分布式指标预测等;通信方式从GRPC切换成了BRPC
- 开源异构参数服务器,既支持传统的纯CPU机器PS,也支持基于三级存储(SSD/内存/显存)的纯GPU机器PS,还支持CPU机器+GPU机器/昆仑机器混布PS,可以完成万亿参数点击率预估模型的分钟级训练
- 新训练机制支持:
- 支持基于控制流的多任务分布式训练,性能较基于Intag的多任务提升50%以上
- 分布式启动方式优化
- 支持使用
等分布式低阶API; -
; - 优化
,去除grpc依赖,添加一定的容错性,提升分布式任务启动的稳定性; - 支持Gloo方式启动集合通信多CPU
- 支持使用
- 规范Layer、Optimzier等API的set_dict方法名,统一改为set_state_dict。
- paddle.load兼容性增强:支持从fluid.io.save_inference_model和fluid.io.save_params/persistables等接口的存储结果中载入Layer的state_dict。
- 修改paddle.save/load接口行为,paddle.save不再为存储结果添加后缀,paddle.load每次载入仅返回一个结果,规范接口语义。
- 移除paddle.SaveLoadConfig,对于paddle.jit.save, paddle.jit.load, paddle.load等接口兼容载入的场景,使用**kwargs传入额外的配置,简化接口的使用。
- 原静态图API paddle.io.save, paddle.io.load, paddle.io.save_inference_model, paddle.io.load_inference_model移动到paddle.static模块下。
- 优化paddle.static.load_program_state接口使用体验,在不指定载入var_list的使用场景中,载入目录存在干扰文件时仅警告而不报错。
- 扩展动静态图执行引擎,支持复数神经网络训练与复数梯度累加。
- 新增mul, div, matmul, kron, abs等Op对复数计算支持。
- 新增API:
支持paddle2.0动态图转换到ONNX协议 - 新增PPOCR,PPYOLO,FasterRCNN,ERNIE等模型转换
- 更丰富的Paddle op覆盖,支持88个Paddle OP算子,同时支持导出为ONNX 1~12不同版本的算子集
- 数据读取性能优化:简化动态图模式下DataLoader底层实现逻辑,降低读取线程开销,进一步提升数据读取效率,提升模型整体训练速度。MobileNetV1在V100单卡、BatchSize=128的场景下整体训练速度提升34%。
- 动态图组网API升级和性能优化,大量动态图API将直接调用自动生成的Pybind接口,性能显著提升。
- 提高了Resnet50 oneDNN动态图训练的性能。目前CPU场景Resnet50 oneDNN 动态图训练速度提升6.4 倍。
- argsort:优化输入Tensor的元素个数等于其
维长度时的性能,前向速度提升34倍,反向速度提升10倍。 - dropout:优化GPU性能,FP32性能提升约20%,FP16性能提升约50%。
- cast:优化GPU性能,性能提升10%~20%。
- softmax:优化axis=-1的情况下的GPU性能,针对不同shape有3倍~96倍的提升。
- 其他OP性能优化:cumsum,reshape,Flatten,IndexSelect,Roll,elementwise_add,AdamW及RNN类(LSTM,GRU,SimpleRNN)等OP,均有明显性能提升。
- argsort:优化输入Tensor的元素个数等于其
- 新增fused_bn_add_act融合策略,可以自动对batch_norm+elementwise_add+activation的组合模式进行自动融合加速。
- 新增梯度聚合的inplace addto策略,支持原位梯度累加,在ResNet-50混合精度训练中性能提升6.3%。
- 优化lars策略, ResNet50 分布式多卡训练 16k batch size 的 time2train 指标小于 10 分钟。
- 优化paddle.fleet amp分布式性能,修复最后一个通信和计算不重叠的情况,fp16 4机32卡性能提升约0.5%。
- 优化paddle.fleet.gradient_merge分布式性能,先聚合梯度再通信,多机性能可提升20%-40%,达到线性加速比。
- 优化参数服务器通信组件Communicator性能。GEO-400batch通信一次的情况下,W2V模型吞吐率、Simnet-Bow模型性能均有显著提升。Async模式下,相较于飞桨框架1.8按本,W2V模型吞吐率提升11%,CTR-DNN模型性能提升14% 。
- 将框架内仅100处使用LOG(FATAL)抛出异常的写法统一改为使用PADDLE_THROW,优化由于框架不支持某种行为而导致的报错格式与内容。
- 完善框架内Signal Handler实现,优化执行遇到系统Signal错误时的报错格式与内容。
- 优化框架报错栈格式,将编译时python报错栈移至原生报错栈下方,提升报错信息阅读体验。
- 累计进一步完善约1500余条框架内检查报错的错误类型与提示文案,提升框架整体调试易用性。
- 动态图报错信息增强,动态图下Pybind层的报错信息进行系统性增强,提升用户体验。
- 优化Paddle Python端报错异常类型,与Python原生报错类型对齐。
- 默认隐藏C++报错栈,优化隐藏C++栈之后的报错格式,去掉分界标志
Error Message Summary
,与Python原生报错格式对齐。 - 优化部分static模块下API在非静态图模式下使用报错提示,包括static.append_backward, static.gradients, static.scope_guard, static.Print, static.nn.embedding, static.nn.data_norm, static.nn.multi_box_head, static.nn.nce, static.nn.py_func共9个API。
- 优化了动态图模型下传入Tensor为None时的报错信息。
- 优化了Layer的打印信息,支持打印Layer中的各个层次结构关系。
- 动态图训练时量化功能增强,新增
类统一管理动态图量化功能。目前支持对Conv2D、Linear等带权重层的量化,并支持对权重进行分channel求取量化参数,同时也支持无权重层如ReLU,Tanh的量化,以及skip指定Layer量化的功能。 - 新增动态图量化训练过程中对模型layer求取output scale参数功能,供Server端量化推理部署使用
- 动态图量化模型支持使用Paddle-Lite进行预测部署。
- 离线量化功能支持提前融合conv+bn,及产出LSTM量化模型的功能,移除保存采样数据到临时文件的功能。
- 静态图量化支持Conv2d_tranpose量化,支持Linear使用per-channel形式量化。
Paddle Inference
- 全面升级推理C++ API,推荐使用新版API。原API暂时保留,但使用时会报 warning,计划未来会删除;新版API主要是从规范命名、简化使用方法角度做的升级,重要变化包括:
- C++ 接口新增
命名空间,包含推理相关接口; -
,作为推理接口默认输入输出表示方式; - 简化
,只保留 对AnalysisConfig
的支持,不再支持其他多种Config; - 新增服务相关的工具类,比如
,便于创建多个predictor 时使用。
- C++ 接口新增
- Paddle 在 2.0 中新增或升级了部分算子。从本版本起,对前向算子版本进行定义与兼容约束。通过框架间算子版本的对齐,确保不同框架中同一算子版本的定义和行为一致,从而增强框架整体的健壮性。
- 增加推理前向算子版本的注册机制,并将算子的不兼容升级行为纳入统计。
- 增加预测模型的算子版本信息。预测库通过模型文件,将可以对此模型对应的算子定义进行识别,避免定义不同导致计算错误。
下,兼容旧接口,提升易用性。 - 新增
六个API,用来满足用户执行序列化/反序列化 program,序列化/反序列化 params,以及将模型/参数保存到文件,或从文件中加载模型/参数的需求。
NV GPU 推理相关
- 新增对TRT 7.1版本的适配支持。
- 新增对Jetson Nx硬件的适配支持。
- Paddle-TensorRT增强对 PaddleSlim 量化模型的支持,涵盖CV上检测,分类,分割等多个任务。
- Paddle-TRT支持clip op,支持分类模型GhostNet在Paddle-TRT下运行。
- Paddle-TRT 支持含有channelwise量化的mul op的模型,支持PaddleOCR检测和识别量化模型在Paddle-TRT int8下运行。
- Paddle-TRT 动态shape功能支持PaddleSlim量化Int8模型。
X86 CPU 推理相关
- 添加了对oneDNN BF16的支持:支持conv2d 和gru bf16 计算,目前支持resnet50,googlenet,mobilenetv1和mobilenetv2模型的BF16预测。
- 在oneDNN INT8量化策略中增加对有偏移scales的量化和反量化的支持。
- 添加了一些oneDNN 算子的版本兼容性支持。
- CPU端增加了
INT8 oneDNN 内核支持。 - 提升CPU端测试量化模型的易用性,支持同时对比测试原始模型和量化模型。
- Python端推理新增对用户自定义OP支持。
内存 /显存相关
- X86 推理支持动态图量化模型。
- NVIDIA GPU 推理支持动态图量化模型。
- 编译打开ON_INFER时,FLAGS_call_stack_level默认为打开,报错信息显示调用栈。
- 升级了量化模型的转换和优化
- NV GPU 相关
- 优化了CUDA 的ArgMin, ArgMax OP,使得该OP的二进制大小从60M下降至1.3M。
- ERNIE模型在T4上使用Paddle-TRT FP16推理性能提升15%。
- ERNIE模型在开启TenorRT时增加变长输入的支持,带来性能提升147%。在软件版本cuda10.1、cudnn 7.6、tensorrt 6.0、OSS 7.2.1,模型ernie-base-2.0,数据集QNLI,输入BatchSize = 32时,Nvidia Telsa T4上的性能从905 sentences/s提升到2237 sentences/s。示例代码:Paddle-Inference-Demo/c++。
- X86 CPU相关
- 新增 conv + affine_op pass,在6248机器上,MASK-RCNN fp32单线程性能提高了26%。
- 新增fc + gru pass和oneDNN(原MKL-DNN) GRU fp32内核,使得GRU fp32模型4线程推断速度在机器Intel Xeon 6248上提高 20%。
- 通过支持oneDNN INT8 GRU,GRU INT8模型的速度与NativeConfig推理相比,提高了约1.65倍(线程= 1,batch_size = 50)。
- 添加了oneDNN batchnorm + activation的fuse支持,pvanet_ocr模型性能因此提高了2.8%。
- 增加了oneDNN FC + Gelu,FC + Sigmoid 和 FC + tanh 算子融合,将BERT推理模型提高了4.5%。
- 增加了对部分Op的oneDNN inplace支持。
- 优化的oneDNN LRN op,使得GoogleNet fp32模型提速1%。
- 通过oneDNN升级到1.6,Ernie Large oneDNN在Skylake上(Intel Core 6148)推理的速度提高了约2.7倍(即单元测试 test_analyzer_ernie_large)。
- 增加了插值interpolate oneDNN前向算子支持,目前ocr_det模型推理性能相比单纯CPU Native推理提高了2.04倍。
Paddle Lite
端侧推理引擎Paddle Lite v2.8适配主框架v2.0
- 发布支持使用x86 CPU及飞腾CPU下使用昆仑芯片的安装包。
- 新增安装包对python3.8的支持。
- 新增安装包对cuda10.1、cuda10.2的支持。
- (experimental)发布支持cuda11的安装包。
- 将cuda10.1及以上的Paddle镜像以及CI系统镜像中的NCCL版本到2.7.8。
- 升级oneDNN(原MKL-DNN)从1.3至1.5版本。
- 镜像中新增预装openssl-dev依赖。
- 删除安装依赖包:nltk、opencv、scipy、rarfile、prettytable、pathlib、matplotlib、graphviz、objgraph。
- Paddle的avx与no_avx单独发版,whl包减小40%,默认安装avx版本,优化了安装报错信息,会检查用户的CPU类型与Paddle版本,自动给出对应的安装报错提示。
- Paddle develop版本pypi安装用户体验提升,缩减用户安装路径,用pip --pre方式即可进行安装。
推理引擎Paddle Inference
- 预测库支持cuda10.2-cudnn8-trt7.1的版本。
- 发布支持jetpack的安装包,以及支持nv_jetson的C++预测库。
- 新增发布联编tensorrt的两个wheel包,cuda10.0-cudnn7.6-trt6.0.1.5-python36、cuda10.0-cudnn7.6-trt6.0.1.5-python36。
- 修复联编策略,单独发布包含tensorrt的gpu包,避免用户在安装其他GPU版本的包出现没有tensorrt的报错。
- 修复预测库打包有重复的问题。
- 昆仑芯片:支持单卡训练,静态图多卡训练,并发布10+模型。
- 昇腾910芯片:支持单卡训练。
- 由于cuDNN 8.0.x自身的问题,使用cuDNN 8.0.x编译推理库且没有使用TensorRT加速时,在很多模型上有性能退化现象,等待cuDNN后续版本解决。可以尝试使用TensorRT加速,或者使用cuDNN7.6版本。
- 由于cuDNN 8.0.x自身的问题,使用cuDNN 8.0.x版本进行推理时,在某些模型会发生内存泄露现象,当前发现可能发生的为使用cuDNN的convolutionBiasActivationForward时。可以尝试通过推理配置文件config.pass_builder()->DeletePass()禁用conv_elementwise_add_act_fuse_pass、conv_elementwise_add_act_fuse_pass。如果还有泄露现象,可以尝试cuDNN7.6,并将发现问题的模型通过issue方式发给我们分析。
Release Note
The PaddlePaddle framework V2.0.0 has the following updates:
- Programming Paradigm: Enable dynamic graph mode for model development and training by default, and perform the model deployment and training acceleration through the dynamic to static mode.If you need to use static graph programming paradigm, you can switch to static graph mode by running paddle.enable_static().
- API system: The API has been supplemented and the directory structure has been adjusted to make it easier to use, please see API documentation for more details. A high-level API is provided to simplify the process. See PaddlePaddle High-Level API Usage Guide for more details.
- Framework features: Data loading, dynamic graph execution, OP performance, mixed precision training, distributed training, dynamic-static conversion, etc. have been enhanced and optimized.
- Environment adaptation: Supported ARM-based CPU. Added support for Python 3.8, CUDA 10.1/10.2. Released the installation package (experimental) supporting CUDA11, and released the installation package (experimental) supporting Baidu Kunlun chip. For details, see Start.
- Model zoo and development kits: The official model zoo and kits for PaddlePaddle have been upgraded to PaddlePaddle framework V2.0.0.
- PaddleHub: Support dynamic graph V2.0. Fully migrate the dynamic graph programming mode, make model development and debugging more convenient. The finetune interface is more flexible and easy to use.
- PaddleDetection: Support dynamic graph V2.0. Cover the mainstream algorithm of detection direction (PP-YOLO, Faster-RCNN, SOLOv2), support dynamic-static conversion, hit the inference deployment, and provide a more modular way of networking.
- PaddleClas: Support dynamic graph V2.0. Provide 29 series of classification algorithms and 134 pre-training models, provide an optimization scheme based on SSLD knowledge distillation, and generally improve the precision of classification models by more than 3%.
- PaddleSeg: Support dynamic graph V2.0. Provide 50+ high quality pre-training models, support 15+ mainstream segmentation networks, and provide the industry's SOTA model OCRNet, which well enhances the usability of the product.
- PaddleOCR: Support dynamic graph V2.0. PPOCR system, text detection models (DB, EAST, SAST) and text recognition models (Rosetta, CRNN, StarNet) , and complete the adaptation of dynamic graph V2.0.
- PaddleGAN:Support dynamic graph V2.0. Nine models, including style migration, video enhancement, lip migration, face animation and others are developed based on dynamic graph.
- PaddleRec: Support dynamic graph V2.0. The installation-free and unified dynamic and static networking are provided, convenient for user's research and going online. Release the classic dataset of the recommendation system.
- PaddleNLP:Support dynamic graph V2.0. Provide 25+ pre-training models and easy-to-use API way to enhance the efficiency of text modeling.
- Parakeet:Support dynamic graph 2.0. The released acoustic models and vocoder well support dynamic graph version.
- PaddleVideo:Support dynamic graph V2.0. The video classification and video motion positioning direction models are included, such as TSN, TSM, SlowFast, AttentionLSTM, BMN models and featured application pre-training models VideoTag and FootballAction.
- AmazonDJL: Easy-to-use Java inference interface which supports various operating system platforms (Mac/Windows/Linux) and Paddle pre-training model loading. Please refer to the document for more information.
Forward-looking Preview
- The PaddlePaddle Framework plans to drop the support for python2 and python3.5 from a certain version in the future. It is recommended that you upgrade python to V3.8 for PaddlePaddle.
- The PaddlePaddle Framework plans to drop the support for CUDA 9.0 from a certain version in the future. It is recommended that you upgrade the CUDA for PaddlePaddle.
Training Framework
Compatibility instructions
- Programming paradigm: PaddlePaddle 2.0.0 has the imperative programming paradigm (dynamic graphs) enabled by default, but still retains support for static graphs. static graph code (including static graph code from version 1.8) can be executed by running paddle. enable_static().
- API: The PaddlePaddle Framework Version 2.0.0 recommends users to use the API located in the paddle root directory, while all the APIs from version 1.x are retained in the paddle.fluid directory, retaining support for the API system of earlier versions. Therefore, the static graph training code version 1.x can run normally on version 2.0.0 by running paddle.enable_static(). The model saved by training of version 1.x can be used for inference in version 2.0.0.
- A table of correspondence from version 1.8 API to version 2.0 API is prepared.
- We provide a migration tool to facilitate the migration of codes based on earlier version to codes of version 2.0.0. See Version Migration Tool.
dynamic graph mode
By default, the dynamic graph mode is enabled for model development and training, which allows you to perform model deployment and training acceleration in the dynamic-to-static mode.For details, please see dynamic graph, Dynamic-to-static graph.
API system
- Basic APIs
- API directory structure adjustment: The API V1.x is mainly located in the paddle.fluid directory. In this version, the API directory structure is adjusted so that the classification can be more reasonable. For the specific adjusted directory, see the API documentation.
- Added 186 new APIs. Fixed and revised 260 APIs. See Release Notes of 2.0.0 pre release version and API documentation.
- Added the distributed basic communication class API to paddle.distributed:broadcast, all_reduce, reduce, all_gather, scatter, barrier; dynamic graph multi-card training startup API spawn, init_parallel_ env, dynamic-static unified startup method fleetrun
- Networking class API for dynamic and static unification: supports running in both dynamic graph mode and static graph mode.
- High-level API
- Added PaddlePaddle high-level API, and encapsulated the common operations such as networking, training, evaluation, prediction, access, etc. in the process of model development, to achieve low code development. See PaddlePaddle high level API instructions.
- Added distributed high-level API paddle.distributed.fleet. Supported multiple optimization strategy combinations and automatic parallelism, distributed metrics calculation, InMemoryDataset by configuring the DistributedStrategy.
Function optimization (including distributed)
dynamic graph basic functions
- Usability optimization:
- Tensor function enhancements: Added Tensor copy interface Tensor.clone(), and more than 120 Tensor computation interfaces (e.g. Tensor.cos(), etc.). Added the new function to modify the Tensor function by using index or slice. Added the new function of automatic type boost in case of Tensor and Scalar operation. Optimized the printing information of dynamic graph Tensor. The display form remains the same as Numpy.
- Layer function enhancement: Added the new Layer deep copy interface Layer.deepcopy(). Added the new Layer property and function to view interface Layer.dir(). From this version, the Trace function still records reverse operation automatically after the invoking of Layer.eval(). If you don't need to record reverse, you need to explicitly call paddle. no_grad().
- Added a set_lr() interface for Optimizer so that users can flexibly adjust a learning rate in dynamic graph mode.
- Added a new set_global_initializer() interface to define global parameter initialization methods.
- Simplified the code for multi-card operation without needing to explicitly call scale_loss and apply_collective_grads.
- Performance optimization:
- Supported the function of gradient updating by using sparse parameters for APIs (for example, embedding) in case of multi-card training.
- For dynamic graph training and inference, added the support for Intel acceleration library oneDNN (former MKL-DNN). The speed of Resnet50 model in CPU training scenario can improve by 6 times.
- New dynamic graph Inplace calculation function: The Tensor storage space can be reused, reducing the occupation of video memory. Added the new View method. You can change the Tensor description in case of shared underlying storage.
- [Incompatible upgrade] new dynamic graph gradient accumulation function, with disguised "expand BatchSize" role. By default, the gradient of backward() interface is not clear, with needing to explicitly call optimizer.clear_grad() to clear the gradient.
- Fixing bugs:
- Fixed the bug of train and eval interference with each other when switching between them in multiple models.
Dynamic-to-static graph
Added the grammar support for dynamic-to-static conversion
- Added the support for the return grammar. In the if-elif-else or loop conditions, the loop can return earlier, with return different types of tensor or None.
- Added support for the **kwargs parameter contained in the signature function.
- Added the grammar support of “for”, “for enumerate” traversing Tensor and TensorList, more flexible operation for traversing Tensor.
- Added the support for more python grammars, such as print, assert, cast, isinstance, tuple, dict.pop(), etc.
Optimized the usability of dynamic-static conversion
- Changed the return type of dynamic-to-static from callable function to Class. The code and main_program interfaces invoking the Class can obtain the converted static graph information more easily.
- The dynamic-to-static decorator to_static is added with the directly decorating model instances, such as to_static (model, input_spec).
- Added the jit.not_to_static decorator. The function is not converted in the dynamic-to-static process.
- Added set_verbosity() and set_code_level() interfaces. You can set different levels to view the log or intermediate state code of the dynamic to static process.
- Added InputSpec. You can specify the shape and data type of input Tensor variables in the dynamic to static process.
- Error message optimization: Locate the specific wrong line of code in the original dynamic graph and hide the user-unrelated error message.
- Support break point debugging by using pdb.set_trace().
Optimized deployment of model storage and loading APIs
- Added paddle.jit.save interface for storing dynamic-to-static models: The interface is compatible with and used to store both the Layer object not transcribed by paddle.jit.to_static and paddle.DataParallel models. Remove the old interface ProgramTranslator. save_ inference_model.
- Added the paddle.jit.load interface for loading prediction models stored in static graph format, including models saved by paddle.jit.save and paddle.io.save_inference_model. This can be used for model inference or model training optimization under dynamic graph after loading.
- Added the program method for opaddle.jit. TransLatedLayer for obtaining the program of the paddle.jit.load loading model. It is for understanding of the model structure.
- [Incompatible upgrade] changed the meaning of the interface parameter model_path of paddle.jit.save and paddle.jit.load: That is, changed to the prefix of storage files instead of that of directory.
Mixed precision training
- Mixed precision policy upgrade: In addition to the black and white list policy (hereinafter referred to as "O1 policy"), "Almost FP16 (hereinafter referred to as O2 policy)" is added. That is, use FP16 for calculation as much as possible.
- Added the FP16 Guard function (
): Support users to freely control whether a single Op in the model chooses FP16 calculation type. - User can customize
to control a certain type of Op to keep FP32 computation. - Using the O2 policy: Resnet50 and Bert base can be trained at 1400 images/s and 590 sequences/s, respectively, on a single card V100.
- Added the FP16 Guard function (
- Usability optimization:
- Use the
package to manage the interfaces related to static graph mixed precision training in a unified manner. - Provide the simplified name
: That is, users can customize the AMP black and white list Op list by usingCustomOpLists
- Use the
Optimization of the distributed training
- Integrated communication All Reduce
- Support mixed parallel training of 100 billion language models: support pipeline parallel training based on the executor interface, with sharding-DP strategy, GradientMerge+AMP strategy, Recompute+Offload strategy, and megatron strategy.
- Support dynamic graph: support multi-stream communication strategy, automatic rebuild group strategy, high performance sparse parameter communication, and multi-card gradient sequential consistency strategy.
- Parameter server PS
- Upgraded the large-scale sparse function: Upgrade large-scale sparse PS-API, and abstract communication component/parameter table/optimizer base class. It is convenient for users to carry out secondary development in a subclass derivation mode. Meanwhile, it also supports 100 billion features streaming training, including feature access, exit, incremental training, distributed metrics prediction, etc. The communication mode switches from GRPC to BRPC.
- Open source heterogeneous parameter server: Support both traditional pure CPU machine PS, and pure GPU machine PS based on three levels of storage (SSD/memory/video memory). It also supports CPU machine + GPU machine/Kunlun machine mixing distributed PS, with completing the minute-level training of trillions of parameter hit rate prediction models
- Support of new training mechanism:
- Support control flow-based multitasking distributed training: The performance is improved by more than 50% compared to the Intag-based multitasking.
- Optimization of the distributed startup method
- Supported distributed low-order APIs such as all_gather using the
interface - Upgraded the
interface: Support specifying the number of processes in a single node with simplifying asfleetrun
. - Optimized
: Removed the grpc dependency, added some fault tolerance, and improved the stability of starting distributed tasks. - Supported the startup of multi-CPU in the integrated communication in the Gloo method
- Supported distributed low-order APIs such as all_gather using the
Model saving and loading
- Standardized the set_dict method name of APIs such as Layer and Optimzier: That is, changed to set_state_dict in a unified manner.
- Enhanced paddle.load compatibility: support the loading of Layer's state_dict from storage results of interfaces such as fluid.io.save_inference_model and fluid.io.save_params/persistables.
- Modified the paddle. save/load interface behavior: For the paddle.save, A suffix is not added to the storage results. In each loading, paddle.load returns only one result. Standardize the interface semantics.
- Removed paddle.SaveLoadConfig: For the interface compatibility loading scenarios of paddle.jit.save, paddle.jit.load, and paddle.load, use **kwargs to pass in additional configuration to simplify the use of the interface.
- Moved the original static graph APIs such as paddle.io.save, paddle.io.load, paddle.io.save_inference_model, and paddle.io.load_inference_model to the paddle.static module.
- Optimized the paddle.static.load_program_state interface experience. In the scenarios without specifying the loading var_list, only a warning (instead of error report) is given when there is an interference file in the loading of a directory.
Plural computation
- Extended the dynamic static graph execution engine: Support the plural neural network training and plural gradient accumulation.
- Added Op such as mul, div, matmul, kron, and abs for supporting the plural computation.
ONNX function upgrade
- Added API:
for supporting the conversion from Paddle2.0 dynamic graph to ONNX protocol. - Added PPOCR, PPYOLO, FasterRCNN, and ERNIE for model conversion.
- Richer Paddle op coverage: Support 88 Paddle OP operators. Support the export as different versions of ONNX 1~12 operator sets.
Performance optimization (including the distributed)
dynamic graph performance optimization:
- Optimized the data read performance: Simplify the DataLoader underlying implementation logic in dynamic graph mode, reduce the thread reading overhead, and further improve the data reading efficiency and the overall model training speed. The overall training speed of MobileNetV1 in a scenario of single card V100 and BatchSize = 128 is improved by 34%.
- Upgraded and performance optimization of dynamic graph networking API: A large number of dynamic graph APIs directly call an automatically generated Pybind API. As a result, the performance is improved significantly.
- Improved the training performance of Resnet50 oneDNN dynamic graph. The dynamic graph training speed of the current CPU scenario Resnet50 oneDNN is improved by 6.4 times.
OP performance optimization:
- argsort: The number of elements of the input Tensor is optimized as the number equal to its
dimensional length. In this way, the forward speed is improved by 34 times, and the reverse speed is improved by 10 times. - dropout: Optimized GPU performance. The FP32 performance is improved by 20%. The FP16 performance is improved by 50%.
- cast: Optimized GPU performance. The performance is improved by 10% to 20%.
- softmax: Optimized GPU performance in case of axis=-1. The performance is improved by 3 times to 96 times for different shapes.
- Performance optimization of other OPs: Significantly improved the performance of other OPs such as cumsum, reshape, Flatten, IndexSelect, Roll, elementwise_add, AdamW and RNN class (LSTM, GRU, SimpleRNN).
- argsort: The number of elements of the input Tensor is optimized as the number equal to its
Optimization strategy:
- Added fused_bn_add_act fusion strategy: Performed the automatic fusion acceleration for the combined pattern of batch_norm+elementwise_add+activation.
- Added inplace addto strategy for gradient aggregation: Support in-situ gradient accumulation. Improve the performance by 6.3% in ResNet-50 mixed precision training.
Optimized FastThreadedSSAGraphExecutor scheduling: Fixed the bug that the communication calculation does not overlap in the communication synchronization scenario. The performance of 4 machines and 32 cards resnet50 is improved by about 0.3%.
Distributed performance optimization:
- Optimized lars strategy: The time2train index of 16k batch size in the ResNet50 distributed multi-card training is smaller than 10 minutes.
- Optimized the paddle.fleet amp distributed performance: Fixed the bug that the last communication and calculation are not overlapping. The performance of the 4-machine 32-card FP16 is improved by about 0.5%.
- Optimized paddle. fleet.gradient_merge distributed performance: Aggregate gradients before communication. The multi-machine performance can be improved by 20%-40% to achieve linear acceleration ratio.
- Optimized the performance of the parameter server communication component Communicator. In case of GEO-400batch communication once, the W2V model throughput rate and Simnet-Bow model performance are significantly improved. In the Async mode, compared to the PaddlePaddle Framework 1.8, the throughput rate of W2V model is improved by 11% and the performance of CTR-DNN model is improved by 14%
Debugging analysis
- Uniformly changed the wording of LOG(FATAL) throw exception at just 100 points to PADDLE_THROW: Optimize the error format and content caused by non-support of a framework behavior.
- Improved the Signal Handler implementation within the framework. Optimized the error format and content when system signal error occurs during the execution.
- Optimized the framework error stack format: In the compiling, the python error stack is moved below the native error stack to improve error message reading experience.
- An accumulative total of about 1500 error type and prompt copywritings of check errors within the framework. This enhances the overall debugging usability of the framework.
- Enhanced dynamic graph error messages: Error messages on the Pybind layer under a dynamic graph are systematically enhanced to improve user experience.
- Optimized exception types of Paddle Python side error report: Align with Python native error report types.
- Hide the C++ error stack by default: Optimized the error format after hiding the C++ stack, removed the demarcation flag
Error Message Summary
, and aligned with the native Python error format. - Optimized the error prompts of APIs in non-static graph mode in some static modules, including 9 APIs such as static. append_backward, static.gradients, static.scope_guard, static.Print, static.nn.embedding, static.nn. data_norm, static.nn.multi_box_head, static.nn.nce, and static.nn.py_func.
- Optimized the error message when passing in Tensor as None under dynamic graph model.
- Optimized the printing information of Layers, and supported printing the relationship of each hierarchy in Layers.
Inference Deployment
Model quantification
- Enhanced the quantification function in case of the training of dynamic graphs: Added the quantification function of dynamic graphs for the
class in the unified manner. Currently, it supports quantification of weighted layers such as Conv2D, Linear, etc. Support the obtaining the channel-based quantification parameters of weighted layers, quantification of weightless layers such as ReLU, Tanh, and Layer quantification specified by skip. - Added the function to obtain the output scale parameter for the model layer during the training of dynamic graph quantification, for the deployment of quantification inference on the Server side.
- dynamic graph quantitative model supports inference deployment using Paddle-Lite.
- For the offline quantification function, support the advance fusion of conv+bn and output LSTM quantitative models. Remove the function of saving sampled data to temporary files.
- For the static graph quantification, support Conv2d_tranpose quantification. Support Linear quantification in the form of per-channel.
Paddle Inference
The default naming of inference library is changed from fluid_inference to paddle_inference.
- The inference C++ API is upgraded fully. The new APIs are recommended. The old APIs remain temporarily. There is warning reported in the use of old APIs. The old APIs are planned to be deleted in the future. The new APIs include changes of naming standardization and simplification of usage method, including:
- A new
namespace for the C++ interface, containing inference-related interfaces. - Renamed
as the default input/output representation of the inference interface. - Simplify
, with keeping the support for onlyAnalysisConfig
. Other multiple Configs are not supported. - Added service-related utility classes such as
, which can be used when multiple predictors are created.
- A new
Function upgrade
Operator-related version information
Some operators are newly added or upgraded in Paddle V2.0. Starting from this version, the forward operator version is defined with compatibility constraints. Through the alignment of operator versions between frameworks, ensure consistent definition and behavior of the same operator, thus enhancing the overall robustness of the framework.
Added the registration mechanism for inference forward operator versions and included the incompatible upgrade behavior of operators for the statistics.
Added the operator version information for the prediction models. Through the model file, the inference library is able to identify the definition of the operator corresponding to this model, so as to avoid calculation errors caused by different definitions.
Model interface
- The
APIs are migrated topaddle.static
to improve the usability, with compatibility with the old interfaces. - Added six APIs such as
, andload_from_file
for users to perform serialize/deserialize programs, serialize/deserialize params, and saved models/parameters to file, or loaded models/parameters from files.
- The
Inference-related NV GPU
- Added the adaptive support for TRT 7.1.
- Added the adaptive support for Jetson Nx hardware.
- Paddle-TensorRT enhances the support for the PaddleSlim quantitative model. Cover multiple tasks such as detection, classification, and segmentation on CV.
- Paddle-TRT supports clip op, and supports the classification model GhostNet running on the Paddle-TRT.
- Paddle-TRT supports mul op models with channelwise quantification, and supports the PaddleOCR detection. Identified the quantitative models running in the Paddle-TRT int8.
- Paddle-TRT dynamic shape function supports PaddleSlim quantification Int8 models.
X86 CPU-related inference
- Added the support for oneDNN BF16: support the computation of conv2d and gru bf16. It currently supports BF16 prediction for resnet50, googlenet, mobilenetv1 and mobilenetv2 models.
- Added support for quantification and inverse quantification of scales with bias in oneDNN INT8 quantification strategy.
- Added version compatibility support for some oneDNN operators.
- Added the kernel support for
INT8 oneDNN on the CPU side. - Improved the usability of CPU-side test quantification models. Supported the comparative test of original models and quantitative models at the same time.
Custom OP
- Added the support for user-defined Ops on Python-side inference.
Memory/GPU memory correlation
Added the TryShrinkMemory interface. Reduced the occupation of application's memory/video memory by releasing temporary tensors. For the demo, see Paddle-Inference-Demo.
dynamic graph quantitative model support
- X86 inference supports dynamic graph quantitative models.
- NVIDIA GPU inference supports dynamic graph quantitative model.
Error message:
- In the Compiling, when enabling ON_INFER, FLAGS_call_stack_level is on by default. The error message indicates that the stack is invoked.
Performance optimization
- Improved the transformation and optimization of quantitative models.
- NV GPU correlation
- Optimized the ArgMin and ArgMax OP of CUDA so that the binary system size of the OP is decreased from 60 M to 1.3 M.
- For the ERNIE model on T4 with using the Paddle-TRT FP16 inference, the performance is improved by 15%.
- The ERNIE model adds the support for variable-length inputs when TenorRT is enabled. The performance is improved by 147%.In software versions cuda10.1, cudnn 7.6, tensorrt 6.0, OSS 7.2.1, model ernie- base-2.0, dataset QNLI, the performance on Nvidia Telsa T4 improves from 905 sentences/s to 2237 sentences/s when input BatchSize = 32.Sample code: Paddle-Inference-Demo/c++.
- X86 CPU related
- Added the conv + affine_op pass. The MASK-RCNN fp32 single-threaded performance is improved by 26% on machine 6248.
- Added the fc + gru pass and enable oneDNN (former MKL-DNN) GRU fp32 kernel, speeding up GRU fp32 model inference on 4 CRU threads by 20% on machine Intel Xeon 6248.
- By supporting oneDNN INT8 GRU, the GRU INT8 model is about 1.65 times faster compared to NativeConfig inference (threads = 1, batch_size = 50).
- Added the fuse support for oneDNN batchnorm + activation. The pvanet_ocr model performance is improved by 2.8% as a result.
- Added the oneDNN FC + Gelu, FC + Sigmoid and FC + tanh operator fusion. The BERT inference model is improved by 4.5%.
- Added oneDNN inplace support for partial Op
- Optimized oneDNN LRN op (speedup 1% for the GoogleNet fp32 model).
- With oneDNN upgraded to 1.6, Ernie Large oneDNN inference on Skylake (Intel Core 6148) is about 2.7x faster (i.e. unit test test_analyzer_ernie_large).
- Added the interpolate oneDNN forward operator support. Now ocr_det model inference performance improved by 2.04x compared to CPU Native inference alone.
Paddle Lite
End-side inference engine Paddle Lite v2.8 is adapted to the main framework v2.0
Environment Adaptation
Compile and install
Training Framework Paddle
- Released the installation package supporting the use of x86 CPUs and the use of Kunlun chips under the FT CPU.
- Added the support for python3.8 in the installation package.
- Added the installation package for cuda10.1 and cuda 10.2.
- (experimental) Released the installation package for cuda11.
- Upgraded the Paddle image of cuda 10.1 and later, and the NCCL version in the CI system image to V2.7.8
- Upgraded oneDNN (former MKL-DNN) from V1.3 to V1.5.
- Added the pre-installed openssl-dev dependencies to the image.
- Removed installed dependencies: nltk, opencv, scipy, rarfile, prettytable, pathlib, matplotlib, graphviz, objgraph.
- Paddle's avx and no_avx are released separately. whl package is reduced by 40%. avx version is installed by default. Optimized installation error message. The system checks the user's CPU type and Paddle version, automatically prompting the corresponding installation error.
- Improved the pypi installation user experience for the Paddle develop version. Reduced the user installation path. You can run pip --pre for installation.
Paddle inference engine
- The inference library supports cuda10.2-cudnn8-trt7.1 version.
- Release the installation package supporting jetpack and C++ inference library supporting nv_jetson.
- Newly release the joint compilation of two wheel packages for tensorrt, that is, cuda10.0-cudnn7.6-trt6.0.1.5-python36 and cuda10.0-cudnn7.6-trt6.0.1.5-python36.
- Fixed the joint compilation strategy, released the gpu package containing tensorrt separately to avoid the error of no tensorrt when users install the packages of other GPU versions.
- Fixed a bug of duplicate in the inference library packages.
Support of new hardware training
- Kunlun chip: support single card training, static graph multi-card training. Release 10+ models.
- Centerm 910 chip: support single card training.
Known Issues
- Due to cuDNN 8.0.x's own limitations, when using cuDNN 8.0.x to compile inference library and not using TensorRT acceleration, there is performance degradation on many models. This bug is to be fixed in cuDNN's subsequent versions. You can try to use TensorRT acceleration or use cuDNN 7.6.
- Due to cuDNN 8.0.x’s own limitation, memory leak occurs in some models when using cuDNN 8.0.x for inference. Currently, it is found that the problem occurs when the convolutionBiasActivationForward of cuDNN is used. You can try to disable conv_elementwise_add_act_fuse_pass and conv_elementwise_add_act_fuse_pass by using the inference config file config. pass_builder()->DeletePass().If there is still leakage, you can try cuDNN7.6 and send us the model where you found the problem by issue for analysis.