v0.1.4
版本发布时间: 2022-04-28 15:56:41
hpcaitech/ColossalAI最新发布版本:v0.4.4(2024-09-19 10:53:35)
Main Features
Here are the main improvements of this release:
- ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
- Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
- CLI: a command-line tool that helps users launch distributed training tasks more easily.
- Pipeline Parallelism (PP): a more user-friendly API for PP.
What's Changed
ColoTensor
- [tensor]fix colo_tensor torch_function by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/825
- [tensor]fix test_linear by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/826
- [tensor] ZeRO use ColoTensor as the base class. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/828
- [tensor] revert zero tensors back by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/829
- [Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/889
- [tensor] refine linear and add gather for laynorm by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/893
- [Tensor] test parameters() as member function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/896
- [Tensor] activation is an attr of ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/897
- [Tensor] initialize the ColoOptimizer by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/898
- [tensor] reorganize files by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/820
- [Tensor] apply ColoTensor on Torch functions by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/821
- [Tensor] update ColoTensor torch_function by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/822
- [tensor] lazy init by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/823
- [WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/831
- Init Conext supports lazy allocate model memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/842
- [Tensor] TP Linear 1D row by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/843
- [Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/846
- [Tensor] init a simple network training with ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/849
- [Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/854
- [Tensor] add layer norm Op by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/852
- [tensor] an initial dea of tensor spec by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/865
- [Tensor] colo init context add device attr. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/866
- [tensor] add cross_entropy_loss by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/868
- [Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/869
- [tensor] customized op returns ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/875
- [Tensor] get named parameters for model using ColoTensors by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/874
- [Tensor] Add some attributes to ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/877
- [Tensor] make a simple net works with 1D row TP by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/879
- [tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/881
- [Tensor] make ColoTensor more robust for getattr by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/886
- [Tensor] test model check results for a simple net by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/887
- [tensor] add ColoTensor 1Dcol by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/888
Gemini + ZeRO
- [zero] add zero tensor shard strategy by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/793
- Revert "[zero] add zero tensor shard strategy" by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/806
- [gemini] a new tensor structure by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/818
- [gemini] APIs to set cpu memory capacity by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/809
- [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/808
- [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/813
- [gemini] add GeminiMemoryManger by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/832
- [zero] use GeminiMemoryManager when sampling model data by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/850
- [gemini] polish code by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/855
- [gemini] add stateful tensor container by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/867
- [gemini] polish stateful_tensor_mgr by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/876
- [gemini] accelerate adjust_layout() by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/878
CLI
- [cli] added distributed launcher command by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/791
- [cli] added micro benchmarking for tp by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/789
- [cli] add missing requirement by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/805
- [cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/807
- [cli] fixed single-node process launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/812
- [cli] added check installation cli by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/815
- [CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/844
- [cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/858
Pipeline Parallelism
- [pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/816
- [pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in https://github.com/hpcaitech/ColossalAI/pull/853
Misc
- [hotfix] fix auto tensor placement policy by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/775
- [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in https://github.com/hpcaitech/ColossalAI/pull/772
- [hotfix] fix bugs in zero by @1SAA in https://github.com/hpcaitech/ColossalAI/pull/781
- [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/784
- [refactor] moving memtracer to gemini by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/801
- [log] display tflops if available by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/802
- [refactor] moving grad acc logic to engine by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/804
- [log] local throughput metrics by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/811
- [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/810
- [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/819
- [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/824
- [setup] allow installation with python 3.6 by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/834
- Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/835
- [dependency] removed torchvision by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/833
- [Bot] Synchronize Submodule References by @github-actions in https://github.com/hpcaitech/ColossalAI/pull/827
- [unittest] refactored unit tests for change in dependency by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/838
- [setup] use env var instead of option for cuda ext by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/839
- [hotfix] ColoTensor pin_memory by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/840
- modefied the pp build for ckpt adaptation by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/803
- [hotfix] the bug of numel() in ColoTensor by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/845
- [hotfix] fix _post_init_method of zero init ctx by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/847
- [hotfix] add deconstructor for stateful tensor by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/848
- [utils] refactor profiler by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/837
- [ci] cache cuda extension by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/860
- hotfix tensor unittest bugs by @feifeibear in https://github.com/hpcaitech/ColossalAI/pull/862
- [usability] added assertion message in registry by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/864
- [doc] improved docstring in the communication module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/863
- [doc] improved docstring in the logging module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/861
- [doc] improved docstring in the amp module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/857
- [usability] improved error messages in the context module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/856
- [doc] improved error messages in initialize by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/872
- [doc] improved assertion messages in trainer by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/873
- [doc] improved docstring and assertion messages for the engine module by @FrankLeeeee in https://github.com/hpcaitech/ColossalAI/pull/871
- [hotfix] fix import error by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/880
- [setup] add local version label by @ver217 in https://github.com/hpcaitech/ColossalAI/pull/890
- [model_zoo] change qkv processing by @Gy-Lu in https://github.com/hpcaitech/ColossalAI/pull/870
Full Changelog: https://github.com/hpcaitech/ColossalAI/compare/v0.1.3...v0.1.4