v0.0.6
版本发布时间: 2023-08-25 21:30:09
InternLM/lmdeploy最新发布版本:v0.6.0a0(2024-08-26 17:12:19)
Highlights
- Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
- Support tensor parallelism for W4A16
- Add OpenAI-like RESTful API
- Support Llama-2 70B 4-bit quantization
What's Changed
🚀 Features
- Profiling tool for huggingface and deepspeed models by @wangruohui in https://github.com/InternLM/lmdeploy/pull/161
- Support windows platform by @irexyc in https://github.com/InternLM/lmdeploy/pull/209
- Qwen-7B, dynamic NTK scaling and logN scaling support in turbomind by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/230
- Add Restful API by @AllentDan in https://github.com/InternLM/lmdeploy/pull/223
- Support context decoding with DP in pytorch by @wangruohui in https://github.com/InternLM/lmdeploy/pull/193
💥 Improvements
- Support TP for W4A16 by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/262
- Pass chat template args including meta_prompt to model(https://github.com/InternLM/lmdeploy/commit/7785142d7c13a21bc01c2e7c0bc10b82964371f1) by @AllentDan in https://github.com/InternLM/lmdeploy/pull/225
- Enable the Gradio server to call inference services through the RESTful API by @AllentDan in https://github.com/InternLM/lmdeploy/pull/287
🐞 Bug fixes
- Adjust dependency of gradio server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/236
- Implement
movmatrix
using warp shuffling for CUDA < 11.8 by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/267 - Add 'accelerate' to requirement list by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/261
- Fix building with CUDA 11.3 by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/280
- Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/285
- Fix llama2 70b & qwen quantization error by @pppppM in https://github.com/InternLM/lmdeploy/pull/273
- Import turbomind in gradio server only when it is needed by @AllentDan in https://github.com/InternLM/lmdeploy/pull/303
📚 Documentations
- Remove specified version in user guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/241
- docs(quantzation): update description by @tpoisonooo in https://github.com/InternLM/lmdeploy/pull/253 and https://github.com/InternLM/lmdeploy/pull/272
- Check-in FAQ by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/256
- add readthedocs by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/208
🌐 Other
- Update workflow for building docker image by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/282
- Change to github-hosted runner for building docker image by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/291
Known issues
- 4-bit Qwen-7b model inference failed. #307 is addressing this issue.
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.0.5...v0.0.6