v0.4.0
版本发布时间: 2024-04-23 19:18:37
InternLM/lmdeploy最新发布版本:v0.6.0a0(2024-08-26 17:12:19)
Highlights
Support for Llama3 and additional Vision-Language Models (VLMs):
- We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.
Introduce online int4/int8 KV quantization and inference
- data-free online quantization
- Supports all nvidia GPU models with Volta architecture (sm70) and above
- KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
- Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16
The following table shows the evaluation results of three LLM models with different KV numerical precision:
- | - | - | llama2-7b-chat | - | - | internlm2-chat-7b | - | - | qwen1.5-7b-chat | - | - |
---|---|---|---|---|---|---|---|---|---|---|---|
dataset | version | metric | kv fp16 | kv int8 | kv int4 | kv fp16 | kv int8 | kv int4 | fp16 | kv int8 | kv int4 |
ceval | - | naive_average | 28.42 | 27.96 | 27.58 | 60.45 | 60.88 | 60.28 | 70.56 | 70.49 | 68.62 |
mmlu | - | naive_average | 35.64 | 35.58 | 34.79 | 63.91 | 64 | 62.36 | 61.48 | 61.56 | 60.65 |
triviaqa | 2121ce | score | 56.09 | 56.13 | 53.71 | 58.73 | 58.7 | 58.18 | 44.62 | 44.77 | 44.04 |
gsm8k | 1d7fe4 | accuracy | 28.2 | 28.05 | 27.37 | 70.13 | 69.75 | 66.87 | 54.97 | 56.41 | 54.74 |
race-middle | 9a54b6 | accuracy | 41.57 | 41.78 | 41.23 | 88.93 | 88.93 | 88.93 | 87.33 | 87.26 | 86.28 |
race-high | 9a54b6 | accuracy | 39.65 | 39.77 | 40.77 | 85.33 | 85.31 | 84.62 | 82.53 | 82.59 | 82.02 |
The below table presents LMDeploy's inference performance with quantized KV.
model | kv type | test settings | RPS | v.s. kv fp16 |
---|---|---|---|---|
llama2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 14.98 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 19.01 | 1.27 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 20.81 | 1.39 |
llama2-chat-13b | fp16 | tp1 / ratio 0.9 / bs 128 / prompts 10000 | 8.55 | 1.0 |
- | int8 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 10.96 | 1.28 |
- | int4 | tp1 / ratio 0.9 / bs 256 / prompts 10000 | 11.91 | 1.39 |
internlm2-chat-7b | fp16 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 24.13 | 1.0 |
- | int8 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.28 | 1.05 |
- | int4 | tp1 / ratio 0.8 / bs 256 / prompts 10000 | 25.80 | 1.07 |
What's Changed
🚀 Features
- Support qwen1.5 in turbomind engine by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
- Online 8/4-bit KV-cache quantization by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
- Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
- support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in https://github.com/InternLM/lmdeploy/pull/1425
- support Internvl chat llava by @irexyc in https://github.com/InternLM/lmdeploy/pull/1426
- Add llama3 chat template by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
- Support mini gemini llama by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
- add interactive api in service for VL models by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
- support output logprobs with turbomind backend. by @irexyc in https://github.com/InternLM/lmdeploy/pull/1391
- support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in https://github.com/InternLM/lmdeploy/pull/1458
- Add qwen1.5 awq quantization by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1470
💥 Improvements
- Reduce binary size, add
sm_89
andsm_90
targets by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383 - Use new event loop instead of the current loop for pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
- Optimize inference of pytorch engine with tensor parallelism by @grimoire in https://github.com/InternLM/lmdeploy/pull/1397
- add llava-v1.6-34b template by @irexyc in https://github.com/InternLM/lmdeploy/pull/1408
- Initialize vl encoder first to avoid OOM by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
- Support model_name customization for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
- Expose dynamic split&fuse parameters by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
- warning transformers version by @grimoire in https://github.com/InternLM/lmdeploy/pull/1453
- Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in https://github.com/InternLM/lmdeploy/pull/1457
- set infinity timeout to nccl by @grimoire in https://github.com/InternLM/lmdeploy/pull/1465
- Feat: format internlm2 chat template by @liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456
🐞 Bug fixes
- handle SIGTERM by @grimoire in https://github.com/InternLM/lmdeploy/pull/1389
- fix chat cli
ArgumentError
error happened in python 3.11 by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401 - Fix llama_triton_example by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
- miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
- fix sampling kernel by @grimoire in https://github.com/InternLM/lmdeploy/pull/1417
- Fix loading single safetensor file error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
- remove space in deepseek template by @grimoire in https://github.com/InternLM/lmdeploy/pull/1441
- fix free repetition_penalty_workspace_ buffer by @irexyc in https://github.com/InternLM/lmdeploy/pull/1467
- fix adapter failure when tp>1 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1476
- get model in advance to fix downloading from modelscope error by @irexyc in https://github.com/InternLM/lmdeploy/pull/1473
- Fix the side effect in engine_intance brought by #1391 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480
📚 Documentations
- Add model name corresponding to the test data in the doc by @wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
- fix typo in get_started guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
- Add async openai demo for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
- add the recommendation version for Python Backend by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
- Update kv quantization and inference guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
- update doc for llama3 by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1462
🌐 Other
- hack cmakelist.txt in pr_test workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
- Add benchmark report generated in summary by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
- add restful completions v1 test case by @ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
- Add kvint4/8 ete testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
- impove rotary embedding of qwen in torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1451
- change cutlass url in ut by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
- bump version to v0.4.0 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469
New Contributors
- @wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
- @ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
- @liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0
1、 lmdeploy-0.4.0+cu118-cp310-cp310-manylinux2014_x86_64.whl 72.08MB
2、 lmdeploy-0.4.0+cu118-cp310-cp310-win_amd64.whl 50.26MB
3、 lmdeploy-0.4.0+cu118-cp311-cp311-manylinux2014_x86_64.whl 72.09MB
4、 lmdeploy-0.4.0+cu118-cp311-cp311-win_amd64.whl 50.26MB
5、 lmdeploy-0.4.0+cu118-cp38-cp38-manylinux2014_x86_64.whl 72.09MB
6、 lmdeploy-0.4.0+cu118-cp38-cp38-win_amd64.whl 50.26MB
7、 lmdeploy-0.4.0+cu118-cp39-cp39-manylinux2014_x86_64.whl 72.08MB
8、 lmdeploy-0.4.0+cu118-cp39-cp39-win_amd64.whl 50.25MB