v0.4.0

版本发布时间: 2024-04-23 19:18:37

InternLM/lmdeploy最新发布版本:v0.6.0a0(2024-08-26 17:12:19)

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

data-free online quantization
Supports all nvidia GPU models with Volta architecture (sm70) and above
KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

-	-	-	llama2-7b-chat	-	-	internlm2-chat-7b	-	-	qwen1.5-7b-chat	-	-
dataset	version	metric	kv fp16	kv int8	kv int4	kv fp16	kv int8	kv int4	fp16	kv int8	kv int4
ceval	-	naive_average	28.42	27.96	27.58	60.45	60.88	60.28	70.56	70.49	68.62
mmlu	-	naive_average	35.64	35.58	34.79	63.91	64	62.36	61.48	61.56	60.65
triviaqa	2121ce	score	56.09	56.13	53.71	58.73	58.7	58.18	44.62	44.77	44.04
gsm8k	1d7fe4	accuracy	28.2	28.05	27.37	70.13	69.75	66.87	54.97	56.41	54.74
race-middle	9a54b6	accuracy	41.57	41.78	41.23	88.93	88.93	88.93	87.33	87.26	86.28
race-high	9a54b6	accuracy	39.65	39.77	40.77	85.33	85.31	84.62	82.53	82.59	82.02

The below table presents LMDeploy's inference performance with quantized KV.

model	kv type	test settings	RPS	v.s. kv fp16
llama2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	14.98	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	19.01	1.27
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	20.81	1.39
llama2-chat-13b	fp16	tp1 / ratio 0.9 / bs 128 / prompts 10000	8.55	1.0
-	int8	tp1 / ratio 0.9 / bs 256 / prompts 10000	10.96	1.28
-	int4	tp1 / ratio 0.9 / bs 256 / prompts 10000	11.91	1.39
internlm2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	24.13	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.28	1.05
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.80	1.07

Support qwen1.5 in turbomind engine by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1406
Online 8/4-bit KV-cache quantization by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1377
Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1430
support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in https://github.com/InternLM/lmdeploy/pull/1425
support Internvl chat llava by @irexyc in https://github.com/InternLM/lmdeploy/pull/1426
Add llama3 chat template by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1461
Support mini gemini llama by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1438
add interactive api in service for VL models by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1444
support output logprobs with turbomind backend. by @irexyc in https://github.com/InternLM/lmdeploy/pull/1391
support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in https://github.com/InternLM/lmdeploy/pull/1458
Add qwen1.5 awq quantization by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1470

Reduce binary size, add sm_89 and sm_90 targets by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1383
Use new event loop instead of the current loop for pipeline by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1352
Optimize inference of pytorch engine with tensor parallelism by @grimoire in https://github.com/InternLM/lmdeploy/pull/1397
add llava-v1.6-34b template by @irexyc in https://github.com/InternLM/lmdeploy/pull/1408
Initialize vl encoder first to avoid OOM by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1434
Support model_name customization for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1403
Expose dynamic split&fuse parameters by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1433
warning transformers version by @grimoire in https://github.com/InternLM/lmdeploy/pull/1453
Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in https://github.com/InternLM/lmdeploy/pull/1457
set infinity timeout to nccl by @grimoire in https://github.com/InternLM/lmdeploy/pull/1465
Feat: format internlm2 chat template by @liujiangning30 in https://github.com/InternLM/lmdeploy/pull/1456

handle SIGTERM by @grimoire in https://github.com/InternLM/lmdeploy/pull/1389
fix chat cli ArgumentError error happened in python 3.11 by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1401
Fix llama_triton_example by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1414
miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1420
fix sampling kernel by @grimoire in https://github.com/InternLM/lmdeploy/pull/1417
Fix loading single safetensor file error by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1427
remove space in deepseek template by @grimoire in https://github.com/InternLM/lmdeploy/pull/1441
fix free repetition_penalty_workspace_ buffer by @irexyc in https://github.com/InternLM/lmdeploy/pull/1467
fix adapter failure when tp>1 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1476
get model in advance to fix downloading from modelscope error by @irexyc in https://github.com/InternLM/lmdeploy/pull/1473
Fix the side effect in engine_intance brought by #1391 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1480

Add model name corresponding to the test data in the doc by @wykvictor in https://github.com/InternLM/lmdeploy/pull/1400
fix typo in get_started guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1411
Add async openai demo for api_server by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1409
add the recommendation version for Python Backend by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1436
Update kv quantization and inference guide by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1412
update doc for llama3 by @zhyncs in https://github.com/InternLM/lmdeploy/pull/1462

hack cmakelist.txt in pr_test workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1405
Add benchmark report generated in summary by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1419
add restful completions v1 test case by @ZhoujhZoe in https://github.com/InternLM/lmdeploy/pull/1416
Add kvint4/8 ete testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1448
impove rotary embedding of qwen in torch engine by @grimoire in https://github.com/InternLM/lmdeploy/pull/1451
change cutlass url in ut by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1464
bump version to v0.4.0 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1469

@wykvictor made their first contribution in https://github.com/InternLM/lmdeploy/pull/1400
@ZhoujhZoe made their first contribution in https://github.com/InternLM/lmdeploy/pull/1416
@liujiangning30 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1456

Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.3.0...v0.4.0