v0.4.2
版本发布时间: 2024-05-27 16:56:15
InternLM/lmdeploy最新发布版本:v0.6.0a0(2024-08-26 17:12:19)
Highlight
- Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
Quantization
lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ
Inference with quantized model
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
- Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
What's Changed
🚀 Features
- PyTorch Engine hash table based prefix caching by @grimoire in https://github.com/InternLM/lmdeploy/pull/1429
- support phi3 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1497
- Turbomind prefix caching by @ispobock in https://github.com/InternLM/lmdeploy/pull/1450
- Enable search scale for awq by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1545
- [Feature] Support vl models quantization by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1553
💥 Improvements
- make Qwen compatible with Slora when TP > 1 by @jjjjohnson in https://github.com/InternLM/lmdeploy/pull/1518
- Optimize slora by @grimoire in https://github.com/InternLM/lmdeploy/pull/1447
- Use a faster format for images in VLMs by @isidentical in https://github.com/InternLM/lmdeploy/pull/1575
- add chat-template args to chat cli by @RunningLeon in https://github.com/InternLM/lmdeploy/pull/1566
- Get the max session len from config.json by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1550
- Optimize w8a8 kernel by @grimoire in https://github.com/InternLM/lmdeploy/pull/1353
- support python 3.12 by @irexyc in https://github.com/InternLM/lmdeploy/pull/1605
- Optimize moe by @grimoire in https://github.com/InternLM/lmdeploy/pull/1520
- Balance vision model weights on multi gpus by @irexyc in https://github.com/InternLM/lmdeploy/pull/1591
- Support user-specified IMAGE_TOKEN position for deepseek-vl model by @irexyc in https://github.com/InternLM/lmdeploy/pull/1627
- Optimize GQA/MQA by @grimoire in https://github.com/InternLM/lmdeploy/pull/1649
🐞 Bug fixes
- fix logger init by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1598
- Bugfix: wrongly assign gen_config with True by @thelongestusernameofall in https://github.com/InternLM/lmdeploy/pull/1594
- Enable split-kv for attention by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1606
- Fix xcomposer2 vision model process by @irexyc in https://github.com/InternLM/lmdeploy/pull/1640
- Fix NTK scaling by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1636
- Fix illegal memory access when seq_len < 64 by @lzhangzz in https://github.com/InternLM/lmdeploy/pull/1616
- Fix llava vl template by @irexyc in https://github.com/InternLM/lmdeploy/pull/1620
- [side-effect] fix deepseek-vl when tp is 1 by @irexyc in https://github.com/InternLM/lmdeploy/pull/1648
- fix logprobs output by @irexyc in https://github.com/InternLM/lmdeploy/pull/1561
- fix fused-moe in triton2.2.0 by @grimoire in https://github.com/InternLM/lmdeploy/pull/1654
- Align tokenizers in pipeline and api_server benchmark scripts by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1650
- [side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by @irexyc in https://github.com/InternLM/lmdeploy/pull/1661
- remove paged attention prefill autotune by @grimoire in https://github.com/InternLM/lmdeploy/pull/1658
- Fix transformers 4.41.0 prompt may differ after encode decode by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1617
📚 Documentations
- Fix typo in w8a8.md by @chg0901 in https://github.com/InternLM/lmdeploy/pull/1568
- Update doc for prefix caching by @ispobock in https://github.com/InternLM/lmdeploy/pull/1597
- Update VL document by @AllentDan in https://github.com/InternLM/lmdeploy/pull/1657
🌐 Other
- remove first empty token check and add input validation testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1549
- add more model into benchmark and evaluate workflow by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1565
- add vl awq testcase and refactor pipeline testcase by @zhulinJulia24 in https://github.com/InternLM/lmdeploy/pull/1630
- bump version to v0.4.2 by @lvhan028 in https://github.com/InternLM/lmdeploy/pull/1644
New Contributors
- @isidentical made their first contribution in https://github.com/InternLM/lmdeploy/pull/1575
- @chg0901 made their first contribution in https://github.com/InternLM/lmdeploy/pull/1568
- @thelongestusernameofall made their first contribution in https://github.com/InternLM/lmdeploy/pull/1594
Full Changelog: https://github.com/InternLM/lmdeploy/compare/v0.4.1...v0.4.2
1、 lmdeploy-0.4.2+cu118-cp310-cp310-manylinux2014_x86_64.whl 70.59MB
2、 lmdeploy-0.4.2+cu118-cp310-cp310-win_amd64.whl 48.61MB
3、 lmdeploy-0.4.2+cu118-cp311-cp311-manylinux2014_x86_64.whl 70.61MB
4、 lmdeploy-0.4.2+cu118-cp311-cp311-win_amd64.whl 48.61MB
5、 lmdeploy-0.4.2+cu118-cp312-cp312-manylinux2014_x86_64.whl 70.62MB
6、 lmdeploy-0.4.2+cu118-cp312-cp312-win_amd64.whl 48.61MB
7、 lmdeploy-0.4.2+cu118-cp38-cp38-manylinux2014_x86_64.whl 70.61MB
8、 lmdeploy-0.4.2+cu118-cp38-cp38-win_amd64.whl 48.61MB
9、 lmdeploy-0.4.2+cu118-cp39-cp39-manylinux2014_x86_64.whl 70.59MB
10、 lmdeploy-0.4.2+cu118-cp39-cp39-win_amd64.whl 48.6MB