v0.6.1
版本发布时间: 2024-09-12 05:44:44
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Model Support
- Added support for Pixtral (
mistralai/Pixtral-12B-2409
). (#8377, #8168) - Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
- Multi-input support for LLaVA (#8238), InternVL2 models (#8201)
Performance Enhancements
- Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)
Production Engine
- Support load and unload LoRA in api server (#6566)
- Add progress reporting to batch runner (#8060)
- Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)
Others
- Update the docker image to use Python 3.12 for small performance bump. (#8133)
- Added CODE_OF_CONDUCT.md (#8161)
What's Changed
- [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in https://github.com/vllm-project/vllm/pull/8161
- [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8169
- [Misc] Clean up RoPE forward_native by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8076
- [ci] Mark LoRA test as soft-fail by @khluu in https://github.com/vllm-project/vllm/pull/8160
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in https://github.com/vllm-project/vllm/pull/8173
- [Doc] Add multi-image input example and update supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8181
- Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in https://github.com/vllm-project/vllm/pull/7860
- [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8029
- Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in https://github.com/vllm-project/vllm/pull/8165
- [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in https://github.com/vllm-project/vllm/pull/7962
- [Core] Support load and unload LoRA in api server by @Jeffwan in https://github.com/vllm-project/vllm/pull/6566
- [BugFix] Fix Granite model configuration by @njhill in https://github.com/vllm-project/vllm/pull/8216
- [Frontend] Add --logprobs argument to
benchmark_serving.py
by @afeldman-nm in https://github.com/vllm-project/vllm/pull/8191 - [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7938
- [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8203
- [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in https://github.com/vllm-project/vllm/pull/8248
- [Misc] Remove
SqueezeLLM
by @dsikka in https://github.com/vllm-project/vllm/pull/8220 - [Model] Allow loading from original Mistral format by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8168
- [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7943
- [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in https://github.com/vllm-project/vllm/pull/8256
- [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8238
- Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in https://github.com/vllm-project/vllm/pull/8241
- [tpu][misc] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/8260
- [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8258
- [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in https://github.com/vllm-project/vllm/pull/8201
- [Model][VLM] Decouple weight loading logic for
Paligemma
by @Isotr0py in https://github.com/vllm-project/vllm/pull/8269 - ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in https://github.com/vllm-project/vllm/pull/8026
- [CI/Build] Use python 3.12 in cuda image by @joerunde in https://github.com/vllm-project/vllm/pull/8133
- [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8267
- [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in https://github.com/vllm-project/vllm/pull/8272
- [Frontend] Add progress reporting to run_batch.py by @alugowski in https://github.com/vllm-project/vllm/pull/8060
- [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in https://github.com/vllm-project/vllm/pull/8292
- [Misc] GPTQ Activation Ordering by @kylesayrs in https://github.com/vllm-project/vllm/pull/8135
- [Misc] Fused MoE Marlin support for GPTQ by @dsikka in https://github.com/vllm-project/vllm/pull/8217
- Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in https://github.com/vllm-project/vllm/pull/8319
- [Bugfix] Fix missing
post_layernorm
in CLIP by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8155 - [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in https://github.com/vllm-project/vllm/pull/8327
- [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8314
- [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8130
- Fix ppc64le buildkite job by @sumitd2 in https://github.com/vllm-project/vllm/pull/8309
- [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in https://github.com/vllm-project/vllm/pull/8224
- [Misc] remove peft as dependency for prompt models by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/8162
- [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in https://github.com/vllm-project/vllm/pull/8342
- [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8340
- [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8172
- [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8043
- [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in https://github.com/vllm-project/vllm/pull/8329
- [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in https://github.com/vllm-project/vllm/pull/8299
- [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in https://github.com/vllm-project/vllm/pull/6112
- [model] Support for Llava-Next-Video model by @TKONIY in https://github.com/vllm-project/vllm/pull/7559
- [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/8347
- [Model][VLM] Add Qwen2-VL model support by @fyabc in https://github.com/vllm-project/vllm/pull/7905
- [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/7257
- [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8373
- [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8364
- [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in https://github.com/vllm-project/vllm/pull/6917
- [Misc] Move device options to a single place by @akx in https://github.com/vllm-project/vllm/pull/8322
- [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8317
- Pixtral by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8377
- Bump version to v0.6.1 by @simon-mo in https://github.com/vllm-project/vllm/pull/8379
New Contributors
- @mmcelaney made their first contribution in https://github.com/vllm-project/vllm/pull/8161
- @elfiegg made their first contribution in https://github.com/vllm-project/vllm/pull/8173
- @Manikandan-Thangaraj-ZS0321 made their first contribution in https://github.com/vllm-project/vllm/pull/7860
- @sumitd2 made their first contribution in https://github.com/vllm-project/vllm/pull/8026
- @alugowski made their first contribution in https://github.com/vllm-project/vllm/pull/8060
- @vladislavkruglikov made their first contribution in https://github.com/vllm-project/vllm/pull/8292
- @kevin314 made their first contribution in https://github.com/vllm-project/vllm/pull/8224
- @TKONIY made their first contribution in https://github.com/vllm-project/vllm/pull/7559
- @akx made their first contribution in https://github.com/vllm-project/vllm/pull/8322
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.6.0...v0.6.1
1、 vllm-0.6.1+cu118-cp310-cp310-manylinux1_x86_64.whl 161.98MB
2、 vllm-0.6.1+cu118-cp311-cp311-manylinux1_x86_64.whl 161.98MB
3、 vllm-0.6.1+cu118-cp312-cp312-manylinux1_x86_64.whl 161.98MB
4、 vllm-0.6.1+cu118-cp38-cp38-manylinux1_x86_64.whl 161.98MB
5、 vllm-0.6.1+cu118-cp39-cp39-manylinux1_x86_64.whl 161.98MB