v0.6.1

vllm-project/vllm

版本发布时间: 2024-09-12 05:44:44

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

Highlights

Model Support

Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

Support load and unload LoRA in api server (#6566)
Add progress reporting to batch runner (#8060)
Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

Update the docker image to use Python 3.12 for small performance bump. (#8133)
Added CODE_OF_CONDUCT.md (#8161)

What's Changed

[Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in https://github.com/vllm-project/vllm/pull/8161
[bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8169
[Misc] Clean up RoPE forward_native by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8076
[ci] Mark LoRA test as soft-fail by @khluu in https://github.com/vllm-project/vllm/pull/8160
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in https://github.com/vllm-project/vllm/pull/8173
[Doc] Add multi-image input example and update supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8181
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in https://github.com/vllm-project/vllm/pull/7860
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/8029
Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in https://github.com/vllm-project/vllm/pull/8165
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in https://github.com/vllm-project/vllm/pull/7962
[Core] Support load and unload LoRA in api server by @Jeffwan in https://github.com/vllm-project/vllm/pull/6566
[BugFix] Fix Granite model configuration by @njhill in https://github.com/vllm-project/vllm/pull/8216
[Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in https://github.com/vllm-project/vllm/pull/8191
[Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7938
[CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8203
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in https://github.com/vllm-project/vllm/pull/8248
[Misc] Remove SqueezeLLM by @dsikka in https://github.com/vllm-project/vllm/pull/8220
[Model] Allow loading from original Mistral format by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8168
[misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7943
[Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in https://github.com/vllm-project/vllm/pull/8256
[Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8238
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in https://github.com/vllm-project/vllm/pull/8241
[tpu][misc] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/8260
[Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8258
[Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in https://github.com/vllm-project/vllm/pull/8201
[Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in https://github.com/vllm-project/vllm/pull/8269
ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in https://github.com/vllm-project/vllm/pull/8026
[CI/Build] Use python 3.12 in cuda image by @joerunde in https://github.com/vllm-project/vllm/pull/8133
[Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8267
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in https://github.com/vllm-project/vllm/pull/8272
[Frontend] Add progress reporting to run_batch.py by @alugowski in https://github.com/vllm-project/vllm/pull/8060
[Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in https://github.com/vllm-project/vllm/pull/8292
[Misc] GPTQ Activation Ordering by @kylesayrs in https://github.com/vllm-project/vllm/pull/8135
[Misc] Fused MoE Marlin support for GPTQ by @dsikka in https://github.com/vllm-project/vllm/pull/8217
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in https://github.com/vllm-project/vllm/pull/8319
[Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8155
[CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in https://github.com/vllm-project/vllm/pull/8327
[Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8314
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8130
Fix ppc64le buildkite job by @sumitd2 in https://github.com/vllm-project/vllm/pull/8309
[Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in https://github.com/vllm-project/vllm/pull/8224
[Misc] remove peft as dependency for prompt models by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/8162
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in https://github.com/vllm-project/vllm/pull/8342
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8340
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/8172
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/8043
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in https://github.com/vllm-project/vllm/pull/8329
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in https://github.com/vllm-project/vllm/pull/8299
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in https://github.com/vllm-project/vllm/pull/6112
[model] Support for Llava-Next-Video model by @TKONIY in https://github.com/vllm-project/vllm/pull/7559
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/8347
[Model][VLM] Add Qwen2-VL model support by @fyabc in https://github.com/vllm-project/vllm/pull/7905
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/7257
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in https://github.com/vllm-project/vllm/pull/8373
[Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8364
[Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in https://github.com/vllm-project/vllm/pull/6917
[Misc] Move device options to a single place by @akx in https://github.com/vllm-project/vllm/pull/8322
[Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/8317
Pixtral by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/8377
Bump version to v0.6.1 by @simon-mo in https://github.com/vllm-project/vllm/pull/8379

New Contributors

@mmcelaney made their first contribution in https://github.com/vllm-project/vllm/pull/8161
@elfiegg made their first contribution in https://github.com/vllm-project/vllm/pull/8173
@Manikandan-Thangaraj-ZS0321 made their first contribution in https://github.com/vllm-project/vllm/pull/7860
@sumitd2 made their first contribution in https://github.com/vllm-project/vllm/pull/8026
@alugowski made their first contribution in https://github.com/vllm-project/vllm/pull/8060
@vladislavkruglikov made their first contribution in https://github.com/vllm-project/vllm/pull/8292
@kevin314 made their first contribution in https://github.com/vllm-project/vllm/pull/8224
@TKONIY made their first contribution in https://github.com/vllm-project/vllm/pull/7559
@akx made their first contribution in https://github.com/vllm-project/vllm/pull/8322

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.6.0...v0.6.1

相关地址：原始地址下载(tar) 下载(zip)

1、 vllm-0.6.1+cu118-cp310-cp310-manylinux1_x86_64.whl 161.98MB

2、 vllm-0.6.1+cu118-cp311-cp311-manylinux1_x86_64.whl 161.98MB

3、 vllm-0.6.1+cu118-cp312-cp312-manylinux1_x86_64.whl 161.98MB

4、 vllm-0.6.1+cu118-cp38-cp38-manylinux1_x86_64.whl 161.98MB

5、 vllm-0.6.1+cu118-cp39-cp39-manylinux1_x86_64.whl 161.98MB

查看：2024-09-12发行的版本