v0.5.2
版本发布时间: 2024-07-16 02:01:34
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Major Changes
- ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
- The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.
Highlights
Model Support
- Add PaliGemma (#5189), Fuyu-8B (#3924)
- Support for soft tuned prompts (#4645)
- A new guide for adding multi-modal plugins (#6205)
Hardware
- AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)
Performance
- ZeroMQ fallback for broadcasting large objects (#6183)
- Simplify code to support pipeline parallel (#6406)
- Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
- Use CUTLASS kernels for the FP8 layers with Bias (#6270)
Features
- Enabling bonus token in speculative decoding for KV cache based models (#5765)
- Medusa Implementation with Top-1 proposer (#4978)
- An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)
Others
- Add support for multi-node on CI (#5955)
- Benchmark: add H100 suite (#6047)
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
- Build some nightly wheels (#6380)
What's Changed
- Update wheel builds to strip debug by @simon-mo in https://github.com/vllm-project/vllm/pull/6161
- Fix release wheel build env var by @simon-mo in https://github.com/vllm-project/vllm/pull/6162
- Move release wheel env var to Dockerfile instead by @simon-mo in https://github.com/vllm-project/vllm/pull/6163
- [Doc] Reorganize Supported Models by Type by @ywang96 in https://github.com/vllm-project/vllm/pull/6167
- [Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6168
- [Model] Add PaliGemma by @ywang96 in https://github.com/vllm-project/vllm/pull/5189
- add benchmark for fix length input and output by @haichuan1221 in https://github.com/vllm-project/vllm/pull/5857
- [ Misc ] Support Fp8 via
llm-compressor
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6110 - [misc][frontend] log all available endpoints by @youkaichao in https://github.com/vllm-project/vllm/pull/6195
- do not exclude
object
field in CompletionStreamResponse by @kczimm in https://github.com/vllm-project/vllm/pull/6196 - [Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in https://github.com/vllm-project/vllm/pull/5947
- [Kernel] reloading fused_moe config on the last chunk by @avshalomman in https://github.com/vllm-project/vllm/pull/6210
- [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in https://github.com/vllm-project/vllm/pull/4888
- [Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in https://github.com/vllm-project/vllm/pull/6203
- [Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in https://github.com/vllm-project/vllm/pull/6214
- Add FlashInfer to default Dockerfile by @simon-mo in https://github.com/vllm-project/vllm/pull/6172
- [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in https://github.com/vllm-project/vllm/pull/6216
- [core][distributed] fix ray worker rank assignment by @youkaichao in https://github.com/vllm-project/vllm/pull/6235
- [Bugfix][TPU] Add missing None to model input by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6245
- [Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6256
- Add support for multi-node on CI by @khluu in https://github.com/vllm-project/vllm/pull/5955
- [CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/4645
- [Docs] Docs update for Pipeline Parallel by @andoorve in https://github.com/vllm-project/vllm/pull/6222
- [Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in https://github.com/vllm-project/vllm/pull/6238
- [Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/4978
- [core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in https://github.com/vllm-project/vllm/pull/6183
- [Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6279
- [Doc] Guide for adding multi-modal plugins by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6205
- [Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6287
- [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in https://github.com/vllm-project/vllm/pull/6303
- [CI/Build] Enable mypy typing for remaining folders by @bmuskalla in https://github.com/vllm-project/vllm/pull/6268
- [Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in https://github.com/vllm-project/vllm/pull/6296
- [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in https://github.com/vllm-project/vllm/pull/5765
- [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6313
- [Doc] Remove comments incorrectly copied from another project by @daquexian in https://github.com/vllm-project/vllm/pull/6286
- [Doc] Update description of vLLM support for CPUs by @DamonFool in https://github.com/vllm-project/vllm/pull/6003
- [BugFix]: set outlines pkg version by @xiangyang-95 in https://github.com/vllm-project/vllm/pull/6262
- [Bugfix] Fix snapshot download in serving benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/6318
- [Misc] refactor(config): clean up unused code by @aniaan in https://github.com/vllm-project/vllm/pull/6320
- [BugFix]: fix engine timeout due to request abort by @pushan01 in https://github.com/vllm-project/vllm/pull/6255
- [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in https://github.com/vllm-project/vllm/pull/6326
- [BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in https://github.com/vllm-project/vllm/pull/6266
- [ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6082
- Benchmark: add H100 suite by @simon-mo in https://github.com/vllm-project/vllm/pull/6047
- [bug fix] Fix llava next feature size calculation. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/6339
- [doc] update pipeline parallel in readme by @youkaichao in https://github.com/vllm-project/vllm/pull/6347
- [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5362
- [ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6223
- [Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6351
- [distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in https://github.com/vllm-project/vllm/pull/6346
- [Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in https://github.com/vllm-project/vllm/pull/6349
- [Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in https://github.com/vllm-project/vllm/pull/6343
- [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in https://github.com/vllm-project/vllm/pull/6350
- [ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in https://github.com/vllm-project/vllm/pull/6352
- [ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6353
- [Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in https://github.com/vllm-project/vllm/pull/6364
- [ Misc ] Support Models With Bias in
compressed-tensors
integration by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6356 - [Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6367
- [Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/6379
- [Misc] add fixture to guided processor tests by @kevinbu233 in https://github.com/vllm-project/vllm/pull/6341
- [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in https://github.com/vllm-project/vllm/pull/6365
- [ci] Add GHA workflows to enable full CI run by @khluu in https://github.com/vllm-project/vllm/pull/6381
- [MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in https://github.com/vllm-project/vllm/pull/5327
- Build some nightly wheels by default by @simon-mo in https://github.com/vllm-project/vllm/pull/6380
- Fix release-pipeline.yaml by @simon-mo in https://github.com/vllm-project/vllm/pull/6388
- Fix interpolation in release pipeline by @simon-mo in https://github.com/vllm-project/vllm/pull/6389
- Fix release pipeline's -e flag by @simon-mo in https://github.com/vllm-project/vllm/pull/6390
- [Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in https://github.com/vllm-project/vllm/pull/6382
- [Misc] Add generated git commit hash as
vllm.__commit__
by @mgoin in https://github.com/vllm-project/vllm/pull/6386 - Fix release pipeline's dir permission by @simon-mo in https://github.com/vllm-project/vllm/pull/6391
- [Bugfix][TPU] Fix megacore setting for v5e-litepod by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6397
- [ci] Fix wording for GH bot by @khluu in https://github.com/vllm-project/vllm/pull/6398
- [Doc] Fix Typo in Doc by @esaliya in https://github.com/vllm-project/vllm/pull/6392
- [Bugfix] Fix hard-coded value of x in context_attention_fwd by @tdoublep in https://github.com/vllm-project/vllm/pull/6373
- [Docs] Clean up latest news by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6401
- [ci] try to add multi-node tests by @youkaichao in https://github.com/vllm-project/vllm/pull/6280
- Updating LM Format Enforcer version to v10.3 by @noamgat in https://github.com/vllm-project/vllm/pull/6411
- [ Misc ] More Cleanup of Marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6359
- [Misc] Add deprecation warning for beam search by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6402
- [ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6417
- [Model] Initialize Fuyu-8B support by @Isotr0py in https://github.com/vllm-project/vllm/pull/3924
- Remove unnecessary trailing period in spec_decode.rst by @terrytangyuan in https://github.com/vllm-project/vllm/pull/6405
- [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6384
- [ci][build] fix commit id by @youkaichao in https://github.com/vllm-project/vllm/pull/6420
- [ Misc ] Enable Quantizing All Layers of DeekSeekv2 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6423
- [Feature] vLLM CLI for serving and querying OpenAI compatible server by @EthanqX in https://github.com/vllm-project/vllm/pull/5090
- [Doc] xpu backend requires running setvars.sh by @rscohn2 in https://github.com/vllm-project/vllm/pull/6393
- [CI/Build] Cross python wheel by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6394
- [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by @lxline in https://github.com/vllm-project/vllm/pull/6428
- Report usage for beam search by @simon-mo in https://github.com/vllm-project/vllm/pull/6404
- Add FUNDING.yml by @simon-mo in https://github.com/vllm-project/vllm/pull/6435
- [BugFix] BatchResponseData body should be optional by @zifeitong in https://github.com/vllm-project/vllm/pull/6345
- [Doc] add env docs for flashinfer backend by @DefTruth in https://github.com/vllm-project/vllm/pull/6437
- [core][distributed] simplify code to support pipeline parallel by @youkaichao in https://github.com/vllm-project/vllm/pull/6406
- [Bugfix] Convert image to RGB by default by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6430
- [doc][misc] doc update by @youkaichao in https://github.com/vllm-project/vllm/pull/6439
- [VLM] Minor space optimization for
ClipVisionModel
by @ywang96 in https://github.com/vllm-project/vllm/pull/6436 - [doc][distributed] add suggestion for distributed inference by @youkaichao in https://github.com/vllm-project/vllm/pull/6418
- [Kernel] Use CUTLASS kernels for the FP8 layers with Bias by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6270
- [Misc] Use 0.0.9 version for flashinfer by @Pernekhan in https://github.com/vllm-project/vllm/pull/6447
- [Bugfix] Add custom Triton cache manager to resolve MoE MP issue by @tdoublep in https://github.com/vllm-project/vllm/pull/6140
- [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by @tdoublep in https://github.com/vllm-project/vllm/pull/6409
- bump version to v0.5.2 by @simon-mo in https://github.com/vllm-project/vllm/pull/6433
- [misc][distributed] fix pp missing layer condition by @youkaichao in https://github.com/vllm-project/vllm/pull/6446
New Contributors
- @haichuan1221 made their first contribution in https://github.com/vllm-project/vllm/pull/5857
- @kczimm made their first contribution in https://github.com/vllm-project/vllm/pull/6196
- @ericperfect made their first contribution in https://github.com/vllm-project/vllm/pull/6203
- @qibaoyuan made their first contribution in https://github.com/vllm-project/vllm/pull/6238
- @abhigoyal1997 made their first contribution in https://github.com/vllm-project/vllm/pull/4978
- @bmuskalla made their first contribution in https://github.com/vllm-project/vllm/pull/6268
- @park12sj made their first contribution in https://github.com/vllm-project/vllm/pull/6296
- @daquexian made their first contribution in https://github.com/vllm-project/vllm/pull/6286
- @xiangyang-95 made their first contribution in https://github.com/vllm-project/vllm/pull/6262
- @aniaan made their first contribution in https://github.com/vllm-project/vllm/pull/6320
- @pushan01 made their first contribution in https://github.com/vllm-project/vllm/pull/6255
- @helena-intel made their first contribution in https://github.com/vllm-project/vllm/pull/6349
- @adityagoel14 made their first contribution in https://github.com/vllm-project/vllm/pull/6350
- @kevinbu233 made their first contribution in https://github.com/vllm-project/vllm/pull/6341
- @esaliya made their first contribution in https://github.com/vllm-project/vllm/pull/6392
- @EthanqX made their first contribution in https://github.com/vllm-project/vllm/pull/5090
- @rscohn2 made their first contribution in https://github.com/vllm-project/vllm/pull/6393
- @lxline made their first contribution in https://github.com/vllm-project/vllm/pull/6428
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.1...v0.5.2
1、 vllm-0.5.2+cu118-cp310-cp310-manylinux1_x86_64.whl 140.59MB
2、 vllm-0.5.2+cu118-cp311-cp311-manylinux1_x86_64.whl 140.59MB
3、 vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl 140.59MB
4、 vllm-0.5.2+cu118-cp39-cp39-manylinux1_x86_64.whl 140.59MB
5、 vllm-0.5.2-cp310-cp310-manylinux1_x86_64.whl 140.14MB
6、 vllm-0.5.2-cp311-cp311-manylinux1_x86_64.whl 140.14MB
7、 vllm-0.5.2-cp38-cp38-manylinux1_x86_64.whl 140.14MB
8、 vllm-0.5.2-cp39-cp39-manylinux1_x86_64.whl 140.14MB