v0.5.2

vllm-project/vllm

版本发布时间: 2024-07-16 02:01:34

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

Major Changes

❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Add PaliGemma (#5189), Fuyu-8B (#3924)
Support for soft tuned prompts (#4645)
A new guide for adding multi-modal plugins (#6205)

Hardware

AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

ZeroMQ fallback for broadcasting large objects (#6183)
Simplify code to support pipeline parallel (#6406)
Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

Enabling bonus token in speculative decoding for KV cache based models (#5765)
Medusa Implementation with Top-1 proposer (#4978)
An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

Add support for multi-node on CI (#5955)
Benchmark: add H100 suite (#6047)
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
Build some nightly wheels (#6380)

What's Changed

Update wheel builds to strip debug by @simon-mo in https://github.com/vllm-project/vllm/pull/6161
Fix release wheel build env var by @simon-mo in https://github.com/vllm-project/vllm/pull/6162
Move release wheel env var to Dockerfile instead by @simon-mo in https://github.com/vllm-project/vllm/pull/6163
[Doc] Reorganize Supported Models by Type by @ywang96 in https://github.com/vllm-project/vllm/pull/6167
[Doc] Move guide for multimodal model and other improvements by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6168
[Model] Add PaliGemma by @ywang96 in https://github.com/vllm-project/vllm/pull/5189
add benchmark for fix length input and output by @haichuan1221 in https://github.com/vllm-project/vllm/pull/5857
[ Misc ] Support Fp8 via llm-compressor by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6110
[misc][frontend] log all available endpoints by @youkaichao in https://github.com/vllm-project/vllm/pull/6195
do not exclude object field in CompletionStreamResponse by @kczimm in https://github.com/vllm-project/vllm/pull/6196
[Bugfix] FIx benchmark args for randomly sampled dataset by @haichuan1221 in https://github.com/vllm-project/vllm/pull/5947
[Kernel] reloading fused_moe config on the last chunk by @avshalomman in https://github.com/vllm-project/vllm/pull/6210
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) by @afeldman-nm in https://github.com/vllm-project/vllm/pull/4888
[Bugfix] use diskcache in outlines _get_guide #5436 by @ericperfect in https://github.com/vllm-project/vllm/pull/6203
[Bugfix] Mamba cache Cuda Graph padding by @tomeras91 in https://github.com/vllm-project/vllm/pull/6214
Add FlashInfer to default Dockerfile by @simon-mo in https://github.com/vllm-project/vllm/pull/6172
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability by @youkaichao in https://github.com/vllm-project/vllm/pull/6216
[core][distributed] fix ray worker rank assignment by @youkaichao in https://github.com/vllm-project/vllm/pull/6235
[Bugfix][TPU] Add missing None to model input by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6245
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6256
Add support for multi-node on CI by @khluu in https://github.com/vllm-project/vllm/pull/5955
[CORE] Adding support for insertion of soft-tuned prompts by @SwapnilDreams100 in https://github.com/vllm-project/vllm/pull/4645
[Docs] Docs update for Pipeline Parallel by @andoorve in https://github.com/vllm-project/vllm/pull/6222
[Bugfix]fix and needs_scalar_to_array logic check by @qibaoyuan in https://github.com/vllm-project/vllm/pull/6238
[Speculative Decoding] Medusa Implementation with Top-1 proposer by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/4978
[core][distributed] add zmq fallback for broadcasting large objects by @youkaichao in https://github.com/vllm-project/vllm/pull/6183
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6279
[Doc] Guide for adding multi-modal plugins by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6205
[Bugfix] Support 2D input shape in MoE layer by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6287
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. by @tdoublep in https://github.com/vllm-project/vllm/pull/6303
[CI/Build] Enable mypy typing for remaining folders by @bmuskalla in https://github.com/vllm-project/vllm/pull/6268
[Bugfix] OpenVINOExecutor abstractmethod error by @park12sj in https://github.com/vllm-project/vllm/pull/6296
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models by @sroy745 in https://github.com/vllm-project/vllm/pull/5765
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6313
[Doc] Remove comments incorrectly copied from another project by @daquexian in https://github.com/vllm-project/vllm/pull/6286
[Doc] Update description of vLLM support for CPUs by @DamonFool in https://github.com/vllm-project/vllm/pull/6003
[BugFix]: set outlines pkg version by @xiangyang-95 in https://github.com/vllm-project/vllm/pull/6262
[Bugfix] Fix snapshot download in serving benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/6318
[Misc] refactor(config): clean up unused code by @aniaan in https://github.com/vllm-project/vllm/pull/6320
[BugFix]: fix engine timeout due to request abort by @pushan01 in https://github.com/vllm-project/vllm/pull/6255
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. by @tdoublep in https://github.com/vllm-project/vllm/pull/6326
[BugFix] get_and_reset only when scheduler outputs are not empty by @mzusman in https://github.com/vllm-project/vllm/pull/6266
[ Misc ] Refactor Marlin Python Utilities by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6082
Benchmark: add H100 suite by @simon-mo in https://github.com/vllm-project/vllm/pull/6047
[bug fix] Fix llava next feature size calculation. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/6339
[doc] update pipeline parallel in readme by @youkaichao in https://github.com/vllm-project/vllm/pull/6347
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5362
[ BugFix ] Prompt Logprobs Detokenization by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6223
[Misc] Remove flashinfer warning, add flashinfer tests to CI by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6351
[distributed][misc] keep consistent with how pytorch finds libcudart.so by @youkaichao in https://github.com/vllm-project/vllm/pull/6346
[Bugfix] Fix usage stats logging exception warning with OpenVINO by @helena-intel in https://github.com/vllm-project/vllm/pull/6349
[Model][Phi3-Small] Remove scipy from blocksparse_attention by @mgoin in https://github.com/vllm-project/vllm/pull/6343
[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub by @adityagoel14 in https://github.com/vllm-project/vllm/pull/6350
[ROCm][AMD][Bugfix] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count and fixed navi3x by @hongxiayang in https://github.com/vllm-project/vllm/pull/6352
[ Misc ] Remove separate bias add by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6353
[Misc][Bugfix] Update transformers for tokenizer issue by @ywang96 in https://github.com/vllm-project/vllm/pull/6364
[ Misc ] Support Models With Bias in compressed-tensors integration by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6356
[Bugfix] Fix dtype mismatch in PaliGemma by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6367
[Build/CI] Checking/Waiting for the GPU's clean state by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/6379
[Misc] add fixture to guided processor tests by @kevinbu233 in https://github.com/vllm-project/vllm/pull/6341
[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline by @khluu in https://github.com/vllm-project/vllm/pull/6365
[ci] Add GHA workflows to enable full CI run by @khluu in https://github.com/vllm-project/vllm/pull/6381
[MISC] Upgrade dependency to PyTorch 2.3.1 by @comaniac in https://github.com/vllm-project/vllm/pull/5327
Build some nightly wheels by default by @simon-mo in https://github.com/vllm-project/vllm/pull/6380
Fix release-pipeline.yaml by @simon-mo in https://github.com/vllm-project/vllm/pull/6388
Fix interpolation in release pipeline by @simon-mo in https://github.com/vllm-project/vllm/pull/6389
Fix release pipeline's -e flag by @simon-mo in https://github.com/vllm-project/vllm/pull/6390
[Bugfix] Fix illegal memory access in FP8 MoE kernel by @comaniac in https://github.com/vllm-project/vllm/pull/6382
[Misc] Add generated git commit hash as vllm.__commit__ by @mgoin in https://github.com/vllm-project/vllm/pull/6386
Fix release pipeline's dir permission by @simon-mo in https://github.com/vllm-project/vllm/pull/6391
[Bugfix][TPU] Fix megacore setting for v5e-litepod by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6397
[ci] Fix wording for GH bot by @khluu in https://github.com/vllm-project/vllm/pull/6398
[Doc] Fix Typo in Doc by @esaliya in https://github.com/vllm-project/vllm/pull/6392
[Bugfix] Fix hard-coded value of x in context_attention_fwd by @tdoublep in https://github.com/vllm-project/vllm/pull/6373
[Docs] Clean up latest news by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6401
[ci] try to add multi-node tests by @youkaichao in https://github.com/vllm-project/vllm/pull/6280
Updating LM Format Enforcer version to v10.3 by @noamgat in https://github.com/vllm-project/vllm/pull/6411
[ Misc ] More Cleanup of Marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6359
[Misc] Add deprecation warning for beam search by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6402
[ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6417
[Model] Initialize Fuyu-8B support by @Isotr0py in https://github.com/vllm-project/vllm/pull/3924
Remove unnecessary trailing period in spec_decode.rst by @terrytangyuan in https://github.com/vllm-project/vllm/pull/6405
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6384
[ci][build] fix commit id by @youkaichao in https://github.com/vllm-project/vllm/pull/6420
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6423
[Feature] vLLM CLI for serving and querying OpenAI compatible server by @EthanqX in https://github.com/vllm-project/vllm/pull/5090
[Doc] xpu backend requires running setvars.sh by @rscohn2 in https://github.com/vllm-project/vllm/pull/6393
[CI/Build] Cross python wheel by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6394
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' by @lxline in https://github.com/vllm-project/vllm/pull/6428
Report usage for beam search by @simon-mo in https://github.com/vllm-project/vllm/pull/6404
Add FUNDING.yml by @simon-mo in https://github.com/vllm-project/vllm/pull/6435
[BugFix] BatchResponseData body should be optional by @zifeitong in https://github.com/vllm-project/vllm/pull/6345
[Doc] add env docs for flashinfer backend by @DefTruth in https://github.com/vllm-project/vllm/pull/6437
[core][distributed] simplify code to support pipeline parallel by @youkaichao in https://github.com/vllm-project/vllm/pull/6406
[Bugfix] Convert image to RGB by default by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6430
[doc][misc] doc update by @youkaichao in https://github.com/vllm-project/vllm/pull/6439
[VLM] Minor space optimization for ClipVisionModel by @ywang96 in https://github.com/vllm-project/vllm/pull/6436
[doc][distributed] add suggestion for distributed inference by @youkaichao in https://github.com/vllm-project/vllm/pull/6418
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6270
[Misc] Use 0.0.9 version for flashinfer by @Pernekhan in https://github.com/vllm-project/vllm/pull/6447
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue by @tdoublep in https://github.com/vllm-project/vllm/pull/6140
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF by @tdoublep in https://github.com/vllm-project/vllm/pull/6409
bump version to v0.5.2 by @simon-mo in https://github.com/vllm-project/vllm/pull/6433
[misc][distributed] fix pp missing layer condition by @youkaichao in https://github.com/vllm-project/vllm/pull/6446

New Contributors

@haichuan1221 made their first contribution in https://github.com/vllm-project/vllm/pull/5857
@kczimm made their first contribution in https://github.com/vllm-project/vllm/pull/6196
@ericperfect made their first contribution in https://github.com/vllm-project/vllm/pull/6203
@qibaoyuan made their first contribution in https://github.com/vllm-project/vllm/pull/6238
@abhigoyal1997 made their first contribution in https://github.com/vllm-project/vllm/pull/4978
@bmuskalla made their first contribution in https://github.com/vllm-project/vllm/pull/6268
@park12sj made their first contribution in https://github.com/vllm-project/vllm/pull/6296
@daquexian made their first contribution in https://github.com/vllm-project/vllm/pull/6286
@xiangyang-95 made their first contribution in https://github.com/vllm-project/vllm/pull/6262
@aniaan made their first contribution in https://github.com/vllm-project/vllm/pull/6320
@pushan01 made their first contribution in https://github.com/vllm-project/vllm/pull/6255
@helena-intel made their first contribution in https://github.com/vllm-project/vllm/pull/6349
@adityagoel14 made their first contribution in https://github.com/vllm-project/vllm/pull/6350
@kevinbu233 made their first contribution in https://github.com/vllm-project/vllm/pull/6341
@esaliya made their first contribution in https://github.com/vllm-project/vllm/pull/6392
@EthanqX made their first contribution in https://github.com/vllm-project/vllm/pull/5090
@rscohn2 made their first contribution in https://github.com/vllm-project/vllm/pull/6393
@lxline made their first contribution in https://github.com/vllm-project/vllm/pull/6428

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.1...v0.5.2

相关地址：原始地址下载(tar) 下载(zip)

1、 vllm-0.5.2+cu118-cp310-cp310-manylinux1_x86_64.whl 140.59MB

2、 vllm-0.5.2+cu118-cp311-cp311-manylinux1_x86_64.whl 140.59MB

3、 vllm-0.5.2+cu118-cp38-cp38-manylinux1_x86_64.whl 140.59MB

4、 vllm-0.5.2+cu118-cp39-cp39-manylinux1_x86_64.whl 140.59MB

5、 vllm-0.5.2-cp310-cp310-manylinux1_x86_64.whl 140.14MB

6、 vllm-0.5.2-cp311-cp311-manylinux1_x86_64.whl 140.14MB

7、 vllm-0.5.2-cp38-cp38-manylinux1_x86_64.whl 140.14MB

8、 vllm-0.5.2-cp39-cp39-manylinux1_x86_64.whl 140.14MB

查看：2024-07-16发行的版本