v0.5.5
版本发布时间: 2024-08-24 02:37:46
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Performance Update
- We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set
--num-scheduler-steps 8
as a parameter to the API server (viavllm serve
) orAsyncLLMEngine
. We are working on expanding the coverage toLLM
class and aiming to turning it on by default - Various enhancements:
- Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
- Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
- Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)
Model Support
- Support Jamba 1.5 (#7415, #7601, #6739)
- Support for the first audio model
UltravoxModel
(#7615, #7446) - Improvements to vision models:
- Support image embeddings as input (#6613)
- Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
- Support loading GGUF model (#5191) with tensor parallelism (#7520)
- Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)
Hardware Support
- AMD: Add fp8 Linear Layer for rocm (#7210)
- Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
- Intel: various refactoring for worker, executor, and model runner (#7686, #7712)
Others
- Optimize prefix caching performance (#7193)
- Speculative decoding
- Use target model max length as default for draft model (#7706)
- EAGLE Implementation with Top-1 proposer (#6830)
- Entrypoints
- A new
chat
method in theLLM
class (#5049) - Support embeddings in the run_batch API (#7132)
- Support
prompt_logprobs
in Chat Completion (#7453)
- A new
- Quantizations
- Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
- Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
-
torch.compile
: register custom ops for kernels (#7591, #7594, #7536)
What's Changed
- [ci][frontend] deduplicate tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7101
- [Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in https://github.com/vllm-project/vllm/pull/7100
- [Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in https://github.com/vllm-project/vllm/pull/7129
- [MISC] Use non-blocking transfer in prepare_input by @comaniac in https://github.com/vllm-project/vllm/pull/7172
- [Core] Support loading GGUF model by @Isotr0py in https://github.com/vllm-project/vllm/pull/5191
- [Build] Add initial conditional testing spec by @simon-mo in https://github.com/vllm-project/vllm/pull/6841
- [LoRA] Relax LoRA condition by @jeejeelee in https://github.com/vllm-project/vllm/pull/7146
- [Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7153
- [BugFix] Fix DeepSeek remote code by @dsikka in https://github.com/vllm-project/vllm/pull/7178
- [ BugFix ] Fix ZMQ when
VLLM_PORT
is set by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7205 - [Bugfix] add gguf dependency by @kpapis in https://github.com/vllm-project/vllm/pull/7198
- [SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7183
- [Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5941
- [Core] Optimize evictor-v2 performance by @xiaobochen123 in https://github.com/vllm-project/vllm/pull/7193
- [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in https://github.com/vllm-project/vllm/pull/4942
- [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in https://github.com/vllm-project/vllm/pull/7225
- [BugFix] Overhaul async request cancellation by @njhill in https://github.com/vllm-project/vllm/pull/7111
- [Doc] Mock new dependencies for documentation by @ywang96 in https://github.com/vllm-project/vllm/pull/7245
- [BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in https://github.com/vllm-project/vllm/pull/7227
- [Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7238
- [distributed][misc] add specialized method for cuda platform by @youkaichao in https://github.com/vllm-project/vllm/pull/7249
- [Misc] Refactor linear layer weight loading; introduce
BasevLLMParameter
andweight_loader_v2
by @dsikka in https://github.com/vllm-project/vllm/pull/5874 - [ BugFix ] Move
zmq
frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7222 - Fixes typo in function name by @rafvasq in https://github.com/vllm-project/vllm/pull/7275
- [Bugfix] Fix input processor for InternVL2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/7164
- [OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7251
- [Doc] add online speculative decoding example by @stas00 in https://github.com/vllm-project/vllm/pull/7243
- [BugFix] Fix frontend multiprocessing hang by @maxdebayser in https://github.com/vllm-project/vllm/pull/7217
- [Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in https://github.com/vllm-project/vllm/pull/7219
- [ci] Make building wheels per commit optional by @khluu in https://github.com/vllm-project/vllm/pull/7278
- [Bugfix] Fix gptq failure on T4s by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7264
- [FrontEnd] Make
merge_async_iterators
is_cancelled
arg optional by @njhill in https://github.com/vllm-project/vllm/pull/7282 - [Doc] Update supported_hardware.rst by @mgoin in https://github.com/vllm-project/vllm/pull/7276
- [Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7284
- [Misc] Fix typos in scheduler.py by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7285
- [Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in https://github.com/vllm-project/vllm/pull/7288
- [Bugfix] Fix LoRA with PP by @andoorve in https://github.com/vllm-project/vllm/pull/7292
- [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in https://github.com/vllm-project/vllm/pull/7273
- [Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in https://github.com/vllm-project/vllm/pull/7305
- [Frontend] Kill the server on engine death by @joerunde in https://github.com/vllm-project/vllm/pull/6594
- [Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in https://github.com/vllm-project/vllm/pull/6849
- [Doc] Put collect_env issue output in a
block by @mgoin in https://github.com/vllm-project/vllm/pull/7310 - [CI/Build] Dockerfile.cpu improvements by @dtrifiro in https://github.com/vllm-project/vllm/pull/7298
- [Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/7269
- [Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in https://github.com/vllm-project/vllm/pull/7308
- Add Skywork AI as Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/7314
- [TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in https://github.com/vllm-project/vllm/pull/7005
- [Core] Support serving encoder/decoder models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7258
- [TPU] Fix dockerfile.tpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7331
- [Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7162
- [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7218
- [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/6971
- [Core] Streamline stream termination in
AsyncLLMEngine
by @njhill in https://github.com/vllm-project/vllm/pull/7336 - [Model][Jamba] Mamba cache single buffer by @mzusman in https://github.com/vllm-project/vllm/pull/6739
- [VLM][Doc] Add
stop_token_ids
to InternVL example by @Isotr0py in https://github.com/vllm-project/vllm/pull/7354 - [Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7364
- [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7360
- [Frontend] Support embeddings in the run_batch API by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/7132
- [Bugfix] Fix ITL recording in serving benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/7372
- [Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7089
- [Bugfix] Fix
PerTensorScaleParameter
weight loading for fused models by @dsikka in https://github.com/vllm-project/vllm/pull/7376 - [Misc] Add numpy implementation of
compute_slot_mapping
by @Yard1 in https://github.com/vllm-project/vllm/pull/7377 - [Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in https://github.com/vllm-project/vllm/pull/7380
- [Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in https://github.com/vllm-project/vllm/pull/7392
- [TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7340
- Updating LM Format Enforcer version to v0.10.6 by @noamgat in https://github.com/vllm-project/vllm/pull/7189
- [core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7387
- [CI/Build] build on empty device for better dev experience by @tomeras91 in https://github.com/vllm-project/vllm/pull/4773
- [Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in https://github.com/vllm-project/vllm/pull/7403
- [misc] add commit id in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/7405
- [Docs] Update readme by @simon-mo in https://github.com/vllm-project/vllm/pull/7316
- [CI/Build] Minor refactoring for vLLM assets by @ywang96 in https://github.com/vllm-project/vllm/pull/7407
- [Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7319
- [Core][VLM] Support image embeddings as input by @ywang96 in https://github.com/vllm-project/vllm/pull/6613
- [Frontend] Disallow passing
model
as both argument and option by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7347 - [CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in https://github.com/vllm-project/vllm/pull/6832
- [Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7425
- [ci] Entrypoints run upon changes in vllm/ by @khluu in https://github.com/vllm-project/vllm/pull/7423
- [ci] Cancel fastcheck run when PR is marked ready by @khluu in https://github.com/vllm-project/vllm/pull/7427
- [ci] Cancel fastcheck when PR is ready by @khluu in https://github.com/vllm-project/vllm/pull/7433
- [Misc] Use scalar type to dispatch to different
gptq_marlin
kernels by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7323 - [Core] Consolidate
GB
constant and enable float GB arguments by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7416 - [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in https://github.com/vllm-project/vllm/pull/7208
- [Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in https://github.com/vllm-project/vllm/pull/7398
- [CI/Build] bump minimum cmake version by @dtrifiro in https://github.com/vllm-project/vllm/pull/6999
- [Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7224
- [mypy] Misc. typing improvements by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7417
- [Misc] improve logits processors logging message by @aw632 in https://github.com/vllm-project/vllm/pull/7435
- [ci] Remove fast check cancel workflow by @khluu in https://github.com/vllm-project/vllm/pull/7455
- [Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7410
- [hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in https://github.com/vllm-project/vllm/pull/7102
- [TPU] Suppress import custom_ops warning by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7458
- Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7467
- [Frontend][Core] Add plumbing to support audio language models by @petersalas in https://github.com/vllm-project/vllm/pull/7446
- [Misc] Update LM Eval Tolerance by @dsikka in https://github.com/vllm-project/vllm/pull/7473
- [Misc] Update
gptq_marlin
to use new vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7281 - [Misc] Update Fused MoE weight loading by @dsikka in https://github.com/vllm-project/vllm/pull/7334
- [Misc] Update
awq
andawq_marlin
to usevLLMParameters
by @dsikka in https://github.com/vllm-project/vllm/pull/7422 - Announce NVIDIA Meetup by @simon-mo in https://github.com/vllm-project/vllm/pull/7483
- [frontend] spawn engine process from api server process by @youkaichao in https://github.com/vllm-project/vllm/pull/7484
- [Misc]
compressed-tensors
code reuse by @kylesayrs in https://github.com/vllm-project/vllm/pull/7277 - [misc][plugin] add plugin system implementation by @youkaichao in https://github.com/vllm-project/vllm/pull/7426
- [TPU] Support multi-host inference by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7457
- [Bugfix][CI] Import ray under guard by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7486
- [CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in https://github.com/vllm-project/vllm/pull/7396
- [misc][ci] fix cpu test with plugins by @youkaichao in https://github.com/vllm-project/vllm/pull/7489
- [Bugfix][Docs] Update list of mock imports by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7493
- [doc] update test script to include cudagraph by @youkaichao in https://github.com/vllm-project/vllm/pull/7501
- Fix empty output when temp is too low by @CatherineSue in https://github.com/vllm-project/vllm/pull/2937
- [ci] fix model tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7507
- [Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in https://github.com/vllm-project/vllm/pull/7504
- [Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in https://github.com/vllm-project/vllm/pull/7424
- [VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7126
- [core] [3/N] multi-step args and sequence.py by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7452
- [TPU] Set per-rank XLA cache by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7533
- [Misc] Revert
compressed-tensors
code reuse by @kylesayrs in https://github.com/vllm-project/vllm/pull/7521 - llama_index serving integration documentation by @pavanjava in https://github.com/vllm-project/vllm/pull/6973
- [Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7544
- [Bugfix] update neuron for version > 0.5.0 by @omrishiv in https://github.com/vllm-project/vllm/pull/7175
- [Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in https://github.com/vllm-project/vllm/pull/7182
- [Bugfix] Fix default weight loading for scalars by @mgoin in https://github.com/vllm-project/vllm/pull/7534
- [Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in https://github.com/vllm-project/vllm/pull/7566
- [Misc] Add quantization config support for speculative model. by @ShangmingCai in https://github.com/vllm-project/vllm/pull/7343
- [Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in https://github.com/vllm-project/vllm/pull/7453
- [ci/test] rearrange tests and make adag test soft fail by @youkaichao in https://github.com/vllm-project/vllm/pull/7572
- Chat method for offline llm by @nunjunj in https://github.com/vllm-project/vllm/pull/5049
- [CI] Move quantization cpu offload tests out of fastcheck by @mgoin in https://github.com/vllm-project/vllm/pull/7574
- [Misc/Testing] Use
torch.testing.assert_close
by @jon-chuang in https://github.com/vllm-project/vllm/pull/7324 - register custom op for flash attn and use from torch.ops by @youkaichao in https://github.com/vllm-project/vllm/pull/7536
- [Core] Use uvloop with zmq-decoupled front-end by @njhill in https://github.com/vllm-project/vllm/pull/7570
- [CI] Fix crashes of performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/7500
- [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in https://github.com/vllm-project/vllm/pull/7513
- support tqdm in notebooks by @fzyzcjy in https://github.com/vllm-project/vllm/pull/7510
- [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in https://github.com/vllm-project/vllm/pull/7210
- [Kernel] W8A16 Int8 inside FusedMoE by @mzusman in https://github.com/vllm-project/vllm/pull/7415
- [Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in https://github.com/vllm-project/vllm/pull/7601
- [spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7571
- [Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7440
- [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in https://github.com/vllm-project/vllm/pull/7444
- [Doc] Update quantization supported hardware table by @mgoin in https://github.com/vllm-project/vllm/pull/7595
- [Kernel] register punica functions as torch ops by @bnellnm in https://github.com/vllm-project/vllm/pull/7591
- [Kernel][Misc] dynamo support for ScalarType by @bnellnm in https://github.com/vllm-project/vllm/pull/7594
- [Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in https://github.com/vllm-project/vllm/pull/7596
- [Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in https://github.com/vllm-project/vllm/pull/7611
- [Bugfix] Fix custom_ar support check by @bnellnm in https://github.com/vllm-project/vllm/pull/7617
- .[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7610
- [Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7618
- [aDAG] Unflake aDAG + PP tests by @rkooo567 in https://github.com/vllm-project/vllm/pull/7600
- [Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in https://github.com/vllm-project/vllm/pull/7612
- [misc] use nvml to get consistent device name by @youkaichao in https://github.com/vllm-project/vllm/pull/7582
- [ci][test] fix engine/logger test by @youkaichao in https://github.com/vllm-project/vllm/pull/7621
- [core][misc] update libcudart finding by @youkaichao in https://github.com/vllm-project/vllm/pull/7620
- [Model] Pipeline parallel support for JAIS by @mrbesher in https://github.com/vllm-project/vllm/pull/7603
- [ci][test] allow longer wait time for api server by @youkaichao in https://github.com/vllm-project/vllm/pull/7629
- [Misc]Fix BitAndBytes exception messages by @jeejeelee in https://github.com/vllm-project/vllm/pull/7626
- [VLM] Refactor
MultiModalConfig
initialization and profiling by @ywang96 in https://github.com/vllm-project/vllm/pull/7530 - [TPU] Skip creating empty tensor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7630
- [TPU] Use mark_dynamic only for dummy run by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7634
- [TPU] Optimize RoPE forward_native2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7636
- [ Bugfix ] Fix Prometheus Metrics With
zeromq
Frontend by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7279 - [CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/7475
- [Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7637
- [Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in https://github.com/vllm-project/vllm/pull/7109
- [Core] Use flashinfer sampling kernel when available by @peng1999 in https://github.com/vllm-project/vllm/pull/7137
- fix xpu build by @jikunshang in https://github.com/vllm-project/vllm/pull/7644
- [Misc] Remove Gemma RoPE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7638
- [MISC] Add prefix cache hit rate to metrics by @comaniac in https://github.com/vllm-project/vllm/pull/7606
- [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in https://github.com/vllm-project/vllm/pull/5428
- [core] Multi Step Scheduling by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7000
- [Core] Support tensor parallelism for GGUF quantization by @Isotr0py in https://github.com/vllm-project/vllm/pull/7520
- [Bugfix] Don't disable existing loggers by @a-ys in https://github.com/vllm-project/vllm/pull/7664
- [TPU] Fix redundant input tensor cloning by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7660
- [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7665
- [doc] fix doc build error caused by msgspec by @youkaichao in https://github.com/vllm-project/vllm/pull/7659
- [Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/7508
- [ci] Install Buildkite test suite analysis by @khluu in https://github.com/vllm-project/vllm/pull/7667
- [Bugfix] support
tie_word_embeddings
for all models by @zijian-hu in https://github.com/vllm-project/vllm/pull/5724 - [CI] Organizing performance benchmark files by @KuntaiDu in https://github.com/vllm-project/vllm/pull/7616
- [misc] add nvidia related library in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/7674
- [XPU] fallback to native implementation for xpu custom op by @jianyizh in https://github.com/vllm-project/vllm/pull/7670
- [misc][cuda] add warning for pynvml user by @youkaichao in https://github.com/vllm-project/vllm/pull/7675
- [Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in https://github.com/vllm-project/vllm/pull/7673
- [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7174
- [OpenVINO] Updated documentation by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7687
- [VLM][Model] Add test for InternViT vision encoder by @Isotr0py in https://github.com/vllm-project/vllm/pull/7409
- [Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in https://github.com/vllm-project/vllm/pull/7686
- [CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in https://github.com/vllm-project/vllm/pull/7266
- [Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7695
- [Core] Add
AttentionState
abstraction by @Yard1 in https://github.com/vllm-project/vllm/pull/7663 - [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in https://github.com/vllm-project/vllm/pull/7685
- [ci][test] adjust max wait time for cpu offloading test by @youkaichao in https://github.com/vllm-project/vllm/pull/7709
- [Core] Pipe
worker_class_fn
argument in Executor by @Yard1 in https://github.com/vllm-project/vllm/pull/7707 - [ci] try to log process using the port to debug the port usage by @youkaichao in https://github.com/vllm-project/vllm/pull/7711
- [Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/7187
- [Doc] Section for Multimodal Language Models by @ywang96 in https://github.com/vllm-project/vllm/pull/7719
- [mypy] Enable following imports for entrypoints by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7248
- [Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in https://github.com/vllm-project/vllm/pull/7723
- [BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in https://github.com/vllm-project/vllm/pull/7698
- [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in https://github.com/vllm-project/vllm/pull/7509
- [Bugfix][Hardware][CPU] Fix
mm_limits
initialization for CPU backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/7735 - [Spec Decoding] Use target model max length as default for draft model by @njhill in https://github.com/vllm-project/vllm/pull/7706
- [Bugfix] chat method add_generation_prompt param by @brian14708 in https://github.com/vllm-project/vllm/pull/7734
- [Bugfix][Frontend] Fix Issues Under High Load With
zeromq
Frontend by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7394 - [Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in https://github.com/vllm-project/vllm/pull/7730
- [multi-step] Raise error if not using async engine by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7703
- [Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7716
- [misc] Add Torch profiler support by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7451
- [Model] Add UltravoxModel and UltravoxConfig by @petersalas in https://github.com/vllm-project/vllm/pull/7615
- [ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7760
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in https://github.com/vllm-project/vllm/pull/7527
- [distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in https://github.com/vllm-project/vllm/pull/7756
- [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in https://github.com/vllm-project/vllm/pull/7477
- [Kernel] Replaced
blockReduce[...]
functions withcub::BlockReduce
by @ProExpertProg in https://github.com/vllm-project/vllm/pull/7233 - [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in https://github.com/vllm-project/vllm/pull/7710
- [Bug][Frontend] Improve ZMQ client robustness by @joerunde in https://github.com/vllm-project/vllm/pull/7443
- Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in https://github.com/vllm-project/vllm/pull/7764
- [TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7763
- [ci] refine dependency for distributed tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7776
- [Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7642
- [Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/6830
- Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/7708
- [Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7757
- [Misc] update fp8 to use
vLLMParameter
by @dsikka in https://github.com/vllm-project/vllm/pull/7437 - [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7232
- [Misc] Enhance prefix-caching benchmark tool by @Jeffwan in https://github.com/vllm-project/vllm/pull/6568
- [Doc] Fix incorrect docs from #7615 by @petersalas in https://github.com/vllm-project/vllm/pull/7788
- [Bugfix] Use LoadFormat values as choices for
vllm serve --load-format
by @mgoin in https://github.com/vllm-project/vllm/pull/7784 - [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in https://github.com/vllm-project/vllm/pull/7705
- [Misc] fix typo in triton import warning by @lsy323 in https://github.com/vllm-project/vllm/pull/7794
- [Frontend] error suppression cleanup by @joerunde in https://github.com/vllm-project/vllm/pull/7786
- [Ray backend] Better error when pg topology is bad. by @rkooo567 in https://github.com/vllm-project/vllm/pull/7584
- [Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in https://github.com/vllm-project/vllm/pull/7712
- [misc] Add Torch profiler support for CPU-only devices by @DamonFool in https://github.com/vllm-project/vllm/pull/7806
- [BugFix] Fix server crash on empty prompt by @maxdebayser in https://github.com/vllm-project/vllm/pull/7746
- [github][misc] promote asking llm first by @youkaichao in https://github.com/vllm-project/vllm/pull/7809
- [Misc] Update
marlin
to use vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7803 - Bump version to v0.5.5 by @simon-mo in https://github.com/vllm-project/vllm/pull/7823
New Contributors
- @jischein made their first contribution in https://github.com/vllm-project/vllm/pull/7129
- @kpapis made their first contribution in https://github.com/vllm-project/vllm/pull/7198
- @xiaobochen123 made their first contribution in https://github.com/vllm-project/vllm/pull/7193
- @Atllkks10 made their first contribution in https://github.com/vllm-project/vllm/pull/7227
- @stas00 made their first contribution in https://github.com/vllm-project/vllm/pull/7243
- @maxdebayser made their first contribution in https://github.com/vllm-project/vllm/pull/7217
- @NiuBlibing made their first contribution in https://github.com/vllm-project/vllm/pull/7288
- @lsy323 made their first contribution in https://github.com/vllm-project/vllm/pull/7005
- @pooyadavoodi made their first contribution in https://github.com/vllm-project/vllm/pull/7132
- @sfc-gh-mkeralapura made their first contribution in https://github.com/vllm-project/vllm/pull/7089
- @jon-chuang made their first contribution in https://github.com/vllm-project/vllm/pull/7208
- @aw632 made their first contribution in https://github.com/vllm-project/vllm/pull/7435
- @petersalas made their first contribution in https://github.com/vllm-project/vllm/pull/7446
- @kylesayrs made their first contribution in https://github.com/vllm-project/vllm/pull/7277
- @QwertyJack made their first contribution in https://github.com/vllm-project/vllm/pull/7504
- @wallashss made their first contribution in https://github.com/vllm-project/vllm/pull/7424
- @pavanjava made their first contribution in https://github.com/vllm-project/vllm/pull/6973
- @PHILO-HE made their first contribution in https://github.com/vllm-project/vllm/pull/7182
- @gnpinkert made their first contribution in https://github.com/vllm-project/vllm/pull/7453
- @gongdao123 made their first contribution in https://github.com/vllm-project/vllm/pull/7513
- @charlifu made their first contribution in https://github.com/vllm-project/vllm/pull/7210
- @metasyn made their first contribution in https://github.com/vllm-project/vllm/pull/7612
- @mrbesher made their first contribution in https://github.com/vllm-project/vllm/pull/7603
- @alex-jw-brooks made their first contribution in https://github.com/vllm-project/vllm/pull/7475
- @a-ys made their first contribution in https://github.com/vllm-project/vllm/pull/7664
- @zijian-hu made their first contribution in https://github.com/vllm-project/vllm/pull/5724
- @jianyizh made their first contribution in https://github.com/vllm-project/vllm/pull/7670
- @learninmou made their first contribution in https://github.com/vllm-project/vllm/pull/7509
- @brian14708 made their first contribution in https://github.com/vllm-project/vllm/pull/7734
- @sfc-gh-zhwang made their first contribution in https://github.com/vllm-project/vllm/pull/7708
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5
1、 vllm-0.5.5+cu118-cp310-cp310-manylinux1_x86_64.whl 127.3MB
2、 vllm-0.5.5+cu118-cp311-cp311-manylinux1_x86_64.whl 127.3MB
3、 vllm-0.5.5+cu118-cp312-cp312-manylinux1_x86_64.whl 127.3MB
4、 vllm-0.5.5+cu118-cp38-cp38-manylinux1_x86_64.whl 127.3MB