v0.5.5

vllm-project/vllm

版本发布时间: 2024-08-24 02:37:46

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

Highlights

Performance Update

We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
Various enhancements:
- Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
- Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
- Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

Support Jamba 1.5 (#7415, #7601, #6739)
Support for the first audio model UltravoxModel (#7615, #7446)
Improvements to vision models:
- Support image embeddings as input (#6613)
- Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Support loading GGUF model (#5191) with tensor parallelism (#7520)
Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

AMD: Add fp8 Linear Layer for rocm (#7210)
Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

Optimize prefix caching performance (#7193)
Speculative decoding
- Use target model max length as default for draft model (#7706)
- EAGLE Implementation with Top-1 proposer (#6830)
Entrypoints
- A new chat method in the LLM class (#5049)
- Support embeddings in the run_batch API (#7132)
- Support prompt_logprobs in Chat Completion (#7453)
Quantizations
- Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
- Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

[ci][frontend] deduplicate tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7101
[Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in https://github.com/vllm-project/vllm/pull/7100
[Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in https://github.com/vllm-project/vllm/pull/7129
[MISC] Use non-blocking transfer in prepare_input by @comaniac in https://github.com/vllm-project/vllm/pull/7172
[Core] Support loading GGUF model by @Isotr0py in https://github.com/vllm-project/vllm/pull/5191
[Build] Add initial conditional testing spec by @simon-mo in https://github.com/vllm-project/vllm/pull/6841
[LoRA] Relax LoRA condition by @jeejeelee in https://github.com/vllm-project/vllm/pull/7146
[Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7153
[BugFix] Fix DeepSeek remote code by @dsikka in https://github.com/vllm-project/vllm/pull/7178
[ BugFix ] Fix ZMQ when VLLM_PORT is set by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7205
[Bugfix] add gguf dependency by @kpapis in https://github.com/vllm-project/vllm/pull/7198
[SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7183
[Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5941
[Core] Optimize evictor-v2 performance by @xiaobochen123 in https://github.com/vllm-project/vllm/pull/7193
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in https://github.com/vllm-project/vllm/pull/4942
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in https://github.com/vllm-project/vllm/pull/7225
[BugFix] Overhaul async request cancellation by @njhill in https://github.com/vllm-project/vllm/pull/7111
[Doc] Mock new dependencies for documentation by @ywang96 in https://github.com/vllm-project/vllm/pull/7245
[BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in https://github.com/vllm-project/vllm/pull/7227
[Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7238
[distributed][misc] add specialized method for cuda platform by @youkaichao in https://github.com/vllm-project/vllm/pull/7249
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 by @dsikka in https://github.com/vllm-project/vllm/pull/5874
[ BugFix ] Move zmq frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7222
Fixes typo in function name by @rafvasq in https://github.com/vllm-project/vllm/pull/7275
[Bugfix] Fix input processor for InternVL2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/7164
[OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7251
[Doc] add online speculative decoding example by @stas00 in https://github.com/vllm-project/vllm/pull/7243
[BugFix] Fix frontend multiprocessing hang by @maxdebayser in https://github.com/vllm-project/vllm/pull/7217
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in https://github.com/vllm-project/vllm/pull/7219
[ci] Make building wheels per commit optional by @khluu in https://github.com/vllm-project/vllm/pull/7278
[Bugfix] Fix gptq failure on T4s by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7264
[FrontEnd] Make merge_async_iterators is_cancelled arg optional by @njhill in https://github.com/vllm-project/vllm/pull/7282
[Doc] Update supported_hardware.rst by @mgoin in https://github.com/vllm-project/vllm/pull/7276
[Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7284
[Misc] Fix typos in scheduler.py by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7285
[Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in https://github.com/vllm-project/vllm/pull/7288
[Bugfix] Fix LoRA with PP by @andoorve in https://github.com/vllm-project/vllm/pull/7292
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in https://github.com/vllm-project/vllm/pull/7273
[Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in https://github.com/vllm-project/vllm/pull/7305
[Frontend] Kill the server on engine death by @joerunde in https://github.com/vllm-project/vllm/pull/6594
[Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in https://github.com/vllm-project/vllm/pull/6849
[Doc] Put collect_env issue output in a block by @mgoin in https://github.com/vllm-project/vllm/pull/7310
[CI/Build] Dockerfile.cpu improvements by @dtrifiro in https://github.com/vllm-project/vllm/pull/7298
[Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/7269
[Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in https://github.com/vllm-project/vllm/pull/7308
Add Skywork AI as Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/7314
[TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in https://github.com/vllm-project/vllm/pull/7005
[Core] Support serving encoder/decoder models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7258
[TPU] Fix dockerfile.tpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7331
[Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7162
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7218
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/6971
[Core] Streamline stream termination in AsyncLLMEngine by @njhill in https://github.com/vllm-project/vllm/pull/7336
[Model][Jamba] Mamba cache single buffer by @mzusman in https://github.com/vllm-project/vllm/pull/6739
[VLM][Doc] Add stop_token_ids to InternVL example by @Isotr0py in https://github.com/vllm-project/vllm/pull/7354
[Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7364
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7360
[Frontend] Support embeddings in the run_batch API by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/7132
[Bugfix] Fix ITL recording in serving benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/7372
[Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7089
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models by @dsikka in https://github.com/vllm-project/vllm/pull/7376
[Misc] Add numpy implementation of compute_slot_mapping by @Yard1 in https://github.com/vllm-project/vllm/pull/7377
[Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in https://github.com/vllm-project/vllm/pull/7380
[Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in https://github.com/vllm-project/vllm/pull/7392
[TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7340
Updating LM Format Enforcer version to v0.10.6 by @noamgat in https://github.com/vllm-project/vllm/pull/7189
[core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7387
[CI/Build] build on empty device for better dev experience by @tomeras91 in https://github.com/vllm-project/vllm/pull/4773
[Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in https://github.com/vllm-project/vllm/pull/7403
[misc] add commit id in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/7405
[Docs] Update readme by @simon-mo in https://github.com/vllm-project/vllm/pull/7316
[CI/Build] Minor refactoring for vLLM assets by @ywang96 in https://github.com/vllm-project/vllm/pull/7407
[Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7319
[Core][VLM] Support image embeddings as input by @ywang96 in https://github.com/vllm-project/vllm/pull/6613
[Frontend] Disallow passing model as both argument and option by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7347
[CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in https://github.com/vllm-project/vllm/pull/6832
[Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7425
[ci] Entrypoints run upon changes in vllm/ by @khluu in https://github.com/vllm-project/vllm/pull/7423
[ci] Cancel fastcheck run when PR is marked ready by @khluu in https://github.com/vllm-project/vllm/pull/7427
[ci] Cancel fastcheck when PR is ready by @khluu in https://github.com/vllm-project/vllm/pull/7433
[Misc] Use scalar type to dispatch to different gptq_marlin kernels by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7323
[Core] Consolidate GB constant and enable float GB arguments by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7416
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in https://github.com/vllm-project/vllm/pull/7208
[Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in https://github.com/vllm-project/vllm/pull/7398
[CI/Build] bump minimum cmake version by @dtrifiro in https://github.com/vllm-project/vllm/pull/6999
[Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7224
[mypy] Misc. typing improvements by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7417
[Misc] improve logits processors logging message by @aw632 in https://github.com/vllm-project/vllm/pull/7435
[ci] Remove fast check cancel workflow by @khluu in https://github.com/vllm-project/vllm/pull/7455
[Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7410
[hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in https://github.com/vllm-project/vllm/pull/7102
[TPU] Suppress import custom_ops warning by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7458
Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7467
[Frontend][Core] Add plumbing to support audio language models by @petersalas in https://github.com/vllm-project/vllm/pull/7446
[Misc] Update LM Eval Tolerance by @dsikka in https://github.com/vllm-project/vllm/pull/7473
[Misc] Update gptq_marlin to use new vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7281
[Misc] Update Fused MoE weight loading by @dsikka in https://github.com/vllm-project/vllm/pull/7334
[Misc] Update awq and awq_marlin to use vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7422
Announce NVIDIA Meetup by @simon-mo in https://github.com/vllm-project/vllm/pull/7483
[frontend] spawn engine process from api server process by @youkaichao in https://github.com/vllm-project/vllm/pull/7484
[Misc] compressed-tensors code reuse by @kylesayrs in https://github.com/vllm-project/vllm/pull/7277
[misc][plugin] add plugin system implementation by @youkaichao in https://github.com/vllm-project/vllm/pull/7426
[TPU] Support multi-host inference by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7457
[Bugfix][CI] Import ray under guard by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7486
[CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in https://github.com/vllm-project/vllm/pull/7396
[misc][ci] fix cpu test with plugins by @youkaichao in https://github.com/vllm-project/vllm/pull/7489
[Bugfix][Docs] Update list of mock imports by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7493
[doc] update test script to include cudagraph by @youkaichao in https://github.com/vllm-project/vllm/pull/7501
Fix empty output when temp is too low by @CatherineSue in https://github.com/vllm-project/vllm/pull/2937
[ci] fix model tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7507
[Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in https://github.com/vllm-project/vllm/pull/7504
[Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in https://github.com/vllm-project/vllm/pull/7424
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7126
[core] [3/N] multi-step args and sequence.py by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7452
[TPU] Set per-rank XLA cache by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7533
[Misc] Revert compressed-tensors code reuse by @kylesayrs in https://github.com/vllm-project/vllm/pull/7521
llama_index serving integration documentation by @pavanjava in https://github.com/vllm-project/vllm/pull/6973
[Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7544
[Bugfix] update neuron for version > 0.5.0 by @omrishiv in https://github.com/vllm-project/vllm/pull/7175
[Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in https://github.com/vllm-project/vllm/pull/7182
[Bugfix] Fix default weight loading for scalars by @mgoin in https://github.com/vllm-project/vllm/pull/7534
[Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in https://github.com/vllm-project/vllm/pull/7566
[Misc] Add quantization config support for speculative model. by @ShangmingCai in https://github.com/vllm-project/vllm/pull/7343
[Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in https://github.com/vllm-project/vllm/pull/7453
[ci/test] rearrange tests and make adag test soft fail by @youkaichao in https://github.com/vllm-project/vllm/pull/7572
Chat method for offline llm by @nunjunj in https://github.com/vllm-project/vllm/pull/5049
[CI] Move quantization cpu offload tests out of fastcheck by @mgoin in https://github.com/vllm-project/vllm/pull/7574
[Misc/Testing] Use torch.testing.assert_close by @jon-chuang in https://github.com/vllm-project/vllm/pull/7324
register custom op for flash attn and use from torch.ops by @youkaichao in https://github.com/vllm-project/vllm/pull/7536
[Core] Use uvloop with zmq-decoupled front-end by @njhill in https://github.com/vllm-project/vllm/pull/7570
[CI] Fix crashes of performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/7500
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in https://github.com/vllm-project/vllm/pull/7513
support tqdm in notebooks by @fzyzcjy in https://github.com/vllm-project/vllm/pull/7510
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in https://github.com/vllm-project/vllm/pull/7210
[Kernel] W8A16 Int8 inside FusedMoE by @mzusman in https://github.com/vllm-project/vllm/pull/7415
[Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in https://github.com/vllm-project/vllm/pull/7601
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7571
[Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in https://github.com/vllm-project/vllm/pull/7440
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in https://github.com/vllm-project/vllm/pull/7444
[Doc] Update quantization supported hardware table by @mgoin in https://github.com/vllm-project/vllm/pull/7595
[Kernel] register punica functions as torch ops by @bnellnm in https://github.com/vllm-project/vllm/pull/7591
[Kernel][Misc] dynamo support for ScalarType by @bnellnm in https://github.com/vllm-project/vllm/pull/7594
[Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in https://github.com/vllm-project/vllm/pull/7596
[Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in https://github.com/vllm-project/vllm/pull/7611
[Bugfix] Fix custom_ar support check by @bnellnm in https://github.com/vllm-project/vllm/pull/7617
.[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7610
[Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7618
[aDAG] Unflake aDAG + PP tests by @rkooo567 in https://github.com/vllm-project/vllm/pull/7600
[Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in https://github.com/vllm-project/vllm/pull/7612
[misc] use nvml to get consistent device name by @youkaichao in https://github.com/vllm-project/vllm/pull/7582
[ci][test] fix engine/logger test by @youkaichao in https://github.com/vllm-project/vllm/pull/7621
[core][misc] update libcudart finding by @youkaichao in https://github.com/vllm-project/vllm/pull/7620
[Model] Pipeline parallel support for JAIS by @mrbesher in https://github.com/vllm-project/vllm/pull/7603
[ci][test] allow longer wait time for api server by @youkaichao in https://github.com/vllm-project/vllm/pull/7629
[Misc]Fix BitAndBytes exception messages by @jeejeelee in https://github.com/vllm-project/vllm/pull/7626
[VLM] Refactor MultiModalConfig initialization and profiling by @ywang96 in https://github.com/vllm-project/vllm/pull/7530
[TPU] Skip creating empty tensor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7630
[TPU] Use mark_dynamic only for dummy run by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7634
[TPU] Optimize RoPE forward_native2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7636
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7279
[CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/7475
[Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7637
[Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in https://github.com/vllm-project/vllm/pull/7109
[Core] Use flashinfer sampling kernel when available by @peng1999 in https://github.com/vllm-project/vllm/pull/7137
fix xpu build by @jikunshang in https://github.com/vllm-project/vllm/pull/7644
[Misc] Remove Gemma RoPE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7638
[MISC] Add prefix cache hit rate to metrics by @comaniac in https://github.com/vllm-project/vllm/pull/7606
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in https://github.com/vllm-project/vllm/pull/5428
[core] Multi Step Scheduling by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7000
[Core] Support tensor parallelism for GGUF quantization by @Isotr0py in https://github.com/vllm-project/vllm/pull/7520
[Bugfix] Don't disable existing loggers by @a-ys in https://github.com/vllm-project/vllm/pull/7664
[TPU] Fix redundant input tensor cloning by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7660
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7665
[doc] fix doc build error caused by msgspec by @youkaichao in https://github.com/vllm-project/vllm/pull/7659
[Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/7508
[ci] Install Buildkite test suite analysis by @khluu in https://github.com/vllm-project/vllm/pull/7667
[Bugfix] support tie_word_embeddings for all models by @zijian-hu in https://github.com/vllm-project/vllm/pull/5724
[CI] Organizing performance benchmark files by @KuntaiDu in https://github.com/vllm-project/vllm/pull/7616
[misc] add nvidia related library in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/7674
[XPU] fallback to native implementation for xpu custom op by @jianyizh in https://github.com/vllm-project/vllm/pull/7670
[misc][cuda] add warning for pynvml user by @youkaichao in https://github.com/vllm-project/vllm/pull/7675
[Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in https://github.com/vllm-project/vllm/pull/7673
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7174
[OpenVINO] Updated documentation by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/7687
[VLM][Model] Add test for InternViT vision encoder by @Isotr0py in https://github.com/vllm-project/vllm/pull/7409
[Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in https://github.com/vllm-project/vllm/pull/7686
[CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in https://github.com/vllm-project/vllm/pull/7266
[Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7695
[Core] Add AttentionState abstraction by @Yard1 in https://github.com/vllm-project/vllm/pull/7663
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in https://github.com/vllm-project/vllm/pull/7685
[ci][test] adjust max wait time for cpu offloading test by @youkaichao in https://github.com/vllm-project/vllm/pull/7709
[Core] Pipe worker_class_fn argument in Executor by @Yard1 in https://github.com/vllm-project/vllm/pull/7707
[ci] try to log process using the port to debug the port usage by @youkaichao in https://github.com/vllm-project/vllm/pull/7711
[Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/7187
[Doc] Section for Multimodal Language Models by @ywang96 in https://github.com/vllm-project/vllm/pull/7719
[mypy] Enable following imports for entrypoints by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7248
[Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in https://github.com/vllm-project/vllm/pull/7723
[BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in https://github.com/vllm-project/vllm/pull/7698
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in https://github.com/vllm-project/vllm/pull/7509
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/7735
[Spec Decoding] Use target model max length as default for draft model by @njhill in https://github.com/vllm-project/vllm/pull/7706
[Bugfix] chat method add_generation_prompt param by @brian14708 in https://github.com/vllm-project/vllm/pull/7734
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7394
[Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in https://github.com/vllm-project/vllm/pull/7730
[multi-step] Raise error if not using async engine by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7703
[Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/7716
[misc] Add Torch profiler support by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7451
[Model] Add UltravoxModel and UltravoxConfig by @petersalas in https://github.com/vllm-project/vllm/pull/7615
[ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/7760
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in https://github.com/vllm-project/vllm/pull/7527
[distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in https://github.com/vllm-project/vllm/pull/7756
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in https://github.com/vllm-project/vllm/pull/7477
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce by @ProExpertProg in https://github.com/vllm-project/vllm/pull/7233
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in https://github.com/vllm-project/vllm/pull/7710
[Bug][Frontend] Improve ZMQ client robustness by @joerunde in https://github.com/vllm-project/vllm/pull/7443
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in https://github.com/vllm-project/vllm/pull/7764
[TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7763
[ci] refine dependency for distributed tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7776
[Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7642
[Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in https://github.com/vllm-project/vllm/pull/6830
Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in https://github.com/vllm-project/vllm/pull/7708
[Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/7757
[Misc] update fp8 to use vLLMParameter by @dsikka in https://github.com/vllm-project/vllm/pull/7437
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/7232
[Misc] Enhance prefix-caching benchmark tool by @Jeffwan in https://github.com/vllm-project/vllm/pull/6568
[Doc] Fix incorrect docs from #7615 by @petersalas in https://github.com/vllm-project/vllm/pull/7788
[Bugfix] Use LoadFormat values as choices for vllm serve --load-format by @mgoin in https://github.com/vllm-project/vllm/pull/7784
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in https://github.com/vllm-project/vllm/pull/7705
[Misc] fix typo in triton import warning by @lsy323 in https://github.com/vllm-project/vllm/pull/7794
[Frontend] error suppression cleanup by @joerunde in https://github.com/vllm-project/vllm/pull/7786
[Ray backend] Better error when pg topology is bad. by @rkooo567 in https://github.com/vllm-project/vllm/pull/7584
[Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in https://github.com/vllm-project/vllm/pull/7712
[misc] Add Torch profiler support for CPU-only devices by @DamonFool in https://github.com/vllm-project/vllm/pull/7806
[BugFix] Fix server crash on empty prompt by @maxdebayser in https://github.com/vllm-project/vllm/pull/7746
[github][misc] promote asking llm first by @youkaichao in https://github.com/vllm-project/vllm/pull/7809
[Misc] Update marlin to use vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7803
Bump version to v0.5.5 by @simon-mo in https://github.com/vllm-project/vllm/pull/7823

New Contributors

@jischein made their first contribution in https://github.com/vllm-project/vllm/pull/7129
@kpapis made their first contribution in https://github.com/vllm-project/vllm/pull/7198
@xiaobochen123 made their first contribution in https://github.com/vllm-project/vllm/pull/7193
@Atllkks10 made their first contribution in https://github.com/vllm-project/vllm/pull/7227
@stas00 made their first contribution in https://github.com/vllm-project/vllm/pull/7243
@maxdebayser made their first contribution in https://github.com/vllm-project/vllm/pull/7217
@NiuBlibing made their first contribution in https://github.com/vllm-project/vllm/pull/7288
@lsy323 made their first contribution in https://github.com/vllm-project/vllm/pull/7005
@pooyadavoodi made their first contribution in https://github.com/vllm-project/vllm/pull/7132
@sfc-gh-mkeralapura made their first contribution in https://github.com/vllm-project/vllm/pull/7089
@jon-chuang made their first contribution in https://github.com/vllm-project/vllm/pull/7208
@aw632 made their first contribution in https://github.com/vllm-project/vllm/pull/7435
@petersalas made their first contribution in https://github.com/vllm-project/vllm/pull/7446
@kylesayrs made their first contribution in https://github.com/vllm-project/vllm/pull/7277
@QwertyJack made their first contribution in https://github.com/vllm-project/vllm/pull/7504
@wallashss made their first contribution in https://github.com/vllm-project/vllm/pull/7424
@pavanjava made their first contribution in https://github.com/vllm-project/vllm/pull/6973
@PHILO-HE made their first contribution in https://github.com/vllm-project/vllm/pull/7182
@gnpinkert made their first contribution in https://github.com/vllm-project/vllm/pull/7453
@gongdao123 made their first contribution in https://github.com/vllm-project/vllm/pull/7513
@charlifu made their first contribution in https://github.com/vllm-project/vllm/pull/7210
@metasyn made their first contribution in https://github.com/vllm-project/vllm/pull/7612
@mrbesher made their first contribution in https://github.com/vllm-project/vllm/pull/7603
@alex-jw-brooks made their first contribution in https://github.com/vllm-project/vllm/pull/7475
@a-ys made their first contribution in https://github.com/vllm-project/vllm/pull/7664
@zijian-hu made their first contribution in https://github.com/vllm-project/vllm/pull/5724
@jianyizh made their first contribution in https://github.com/vllm-project/vllm/pull/7670
@learninmou made their first contribution in https://github.com/vllm-project/vllm/pull/7509
@brian14708 made their first contribution in https://github.com/vllm-project/vllm/pull/7734
@sfc-gh-zhwang made their first contribution in https://github.com/vllm-project/vllm/pull/7708

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.4...v0.5.5

相关地址：原始地址下载(tar) 下载(zip)

1、 vllm-0.5.5+cu118-cp310-cp310-manylinux1_x86_64.whl 127.3MB

2、 vllm-0.5.5+cu118-cp311-cp311-manylinux1_x86_64.whl 127.3MB

3、 vllm-0.5.5+cu118-cp312-cp312-manylinux1_x86_64.whl 127.3MB

4、 vllm-0.5.5+cu118-cp38-cp38-manylinux1_x86_64.whl 127.3MB

5、 vllm-0.5.5+cu118-cp39-cp39-manylinux1_x86_64.whl 127.3MB

查看：2024-08-24发行的版本