v0.5.4
版本发布时间: 2024-08-06 06:38:28
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Model Support
- Enhanced pipeline parallelism support for DeepSeek v2 (#6519), Qwen (#6974), Qwen2 (#6924), and Nemotron (#6863)
- Enhanced vision language model support for InternVL2 (#6514, #7067), BLIP-2 (#5920), MiniCPM-V (#4087, #7122).
- Added H2O Danube3-4b (#6451)
- Added Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611)
Hardware Support
- TPU enhancements: collective communication, TP for async engine, faster compile time (#6891, #6933, #6856, #6813, #5871)
- Intel CPU: enable multiprocessing and tensor parallelism (#6125)
Performance
We are progressing along our quest to quickly improve performance. Each of the following PRs contributed some improvements, and we anticipate more enhancements in the next release.
- Separated OpenAI Server's HTTP request handling and model inference loop with
zeromq
. This brought 20% speedup over time to first token and 2x speedup over inter token latency. (#6883) - Used Python's native array data structure speedup padding. This bring 15% throughput enhancement in large batch size scenarios. (#6779)
- Reduce unnecessary compute when logprobs=None. This reduced latency of get log probs from ~30ms to ~5ms in large batch size scenarios. (#6532)
- Optimize
get_seqs
function, bring 2% throughput enhancements. (#7051)
Production Features
- Enhancements to speculative decoding: FlashInfer in DraftModelRunner (#6926), observability (#6963), and benchmarks (#6964)
- Refactor the punica kernel based on Triton (#5036)
- Support for guided decoding for offline LLM (#6878)
Quantization
- Support W4A8 quantization for vllm (#5218)
- Tuned FP8 and INT8 Kernels for Ada Lovelace and SM75 T4 (#6677, #6996, #6848)
- Support reading bitsandbytes pre-quantized model (#5753)
What's Changed
- [Docs] Announce llama3.1 support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6688
- [doc][distributed] fix doc argument order by @youkaichao in https://github.com/vllm-project/vllm/pull/6691
- [Bugfix] Fix a log error in chunked prefill by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6694
- [BugFix] Fix RoPE error in Llama 3.1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6693
- Bump version to 0.5.3.post1 by @simon-mo in https://github.com/vllm-project/vllm/pull/6696
- [Misc] Add ignored layers for
fp8
quantization by @mgoin in https://github.com/vllm-project/vllm/pull/6657 - [Frontend] Add Usage data in each chunk for chat_serving. #6540 by @yecohn in https://github.com/vllm-project/vllm/pull/6652
- [Model] Pipeline Parallel Support for DeepSeek v2 by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6519
- Bump
transformers
version for Llama 3.1 hotfix and patch Chameleon by @ywang96 in https://github.com/vllm-project/vllm/pull/6690 - [build] relax wheel size limit by @youkaichao in https://github.com/vllm-project/vllm/pull/6704
- [CI] Add smoke test for non-uniform AutoFP8 quantization by @mgoin in https://github.com/vllm-project/vllm/pull/6702
- [Bugfix] StatLoggers: cache spec decode metrics when they get collected. by @tdoublep in https://github.com/vllm-project/vllm/pull/6645
- [bitsandbytes]: support read bnb pre-quantized model by @thesues in https://github.com/vllm-project/vllm/pull/5753
- [Bugfix] fix flashinfer cudagraph capture for PP by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/6708
- [SpecDecoding] Update MLPSpeculator CI tests to use smaller model by @njhill in https://github.com/vllm-project/vllm/pull/6714
- [Bugfix] Fix token padding for chameleon by @ywang96 in https://github.com/vllm-project/vllm/pull/6724
- [Docs][ROCm] Detailed instructions to build from source by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6680
- [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/6711
- [Bugfix]fix modelscope compatible issue by @liuyhwangyh in https://github.com/vllm-project/vllm/pull/6730
- Adding f-string to validation error which is missing by @luizanao in https://github.com/vllm-project/vllm/pull/6748
- [Bugfix] Fix speculative decode seeded test by @njhill in https://github.com/vllm-project/vllm/pull/6743
- [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. by @AllenDou in https://github.com/vllm-project/vllm/pull/6686
- [Frontend] split run_server into build_server and run_server by @dtrifiro in https://github.com/vllm-project/vllm/pull/6740
- [Kernels] Add fp8 support to
reshape_and_cache_flash
by @Yard1 in https://github.com/vllm-project/vllm/pull/6667 - [Core] Tweaks to model runner/input builder developer APIs by @Yard1 in https://github.com/vllm-project/vllm/pull/6712
- [Bugfix] Bump transformers to 4.43.2 by @mgoin in https://github.com/vllm-project/vllm/pull/6752
- [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users by @hongxiayang in https://github.com/vllm-project/vllm/pull/6754
- [core][distributed] fix zmq hang by @youkaichao in https://github.com/vllm-project/vllm/pull/6759
- [Frontend] Represent tokens with identifiable strings by @ezliu in https://github.com/vllm-project/vllm/pull/6626
- [Model] Adding support for MiniCPM-V by @HwwwwwwwH in https://github.com/vllm-project/vllm/pull/4087
- [Bugfix] Fix decode tokens w. CUDA graph by @comaniac in https://github.com/vllm-project/vllm/pull/6757
- [Bugfix] Fix awq_marlin and gptq_marlin flags by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6745
- [Bugfix] Fix encoding_format in examples/openai_embedding_client.py by @CatherineSue in https://github.com/vllm-project/vllm/pull/6755
- [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V by @HwwwwwwwH in https://github.com/vllm-project/vllm/pull/6787
- [ Misc ]
fp8-marlin
channelwise viacompressed-tensors
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6524 - [Bugfix] Fix
kv_cache_dtype=fp8
without scales for FP8 checkpoints by @mgoin in https://github.com/vllm-project/vllm/pull/6761 - [Bugfix] Add synchronize to prevent possible data race by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6788
- [Doc] Add documentations for nightly benchmarks by @KuntaiDu in https://github.com/vllm-project/vllm/pull/6412
- [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/6798
- [doc][distributed] improve multinode serving doc by @youkaichao in https://github.com/vllm-project/vllm/pull/6804
- [Docs] Publish 5th meetup slides by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6799
- [Core] Fix ray forward_dag error mssg by @rkooo567 in https://github.com/vllm-project/vllm/pull/6792
- [ci][distributed] fix flaky tests by @youkaichao in https://github.com/vllm-project/vllm/pull/6806
- [ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check by @khluu in https://github.com/vllm-project/vllm/pull/6810
- Fix ReplicatedLinear weight loading by @qingquansong in https://github.com/vllm-project/vllm/pull/6793
- [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. by @eaplatanios in https://github.com/vllm-project/vllm/pull/6770
- [Core] Use array to speedup padding by @peng1999 in https://github.com/vllm-project/vllm/pull/6779
- [doc][debugging] add known issues for hangs by @youkaichao in https://github.com/vllm-project/vllm/pull/6816
- [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) by @mgoin in https://github.com/vllm-project/vllm/pull/6611
- [Bugfix][Kernel] Promote another index to int64_t by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6838
- [Build/CI][ROCm] Minor simplification to Dockerfile.rocm by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6811
- [Misc][TPU] Support TPU in initialize_ray_cluster by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6812
- [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/6125
- [Doc] Add Nemotron to supported model docs by @mgoin in https://github.com/vllm-project/vllm/pull/6843
- [Doc] Update SkyPilot doc for wrong indents and instructions for update service by @Michaelvll in https://github.com/vllm-project/vllm/pull/4283
- Update README.md by @gurpreet-dhami in https://github.com/vllm-project/vllm/pull/6847
- enforce eager mode with bnb quantization temporarily by @chenqianfzh in https://github.com/vllm-project/vllm/pull/6846
- [TPU] Support collective communications in XLA devices by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6813
- [Frontend] Factor out code for running uvicorn by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6828
- [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/6852
- [Bugfix]: Fix Tensorizer test failures by @sangstar in https://github.com/vllm-project/vllm/pull/6835
- [ROCm] Upgrade PyTorch nightly version by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6845
- [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron by @omrishiv in https://github.com/vllm-project/vllm/pull/6844
- [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba by @tomeras91 in https://github.com/vllm-project/vllm/pull/6784
- [Model] H2O Danube3-4b by @g-eoj in https://github.com/vllm-project/vllm/pull/6451
- [Hardware][TPU] Implement tensor parallelism with Ray by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5871
- [Doc] Add missing mock import to docs
conf.py
by @hmellor in https://github.com/vllm-project/vllm/pull/6834 - [Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6802
- [Misc][VLM][Doc] Consolidate offline examples for vision language models by @ywang96 in https://github.com/vllm-project/vllm/pull/6858
- [Bugfix] Fix VLM example typo by @ywang96 in https://github.com/vllm-project/vllm/pull/6859
- [bugfix] make args.stream work by @WrRan in https://github.com/vllm-project/vllm/pull/6831
- [CI/Build][Doc] Update CI and Doc for VLM example changes by @ywang96 in https://github.com/vllm-project/vllm/pull/6860
- [Model] Initial support for BLIP-2 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5920
- [Docs] Add RunLLM chat widget by @cw75 in https://github.com/vllm-project/vllm/pull/6857
- [TPU] Reduce compilation time & Upgrade PyTorch XLA version by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6856
- [Kernel] Increase precision of GPTQ/AWQ Marlin kernel by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6795
- Add Nemotron to PP_SUPPORTED_MODELS by @mgoin in https://github.com/vllm-project/vllm/pull/6863
- [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 by @zeyugao in https://github.com/vllm-project/vllm/pull/6871
- [Model] Initialize support for InternVL2 series models by @Isotr0py in https://github.com/vllm-project/vllm/pull/6514
- [Kernel] Tuned FP8 Kernels for Ada Lovelace by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6677
- [Core] Reduce unnecessary compute when logprobs=None by @peng1999 in https://github.com/vllm-project/vllm/pull/6532
- [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6901
- [TPU] Add TPU tensor parallelism to async engine by @etwk in https://github.com/vllm-project/vllm/pull/6891
- [Bugfix] Allow vllm to still work if triton is not installed. by @tdoublep in https://github.com/vllm-project/vllm/pull/6786
- [Frontend] New
allowed_token_ids
decoding request parameter by @njhill in https://github.com/vllm-project/vllm/pull/6753 - [Kernel] Remove unused variables in awq/gemm_kernels.cu by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6908
- [ci] GHA workflow to remove ready label upon "/notready" comment by @khluu in https://github.com/vllm-project/vllm/pull/6921
- [Kernel] Fix marlin divide-by-zero warnings by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6904
- [Kernel] Tuned int8 kernels for Ada Lovelace by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6848
- [TPU] Fix greedy decoding by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6933
- [Bugfix] Fix PaliGemma MMP by @ywang96 in https://github.com/vllm-project/vllm/pull/6930
- [Doc] Super tiny fix doc typo by @fzyzcjy in https://github.com/vllm-project/vllm/pull/6949
- [BugFix] Fix use of per-request seed with pipeline parallel by @njhill in https://github.com/vllm-project/vllm/pull/6698
- [Kernel] Squash a few more warnings by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6914
- [OpenVINO] Updated OpenVINO requirements and build docs by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/6948
- [Bugfix] Fix tensorizer memory profiling bug during testing by @sangstar in https://github.com/vllm-project/vllm/pull/6881
- [Kernel] Remove scaled_fp8_quant kernel padding footgun by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6842
- [core][misc] improve free_finished_seq_groups by @youkaichao in https://github.com/vllm-project/vllm/pull/6865
- [Build] Temporarily Disable Kernels and LoRA tests by @simon-mo in https://github.com/vllm-project/vllm/pull/6961
- [Nightly benchmarking suite] Remove pkill python from run benchmark suite by @cadedaniel in https://github.com/vllm-project/vllm/pull/6965
- [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists by @cadedaniel in https://github.com/vllm-project/vllm/pull/6706
- [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding by @cadedaniel in https://github.com/vllm-project/vllm/pull/6964
- [mypy] Enable following imports for some directories by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6681
- [Bugfix] Fix broadcasting logic for
multi_modal_kwargs
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6836 - [CI/Build] Fix mypy errors by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6968
- [Bugfix][TPU] Set readonly=True for non-root devices by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6980
- [Bugfix] fix logit processor excceed vocab size issue by @FeiDeng in https://github.com/vllm-project/vllm/pull/6927
- Support W4A8 quantization for vllm by @HandH1998 in https://github.com/vllm-project/vllm/pull/5218
- [Bugfix] Clean up MiniCPM-V by @HwwwwwwwH in https://github.com/vllm-project/vllm/pull/6939
- [Bugfix] Fix feature size calculation for LLaVA-NeXT by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6982
- [Model] use FusedMoE layer in Jamba by @avshalomman in https://github.com/vllm-project/vllm/pull/6935
- [MISC] Introduce pipeline parallelism partition strategies by @comaniac in https://github.com/vllm-project/vllm/pull/6920
- [Bugfix] Support cpu offloading with quant_method.process_weights_after_loading by @mgoin in https://github.com/vllm-project/vllm/pull/6960
- [Kernel] Enable FP8 Cutlass for Ada Lovelace by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6950
- [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6996
- [Misc] Add compressed-tensors to optimized quant list by @mgoin in https://github.com/vllm-project/vllm/pull/7006
- Revert "[Frontend] Factor out code for running uvicorn" by @simon-mo in https://github.com/vllm-project/vllm/pull/7012
- [Kernel][RFC] Refactor the punica kernel based on Triton by @jeejeelee in https://github.com/vllm-project/vllm/pull/5036
- [Model] Pipeline parallel support for Qwen2 by @xuyi in https://github.com/vllm-project/vllm/pull/6924
- [Bugfix][TPU] Do not use torch.Generator for TPUs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6981
- [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6758
- PP comm optimization: replace send with partial send + allgather by @aurickq in https://github.com/vllm-project/vllm/pull/6695
- [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user by @zifeitong in https://github.com/vllm-project/vllm/pull/6954
- [core][scheduler] simplify and improve scheduler by @youkaichao in https://github.com/vllm-project/vllm/pull/6867
- [Build/CI] Fixing Docker Hub quota issue. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/7043
- [CI/Build] Update torch to 2.4 by @SageMoore in https://github.com/vllm-project/vllm/pull/6951
- [Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm by @Isotr0py in https://github.com/vllm-project/vllm/pull/6992
- [CI/Build] Remove sparseml requirement from testing by @mgoin in https://github.com/vllm-project/vllm/pull/7037
- [Bugfix] Lower gemma's unloaded_params exception to warning by @mgoin in https://github.com/vllm-project/vllm/pull/7002
- [Models] Support Qwen model with PP by @andoorve in https://github.com/vllm-project/vllm/pull/6974
- Update run-amd-test.sh by @okakarpa in https://github.com/vllm-project/vllm/pull/7044
- [Misc] Support attention logits soft-capping with flash-attn by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7022
- [CI/Build][Bugfix] Fix CUTLASS header-only line by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7034
- [Performance] Optimize
get_seqs
by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7051 - [Kernel] Fix input for flashinfer prefill wrapper. by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7008
- [mypy] Speed up mypy checking by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7056
- [ci][distributed] try to fix pp test by @youkaichao in https://github.com/vllm-project/vllm/pull/7054
- Fix tracing.py by @bong-furiosa in https://github.com/vllm-project/vllm/pull/7065
- [cuda][misc] remove error_on_invalid_device_count_status by @youkaichao in https://github.com/vllm-project/vllm/pull/7069
- [Core] Comment out unused code in sampler by @peng1999 in https://github.com/vllm-project/vllm/pull/7023
- [Hardware][Intel CPU] Update torch 2.4.0 for CPU backend by @DamonFool in https://github.com/vllm-project/vllm/pull/6931
- [ci] set timeout for test_oot_registration.py by @youkaichao in https://github.com/vllm-project/vllm/pull/7082
- [CI/Build] Add support for Python 3.12 by @mgoin in https://github.com/vllm-project/vllm/pull/7035
- [Misc] Disambiguate quantized types via a new ScalarType by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/6396
- [Core] Pipeline parallel with Ray ADAG by @ruisearch42 in https://github.com/vllm-project/vllm/pull/6837
- [Misc] Revive to use loopback address for driver IP by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7091
- [misc] add a flag to enable compile by @youkaichao in https://github.com/vllm-project/vllm/pull/7092
- [ Frontend ] Multiprocessing for OpenAI Server with
zeromq
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6883 - [ci][distributed] shorten wait time if server hangs by @youkaichao in https://github.com/vllm-project/vllm/pull/7098
- [Frontend] Factor out chat message parsing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7055
- [ci][distributed] merge distributed test commands by @youkaichao in https://github.com/vllm-project/vllm/pull/7097
- [ci][distributed] disable ray dag tests by @youkaichao in https://github.com/vllm-project/vllm/pull/7099
- [Model] Refactor and decouple weight loading logic for InternVL2 model by @Isotr0py in https://github.com/vllm-project/vllm/pull/7067
- [Bugfix] Fix block table for seqs that have prefix cache hits by @zachzzc in https://github.com/vllm-project/vllm/pull/7018
- [LoRA] ReplicatedLinear support LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/7081
- [CI] Temporarily turn off H100 performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/7104
- [ci][test] finalize fork_new_process_for_each_test by @youkaichao in https://github.com/vllm-project/vllm/pull/7114
- [Frontend] Warn if user
max_model_len
is greater than derivedmax_model_len
by @fialhocoelho in https://github.com/vllm-project/vllm/pull/7080 - Support for guided decoding for offline LLM by @kevinbu233 in https://github.com/vllm-project/vllm/pull/6878
- [misc] add zmq in collect env by @youkaichao in https://github.com/vllm-project/vllm/pull/7119
- [core][misc] simply output processing with shortcut for non-parallel sampling and non-beam search usecase by @youkaichao in https://github.com/vllm-project/vllm/pull/7117
- [Model]Refactor MiniCPMV by @jeejeelee in https://github.com/vllm-project/vllm/pull/7020
- [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator by @tdoublep in https://github.com/vllm-project/vllm/pull/7105
- [misc][distributed] improve libcudart.so finding by @youkaichao in https://github.com/vllm-project/vllm/pull/7127
- Clean up remaining Punica C information by @jeejeelee in https://github.com/vllm-project/vllm/pull/7027
- [Model] Add multi-image support for minicpmv offline inference by @HwwwwwwwH in https://github.com/vllm-project/vllm/pull/7122
- [Frontend] Reapply "Factor out code for running uvicorn" by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7095
- [Model] SiglipVisionModel ported from transformers by @ChristopherCho in https://github.com/vllm-project/vllm/pull/6942
- [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification by @cadedaniel in https://github.com/vllm-project/vllm/pull/6963
- [SpecDecode] Support FlashInfer in DraftModelRunner by @bong-furiosa in https://github.com/vllm-project/vllm/pull/6926
- [BugFix] Use IP4 localhost form for zmq bind by @njhill in https://github.com/vllm-project/vllm/pull/7163
- [BugFix] Use args.trust_remote_code by @VastoLorde95 in https://github.com/vllm-project/vllm/pull/7121
- [Misc] Fix typo in GroupCoordinator.recv() by @ruisearch42 in https://github.com/vllm-project/vllm/pull/7167
- [Kernel] Update CUTLASS to 3.5.1 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7085
- [CI/Build] Suppress divide-by-zero and missing return statement warnings by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7001
- [Bugfix][CI/Build] Fix CUTLASS FetchContent by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/7171
- bump version to v0.5.4 by @simon-mo in https://github.com/vllm-project/vllm/pull/7139
New Contributors
- @yecohn made their first contribution in https://github.com/vllm-project/vllm/pull/6652
- @thesues made their first contribution in https://github.com/vllm-project/vllm/pull/5753
- @luizanao made their first contribution in https://github.com/vllm-project/vllm/pull/6748
- @ezliu made their first contribution in https://github.com/vllm-project/vllm/pull/6626
- @HwwwwwwwH made their first contribution in https://github.com/vllm-project/vllm/pull/4087
- @LucasWilkinson made their first contribution in https://github.com/vllm-project/vllm/pull/6798
- @qingquansong made their first contribution in https://github.com/vllm-project/vllm/pull/6793
- @eaplatanios made their first contribution in https://github.com/vllm-project/vllm/pull/6770
- @gurpreet-dhami made their first contribution in https://github.com/vllm-project/vllm/pull/6847
- @omrishiv made their first contribution in https://github.com/vllm-project/vllm/pull/6844
- @cw75 made their first contribution in https://github.com/vllm-project/vllm/pull/6857
- @zeyugao made their first contribution in https://github.com/vllm-project/vllm/pull/6871
- @etwk made their first contribution in https://github.com/vllm-project/vllm/pull/6891
- @fzyzcjy made their first contribution in https://github.com/vllm-project/vllm/pull/6949
- @FeiDeng made their first contribution in https://github.com/vllm-project/vllm/pull/6927
- @HandH1998 made their first contribution in https://github.com/vllm-project/vllm/pull/5218
- @xuyi made their first contribution in https://github.com/vllm-project/vllm/pull/6924
- @bong-furiosa made their first contribution in https://github.com/vllm-project/vllm/pull/7065
- @zachzzc made their first contribution in https://github.com/vllm-project/vllm/pull/7018
- @fialhocoelho made their first contribution in https://github.com/vllm-project/vllm/pull/7080
- @ChristopherCho made their first contribution in https://github.com/vllm-project/vllm/pull/6942
- @VastoLorde95 made their first contribution in https://github.com/vllm-project/vllm/pull/7121
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.3...v0.5.4
1、 vllm-0.5.4+cu118-cp310-cp310-manylinux1_x86_64.whl 118.82MB
2、 vllm-0.5.4+cu118-cp311-cp311-manylinux1_x86_64.whl 118.82MB
3、 vllm-0.5.4+cu118-cp312-cp312-manylinux1_x86_64.whl 118.82MB
4、 vllm-0.5.4+cu118-cp38-cp38-manylinux1_x86_64.whl 118.82MB
5、 vllm-0.5.4+cu118-cp39-cp39-manylinux1_x86_64.whl 118.82MB
6、 vllm-0.5.4-cp310-cp310-manylinux1_x86_64.whl 118.42MB
7、 vllm-0.5.4-cp311-cp311-manylinux1_x86_64.whl 118.42MB
8、 vllm-0.5.4-cp312-cp312-manylinux1_x86_64.whl 118.42MB
9、 vllm-0.5.4-cp38-cp38-manylinux1_x86_64.whl 118.42MB
10、 vllm-0.5.4-cp39-cp39-manylinux1_x86_64.whl 118.42MB