vtest
版本发布时间: 2024-07-02 04:19:54
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
What's Changed
- [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with
perf-benchmarks
label by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5073 - [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
- [Misc] Fix arg names by @AllenDou in https://github.com/vllm-project/vllm/pull/5524
- [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
- [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in https://github.com/vllm-project/vllm/pull/5546
- [Core] Remove duplicate processing in async engine by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
- [misc][distributed] fix benign error in
is_in_the_same_node
by @youkaichao in https://github.com/vllm-project/vllm/pull/5512 - [Docs] Add ZhenFund as a Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/5548
- [Doc] Update documentation on Tensorizer by @sangstar in https://github.com/vllm-project/vllm/pull/5471
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in https://github.com/vllm-project/vllm/pull/5460
- [Bugfix] Fix typo in Pallas backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
- [Core][Distributed] improve p2p cache generation by @youkaichao in https://github.com/vllm-project/vllm/pull/5528
- Add ccache to amd by @simon-mo in https://github.com/vllm-project/vllm/pull/5555
- [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in https://github.com/vllm-project/vllm/pull/5364
- [mypy] Enable type checking for test directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
- [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
- [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5538
- add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
- [BugFix] Don't start a Ray cluster when not using Ray by @njhill in https://github.com/vllm-project/vllm/pull/5570
- [Fix] Correct OpenAI batch response format by @zifeitong in https://github.com/vllm-project/vllm/pull/5554
- Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in https://github.com/vllm-project/vllm/pull/5518
- [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in https://github.com/vllm-project/vllm/pull/5577
- [build][misc] limit numpy version by @youkaichao in https://github.com/vllm-project/vllm/pull/5582
- [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/5581
- Fix w8a8 benchmark and add Llama-3-8B by @comaniac in https://github.com/vllm-project/vllm/pull/5562
- [Model] Rename Phi3 rope scaling type by @garg-amit in https://github.com/vllm-project/vllm/pull/5595
- Correct alignment in the seq_len diagram. by @CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
- [Kernel]
compressed-tensors
marlin 24 support by @dsikka in https://github.com/vllm-project/vllm/pull/5435 - [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in https://github.com/vllm-project/vllm/pull/5588
- [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in https://github.com/vllm-project/vllm/pull/3814
- [CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in https://github.com/vllm-project/vllm/pull/5574
- [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
- [bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in https://github.com/vllm-project/vllm/pull/5604
- [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in https://github.com/vllm-project/vllm/pull/5584
- [Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in https://github.com/vllm-project/vllm/pull/5142
- [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5606
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in https://github.com/vllm-project/vllm/pull/5131
- [Model] Initialize Phi-3-vision support by @Isotr0py in https://github.com/vllm-project/vllm/pull/4986
- [Kernel] Add punica dimensions for Granite 13b by @joerunde in https://github.com/vllm-project/vllm/pull/5559
- [misc][typo] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/5620
- [Misc] Fix typo by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
- [CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
- [bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in https://github.com/vllm-project/vllm/pull/5612
- [Misc] Remove import from transformers logging by @CatherineSue in https://github.com/vllm-project/vllm/pull/5625
- [CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5623
- [ci] Deprecate original CI template by @khluu in https://github.com/vllm-project/vllm/pull/5624
- [Misc] Add OpenTelemetry support by @ronensc in https://github.com/vllm-project/vllm/pull/4687
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in https://github.com/vllm-project/vllm/pull/5542
- [ci] Setup Release pipeline and build release wheels with cache by @khluu in https://github.com/vllm-project/vllm/pull/5610
- [Model] LoRA support added for command-r by @sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
- [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in https://github.com/vllm-project/vllm/pull/5639
- [Doc] Added cerebrium as Integration option by @milo157 in https://github.com/vllm-project/vllm/pull/5553
- [Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
- [Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
- [Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in https://github.com/vllm-project/vllm/pull/5628
- [Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in https://github.com/vllm-project/vllm/pull/5659
- [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in https://github.com/vllm-project/vllm/pull/5641
- [misc][distributed] use localhost for single-node by @youkaichao in https://github.com/vllm-project/vllm/pull/5619
- [Model] Add FP8 kv cache for Qwen2 by @mgoin in https://github.com/vllm-project/vllm/pull/5656
- [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in https://github.com/vllm-project/vllm/pull/5684
- [Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in https://github.com/vllm-project/vllm/pull/5629
- [CI/Build] Add tqdm to dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
- [ci] Add A100 queue into AWS CI template by @khluu in https://github.com/vllm-project/vllm/pull/5648
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in https://github.com/vllm-project/vllm/pull/5688
- [ci][distributed] add tests for custom allreduce by @youkaichao in https://github.com/vllm-project/vllm/pull/5689
- [Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in https://github.com/vllm-project/vllm/pull/5654
- [Doc] Update docker references by @rafvasq in https://github.com/vllm-project/vllm/pull/5614
- [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in https://github.com/vllm-project/vllm/pull/5650
- [ci] Limit num gpus if specified for A100 by @khluu in https://github.com/vllm-project/vllm/pull/5694
- [Misc] Improve conftest by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
- [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in https://github.com/vllm-project/vllm/pull/5703
- [Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
- [Model] Port over CLIPVisionModel for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5591
- [Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in https://github.com/vllm-project/vllm/pull/5718
- [distributed][misc] use fork by default for mp by @youkaichao in https://github.com/vllm-project/vllm/pull/5669
- [Model] MLPSpeculator speculative decoding support by @JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
- [Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
- [BugFix] Fix test_phi3v.py by @CatherineSue in https://github.com/vllm-project/vllm/pull/5725
- [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in https://github.com/vllm-project/vllm/pull/5665
- [Core][Distributed] add shm broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5399
- [Kernel][CPU] Add Quick
gelu
to CPU by @ywang96 in https://github.com/vllm-project/vllm/pull/5717 - [Doc] Documentation on supported hardware for quantization methods by @mgoin in https://github.com/vllm-project/vllm/pull/5745
- [BugFix] exclude version 1.15.0 for modelscope by @zhyncs in https://github.com/vllm-project/vllm/pull/5668
- [ci][test] fix ca test in main by @youkaichao in https://github.com/vllm-project/vllm/pull/5746
- [LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in https://github.com/vllm-project/vllm/pull/5603
- [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in https://github.com/vllm-project/vllm/pull/5616
- [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in https://github.com/vllm-project/vllm/pull/5710
- [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5756
- [Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
- [Docs][TPU] Add installation tip for TPU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
- [core][distributed] improve shared memory broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5754
- [BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
- [Distributed] Add send and recv helpers by @andoorve in https://github.com/vllm-project/vllm/pull/5719
- [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in https://github.com/vllm-project/vllm/pull/5772
- [doc][faq] add warning to download models for every nodes by @youkaichao in https://github.com/vllm-project/vllm/pull/5783
- [Doc] Add "Suggest edit" button to doc pages by @mgoin in https://github.com/vllm-project/vllm/pull/5789
- [Doc] Add Phi-3-medium to list of supported models by @mgoin in https://github.com/vllm-project/vllm/pull/5788
- [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in https://github.com/vllm-project/vllm/pull/5795
- [ci] Remove aws template by @khluu in https://github.com/vllm-project/vllm/pull/5757
- [Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
- [Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
- [Misc] Remove useless code in cpu_worker by @DamonFool in https://github.com/vllm-project/vllm/pull/5824
- [Core] Add fault tolerance for
RayTokenizerGroupPool
by @Yard1 in https://github.com/vllm-project/vllm/pull/5748 - [doc][distributed] add both gloo and nccl tests by @youkaichao in https://github.com/vllm-project/vllm/pull/5834
- [CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in https://github.com/vllm-project/vllm/pull/5798
- [Misc] Update
w4a16
compressed-tensors
support to includew8a16
by @dsikka in https://github.com/vllm-project/vllm/pull/5794 - [Hardware][TPU] Refactor TPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
- [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in https://github.com/vllm-project/vllm/pull/5422
- [Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
- [CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in https://github.com/vllm-project/vllm/pull/5791
- [Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in https://github.com/vllm-project/vllm/pull/5841
- [Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
- [Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in https://github.com/vllm-project/vllm/pull/5832
- [bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in https://github.com/vllm-project/vllm/pull/5801
- [Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
- [Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
- [CI/Build] Refactor image test assets by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
- [Kernel] Adding bias epilogue support for
cutlass_scaled_mm
by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5560 - [Frontend] Add tokenize/detokenize endpoints by @sasha0552 in https://github.com/vllm-project/vllm/pull/5054
- [Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
- [Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
- Support CPU inference with VSX PowerPC ISA by @ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
- [doc] update usage of env var to avoid conflict by @youkaichao in https://github.com/vllm-project/vllm/pull/5873
- [Misc] Add example for LLaVA-NeXT by @ywang96 in https://github.com/vllm-project/vllm/pull/5879
- [BugFix] Fix cuda graph for MLPSpeculator by @njhill in https://github.com/vllm-project/vllm/pull/5875
- [Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
- [VLM][Bugfix] Make sure that
multi_modal_kwargs
is broadcasted properly by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880 - [Model] Add base class for LoRA-supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
- [Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in https://github.com/vllm-project/vllm/pull/5888
- [CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
- [doc][misc] add note for Kubernetes users by @youkaichao in https://github.com/vllm-project/vllm/pull/5916
- [BugFix] Fix
MLPSpeculator
handling ofnum_speculative_tokens
by @njhill in https://github.com/vllm-project/vllm/pull/5876 - [BugFix] Fix
min_tokens
behaviour for multiple eos tokens by @njhill in https://github.com/vllm-project/vllm/pull/5849 - [CI/Build] Fix Args for
_get_logits_warper
in Sampler Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5922 - [Model] Add Gemma 2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
- [core][misc] remove logical block by @youkaichao in https://github.com/vllm-project/vllm/pull/5882
- [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in https://github.com/vllm-project/vllm/pull/5932
- [Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
- [VLM][BugFix] Make sure that
multi_modal_kwargs
can broadcast properly with ring buffer. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905 - [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5956
- [Core] Registry for processing model inputs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
- Unmark fused_moe config json file as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
- [Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
- [Bugfix] Better error message for MLPSpeculator when
num_speculative_tokens
is set too high by @tdoublep in https://github.com/vllm-project/vllm/pull/5894 - [CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
- [Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
- [Spec Decode] Introduce DraftModelRunner by @comaniac in https://github.com/vllm-project/vllm/pull/5799
- [Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
- [ Misc ] Remove
fp8_shard_indexer
from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928 - [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
- Support Deepseek-V2 by @zwd003 in https://github.com/vllm-project/vllm/pull/4650
New Contributors
- @garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
- @CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
- @bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
- @sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
- @milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
- @ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
- @rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
- @JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
- @rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
- @wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
- @aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
- @stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
- @ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
- @ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
- @ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.0.post1...vtest