vtest

vllm-project/vllm

版本发布时间: 2024-07-02 04:19:54

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

What's Changed

[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5073
[CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
[Misc] Fix arg names by @AllenDou in https://github.com/vllm-project/vllm/pull/5524
[ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
[mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in https://github.com/vllm-project/vllm/pull/5546
[Core] Remove duplicate processing in async engine by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
[misc][distributed] fix benign error in is_in_the_same_node by @youkaichao in https://github.com/vllm-project/vllm/pull/5512
[Docs] Add ZhenFund as a Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/5548
[Doc] Update documentation on Tensorizer by @sangstar in https://github.com/vllm-project/vllm/pull/5471
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in https://github.com/vllm-project/vllm/pull/5460
[Bugfix] Fix typo in Pallas backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
[Core][Distributed] improve p2p cache generation by @youkaichao in https://github.com/vllm-project/vllm/pull/5528
Add ccache to amd by @simon-mo in https://github.com/vllm-project/vllm/pull/5555
[Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in https://github.com/vllm-project/vllm/pull/5364
[mypy] Enable type checking for test directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
[CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
[misc] Do not allow to use lora with chunked prefill. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5538
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
[BugFix] Don't start a Ray cluster when not using Ray by @njhill in https://github.com/vllm-project/vllm/pull/5570
[Fix] Correct OpenAI batch response format by @zifeitong in https://github.com/vllm-project/vllm/pull/5554
Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in https://github.com/vllm-project/vllm/pull/5518
[CI][BugFix] Flip is_quant_method_supported condition by @mgoin in https://github.com/vllm-project/vllm/pull/5577
[build][misc] limit numpy version by @youkaichao in https://github.com/vllm-project/vllm/pull/5582
[Doc] add debugging tips for crash and multi-node debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/5581
Fix w8a8 benchmark and add Llama-3-8B by @comaniac in https://github.com/vllm-project/vllm/pull/5562
[Model] Rename Phi3 rope scaling type by @garg-amit in https://github.com/vllm-project/vllm/pull/5595
Correct alignment in the seq_len diagram. by @CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
[Kernel] compressed-tensors marlin 24 support by @dsikka in https://github.com/vllm-project/vllm/pull/5435
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in https://github.com/vllm-project/vllm/pull/5588
[Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in https://github.com/vllm-project/vllm/pull/3814
[CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in https://github.com/vllm-project/vllm/pull/5574
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
[bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in https://github.com/vllm-project/vllm/pull/5604
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in https://github.com/vllm-project/vllm/pull/5584
[Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in https://github.com/vllm-project/vllm/pull/5142
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5606
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in https://github.com/vllm-project/vllm/pull/5131
[Model] Initialize Phi-3-vision support by @Isotr0py in https://github.com/vllm-project/vllm/pull/4986
[Kernel] Add punica dimensions for Granite 13b by @joerunde in https://github.com/vllm-project/vllm/pull/5559
[misc][typo] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/5620
[Misc] Fix typo by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
[CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
[bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in https://github.com/vllm-project/vllm/pull/5612
[Misc] Remove import from transformers logging by @CatherineSue in https://github.com/vllm-project/vllm/pull/5625
[CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5623
[ci] Deprecate original CI template by @khluu in https://github.com/vllm-project/vllm/pull/5624
[Misc] Add OpenTelemetry support by @ronensc in https://github.com/vllm-project/vllm/pull/4687
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in https://github.com/vllm-project/vllm/pull/5542
[ci] Setup Release pipeline and build release wheels with cache by @khluu in https://github.com/vllm-project/vllm/pull/5610
[Model] LoRA support added for command-r by @sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in https://github.com/vllm-project/vllm/pull/5639
[Doc] Added cerebrium as Integration option by @milo157 in https://github.com/vllm-project/vllm/pull/5553
[Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
[Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
[Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in https://github.com/vllm-project/vllm/pull/5628
[Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in https://github.com/vllm-project/vllm/pull/5659
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in https://github.com/vllm-project/vllm/pull/5641
[misc][distributed] use localhost for single-node by @youkaichao in https://github.com/vllm-project/vllm/pull/5619
[Model] Add FP8 kv cache for Qwen2 by @mgoin in https://github.com/vllm-project/vllm/pull/5656
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in https://github.com/vllm-project/vllm/pull/5684
[Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in https://github.com/vllm-project/vllm/pull/5629
[CI/Build] Add tqdm to dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
[ci] Add A100 queue into AWS CI template by @khluu in https://github.com/vllm-project/vllm/pull/5648
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in https://github.com/vllm-project/vllm/pull/5688
[ci][distributed] add tests for custom allreduce by @youkaichao in https://github.com/vllm-project/vllm/pull/5689
[Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in https://github.com/vllm-project/vllm/pull/5654
[Doc] Update docker references by @rafvasq in https://github.com/vllm-project/vllm/pull/5614
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in https://github.com/vllm-project/vllm/pull/5650
[ci] Limit num gpus if specified for A100 by @khluu in https://github.com/vllm-project/vllm/pull/5694
[Misc] Improve conftest by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in https://github.com/vllm-project/vllm/pull/5703
[Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
[Model] Port over CLIPVisionModel for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5591
[Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in https://github.com/vllm-project/vllm/pull/5718
[distributed][misc] use fork by default for mp by @youkaichao in https://github.com/vllm-project/vllm/pull/5669
[Model] MLPSpeculator speculative decoding support by @JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
[Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
[BugFix] Fix test_phi3v.py by @CatherineSue in https://github.com/vllm-project/vllm/pull/5725
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in https://github.com/vllm-project/vllm/pull/5665
[Core][Distributed] add shm broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5399
[Kernel][CPU] Add Quick gelu to CPU by @ywang96 in https://github.com/vllm-project/vllm/pull/5717
[Doc] Documentation on supported hardware for quantization methods by @mgoin in https://github.com/vllm-project/vllm/pull/5745
[BugFix] exclude version 1.15.0 for modelscope by @zhyncs in https://github.com/vllm-project/vllm/pull/5668
[ci][test] fix ca test in main by @youkaichao in https://github.com/vllm-project/vllm/pull/5746
[LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in https://github.com/vllm-project/vllm/pull/5603
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in https://github.com/vllm-project/vllm/pull/5616
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in https://github.com/vllm-project/vllm/pull/5710
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5756
[Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
[Docs][TPU] Add installation tip for TPU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
[core][distributed] improve shared memory broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5754
[BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
[Distributed] Add send and recv helpers by @andoorve in https://github.com/vllm-project/vllm/pull/5719
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in https://github.com/vllm-project/vllm/pull/5772
[doc][faq] add warning to download models for every nodes by @youkaichao in https://github.com/vllm-project/vllm/pull/5783
[Doc] Add "Suggest edit" button to doc pages by @mgoin in https://github.com/vllm-project/vllm/pull/5789
[Doc] Add Phi-3-medium to list of supported models by @mgoin in https://github.com/vllm-project/vllm/pull/5788
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in https://github.com/vllm-project/vllm/pull/5795
[ci] Remove aws template by @khluu in https://github.com/vllm-project/vllm/pull/5757
[Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
[Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
[Misc] Remove useless code in cpu_worker by @DamonFool in https://github.com/vllm-project/vllm/pull/5824
[Core] Add fault tolerance for RayTokenizerGroupPool by @Yard1 in https://github.com/vllm-project/vllm/pull/5748
[doc][distributed] add both gloo and nccl tests by @youkaichao in https://github.com/vllm-project/vllm/pull/5834
[CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in https://github.com/vllm-project/vllm/pull/5798
[Misc] Update w4a16 compressed-tensors support to include w8a16 by @dsikka in https://github.com/vllm-project/vllm/pull/5794
[Hardware][TPU] Refactor TPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in https://github.com/vllm-project/vllm/pull/5422
[Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
[CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in https://github.com/vllm-project/vllm/pull/5791
[Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in https://github.com/vllm-project/vllm/pull/5841
[Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
[Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in https://github.com/vllm-project/vllm/pull/5832
[bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in https://github.com/vllm-project/vllm/pull/5801
[Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
[Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
[CI/Build] Refactor image test assets by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
[Kernel] Adding bias epilogue support for cutlass_scaled_mm by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5560
[Frontend] Add tokenize/detokenize endpoints by @sasha0552 in https://github.com/vllm-project/vllm/pull/5054
[Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
[Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
Support CPU inference with VSX PowerPC ISA by @ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
[doc] update usage of env var to avoid conflict by @youkaichao in https://github.com/vllm-project/vllm/pull/5873
[Misc] Add example for LLaVA-NeXT by @ywang96 in https://github.com/vllm-project/vllm/pull/5879
[BugFix] Fix cuda graph for MLPSpeculator by @njhill in https://github.com/vllm-project/vllm/pull/5875
[Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880
[Model] Add base class for LoRA-supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
[Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in https://github.com/vllm-project/vllm/pull/5888
[CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
[doc][misc] add note for Kubernetes users by @youkaichao in https://github.com/vllm-project/vllm/pull/5916
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens by @njhill in https://github.com/vllm-project/vllm/pull/5876
[BugFix] Fix min_tokens behaviour for multiple eos tokens by @njhill in https://github.com/vllm-project/vllm/pull/5849
[CI/Build] Fix Args for _get_logits_warper in Sampler Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5922
[Model] Add Gemma 2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
[core][misc] remove logical block by @youkaichao in https://github.com/vllm-project/vllm/pull/5882
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in https://github.com/vllm-project/vllm/pull/5932
[Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5956
[Core] Registry for processing model inputs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
Unmark fused_moe config json file as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
[Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high by @tdoublep in https://github.com/vllm-project/vllm/pull/5894
[CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
[Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
[Spec Decode] Introduce DraftModelRunner by @comaniac in https://github.com/vllm-project/vllm/pull/5799
[Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
Support Deepseek-V2 by @zwd003 in https://github.com/vllm-project/vllm/pull/4650

New Contributors

@garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
@CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
@bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
@sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
@milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
@ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
@rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
@JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
@rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
@wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
@aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
@stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
@ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
@ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
@ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.0.post1...vtest

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-07-02发行的版本