v0.4.2

vllm-project/vllm

版本发布时间: 2024-05-05 12:31:08

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

Highlights

Features

Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
Support FlashInfer as attention backend (#4353)

Models and Enhancements

Add support for Phi-3-mini (#4298, #4372, #4380)
Add more histogram metrics (#2764, #4523)
Full tensor parallelism for LoRA layers (#3524)
Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)

Dependency Upgrade

Upgrade to torch==2.3.0 (#4454)
Upgrade to tensorizer==2.9.0 (#4467)
Expansion of AMD test suite (#4267)

Progress and Dev Experience

Centralize and document all environment variables (#4548, #4574)
Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
Progress towards pipeline parallelism (#4512, #4444, #4566)
Progress towards multiprocessing based executors (#4348, #4402, #4419)
Progress towards FP8 support (#4343, #4332, 4527)

What's Changed

[Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in https://github.com/vllm-project/vllm/pull/4318
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/4279
[Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in https://github.com/vllm-project/vllm/pull/4218
[Doc] Add note for docker user by @youkaichao in https://github.com/vllm-project/vllm/pull/4340
[Misc] Use public API in benchmark_throughput by @zifeitong in https://github.com/vllm-project/vllm/pull/4300
[Model] Adds Phi-3 support by @caiom in https://github.com/vllm-project/vllm/pull/4298
[Core] Move ray_utils.py from engine to executor package by @njhill in https://github.com/vllm-project/vllm/pull/4347
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/4324
[CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4213
[Doc] README Phi-3 name fix. by @caiom in https://github.com/vllm-project/vllm/pull/4372
[Core]refactor aqlm quant ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4351
[Mypy] Typing lora folder by @rkooo567 in https://github.com/vllm-project/vllm/pull/4337
[Misc] Optimize flash attention backend log by @esmeetu in https://github.com/vllm-project/vllm/pull/4368
[Core] Add shutdown() method to ExecutorBase by @njhill in https://github.com/vllm-project/vllm/pull/4349
[Core] Move function tracing setup to util function by @njhill in https://github.com/vllm-project/vllm/pull/4352
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in https://github.com/vllm-project/vllm/pull/4376
[Bugfix] Fix parameter name in get_tokenizer by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4107
[Frontend] Add --log-level option to api server by @normster in https://github.com/vllm-project/vllm/pull/4377
[CI] Disable non-lazy string operation on logging by @rkooo567 in https://github.com/vllm-project/vllm/pull/4326
[Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in https://github.com/vllm-project/vllm/pull/4309
[Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in https://github.com/vllm-project/vllm/pull/4373
[Misc] add RFC issue template by @youkaichao in https://github.com/vllm-project/vllm/pull/4401
[Core] Introduce DistributedGPUExecutor abstract class by @njhill in https://github.com/vllm-project/vllm/pull/4348
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in https://github.com/vllm-project/vllm/pull/4343
[Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4355
[Misc] Fix logger format typo by @esmeetu in https://github.com/vllm-project/vllm/pull/4396
[ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in https://github.com/vllm-project/vllm/pull/4406
[Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in https://github.com/vllm-project/vllm/pull/3524
[Model] Phi-3 4k sliding window temp. fix by @caiom in https://github.com/vllm-project/vllm/pull/4380
[Bugfix][Core] Fix get decoding config from ray by @esmeetu in https://github.com/vllm-project/vllm/pull/4335
[Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in https://github.com/vllm-project/vllm/pull/4363
[BugFix] Fix min_tokens when eos_token_id is None by @njhill in https://github.com/vllm-project/vllm/pull/4389
✨ support local cache for models by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4374
[BugFix] Fix return type of executor execute_model methods by @njhill in https://github.com/vllm-project/vllm/pull/4402
[BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4418
[Misc] fix typo in llm_engine init logging by @DefTruth in https://github.com/vllm-project/vllm/pull/4428
Add more Prometheus metrics by @ronensc in https://github.com/vllm-project/vllm/pull/2764
[CI] clean docker cache for neuron by @simon-mo in https://github.com/vllm-project/vllm/pull/4441
[mypy][5/N] Support all typing on model executor by @rkooo567 in https://github.com/vllm-project/vllm/pull/4427
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3922
[CI] hotfix: soft fail neuron test by @simon-mo in https://github.com/vllm-project/vllm/pull/4458
[Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in https://github.com/vllm-project/vllm/pull/4444
[Misc] Upgrade to torch==2.3.0 by @mgoin in https://github.com/vllm-project/vllm/pull/4454
[Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4463
[Core]Refactor gptq_marlin ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4466
[BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in https://github.com/vllm-project/vllm/pull/4165
[Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4456
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4332
[Frontend] Support complex message content for chat completions endpoint by @fgreinacher in https://github.com/vllm-project/vllm/pull/3467
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version by @alpayariyak in https://github.com/vllm-project/vllm/pull/4467
[Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4468
fix_tokenizer_snapshot_download_bug by @kingljl in https://github.com/vllm-project/vllm/pull/4493
Unable to find Punica extension issue during source code installation by @kingljl in https://github.com/vllm-project/vllm/pull/4494
[Core] Centralize GPU Worker construction by @njhill in https://github.com/vllm-project/vllm/pull/4419
[Misc][Typo] type annotation fix by @HarryWu99 in https://github.com/vllm-project/vllm/pull/4495
[Misc] fix typo in block manager by @Juelianqvq in https://github.com/vllm-project/vllm/pull/4453
Allow user to define whitespace pattern for outlines by @robcaulk in https://github.com/vllm-project/vllm/pull/4305
[Misc]Add customized information for models by @jeejeelee in https://github.com/vllm-project/vllm/pull/4132
[Test] Add ignore_eos test by @rkooo567 in https://github.com/vllm-project/vllm/pull/4519
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in https://github.com/vllm-project/vllm/pull/4173
[Bugfix] Fix 307 Redirect for /metrics by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4523
[Doc] update(example model): for OpenAI compatible serving by @fpaupier in https://github.com/vllm-project/vllm/pull/4503
[Bugfix] Use random seed if seed is -1 by @sasha0552 in https://github.com/vllm-project/vllm/pull/4531
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/4534
[Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in https://github.com/vllm-project/vllm/pull/4237
[Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in https://github.com/vllm-project/vllm/pull/4142
[Core] Add multiproc_worker_utils for multiprocessing-based workers by @njhill in https://github.com/vllm-project/vllm/pull/4357
[Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in https://github.com/vllm-project/vllm/pull/4457
[Bugfix] Add validation for seed by @sasha0552 in https://github.com/vllm-project/vllm/pull/4529
[Bugfix][Core] Fix and refactor logging stats by @esmeetu in https://github.com/vllm-project/vllm/pull/4336
[Core][Distributed] fix pynccl del error by @youkaichao in https://github.com/vllm-project/vllm/pull/4508
[Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in https://github.com/vllm-project/vllm/pull/4543
[Misc] Fix expert_ids shape in MoE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4517
[MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in https://github.com/vllm-project/vllm/pull/4273
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in https://github.com/vllm-project/vllm/pull/4451
[CI]Add regression tests to ensure the async engine generates metrics by @ronensc in https://github.com/vllm-project/vllm/pull/4524
[mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in https://github.com/vllm-project/vllm/pull/4450
[Core][Distributed] enable multiple tp group by @youkaichao in https://github.com/vllm-project/vllm/pull/4512
[Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in https://github.com/vllm-project/vllm/pull/4533
[mypy][7/N] Cover all directories by @rkooo567 in https://github.com/vllm-project/vllm/pull/4555
[Misc] Exclude the tests directory from being packaged by @itechbear in https://github.com/vllm-project/vllm/pull/4552
[BugFix] Include target-device specific requirements.txt in sdist by @markmc in https://github.com/vllm-project/vllm/pull/4559
[Misc] centralize all usage of environment variables by @youkaichao in https://github.com/vllm-project/vllm/pull/4548
[kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in https://github.com/vllm-project/vllm/pull/4405
[CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4267
[Core] Ignore infeasible swap requests. by @rkooo567 in https://github.com/vllm-project/vllm/pull/4557
[Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in https://github.com/vllm-project/vllm/pull/4566
[BugFix] Prevent the task of _force_log from being garbage collected by @Atry in https://github.com/vllm-project/vllm/pull/4567
[Misc] remove chunk detected debug logs by @DefTruth in https://github.com/vllm-project/vllm/pull/4571
[Doc] add env vars to the doc by @youkaichao in https://github.com/vllm-project/vllm/pull/4572
[Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in https://github.com/vllm-project/vllm/pull/4518
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in https://github.com/vllm-project/vllm/pull/4586
Fix/async chat serving by @schoennenbeck in https://github.com/vllm-project/vllm/pull/2727
[Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4353
[Speculative decoding] Support target-model logprobs by @cadedaniel in https://github.com/vllm-project/vllm/pull/4378
[Misc] add installation time env vars by @youkaichao in https://github.com/vllm-project/vllm/pull/4574
[Misc][Refactor] Introduce ExecuteModelData by @comaniac in https://github.com/vllm-project/vllm/pull/4540
[Doc] Chunked Prefill Documentation by @rkooo567 in https://github.com/vllm-project/vllm/pull/4580
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in https://github.com/vllm-project/vllm/pull/4527
[CI] check size of the wheels by @simon-mo in https://github.com/vllm-project/vllm/pull/4319
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in https://github.com/vllm-project/vllm/pull/3937
bump version to v0.4.2 by @simon-mo in https://github.com/vllm-project/vllm/pull/4600
[CI] Reduce wheel size by not shipping debug symbols by @simon-mo in https://github.com/vllm-project/vllm/pull/4602

New Contributors

@zifeitong made their first contribution in https://github.com/vllm-project/vllm/pull/4300
@caiom made their first contribution in https://github.com/vllm-project/vllm/pull/4298
@Alexei-V-Ivanov-AMD made their first contribution in https://github.com/vllm-project/vllm/pull/4213
@normster made their first contribution in https://github.com/vllm-project/vllm/pull/4377
@FurtherAI made their first contribution in https://github.com/vllm-project/vllm/pull/3524
@chestnut-Q made their first contribution in https://github.com/vllm-project/vllm/pull/4363
@prashantgupta24 made their first contribution in https://github.com/vllm-project/vllm/pull/4374
@fgreinacher made their first contribution in https://github.com/vllm-project/vllm/pull/3467
@alpayariyak made their first contribution in https://github.com/vllm-project/vllm/pull/4467
@HarryWu99 made their first contribution in https://github.com/vllm-project/vllm/pull/4495
@Juelianqvq made their first contribution in https://github.com/vllm-project/vllm/pull/4453
@robcaulk made their first contribution in https://github.com/vllm-project/vllm/pull/4305
@AnyISalIn made their first contribution in https://github.com/vllm-project/vllm/pull/4173
@sasha0552 made their first contribution in https://github.com/vllm-project/vllm/pull/4531
@tdg5 made their first contribution in https://github.com/vllm-project/vllm/pull/4273
@itechbear made their first contribution in https://github.com/vllm-project/vllm/pull/4552
@markmc made their first contribution in https://github.com/vllm-project/vllm/pull/4559
@Atry made their first contribution in https://github.com/vllm-project/vllm/pull/4567
@schoennenbeck made their first contribution in https://github.com/vllm-project/vllm/pull/2727
@DearPlanet made their first contribution in https://github.com/vllm-project/vllm/pull/3937

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2

相关地址：原始地址下载(tar) 下载(zip)

1、 vllm-0.4.2+cu118-cp310-cp310-manylinux1_x86_64.whl 64.58MB

2、 vllm-0.4.2+cu118-cp311-cp311-manylinux1_x86_64.whl 64.59MB

3、 vllm-0.4.2+cu118-cp38-cp38-manylinux1_x86_64.whl 64.58MB

4、 vllm-0.4.2+cu118-cp39-cp39-manylinux1_x86_64.whl 64.58MB

5、 vllm-0.4.2-cp310-cp310-manylinux1_x86_64.whl 64.6MB

6、 vllm-0.4.2-cp311-cp311-manylinux1_x86_64.whl 64.6MB

7、 vllm-0.4.2-cp38-cp38-manylinux1_x86_64.whl 64.6MB

8、 vllm-0.4.2-cp39-cp39-manylinux1_x86_64.whl 64.6MB

查看：2024-05-05发行的版本