v0.4.2
版本发布时间: 2024-05-05 12:31:08
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Features
- Chunked prefill is ready for testing! It improves inter-token latency in high load scenario by chunking the prompt processing and priortizes decode (#4580)
- Speculative decoding functionalities: logprobs (#4378), ngram (#4237)
- Support FlashInfer as attention backend (#4353)
Models and Enhancements
- Add support for Phi-3-mini (#4298, #4372, #4380)
- Add more histogram metrics (#2764, #4523)
- Full tensor parallelism for LoRA layers (#3524)
- Expanding Marlin kernel to support all GPTQ models (#3922, #4466, #4533)
Dependency Upgrade
- Upgrade to
torch==2.3.0
(#4454) - Upgrade to
tensorizer==2.9.0
(#4467) - Expansion of AMD test suite (#4267)
Progress and Dev Experience
- Centralize and document all environment variables (#4548, #4574)
- Progress towards fully typed codebase (#4337, #4427, #4555, #4450)
- Progress towards pipeline parallelism (#4512, #4444, #4566)
- Progress towards multiprocessing based executors (#4348, #4402, #4419)
- Progress towards FP8 support (#4343, #4332, 4527)
What's Changed
- [Core][Distributed] use existing torch.cuda.device context manager by @youkaichao in https://github.com/vllm-project/vllm/pull/4318
- [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark by @ywang96 in https://github.com/vllm-project/vllm/pull/4279
- [Bugfix] Fix marlin kernel crash on H100 by @alexm-nm in https://github.com/vllm-project/vllm/pull/4218
- [Doc] Add note for docker user by @youkaichao in https://github.com/vllm-project/vllm/pull/4340
- [Misc] Use public API in benchmark_throughput by @zifeitong in https://github.com/vllm-project/vllm/pull/4300
- [Model] Adds Phi-3 support by @caiom in https://github.com/vllm-project/vllm/pull/4298
- [Core] Move ray_utils.py from
engine
toexecutor
package by @njhill in https://github.com/vllm-project/vllm/pull/4347 - [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/4324
- [CI/Build] Adding functionality to reset the node's GPUs before processing. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4213
- [Doc] README Phi-3 name fix. by @caiom in https://github.com/vllm-project/vllm/pull/4372
- [Core]refactor aqlm quant ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4351
- [Mypy] Typing lora folder by @rkooo567 in https://github.com/vllm-project/vllm/pull/4337
- [Misc] Optimize flash attention backend log by @esmeetu in https://github.com/vllm-project/vllm/pull/4368
- [Core] Add
shutdown()
method toExecutorBase
by @njhill in https://github.com/vllm-project/vllm/pull/4349 - [Core] Move function tracing setup to util function by @njhill in https://github.com/vllm-project/vllm/pull/4352
- [ROCm][Hardware][AMD][Doc] Documentation update for ROCm by @hongxiayang in https://github.com/vllm-project/vllm/pull/4376
- [Bugfix] Fix parameter name in
get_tokenizer
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4107 - [Frontend] Add --log-level option to api server by @normster in https://github.com/vllm-project/vllm/pull/4377
- [CI] Disable non-lazy string operation on logging by @rkooo567 in https://github.com/vllm-project/vllm/pull/4326
- [Core] Refactoring sampler and support prompt logprob for chunked prefill by @rkooo567 in https://github.com/vllm-project/vllm/pull/4309
- [Misc][Refactor] Generalize linear_method to be quant_method by @comaniac in https://github.com/vllm-project/vllm/pull/4373
- [Misc] add RFC issue template by @youkaichao in https://github.com/vllm-project/vllm/pull/4401
- [Core] Introduce
DistributedGPUExecutor
abstract class by @njhill in https://github.com/vllm-project/vllm/pull/4348 - [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales by @pcmoritz in https://github.com/vllm-project/vllm/pull/4343
- [Frontend][Bugfix] Disallow extra fields in OpenAI API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4355
- [Misc] Fix logger format typo by @esmeetu in https://github.com/vllm-project/vllm/pull/4396
- [ROCm][Hardware][AMD] Enable group query attention for triton FA by @hongxiayang in https://github.com/vllm-project/vllm/pull/4406
- [Kernel] Full Tensor Parallelism for LoRA Layers by @FurtherAI in https://github.com/vllm-project/vllm/pull/3524
- [Model] Phi-3 4k sliding window temp. fix by @caiom in https://github.com/vllm-project/vllm/pull/4380
- [Bugfix][Core] Fix get decoding config from ray by @esmeetu in https://github.com/vllm-project/vllm/pull/4335
- [Bugfix] Abort requests when the connection to /v1/completions is interrupted by @chestnut-Q in https://github.com/vllm-project/vllm/pull/4363
- [BugFix] Fix
min_tokens
wheneos_token_id
is None by @njhill in https://github.com/vllm-project/vllm/pull/4389 - ✨ support local cache for models by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4374
- [BugFix] Fix return type of executor execute_model methods by @njhill in https://github.com/vllm-project/vllm/pull/4402
- [BugFix] Resolved Issues For LinearMethod --> QuantConfig by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4418
- [Misc] fix typo in llm_engine init logging by @DefTruth in https://github.com/vllm-project/vllm/pull/4428
- Add more Prometheus metrics by @ronensc in https://github.com/vllm-project/vllm/pull/2764
- [CI] clean docker cache for neuron by @simon-mo in https://github.com/vllm-project/vllm/pull/4441
- [mypy][5/N] Support all typing on model executor by @rkooo567 in https://github.com/vllm-project/vllm/pull/4427
- [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/3922
- [CI] hotfix: soft fail neuron test by @simon-mo in https://github.com/vllm-project/vllm/pull/4458
- [Core][Distributed] use cpu group to broadcast metadata in cpu by @youkaichao in https://github.com/vllm-project/vllm/pull/4444
- [Misc] Upgrade to
torch==2.3.0
by @mgoin in https://github.com/vllm-project/vllm/pull/4454 - [Bugfix][Kernel] Fix compute_type for MoE kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4463
- [Core]Refactor gptq_marlin ops by @jikunshang in https://github.com/vllm-project/vllm/pull/4466
- [BugFix] fix num_lookahead_slots missing in async executor by @leiwen83 in https://github.com/vllm-project/vllm/pull/4165
- [Doc] add visualization for multi-stage dockerfile by @prashantgupta24 in https://github.com/vllm-project/vllm/pull/4456
- [Kernel] Support Fp8 Checkpoints (Dynamic + Static) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4332
- [Frontend] Support complex message content for chat completions endpoint by @fgreinacher in https://github.com/vllm-project/vllm/pull/3467
- [Frontend] [Core] Tensorizer: support dynamic
num_readers
, update version by @alpayariyak in https://github.com/vllm-project/vllm/pull/4467 - [Bugfix][Minor] Make ignore_eos effective by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4468
- fix_tokenizer_snapshot_download_bug by @kingljl in https://github.com/vllm-project/vllm/pull/4493
- Unable to find Punica extension issue during source code installation by @kingljl in https://github.com/vllm-project/vllm/pull/4494
- [Core] Centralize GPU Worker construction by @njhill in https://github.com/vllm-project/vllm/pull/4419
- [Misc][Typo] type annotation fix by @HarryWu99 in https://github.com/vllm-project/vllm/pull/4495
- [Misc] fix typo in block manager by @Juelianqvq in https://github.com/vllm-project/vllm/pull/4453
- Allow user to define whitespace pattern for outlines by @robcaulk in https://github.com/vllm-project/vllm/pull/4305
- [Misc]Add customized information for models by @jeejeelee in https://github.com/vllm-project/vllm/pull/4132
- [Test] Add ignore_eos test by @rkooo567 in https://github.com/vllm-project/vllm/pull/4519
- [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. by @AnyISalIn in https://github.com/vllm-project/vllm/pull/4173
- [Bugfix] Fix 307 Redirect for
/metrics
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/4523 - [Doc] update(example model): for OpenAI compatible serving by @fpaupier in https://github.com/vllm-project/vllm/pull/4503
- [Bugfix] Use random seed if seed is -1 by @sasha0552 in https://github.com/vllm-project/vllm/pull/4531
- [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/4534
- [Speculative decoding] Add ngram prompt lookup decoding by @leiwen83 in https://github.com/vllm-project/vllm/pull/4237
- [Core] Enable prefix caching with block manager v2 enabled by @leiwen83 in https://github.com/vllm-project/vllm/pull/4142
- [Core] Add
multiproc_worker_utils
for multiprocessing-based workers by @njhill in https://github.com/vllm-project/vllm/pull/4357 - [Kernel] Update fused_moe tuning script for FP8 by @pcmoritz in https://github.com/vllm-project/vllm/pull/4457
- [Bugfix] Add validation for seed by @sasha0552 in https://github.com/vllm-project/vllm/pull/4529
- [Bugfix][Core] Fix and refactor logging stats by @esmeetu in https://github.com/vllm-project/vllm/pull/4336
- [Core][Distributed] fix pynccl del error by @youkaichao in https://github.com/vllm-project/vllm/pull/4508
- [Misc] Remove Mixtral device="cuda" declarations by @pcmoritz in https://github.com/vllm-project/vllm/pull/4543
- [Misc] Fix expert_ids shape in MoE by @WoosukKwon in https://github.com/vllm-project/vllm/pull/4517
- [MISC] Rework logger to enable pythonic custom logging configuration to be provided by @tdg5 in https://github.com/vllm-project/vllm/pull/4273
- [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption by @rkooo567 in https://github.com/vllm-project/vllm/pull/4451
- [CI]Add regression tests to ensure the async engine generates metrics by @ronensc in https://github.com/vllm-project/vllm/pull/4524
- [mypy][6/N] Fix all the core subdirectory typing by @rkooo567 in https://github.com/vllm-project/vllm/pull/4450
- [Core][Distributed] enable multiple tp group by @youkaichao in https://github.com/vllm-project/vllm/pull/4512
- [Kernel] Support running GPTQ 8-bit models in Marlin by @alexm-nm in https://github.com/vllm-project/vllm/pull/4533
- [mypy][7/N] Cover all directories by @rkooo567 in https://github.com/vllm-project/vllm/pull/4555
- [Misc] Exclude the
tests
directory from being packaged by @itechbear in https://github.com/vllm-project/vllm/pull/4552 - [BugFix] Include target-device specific requirements.txt in sdist by @markmc in https://github.com/vllm-project/vllm/pull/4559
- [Misc] centralize all usage of environment variables by @youkaichao in https://github.com/vllm-project/vllm/pull/4548
- [kernel] fix sliding window in prefix prefill Triton kernel by @mmoskal in https://github.com/vllm-project/vllm/pull/4405
- [CI/Build] AMD CI pipeline with extended set of tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/4267
- [Core] Ignore infeasible swap requests. by @rkooo567 in https://github.com/vllm-project/vllm/pull/4557
- [Core][Distributed] enable allreduce for multiple tp groups by @youkaichao in https://github.com/vllm-project/vllm/pull/4566
- [BugFix] Prevent the task of
_force_log
from being garbage collected by @Atry in https://github.com/vllm-project/vllm/pull/4567 - [Misc] remove chunk detected debug logs by @DefTruth in https://github.com/vllm-project/vllm/pull/4571
- [Doc] add env vars to the doc by @youkaichao in https://github.com/vllm-project/vllm/pull/4572
- [Core][Model runner refactoring 1/N] Refactor attn metadata term by @rkooo567 in https://github.com/vllm-project/vllm/pull/4518
- [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None by @mgoin in https://github.com/vllm-project/vllm/pull/4586
- Fix/async chat serving by @schoennenbeck in https://github.com/vllm-project/vllm/pull/2727
- [Kernel] Use flashinfer for decoding by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4353
- [Speculative decoding] Support target-model logprobs by @cadedaniel in https://github.com/vllm-project/vllm/pull/4378
- [Misc] add installation time env vars by @youkaichao in https://github.com/vllm-project/vllm/pull/4574
- [Misc][Refactor] Introduce ExecuteModelData by @comaniac in https://github.com/vllm-project/vllm/pull/4540
- [Doc] Chunked Prefill Documentation by @rkooo567 in https://github.com/vllm-project/vllm/pull/4580
- [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) by @mgoin in https://github.com/vllm-project/vllm/pull/4527
- [CI] check size of the wheels by @simon-mo in https://github.com/vllm-project/vllm/pull/4319
- [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics by @DearPlanet in https://github.com/vllm-project/vllm/pull/3937
- bump version to v0.4.2 by @simon-mo in https://github.com/vllm-project/vllm/pull/4600
- [CI] Reduce wheel size by not shipping debug symbols by @simon-mo in https://github.com/vllm-project/vllm/pull/4602
New Contributors
- @zifeitong made their first contribution in https://github.com/vllm-project/vllm/pull/4300
- @caiom made their first contribution in https://github.com/vllm-project/vllm/pull/4298
- @Alexei-V-Ivanov-AMD made their first contribution in https://github.com/vllm-project/vllm/pull/4213
- @normster made their first contribution in https://github.com/vllm-project/vllm/pull/4377
- @FurtherAI made their first contribution in https://github.com/vllm-project/vllm/pull/3524
- @chestnut-Q made their first contribution in https://github.com/vllm-project/vllm/pull/4363
- @prashantgupta24 made their first contribution in https://github.com/vllm-project/vllm/pull/4374
- @fgreinacher made their first contribution in https://github.com/vllm-project/vllm/pull/3467
- @alpayariyak made their first contribution in https://github.com/vllm-project/vllm/pull/4467
- @HarryWu99 made their first contribution in https://github.com/vllm-project/vllm/pull/4495
- @Juelianqvq made their first contribution in https://github.com/vllm-project/vllm/pull/4453
- @robcaulk made their first contribution in https://github.com/vllm-project/vllm/pull/4305
- @AnyISalIn made their first contribution in https://github.com/vllm-project/vllm/pull/4173
- @sasha0552 made their first contribution in https://github.com/vllm-project/vllm/pull/4531
- @tdg5 made their first contribution in https://github.com/vllm-project/vllm/pull/4273
- @itechbear made their first contribution in https://github.com/vllm-project/vllm/pull/4552
- @markmc made their first contribution in https://github.com/vllm-project/vllm/pull/4559
- @Atry made their first contribution in https://github.com/vllm-project/vllm/pull/4567
- @schoennenbeck made their first contribution in https://github.com/vllm-project/vllm/pull/2727
- @DearPlanet made their first contribution in https://github.com/vllm-project/vllm/pull/3937
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.4.1...v0.4.2
1、 vllm-0.4.2+cu118-cp310-cp310-manylinux1_x86_64.whl 64.58MB
2、 vllm-0.4.2+cu118-cp311-cp311-manylinux1_x86_64.whl 64.59MB
3、 vllm-0.4.2+cu118-cp38-cp38-manylinux1_x86_64.whl 64.58MB
4、 vllm-0.4.2+cu118-cp39-cp39-manylinux1_x86_64.whl 64.58MB
5、 vllm-0.4.2-cp310-cp310-manylinux1_x86_64.whl 64.6MB
6、 vllm-0.4.2-cp311-cp311-manylinux1_x86_64.whl 64.6MB
7、 vllm-0.4.2-cp38-cp38-manylinux1_x86_64.whl 64.6MB