v0.5.3
版本发布时间: 2024-07-23 15:01:03
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Model Support
- vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
- Please checkout this thread for any known issues related to the model.
- The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
- The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
- In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
- Support Mistral-Nemo (#6548)
- Support Chameleon (#6633, #5770)
- Pipeline parallel support for Mixtral (#6516)
Hardware Support
- Many enhancements to TPU support. (#6277, #6457, #6506, #6504)
Performance Enhancements
- Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
- Progress towards refactoring for SPMD worker execution. (#6032)
- Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
- Memory optimization for pipeline parallelism. (#6455)
Production Engine
- Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
- Support dynamically loading Lora adapter from HuggingFace (#6234)
- Pipeline Parallel using stdlib multiprocessing module (#6130)
Others
- A CPU offloading implementation, you can now use
--cpu-offload-gb
to control how much memory to "extend" the RAM with. (#6496) - The new
vllm
CLI is now ready for testing. It comes with three commands:serve
,complete
, andchat
. Feedback and improvements are greatly welcomed! (#6431) - The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)
What's Changed
- [Docs] Add Google Cloud to sponsor list by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6450
- [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6289
- [CI/Build][TPU] Add TPU CI test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6277
- Pin sphinx-argparse version by @khluu in https://github.com/vllm-project/vllm/pull/6453
- [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug by @mzusman in https://github.com/vllm-project/vllm/pull/6425
- [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests by @g-eoj in https://github.com/vllm-project/vllm/pull/6419
- [Docs] Announce 5th meetup by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6458
- [CI/Build] vLLM cache directory for images by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6444
- [Frontend] Support for chat completions input in the tokenize endpoint by @sasha0552 in https://github.com/vllm-project/vllm/pull/5923
- [Misc] Fix typos in spec. decode metrics logging. by @tdoublep in https://github.com/vllm-project/vllm/pull/6470
- [Core] Use numpy to speed up padded token processing by @peng1999 in https://github.com/vllm-project/vllm/pull/6442
- [CI/Build] Remove "boardwalk" image asset by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6460
- [doc][misc] remind users to cancel debugging environment variables after debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/6481
- [Hardware][TPU] Support MoE with Pallas GMM kernel by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6457
- [Doc] Fix the lora adapter path in server startup script by @Jeffwan in https://github.com/vllm-project/vllm/pull/6230
- [Misc] Log spec decode metrics by @comaniac in https://github.com/vllm-project/vllm/pull/6454
- [Kernel][Attention] Separate
Attention.kv_scale
intok_scale
andv_scale
by @mgoin in https://github.com/vllm-project/vllm/pull/6081 - [ci][distributed] add pipeline parallel correctness test by @youkaichao in https://github.com/vllm-project/vllm/pull/6410
- [misc][distributed] improve tests by @youkaichao in https://github.com/vllm-project/vllm/pull/6488
- [misc][distributed] add seed to dummy weights by @youkaichao in https://github.com/vllm-project/vllm/pull/6491
- [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization by @wushidonguc in https://github.com/vllm-project/vllm/pull/6455
- [ROCm] Cleanup Dockerfile and remove outdated patch by @hongxiayang in https://github.com/vllm-project/vllm/pull/6482
- [Misc][Speculative decoding] Typos and typing fixes by @ShangmingCai in https://github.com/vllm-project/vllm/pull/6467
- [Doc][CI/Build] Update docs and tests to use
vllm serve
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6431 - [Bugfix] Fix for multinode crash on 4 PP by @andoorve in https://github.com/vllm-project/vllm/pull/6495
- [TPU] Remove multi-modal args in TPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6504
- [Misc] Use
torch.Tensor
for type annotation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6505 - [Core] Refactor _prepare_model_input_tensors - take 2 by @comaniac in https://github.com/vllm-project/vllm/pull/6164
- [DOC] - Add docker image to Cerebrium Integration by @milo157 in https://github.com/vllm-project/vllm/pull/6510
- [Bugfix] Fix Ray Metrics API usage by @Yard1 in https://github.com/vllm-project/vllm/pull/6354
- [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6338
- [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6511
- [Model] Pipeline parallel support for Mixtral by @comaniac in https://github.com/vllm-project/vllm/pull/6516
- [ Kernel ] Fp8 Channelwise Weight Support by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6487
- [core][model] yet another cpu offload implementation by @youkaichao in https://github.com/vllm-project/vllm/pull/6496
- [BugFix] Avoid secondary error in ShmRingBuffer destructor by @njhill in https://github.com/vllm-project/vllm/pull/6530
- [Core] Introduce SPMD worker execution using Ray accelerated DAG by @ruisearch42 in https://github.com/vllm-project/vllm/pull/6032
- [Misc] Minor patch for draft model runner by @comaniac in https://github.com/vllm-project/vllm/pull/6523
- [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs by @njhill in https://github.com/vllm-project/vllm/pull/6227
- [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash by @noamgat in https://github.com/vllm-project/vllm/pull/6501
- [TPU] Refactor TPU worker & model runner by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6506
- [ Misc ] Improve Min Capability Checking in
compressed-tensors
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6522 - [ci] Reword Github bot comment by @khluu in https://github.com/vllm-project/vllm/pull/6534
- [Model] Support Mistral-Nemo by @mgoin in https://github.com/vllm-project/vllm/pull/6548
- Fix PR comment bot by @khluu in https://github.com/vllm-project/vllm/pull/6554
- [ci][test] add correctness test for cpu offloading by @youkaichao in https://github.com/vllm-project/vllm/pull/6549
- [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6552
- [CI/Build] Build on Ubuntu 20.04 instead of 22.04 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6517
- Add support for a rope extension method by @simon-mo in https://github.com/vllm-project/vllm/pull/6553
- [Core] Multiprocessing Pipeline Parallel support by @njhill in https://github.com/vllm-project/vllm/pull/6130
- [Bugfix] Make spec. decode respect per-request seed. by @tdoublep in https://github.com/vllm-project/vllm/pull/6034
- [ Misc ] non-uniform quantization via
compressed-tensors
forLlama
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6515 - [Bugfix][Frontend] Fix missing
/metrics
endpoint by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6463 - [BUGFIX] Raise an error for no draft token case when draft_tp>1 by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/6369
- [Model] RowParallelLinear: pass bias to quant_method.apply by @tdoublep in https://github.com/vllm-project/vllm/pull/6327
- [Bugfix][Frontend] remove duplicate init logger by @dtrifiro in https://github.com/vllm-project/vllm/pull/6581
- [Misc] Small perf improvements by @Yard1 in https://github.com/vllm-project/vllm/pull/6520
- [Docs] Update docs for wheel location by @simon-mo in https://github.com/vllm-project/vllm/pull/6580
- [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection by @tdoublep in https://github.com/vllm-project/vllm/pull/6578
- [bugfix][distributed] fix multi-node bug for shared memory by @youkaichao in https://github.com/vllm-project/vllm/pull/6597
- [ Kernel ] Enable Dynamic Per Token
fp8
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6547 - [Docs] Update PP docs by @andoorve in https://github.com/vllm-project/vllm/pull/6598
- [build] add ib so that multi-node support with infiniband can be supported out-of-the-box by @youkaichao in https://github.com/vllm-project/vllm/pull/6599
- [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/6593
- [Core] Allow specifying custom Executor by @Yard1 in https://github.com/vllm-project/vllm/pull/6557
- [Bugfix][Core]: Guard for KeyErrors that can occur if a request is aborted with Pipeline Parallel by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6587
- [Misc] Consolidate and optimize logic for building padded tensors by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6541
- [ Misc ]
fbgemm
checkpoints by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6559 - [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes by @mawong-amd in https://github.com/vllm-project/vllm/pull/6543
- [ Kernel ] Enable
fp8-marlin
forfbgemm-fp8
models by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6606 - [Misc] Fix input_scale typing in w8a8_utils.py by @mgoin in https://github.com/vllm-project/vllm/pull/6579
- [ Bugfix ] Fix AutoFP8 fp8 marlin by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6609
- [Frontend] Move chat utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6602
- [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. by @sroy745 in https://github.com/vllm-project/vllm/pull/6485
- [Misc] Remove abused noqa by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6619
- [Model] Refactor and decouple phi3v image embedding by @Isotr0py in https://github.com/vllm-project/vllm/pull/6621
- [Kernel][Core] Add AWQ support to the Marlin kernel by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/6612
- [Model] Initial Support for Chameleon by @ywang96 in https://github.com/vllm-project/vllm/pull/5770
- [Misc] Add a wrapper for torch.inference_mode by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6618
- [Bugfix] Fix
vocab_size
field access in LLaVA models by @jaywonchung in https://github.com/vllm-project/vllm/pull/6624 - [Frontend] Refactor prompt processing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/4028
- [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/6649
- [ci] Use different sccache bucket for CUDA 11.8 wheel build by @khluu in https://github.com/vllm-project/vllm/pull/6656
- [Core] Support dynamically loading Lora adapter from HuggingFace by @Jeffwan in https://github.com/vllm-project/vllm/pull/6234
- [ci][build] add back vim in docker by @youkaichao in https://github.com/vllm-project/vllm/pull/6661
- [Misc] Remove deprecation warning for beam search by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6659
- [Core] Modulize prepare input and attention metadata builder by @comaniac in https://github.com/vllm-project/vllm/pull/6596
- [Bugfix] Fix null
modules_to_not_convert
in FBGEMM Fp8 quantization by @cli99 in https://github.com/vllm-project/vllm/pull/6665 - [Misc] Enable chunked prefill by default for long context models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/6666
- [misc] add start loading models for users information by @youkaichao in https://github.com/vllm-project/vllm/pull/6670
- add tqdm when loading checkpoint shards by @zhaotyer in https://github.com/vllm-project/vllm/pull/6569
- [Misc] Support FP8 kv cache scales from compressed-tensors by @mgoin in https://github.com/vllm-project/vllm/pull/6528
- [doc][distributed] add more doc for setting up multi-node environment by @youkaichao in https://github.com/vllm-project/vllm/pull/6529
- [Misc] Manage HTTP connections in one place by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6600
- [misc] only tqdm for first rank by @youkaichao in https://github.com/vllm-project/vllm/pull/6672
- [VLM][Model] Support image input for Chameleon by @ywang96 in https://github.com/vllm-project/vllm/pull/6633
- support ignore patterns in model loader by @simon-mo in https://github.com/vllm-project/vllm/pull/6673
- Bump version to v0.5.3 by @simon-mo in https://github.com/vllm-project/vllm/pull/6674
New Contributors
- @g-eoj made their first contribution in https://github.com/vllm-project/vllm/pull/6419
- @peng1999 made their first contribution in https://github.com/vllm-project/vllm/pull/6442
- @Jeffwan made their first contribution in https://github.com/vllm-project/vllm/pull/6230
- @wushidonguc made their first contribution in https://github.com/vllm-project/vllm/pull/6455
- @ShangmingCai made their first contribution in https://github.com/vllm-project/vllm/pull/6467
- @ruisearch42 made their first contribution in https://github.com/vllm-project/vllm/pull/6032
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.2...v0.5.3
1、 vllm-0.5.3+cu118-cp310-cp310-manylinux1_x86_64.whl 151.03MB
2、 vllm-0.5.3+cu118-cp311-cp311-manylinux1_x86_64.whl 151.03MB
3、 vllm-0.5.3+cu118-cp38-cp38-manylinux1_x86_64.whl 151.03MB
4、 vllm-0.5.3+cu118-cp39-cp39-manylinux1_x86_64.whl 151.03MB
5、 vllm-0.5.3-cp310-cp310-manylinux1_x86_64.whl 150.96MB
6、 vllm-0.5.3-cp311-cp311-manylinux1_x86_64.whl 150.96MB
7、 vllm-0.5.3-cp38-cp38-manylinux1_x86_64.whl 150.96MB
8、 vllm-0.5.3-cp39-cp39-manylinux1_x86_64.whl 150.96MB