v0.6.0
版本发布时间: 2024-09-05 07:35:42
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
Performance Update
- We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameters should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting
--num-scheduler-steps 8
in the engine arguments. Please note that it still have some limitations and being actively hardened, see #7528 for known issues.- Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
- Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
- Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)
Model Support
- Support bitsandbytes 8-bit and FP4 quantized models (#7445)
- New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
- A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
- Multi-modality:
- multi-image input support for LLaVA-Next (#7230), Phi-3-vision models (#7783)
- Ultravox support for multiple audio chunks (#7963)
- TP support for ViTs (#7186)
Hardware Support
- NVIDIA GPU: extend cuda graph size for H200 (#7894)
- AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
- Intel GPU: pipeline parallel support (#7810)
- Neuron: context lengths and token generation buckets (#7885, #8062)
- TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)
Production Features
- OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
- Add json_schema support from OpenAI protocol (#7654)
- Enable chunked prefill and prefix caching together (#7753, #8120)
- Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)
Misc
- Support benchmarking async engine in benchmark_throughput.py (#7964)
- Progress in integration with
torch.compile
: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)
What's Changed
- [Core] Add multi-step support to LLMEngine by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7789
- [Bugfix] Fix run_batch logger by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/7640
- [Frontend] Publish Prometheus metrics in run_batch API by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/7641
- [Frontend] add json_schema support from OpenAI protocol by @rockwotj in https://github.com/vllm-project/vllm/pull/7654
- [misc][core] lazy import outlines by @youkaichao in https://github.com/vllm-project/vllm/pull/7831
- [ci][test] exclude model download time in server start time by @youkaichao in https://github.com/vllm-project/vllm/pull/7834
- [ci][test] fix RemoteOpenAIServer by @youkaichao in https://github.com/vllm-project/vllm/pull/7838
- [Bugfix] Fix Phi-3v crash when input images are of certain sizes by @zifeitong in https://github.com/vllm-project/vllm/pull/7840
- [Model][VLM] Support multi-images inputs for Phi-3-vision models by @Isotr0py in https://github.com/vllm-project/vllm/pull/7783
- [Misc] Remove snapshot_download usage in InternVL2 test by @Isotr0py in https://github.com/vllm-project/vllm/pull/7835
- [misc][cuda] improve pynvml warning by @youkaichao in https://github.com/vllm-project/vllm/pull/7852
- [Spec Decoding] Streamline batch expansion tensor manipulation by @njhill in https://github.com/vllm-project/vllm/pull/7851
- [Bugfix]: Use float32 for base64 embedding by @HollowMan6 in https://github.com/vllm-project/vllm/pull/7855
- [CI/Build] Avoid downloading all HF files in
RemoteOpenAIServer
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7836 - [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by @comaniac in https://github.com/vllm-project/vllm/pull/7822
- [Misc] Update
qqq
to use vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7805 - [Misc] Update
gptq_marlin_24
to use vLLMParameters by @dsikka in https://github.com/vllm-project/vllm/pull/7762 - [misc] fix custom allreduce p2p cache file generation by @youkaichao in https://github.com/vllm-project/vllm/pull/7853
- [Bugfix] neuron: enable tensor parallelism by @omrishiv in https://github.com/vllm-project/vllm/pull/7562
- [Misc] Update compressed tensors lifecycle to remove
prefix
fromcreate_weights
by @dsikka in https://github.com/vllm-project/vllm/pull/7825 - [Core] Asynchronous Output Processor by @megha95 in https://github.com/vllm-project/vllm/pull/7049
- [Tests] Disable retries and use context manager for openai client by @njhill in https://github.com/vllm-project/vllm/pull/7565
- [core][torch.compile] not compile for profiling by @youkaichao in https://github.com/vllm-project/vllm/pull/7796
- Revert #7509 by @comaniac in https://github.com/vllm-project/vllm/pull/7887
- [Model] Add Mistral Tokenization to improve robustness and chat encoding by @patrickvonplaten in https://github.com/vllm-project/vllm/pull/7739
- [CI/Build][VLM] Cleanup multiple images inputs model test by @Isotr0py in https://github.com/vllm-project/vllm/pull/7897
- [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by @jikunshang in https://github.com/vllm-project/vllm/pull/7810
- [CI/Build][ROCm] Enabling tensorizer tests for ROCm by @alexeykondrat in https://github.com/vllm-project/vllm/pull/7237
- [Bugfix] Fix phi3v incorrect image_idx when using async engine by @Isotr0py in https://github.com/vllm-project/vllm/pull/7916
- [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by @youkaichao in https://github.com/vllm-project/vllm/pull/7924
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in https://github.com/vllm-project/vllm/pull/7766
- [benchmark] Update TGI version by @philschmid in https://github.com/vllm-project/vllm/pull/7917
- [Model] Add multi-image input support for LLaVA-Next offline inference by @zifeitong in https://github.com/vllm-project/vllm/pull/7230
- [mypy] Enable mypy type checking for
vllm/core
by @jberkhahn in https://github.com/vllm-project/vllm/pull/7229 - [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by @petersalas in https://github.com/vllm-project/vllm/pull/7902
- [hardware][rocm] allow rocm to override default env var by @youkaichao in https://github.com/vllm-project/vllm/pull/7926
- [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by @bnellnm in https://github.com/vllm-project/vllm/pull/7886
- [mypy][CI/Build] Fix mypy errors by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7929
- [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7911
- [Performance] Enable chunked prefill and prefix caching together by @comaniac in https://github.com/vllm-project/vllm/pull/7753
- [ci][test] fix pp test failure by @youkaichao in https://github.com/vllm-project/vllm/pull/7945
- [Doc] fix the autoAWQ example by @stas00 in https://github.com/vllm-project/vllm/pull/7937
- [Bugfix][VLM] Fix incompatibility between #7902 and #7230 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7948
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by @pavanimajety in https://github.com/vllm-project/vllm/pull/7798
- [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by @rasmith in https://github.com/vllm-project/vllm/pull/7386
- [TPU] Upgrade PyTorch XLA nightly by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7967
- [Doc] fix 404 link by @stas00 in https://github.com/vllm-project/vllm/pull/7966
- [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by @mzusman in https://github.com/vllm-project/vllm/pull/7651
- [Bugfix] Make torch registration of punica ops optional by @bnellnm in https://github.com/vllm-project/vllm/pull/7970
- [torch.compile] avoid Dynamo guard evaluation overhead by @youkaichao in https://github.com/vllm-project/vllm/pull/7898
- Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by @mgoin in https://github.com/vllm-project/vllm/pull/7961
- [Frontend] Minor optimizations to zmq decoupled front-end by @njhill in https://github.com/vllm-project/vllm/pull/7957
- [torch.compile] remove reset by @youkaichao in https://github.com/vllm-project/vllm/pull/7975
- [VLM][Core] Fix exceptions on ragged NestedTensors by @petersalas in https://github.com/vllm-project/vllm/pull/7974
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by @youkaichao in https://github.com/vllm-project/vllm/pull/7982
- [Bugfix] Unify rank computation across regular decoding and speculative decoding by @jmkuebler in https://github.com/vllm-project/vllm/pull/7899
- [Core] Combine async postprocessor and multi-step by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/7921
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by @pavanimajety in https://github.com/vllm-project/vllm/pull/7985
- extend cuda graph size for H200 by @kushanam in https://github.com/vllm-project/vllm/pull/7894
- [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by @Isotr0py in https://github.com/vllm-project/vllm/pull/7954
- [misc] update tpu int8 to use new vLLM Parameters by @dsikka in https://github.com/vllm-project/vllm/pull/7973
- [Neuron] Adding support for context-lenght, token-gen buckets. by @hbikki in https://github.com/vllm-project/vllm/pull/7885
- support bitsandbytes 8-bit and FP4 quantized models by @chenqianfzh in https://github.com/vllm-project/vllm/pull/7445
- Add more percentiles and latencies by @wschin in https://github.com/vllm-project/vllm/pull/7759
- [VLM] Disallow overflowing
max_model_len
for multimodal models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/7998 - [Core] Logprobs support in Multi-step by @afeldman-nm in https://github.com/vllm-project/vllm/pull/7652
- [TPU] Async output processing for TPU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8011
- [Kernel] changing fused moe kernel chunk size default to 32k by @avshalomman in https://github.com/vllm-project/vllm/pull/7995
- [MODEL] add Exaone model support by @nayohan in https://github.com/vllm-project/vllm/pull/7819
- Support vLLM single and multi-host TPUs on GKE by @richardsliu in https://github.com/vllm-project/vllm/pull/7613
- [Bugfix] Fix import error in Exaone model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8034
- [VLM][Model] TP support for ViTs by @ChristopherCho in https://github.com/vllm-project/vllm/pull/7186
- [Core] Increase default
max_num_batched_tokens
for multimodal models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8028 - [Frontend]-config-cli-args by @KaunilD in https://github.com/vllm-project/vllm/pull/7737
- [TPU][Bugfix] Fix tpu type api by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8035
- [Model] Adding support for MSFT Phi-3.5-MoE by @wenxcs in https://github.com/vllm-project/vllm/pull/7729
- [Bugfix] Address #8009 and add model test for flashinfer fp8 kv cache. by @pavanimajety in https://github.com/vllm-project/vllm/pull/8013
- [Bugfix] Fix import error in Phi-3.5-MoE by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8052
- [Bugfix] Fix ModelScope models in v0.5.5 by @NickLucche in https://github.com/vllm-project/vllm/pull/8037
- [BugFix][Core] Multistep Fix Crash on Request Cancellation by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/8059
- [Frontend][VLM] Add support for multiple multi-modal items in the OpenAI frontend by @ywang96 in https://github.com/vllm-project/vllm/pull/8049
- [Misc] Optional installation of audio related packages by @ywang96 in https://github.com/vllm-project/vllm/pull/8063
- [Model] Adding Granite model. by @shawntan in https://github.com/vllm-project/vllm/pull/7436
- [SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/7244
- [TPU] Align worker index with node boundary by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7932
- [Core][Bugfix] Accept GGUF model without .gguf extension by @Isotr0py in https://github.com/vllm-project/vllm/pull/8056
- [Bugfix] Fix internlm2 tensor parallel inference by @Isotr0py in https://github.com/vllm-project/vllm/pull/8055
- [Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. by @noooop in https://github.com/vllm-project/vllm/pull/7874
- [Bugfix] Fix single output condition in output processor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/7881
- [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend by @Isotr0py in https://github.com/vllm-project/vllm/pull/8061
- [Performance] Enable chunked prefill and prefix caching together by @comaniac in https://github.com/vllm-project/vllm/pull/8120
- [CI] Only PR reviewers/committers can trigger CI on PR by @khluu in https://github.com/vllm-project/vllm/pull/8124
- [Core] Optimize Async + Multi-step by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/8050
- [Misc] Raise a more informative exception in add/remove_logger by @Yard1 in https://github.com/vllm-project/vllm/pull/7750
- [CI/Build] fix: Add the +empty tag to the version only when the VLLM_TARGET_DEVICE envvar was explicitly set to "empty" by @tomeras91 in https://github.com/vllm-project/vllm/pull/8118
- [ci] Fix GHA workflow by @khluu in https://github.com/vllm-project/vllm/pull/8129
- [TPU][Bugfix] Fix next_token_ids shape by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8128
- [CI] Change PR remainder to avoid at-mentions by @simon-mo in https://github.com/vllm-project/vllm/pull/8134
- [Misc] Update
GPTQ
to usevLLMParameters
by @dsikka in https://github.com/vllm-project/vllm/pull/7976 - [Benchmark] Add
--async-engine
option to benchmark_throughput.py by @njhill in https://github.com/vllm-project/vllm/pull/7964 - [TPU][Bugfix] Use XLA rank for persistent cache path by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8137
- [Misc] Update fbgemmfp8 to use
vLLMParameters
by @dsikka in https://github.com/vllm-project/vllm/pull/7972 - [Model] Add Ultravox support for multiple audio chunks by @petersalas in https://github.com/vllm-project/vllm/pull/7963
- [Frontend] Multimodal support in offline chat by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/8098
- chore: Update check-wheel-size.py to read VLLM_MAX_SIZE_MB from env by @haitwang-cloud in https://github.com/vllm-project/vllm/pull/8103
- [Bugfix] remove post_layernorm in siglip by @wnma3mz in https://github.com/vllm-project/vllm/pull/8106
- [MISC] Consolidate FP8 kv-cache tests by @comaniac in https://github.com/vllm-project/vllm/pull/8131
- [CI/Build][ROCm] Enabling LoRA tests on ROCm by @alexeykondrat in https://github.com/vllm-project/vllm/pull/7369
- [CI] Change test input in Gemma LoRA test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/8163
- [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models by @K-Mistele in https://github.com/vllm-project/vllm/pull/5649
- [MISC] Replace input token throughput with total token throughput by @comaniac in https://github.com/vllm-project/vllm/pull/8164
- [Neuron] Adding support for adding/ overriding neuron configuration a… by @hbikki in https://github.com/vllm-project/vllm/pull/8062
- Bump version to v0.6.0 by @simon-mo in https://github.com/vllm-project/vllm/pull/8166
New Contributors
- @rockwotj made their first contribution in https://github.com/vllm-project/vllm/pull/7654
- @HollowMan6 made their first contribution in https://github.com/vllm-project/vllm/pull/7855
- @patrickvonplaten made their first contribution in https://github.com/vllm-project/vllm/pull/7739
- @philschmid made their first contribution in https://github.com/vllm-project/vllm/pull/7917
- @jberkhahn made their first contribution in https://github.com/vllm-project/vllm/pull/7229
- @pavanimajety made their first contribution in https://github.com/vllm-project/vllm/pull/7798
- @rasmith made their first contribution in https://github.com/vllm-project/vllm/pull/7386
- @jmkuebler made their first contribution in https://github.com/vllm-project/vllm/pull/7899
- @kushanam made their first contribution in https://github.com/vllm-project/vllm/pull/7894
- @hbikki made their first contribution in https://github.com/vllm-project/vllm/pull/7885
- @wschin made their first contribution in https://github.com/vllm-project/vllm/pull/7759
- @nayohan made their first contribution in https://github.com/vllm-project/vllm/pull/7819
- @richardsliu made their first contribution in https://github.com/vllm-project/vllm/pull/7613
- @KaunilD made their first contribution in https://github.com/vllm-project/vllm/pull/7737
- @wenxcs made their first contribution in https://github.com/vllm-project/vllm/pull/7729
- @NickLucche made their first contribution in https://github.com/vllm-project/vllm/pull/8037
- @shawntan made their first contribution in https://github.com/vllm-project/vllm/pull/7436
- @noooop made their first contribution in https://github.com/vllm-project/vllm/pull/7874
- @haitwang-cloud made their first contribution in https://github.com/vllm-project/vllm/pull/8103
- @wnma3mz made their first contribution in https://github.com/vllm-project/vllm/pull/8106
- @K-Mistele made their first contribution in https://github.com/vllm-project/vllm/pull/5649
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.5...v0.6.0
1、 vllm-0.6.0+cu118-cp310-cp310-manylinux1_x86_64.whl 161.96MB
2、 vllm-0.6.0+cu118-cp311-cp311-manylinux1_x86_64.whl 161.96MB
3、 vllm-0.6.0+cu118-cp312-cp312-manylinux1_x86_64.whl 161.96MB
4、 vllm-0.6.0+cu118-cp38-cp38-manylinux1_x86_64.whl 161.96MB
5、 vllm-0.6.0+cu118-cp39-cp39-manylinux1_x86_64.whl 161.96MB