v0.2.6
版本发布时间: 2023-12-18 02:35:42
vllm-project/vllm最新发布版本:v0.4.1(2024-04-24 10:28:08)
Major changes
- Fast model execution with CUDA/HIP graph
- W4A16 GPTQ support (thanks to @chu-tianxiang)
- Fix memory profiling with tensor parallelism
- Fix *.bin weight loading for Mixtral models
What's Changed
- Fix typing in generate function for AsyncLLMEngine & add toml to requirements-dev by @mezuzza in https://github.com/vllm-project/vllm/pull/2100
- Fix Dockerfile.rocm by @tjtanaa in https://github.com/vllm-project/vllm/pull/2101
- avoid multiple redefinition by @MitchellX in https://github.com/vllm-project/vllm/pull/1817
- Add a flag to include stop string in output text by @yunfeng-scale in https://github.com/vllm-project/vllm/pull/1976
- Add GPTQ support by @chu-tianxiang in https://github.com/vllm-project/vllm/pull/916
- [Docs] Add quantization support to docs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2135
- [ROCm] Temporarily remove GPTQ ROCm support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2138
- simplify loading weights logic by @esmeetu in https://github.com/vllm-project/vllm/pull/2133
- Optimize model execution with CUDA graph by @WoosukKwon in https://github.com/vllm-project/vllm/pull/1926
- [Minor] Delete Llama tokenizer warnings by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2146
- Fix all-reduce memory usage by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2151
- Pin PyTorch & xformers versions by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2155
- Remove dependency on CuPy by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2152
- [Docs] Add CUDA graph support to docs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2148
- Temporarily enforce eager mode for GPTQ models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2154
- [Minor] Add more detailed explanation on
quantization
argument by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2145 - [Minor] Fix xformers version by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2158
- [Minor] Add Phi 2 to supported models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2159
- Make sampler less blocking by @Yard1 in https://github.com/vllm-project/vllm/pull/1889
- [Minor] Fix a typo in .pt weight support by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2160
- Disable CUDA graph for SqueezeLLM by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2161
- Bump up to v0.2.6 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/2157
New Contributors
- @mezuzza made their first contribution in https://github.com/vllm-project/vllm/pull/2100
- @MitchellX made their first contribution in https://github.com/vllm-project/vllm/pull/1817
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.2.5...v0.2.6
1、 vllm-0.2.6+cu118-cp310-cp310-manylinux1_x86_64.whl 9.71MB
2、 vllm-0.2.6+cu118-cp311-cp311-manylinux1_x86_64.whl 9.72MB
3、 vllm-0.2.6+cu118-cp38-cp38-manylinux1_x86_64.whl 9.71MB
4、 vllm-0.2.6+cu118-cp39-cp39-manylinux1_x86_64.whl 9.71MB
5、 vllm-0.2.6-cp310-cp310-manylinux1_x86_64.whl 9.72MB
6、 vllm-0.2.6-cp311-cp311-manylinux1_x86_64.whl 9.74MB
7、 vllm-0.2.6-cp38-cp38-manylinux1_x86_64.whl 9.73MB