v1.4.0
版本发布时间: 2023-05-01 00:56:01
ggerganov/whisper.cpp最新发布版本:v1.5.5(2024-04-16 19:14:06)
Overview
This is a new major release adding integer quantization and partial GPU (NVIDIA) support
Integer quantization
This allows the ggml
Whisper models to be converted from the default 16-bit floating point weights to 4, 5 or 8 bit integer weights.
The resulting quantized models are smaller in disk size and memory usage and can be processed faster on some architectures. The transcription quality is degraded to some extend - not quantified at the moment.
- Supported quantization modes:
Q4_0
,Q4_1
,Q4_2
,Q5_0
,Q5_1
,Q8_0
- Implementation details: https://github.com/ggerganov/whisper.cpp/pull/540
- Usage instructions: README
- All WASM examples now support
Q5
quantized models: https://whisper.ggerganov.com
Here is a quantitative evaluation of the different quantization modes applied to the LLaMA and RWKV large language models. These results can give an impression about the expected quality, size and speed for quantized Whisper models:
LLaMA quantization (measured on M1 Pro)
Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
---|---|---|---|---|---|---|---|---|
7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
ref: https://github.com/ggerganov/llama.cpp#quantization
RWKV quantization
Format | Perplexity (169M) | Latency, ms (1.5B) | File size, GB (1.5B) |
---|---|---|---|
Q4_0 |
17.507 | 76 | 1.53 |
Q4_1 |
17.187 | 72 | 1.68 |
Q4_2 |
17.060 | 85 | 1.53 |
Q5_0 |
16.194 | 78 | 1.60 |
Q5_1 |
15.851 | 81 | 1.68 |
Q8_0 |
15.652 | 89 | 2.13 |
FP16 |
15.623 | 117 | 2.82 |
FP32 |
15.623 | 198 | 5.64 |
ref: https://github.com/ggerganov/ggml/issues/89#issuecomment-1528781992
This feature is possible thanks to the many contributions in the llama.cpp project: https://github.com/users/ggerganov/projects/2
GPU support via cuBLAS
Using cuBLAS results mainly in improved Encoder inference speed. I haven't done proper timings, but one can expect at least 2-3 times faster Encoder evaluation with modern NVIDIA GPU cards compared to CPU-only processing. Feel free to post your Encoder benchmarks in issue #89.
- Implementation details: https://github.com/ggerganov/whisper.cpp/pull/834
- Usage instructions: README
This is another feature made possible by the llama.cpp project. Special recognition to @slaren for putting almost all of this work together
This release remains in "beta" stage as I haven't verified that everything works as expected.
What's Changed
- Updated escape_double_quotes() Function by @tauseefmohammed2 in https://github.com/ggerganov/whisper.cpp/pull/776
- examples : add missing #include
by @pH5 in https://github.com/ggerganov/whisper.cpp/pull/798 - Flush upon finishing inference by @tarasglek in https://github.com/ggerganov/whisper.cpp/pull/811
- Escape quotes in csv output by @laytan in https://github.com/ggerganov/whisper.cpp/pull/815
- C++11style by @wuyudi in https://github.com/ggerganov/whisper.cpp/pull/768
- Optionally allow a Core ML build of Whisper to work with or without Core ML models by @Canis-UK in https://github.com/ggerganov/whisper.cpp/pull/812
- add some tips about in the readme of the android project folder by @Zolliner in https://github.com/ggerganov/whisper.cpp/pull/816
- whisper: Use correct seek_end when offset is used by @ThijsRay in https://github.com/ggerganov/whisper.cpp/pull/833
- ggml : fix 32-bit ARM NEON by @ggerganov in https://github.com/ggerganov/whisper.cpp/pull/836
- Add CUDA support via cuBLAS by @ggerganov in https://github.com/ggerganov/whisper.cpp/pull/834
- Integer quantisation support by @ggerganov in https://github.com/ggerganov/whisper.cpp/pull/540
New Contributors
- @tauseefmohammed2 made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/776
- @pH5 made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/798
- @tarasglek made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/811
- @laytan made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/815
- @wuyudi made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/768
- @Canis-UK made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/812
- @Zolliner made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/816
- @ThijsRay made their first contribution in https://github.com/ggerganov/whisper.cpp/pull/833
Full Changelog: https://github.com/ggerganov/whisper.cpp/compare/v1.3.0...v1.4.0
1、 whisper-bin-Win32.zip 941.46KB
2、 whisper-bin-x64.zip 1.05MB
3、 whisper-blas-bin-Win32.zip 7.32MB
4、 whisper-blas-bin-x64.zip 12.46MB