v0.5.0
版本发布时间: 2024-03-30 16:38:38
argmaxinc/WhisperKit最新发布版本:v0.10.1(2024-12-21 13:48:53)
This is a HUGE release with some great new features and fixes 🙌
Highlights
- Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option
withoutTimestamps: true
- Language detection by @Abhinay1997
- New function on the
TextDecoding
protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio - Enabled by default for decoding options where
usePrefilPrompt: false
and thelanguage: nil
and it is not an English only model.
- New function on the
- First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
- Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3 distil-large-v3_594MB distil-large-v3_turbo distil-large-v3_turbo_600MB
- Note that these do not yet have word timestamp alignment heads, so can't be used with
wordTimestamps: true
- It can be run via CLI as well:
-
swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav
-
⚠️ Experimental new stream mode
We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.
Recommended settings for the best performance for this iteration are:
- Max tokens per loop < 100
- Max fallback count < 2
- Prompt and cache prefill true
Looking for feedback on:
- Token confirmation numbers that work well
- Model, device, and settings combinations that work well
https://github.com/argmaxinc/WhisperKit/assets/1981179/0a88ca34-3a0e-4ff5-9829-9f980a4661ea
What's Changed
- CLI Task Handling in https://github.com/argmaxinc/WhisperKit/pull/85
- Added TimestampRulesFilter implementation by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/45
- Support distil whisper models in https://github.com/argmaxinc/WhisperKit/pull/88
- Language Detection by @Abhinay1997 in https://github.com/argmaxinc/WhisperKit/pull/78
- Tokenizer refactor, tests cleanup by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/87
- First token logProb thresholding by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/90
- [#93] Add missing settings to decoding options by @cgfarmer4 in https://github.com/argmaxinc/WhisperKit/pull/94
- "Eager" streaming mode via word timestamps in https://github.com/argmaxinc/WhisperKit/pull/95
New Contributors
- @Abhinay1997 made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/78
Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.4.1...v0.5.0