v0.5.0

版本发布时间: 2024-03-30 16:38:38

argmaxinc/WhisperKit最新发布版本:v0.10.1(2024-12-21 13:48:53)

This is a HUGE release with some great new features and fixes 🙌

Highlights

Timestamp logits filter by @jkrukowski
- Significantly improves the amount of timestamp tokens in a particular window, which helps a lot with segmentation
- This is on by default but can be disabled using the decoding option withoutTimestamps: true
Language detection by @Abhinay1997
- New function on the TextDecoding protocol which runs a single forward pass and reads the language logits to find the most likely language for the input audio
- Enabled by default for decoding options whereusePrefilPrompt: false and the language: nil and it is not an English only model.
First token log prob thresholds fallback check by @jkrukowski
- This feature is not in the original openai implementation but helps reduce hallucinations quite a bit.
- Often, fallbacks due to log prob threshold are immediately identifiable by the first token, so this reduces the amount of forward passes needed to move to a higher temperature
Distil whisper support
- Recently distil-large-v3 was released which massively speeds up predictions at minimal quality loss. We've converted and optimized 4 distil models to use in WhisperKit on CoreML, they're really fast!
- distil-large-v3 distil-large-v3_594MB distil-large-v3_turbo distil-large-v3_turbo_600MB
- Note that these do not yet have word timestamp alignment heads, so can't be used with wordTimestamps: true
- It can be run via CLI as well:
  - swift run whisperkit-cli transcribe --model-prefix "distil" --model "large-v3_turbo_600MB" --verbose --audio-path ~/your_audio.wav

⚠️ Experimental new stream mode

We added an experimental new mode for streaming in WhisperAX called "Eager streaming mode". We're still refining this feature but we think it can soon be a great way to do real-time transcription with Whisper. Give it a try in Testflight or take a look a the code and let us know how it can be improved.

Recommended settings for the best performance for this iteration are:

Max tokens per loop < 100
Max fallback count < 2
Prompt and cache prefill true

Looking for feedback on:

Token confirmation numbers that work well
Model, device, and settings combinations that work well

https://github.com/argmaxinc/WhisperKit/assets/1981179/0a88ca34-3a0e-4ff5-9829-9f980a4661ea

What's Changed

CLI Task Handling in https://github.com/argmaxinc/WhisperKit/pull/85
Added TimestampRulesFilter implementation by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/45
Support distil whisper models in https://github.com/argmaxinc/WhisperKit/pull/88
Language Detection by @Abhinay1997 in https://github.com/argmaxinc/WhisperKit/pull/78
Tokenizer refactor, tests cleanup by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/87
First token logProb thresholding by @jkrukowski in https://github.com/argmaxinc/WhisperKit/pull/90
[#93] Add missing settings to decoding options by @cgfarmer4 in https://github.com/argmaxinc/WhisperKit/pull/94
"Eager" streaming mode via word timestamps in https://github.com/argmaxinc/WhisperKit/pull/95

New Contributors

@Abhinay1997 made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/78

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.4.1...v0.5.0

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-03-30发行的版本