v0.10.0
版本发布时间: 2024-12-20 08:17:17
argmaxinc/WhisperKit最新发布版本:v0.10.0(2024-12-20 08:17:17)
Highlights
This release provides support for protocol-defined model inputs and output types, supporting full MLX or MLTensor pipelines without the need to convert to MLMultiArrays between encoder/decoder stages. For example, instead of
func encodeFeatures(_ features: MLMultiArray) async throws -> MLMultiArray?
you can now define the types by protocol:
func encodeFeatures(_ features: any FeatureExtractorOutputType) async throws -> (any AudioEncoderOutputType)?
where the types are defined as so:
public protocol FeatureExtractorOutputType {}
extension MLMultiArray: FeatureExtractorOutputType {}
public protocol AudioEncoderOutputType {}
extension MLMultiArray: AudioEncoderOutputType {}
or for a type that is a struct:
public struct TextDecoderMLMultiArrayOutputType: TextDecoderOutputType {
public var logits: MLMultiArray?
public var cache: DecodingCache?
}
so the entire structure can be handled by any model that conforms to the protocol, adding more flexibility for passing different data types between models, and thus reducing the amount of conversion steps vs. previous where it was assumed to be all MLMultiArrays.
We've made a start in using different inference types by using the new MLTensor
for token sampling on devices that have the latest OS support, which resulted in a 2x speedup for that operation. Future work will shift the entire pipeline to using these.
There are also some important fixes included:
- Timestamp rules are now enabled when the
withoutTimestamps
decoding option is set to false, increasing parity with OpenAI's python implementation. This will significantly increase the amount of timestamps returned during decoding and shorten the average length of individual segments overall.- Previous:
<|0.00|> So in college, I was a government major,<|4.92|><|4.94|> which means I had to write a lot of papers.<|7.38|>
- Now:
<|0.00|> So in college,<|2.00|><|3.36|> I was a government major,<|4.88|><|4.90|> which means I had to write a lot of papers.<|7.36|>
- Previous:
- Early stopping via callback (a way to stop the decoding loop early if repetition is detected) has been converted to use an actor to fix some concurrency issues noted by the community.
- CI script now uploads failure results to github for better visibility.
⚠️ Breaking changes
- Changing the protocol may result in some unexpected behavior if you are using a custom implementation, please raise an issue if you notice anything.
-
WhisperKit.sampleRate
has been moved toConstants.defaultWindowSamples
Finally, there were some great open-source contributions listed below, with a broad range of improvements to the library. Huge thanks to all the contributors 🙏
What's Changed
- Fix audio processing edge case by @ZachNagengast in https://github.com/argmaxinc/WhisperKit/pull/237
- Add public callbacks to help expose internal state a little more by @iandundas in https://github.com/argmaxinc/WhisperKit/pull/240
- Freeze loglevel enum by @ZachNagengast in https://github.com/argmaxinc/WhisperKit/pull/255
- Update WhisperAX app icon for macOS to align with Apple HIG standards by @Stv-X in https://github.com/argmaxinc/WhisperKit/pull/257
- Add ability to prevent config.json being written to
~/Documents/huggingface/...
by @iandundas in https://github.com/argmaxinc/WhisperKit/pull/262 - Typo in Model Descriptions by @rk-helper in https://github.com/argmaxinc/WhisperKit/pull/269
- Audio: Fix taking a suffix of negative length from a collection by @mattisssa in https://github.com/argmaxinc/WhisperKit/pull/278
New Contributors
- @Stv-X made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/257
- @rk-helper made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/269
- @mattisssa made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/278
Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.9.4...v0.10.0