v0.10.0

版本发布时间: 2024-12-20 08:17:17

argmaxinc/WhisperKit最新发布版本:v0.10.0(2024-12-20 08:17:17)

Highlights

This release provides support for protocol-defined model inputs and output types, supporting full MLX or MLTensor pipelines without the need to convert to MLMultiArrays between encoder/decoder stages. For example, instead of

func encodeFeatures(_ features: MLMultiArray) async throws -> MLMultiArray?

you can now define the types by protocol:

func encodeFeatures(_ features: any FeatureExtractorOutputType) async throws -> (any AudioEncoderOutputType)?

where the types are defined as so:

public protocol FeatureExtractorOutputType {}
extension MLMultiArray: FeatureExtractorOutputType {}
public protocol AudioEncoderOutputType {}
extension MLMultiArray: AudioEncoderOutputType {}

or for a type that is a struct:

public struct TextDecoderMLMultiArrayOutputType: TextDecoderOutputType {
    public var logits: MLMultiArray?
    public var cache: DecodingCache?
}

so the entire structure can be handled by any model that conforms to the protocol, adding more flexibility for passing different data types between models, and thus reducing the amount of conversion steps vs. previous where it was assumed to be all MLMultiArrays.

We've made a start in using different inference types by using the new MLTensor for token sampling on devices that have the latest OS support, which resulted in a 2x speedup for that operation. Future work will shift the entire pipeline to using these.

There are also some important fixes included:

Timestamp rules are now enabled when the withoutTimestamps decoding option is set to false, increasing parity with OpenAI's python implementation. This will significantly increase the amount of timestamps returned during decoding and shorten the average length of individual segments overall.
- Previous: <|0.00|> So in college, I was a government major,<|4.92|><|4.94|> which means I had to write a lot of papers.<|7.38|>
- Now: <|0.00|> So in college,<|2.00|><|3.36|> I was a government major,<|4.88|><|4.90|> which means I had to write a lot of papers.<|7.36|>
Early stopping via callback (a way to stop the decoding loop early if repetition is detected) has been converted to use an actor to fix some concurrency issues noted by the community.
CI script now uploads failure results to github for better visibility.

⚠️ Breaking changes

Changing the protocol may result in some unexpected behavior if you are using a custom implementation, please raise an issue if you notice anything.
WhisperKit.sampleRate has been moved to Constants.defaultWindowSamples

Finally, there were some great open-source contributions listed below, with a broad range of improvements to the library. Huge thanks to all the contributors 🙏

What's Changed

Fix audio processing edge case by @ZachNagengast in https://github.com/argmaxinc/WhisperKit/pull/237
Add public callbacks to help expose internal state a little more by @iandundas in https://github.com/argmaxinc/WhisperKit/pull/240
Freeze loglevel enum by @ZachNagengast in https://github.com/argmaxinc/WhisperKit/pull/255
Update WhisperAX app icon for macOS to align with Apple HIG standards by @Stv-X in https://github.com/argmaxinc/WhisperKit/pull/257
Add ability to prevent config.json being written to ~/Documents/huggingface/... by @iandundas in https://github.com/argmaxinc/WhisperKit/pull/262
Typo in Model Descriptions by @rk-helper in https://github.com/argmaxinc/WhisperKit/pull/269
Audio: Fix taking a suffix of negative length from a collection by @mattisssa in https://github.com/argmaxinc/WhisperKit/pull/278

New Contributors

@Stv-X made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/257
@rk-helper made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/269
@mattisssa made their first contribution in https://github.com/argmaxinc/WhisperKit/pull/278

Full Changelog: https://github.com/argmaxinc/WhisperKit/compare/v0.9.4...v0.10.0

相关地址：原始地址下载(tar) 下载(zip)

查看：2024-12-20发行的版本