v0.1.0b7

版本发布时间: 2024-05-24 19:24:33

huggingface/optimum-nvidia最新发布版本:v0.1.0b8(2024-09-17 21:09:22)

Highlights

Mixtral models are now supported (requires a multi-gpu setup)
Tensor Parallelism & Pipeline Parallelism are supported on from_pretrained and pipeline through the use of tp=<int>, pp=<int>
Models from transformers are now loaded in their respective checkpoint data type rather than float32 avoiding most of memory errors that were happening in 0.1.0b6
Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (checkpoints/ and engines/) to avoid issues when building multiple checkpoints with the same config.json (TP / PP setup)

Fix checking output limits for #114 by @zaycev in https://github.com/huggingface/optimum-nvidia/pull/115
Test batched causallm inference by @fxmarty in https://github.com/huggingface/optimum-nvidia/pull/117
Remove claim of Turing support by @laikhtewari in https://github.com/huggingface/optimum-nvidia/pull/118
Mention important additional parameters for engine config in README by @zaycev in https://github.com/huggingface/optimum-nvidia/pull/113
Update to TensorRT-LLM v0.9.0 by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/124
Use a percentage based matching rather than exact token match for tests by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/125
Mixtral by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/131

Full Changelog: https://github.com/huggingface/optimum-nvidia/compare/v0.1.0b6...v0.1.0b7