v0.1.0b7
版本发布时间: 2024-05-24 19:24:33
huggingface/optimum-nvidia最新发布版本:v0.1.0b8(2024-09-17 21:09:22)
Highlights
- Mixtral models are now supported (requires a multi-gpu setup)
- Tensor Parallelism & Pipeline Parallelism are supported on
from_pretrained
andpipeline
through the use oftp=<int>
,pp=<int>
- Models from
transformers
are now loaded in their respective checkpoint data type rather thanfloat32
avoiding most of memory errors that were happening in 0.1.0b6 - Intermediate TensorRT-LLM checkpoints and engines are now saved in two different folders (
checkpoints/
andengines/
) to avoid issues when building multiple checkpoints with the sameconfig.json
(TP / PP setup)
What's Changed
- Fix checking output limits for #114 by @zaycev in https://github.com/huggingface/optimum-nvidia/pull/115
- Test batched causallm inference by @fxmarty in https://github.com/huggingface/optimum-nvidia/pull/117
- Remove claim of Turing support by @laikhtewari in https://github.com/huggingface/optimum-nvidia/pull/118
- Mention important additional parameters for engine config in README by @zaycev in https://github.com/huggingface/optimum-nvidia/pull/113
- Update to TensorRT-LLM v0.9.0 by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/124
- Use a percentage based matching rather than exact token match for tests by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/125
- Mixtral by @mfuntowicz in https://github.com/huggingface/optimum-nvidia/pull/131
New Contributors
- @zaycev made their first contribution in https://github.com/huggingface/optimum-nvidia/pull/115
Full Changelog: https://github.com/huggingface/optimum-nvidia/compare/v0.1.0b6...v0.1.0b7