huggingface/optimum-nvidia
Fork: 89 Star: 895 (更新于 2024-12-09 02:14:12)
license: Apache-2.0
Language: Python .
最后发布版本: v0.1.0b8 ( 2024-09-17 21:09:22)
Optimum-NVIDIA
Optimized inference with NVIDIA and Hugging Face
Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers code.
Installation
Pip
Pip installation flow has been validated on Ubuntu only at this stage.
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev
python -m pip install --pre --extra-index-url https://pypi.nvidia.com optimum-nvidia
For developers who want to target the best performances, please look at the installation methods below.
Docker container
You can use a Docker container to try Optimum-NVIDIA today. Images are available on the Hugging Face Docker Hub.
docker pull huggingface/optimum-nvidia
Building from source
Instead of using the pre-built docker container, you can build Optimum-NVIDIA from source:
TARGET_SM="90-real;89-real"
git clone --recursive --depth=1 https://github.com/huggingface/optimum-nvidia.git
cd optimum-nvidia/third-party/tensorrt-llm
make -C docker release_build CUDA_ARCHS=$TARGET_SM
cd ../.. && docker build -t <organisation_name/image_name>:<version> -f docker/Dockerfile .
Quickstart Guide
Pipelines
Hugging Face pipelines provide a simple yet powerful abstraction to quickly set up inference. If you already have a pipeline from transformers, you can unlock the performance benefits of Optimum-NVIDIA by just changing one line.
- from transformers.pipelines import pipeline
+ from optimum.nvidia.pipelines import pipeline
pipe = pipeline('text-generation', 'meta-llama/Llama-2-7b-chat-hf', use_fp8=True)
pipe("Describe a real-world application of AI in sustainable energy.")
Generate
If you want control over advanced features like quantization and token selection strategies, we recommend using the generate()
API. Just like with pipelines
, switching from existing transformers code is super simple.
- from transformers import AutoModelForCausalLM
+ from optimum.nvidia import AutoModelForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
+ use_fp8=True,
+ max_prompt_length=1024,
+ max_output_length=2048, # Must be at least size of max_prompt_length + max_new_tokens
+ max_batch_size=8,
)
model_inputs = tokenizer(["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], return_tensors="pt").to("cuda")
generated_ids = model.generate(
**model_inputs,
top_k=40,
top_p=0.7,
repetition_penalty=10,
)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
To learn more about text generation with LLMs, check out this guide!
Support Matrix
We test Optimum-NVIDIA on 4090, L40S, and H100 Tensor Core GPUs, though it is expected to work on any GPU based on the following architectures:
- Ampere (A100/A30 are supported. Experimental support for A10, A40, RTX Ax000)
- Hopper
- Ada-Lovelace
Note that FP8 support is only available on GPUs based on Hopper and Ada-Lovelace architectures.
Optimum-NVIDIA works on Linux will support Windows soon.
Optimum-NVIDIA currently accelerates text-generation with LLaMAForCausalLM, and we are actively working to expand support to include more model architectures and tasks.
Contributing
Check out our Contributing Guide
最近版本更新:(数据更新于 2024-09-29 06:59:33)
2024-09-17 21:09:22 v0.1.0b8
2024-05-24 19:24:33 v0.1.0b7
2024-04-12 05:05:43 v0.1.0b6
2024-03-21 22:29:06 v0.1.0b4
2024-02-29 05:44:48 v0.1.0b3
2023-12-21 21:11:05 v0.1.0b2
2023-12-18 23:05:14 v0.1.0b1
huggingface/optimum-nvidia同语言 Python最近更新仓库
2024-12-22 09:03:32 ultralytics/ultralytics
2024-12-21 13:26:40 notepad-plus-plus/nppPluginList
2024-12-21 11:42:53 XiaoMi/ha_xiaomi_home
2024-12-21 04:33:22 comfyanonymous/ComfyUI
2024-12-20 18:47:56 home-assistant/core
2024-12-20 15:41:40 jxxghp/MoviePilot