MyGit

NVIDIA/TensorRT-LLM

Fork: 1028 Star: 8930 (更新于 2024-12-20 15:10:12)

license: Apache-2.0

Language: C++ .

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

最后发布版本: v0.13.0 ( 2024-09-30 16:37:55)

官方网址 GitHub网址

TensorRT-LLM

A TensorRT Toolbox for Optimized Large Language Model Inference

Documentation python cuda trt version license

Architecture   |   Performance   |   Examples   |   Documentation   |   Roadmap


Latest News

  • [2024/12/10] ⚡ Llama 3.3 70B from AI at Meta is accelerated by TensorRT-LLM. 🌟 State-of-the-art model on par with Llama 3.1 405B for reasoning, math, instruction following and tool use. Explore the preview ➡️ link
  • [2024/12/03] 🌟 Boost your AI hashtag#inference throughput by up to 3.6x. We now support speculative decoding and tripling token throughput with our NVIDIA TensorRT-LLM. Perfect for your generative AI apps. ⚡Learn how in this technical deep dive ➡️ link

  • [2024/12/02] Working on deploying ONNX models for performance-critical applications? Try our NVIDIA Nsight Deep Learning Designer ⚡ A user-friendly GUI and tight integration with NVIDIA TensorRT that offers: ✅ Intuitive visualization of ONNX model graphs ✅ Quick tweaking of model architecture and parameters ✅ Detailed performance profiling with either ORT or TensorRT ✅ Easy building of TensorRT engines ➡️ link

  • [2024/11/26] 📣 Introducing TensorRT-LLM for Jetson AGX Orin, making it even easier to deploy on Jetson AGX Orin with initial support in JetPack 6.1 via the v0.12.0-jetson branch of the TensorRT-LLM repo. ✅ Pre-compiled TensorRT-LLM wheels & containers for easy integration ✅ Comprehensive guides & docs to get you started ➡️ link

  • [2024/11/21] NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 ➡️ link

  • [2024/11/19] Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs ➡️ link

  • [2024/11/09] 🚀🚀🚀 3x Faster AllReduce with NVSwitch and TensorRT-LLM MultiShot ➡️ link

  • [2024/11/09] ✨ NVIDIA advances the AI ecosystem with the AI model of LG AI Research 🙌 ➡️ link

  • [2024/11/02] 🌟🌟🌟 NVIDIA and LlamaIndex Developer Contest 🙌 Enter for a chance to win prizes including an NVIDIA® GeForce RTX™ 4080 SUPER GPU, DLI credits, and more🙌 ➡️ link

  • [2024/10/28] 🏎️🏎️🏎️ NVIDIA GH200 Superchip Accelerates Inference by 2x in Multiturn Interactions with Llama Models ➡️ link

  • [2024/10/22] New 📝 Step-by-step instructions on how to ✅ Optimize LLMs with NVIDIA TensorRT-LLM, ✅ Deploy the optimized models with Triton Inference Server, ✅ Autoscale LLMs deployment in a Kubernetes environment. 🙌 Technical Deep Dive: ➡️ link

  • [2024/10/07] 🚀🚀🚀Optimizing Microsoft Bing Visual Search with NVIDIA Accelerated Libraries ➡️ link

Previous News

TensorRT-LLM Overview

TensorRT-LLM is a library for optimizing Large Language Model (LLM) inference. It provides state-of-the-art optimizations, including custom attention kernels, inflight batching, paged KV caching, quantization (FP8, INT4 AWQ, INT8 SmoothQuant, ++) and much more, to perform inference efficiently on NVIDIA GPUs

TensorRT-LLM provides a Python API to build LLMs into optimized TensorRT engines. It contains runtimes in Python (bindings) and C++ to execute those TensorRT engines. It also includes a backend for integration with the NVIDIA Triton Inference Server. Models built with TensorRT-LLM can be executed on a wide range of configurations from a single GPU to multiple nodes with multiple GPUs (using Tensor Parallelism and/or Pipeline Parallelism).

TensorRT-LLM comes with several popular models pre-defined. They can easily be modified and extended to fit custom needs via a PyTorch-like Python API. Refer to the Support Matrix for a list of supported models.

TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library. It leverages much of TensorRT's deep learning optimizations and adds LLM-specific optimizations on top, as described above. TensorRT is an ahead-of-time compiler; it builds "Engines" which are optimized representations of the compiled model containing the entire execution graph. These engines are optimized for a specific GPU architecture, and can be validated, benchmarked, and serialized for later deployment in a production environment.

Getting Started

To get started with TensorRT-LLM, visit our documentation:

Community

  • Model zoo (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

最近版本更新:(数据更新于 2024-10-09 20:00:20)

2024-09-30 16:37:55 v0.13.0

2024-08-29 23:01:08 v0.12.0

2024-07-17 20:56:53 v0.11.0

2024-06-05 21:02:34 v0.10.0

2024-04-16 12:38:00 v0.9.0

2024-02-29 17:54:49 v0.8.0

2023-12-27 09:59:03 v0.7.1

2023-12-04 19:11:03 v0.6.1

2023-10-19 21:14:44 v0.5.0

NVIDIA/TensorRT-LLM同语言 C++最近更新仓库

2024-12-22 08:09:36 LizardByte/Sunshine

2024-12-22 01:52:26 rizonesoft/Notepad3

2024-12-20 06:20:01 ClickHouse/ClickHouse

2024-12-17 13:11:45 cxasm/notepad--

2024-12-16 20:00:54 AaronFeng753/Waifu2x-Extension-GUI

2024-12-16 15:51:57 PCSX2/pcsx2