intel-analytics/ipex-llm
Fork: 1256 Star: 6621 (更新于 2024-10-23 21:40:22)
license: Apache-2.0
Language: Python .
Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
最后发布版本: v2.1.0 ( 2024-08-22 17:06:57)
[!IMPORTANT]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here.
💫 Intel® LLM Library for PyTorch*
< English | 中文 >
IPEX-LLM
is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency[^1].
[!NOTE]
- It is built on top of the excellent work of
llama.cpp
,transformers
,bitsandbytes
,vLLM
,qlora
,AutoGPTQ
,AutoAWQ
, etc.- It provides seamless integration with llama.cpp, Ollama, Text-Generation-WebUI, HuggingFace transformers, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, Axolotl, HuggingFace PEFT, HuggingFace TRL, AutoGen, ModeScope, etc.
- 50+ models have been optimized/verified on
ipex-llm
(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
Latest Update 🔥
- [2024/07] We added support for running Microsoft's GraphRAG using local LLM on Intel GPU; see the quickstart guide here.
- [2024/07] We added extensive support for Large Multimodal Models, including StableDiffusion, Phi-3-Vision, Qwen-VL, and more.
- [2024/07] We added FP6 support on Intel GPU.
- [2024/06] We added experimental NPU support for Intel Core Ultra processors; see the examples here.
- [2024/06] We added extensive support of pipeline parallel inference, which makes it easy to run large-sized LLM using 2 or more Intel GPUs (such as Arc).
- [2024/06] We added support for running RAGFlow with
ipex-llm
on Intel GPU. - [2024/05]
ipex-llm
now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.
More updates
- [2024/05] You can now easily run
ipex-llm
inference, serving and finetuning using the Docker images. - [2024/05] You can now install
ipex-llm
on Windows using just "one command". - [2024/04] You can now run Open WebUI on Intel GPU using
ipex-llm
; see the quickstart here. - [2024/04] You can now run Llama 3 on Intel GPU using
llama.cpp
andollama
withipex-llm
; see the quickstart here. - [2024/04]
ipex-llm
now supports Llama 3 on both Intel GPU and CPU. - [2024/04]
ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here. - [2024/02]
ipex-llm
now supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llm
through Text-Generation-WebUI GUI. - [2024/02]
ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here). - [2023/12]
ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llm
now supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llm
is available. - [2023/11]
ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llm
tutorial is released.
ipex-llm
Performance
See the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below[^1] (and refer to [2][3][4] for more details).
You may follow the Benchmarking Guide to run ipex-llm
performance benchmark yourself.
ipex-llm
Demo
See demos of running local LLMs on Intel Iris iGPU, Intel Core Ultra iGPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm
below.
Intel Iris iGPU | Intel Core Ultra iGPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
llama.cpp (Phi-3-mini Q4_0) | Ollama (Mistral-7B Q4_K) | TextGeneration-WebUI (Llama3-8B FP8) | FastChat (QWen1.5-32B FP6) |
Model Accuracy
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
---|---|---|---|---|---|---|
Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
[^1]: Performance varies by use, configuration and other factors. ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.
ipex-llm
Quickstart
Docker
-
GPU Inference in C++: running
llama.cpp
,ollama
,OpenWebUI
, etc., withipex-llm
on Intel GPU -
GPU Inference in Python : running HuggingFace
transformers
,LangChain
,LlamaIndex
,ModelScope
, etc. withipex-llm
on Intel GPU -
vLLM on GPU: running
vLLM
serving withipex-llm
on Intel GPU -
vLLM on CPU: running
vLLM
serving withipex-llm
on Intel CPU -
FastChat on GPU: running
FastChat
serving withipex-llm
on Intel GPU -
VSCode on GPU: running and developing
ipex-llm
applications in Python using VSCode on Intel GPU
Use
-
llama.cpp: running llama.cpp (using C++ interface of
ipex-llm
) on Intel GPU -
Ollama: running ollama (using C++ interface of
ipex-llm
) on Intel GPU -
PyTorch/HuggingFace: running PyTorch, HuggingFace, LangChain, LlamaIndex, etc. (using Python interface of
ipex-llm
) on Intel GPU for Windows and Linux -
vLLM: running
ipex-llm
in vLLM on both Intel GPU and CPU -
FastChat: running
ipex-llm
in FastChat serving on on both Intel GPU and CPU -
Serving on multiple Intel GPUs: running
ipex-llm
serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPI -
Text-Generation-WebUI: running
ipex-llm
inoobabooga
WebUI -
Axolotl: running
ipex-llm
in Axolotl for LLM finetuning -
Benchmarking: running (latency and throughput) benchmarks for
ipex-llm
on Intel CPU and GPU
Applications
-
GraphRAG: running Microsoft's
GraphRAG
using local LLM withipex-llm
-
RAGFlow: running
RAGFlow
(an open-source RAG engine) withipex-llm
-
LangChain-Chatchat: running
LangChain-Chatchat
(Knowledge Base QA using RAG pipeline) withipex-llm
-
Coding copilot: running
Continue
(coding copilot in VSCode) withipex-llm
-
Open WebUI: running
Open WebUI
withipex-llm
-
PrivateGPT: running
PrivateGPT
to interact with documents withipex-llm
-
Dify platform: running
ipex-llm
inDify
(production-ready LLM app development platform)
Install
-
Windows GPU: installing
ipex-llm
on Windows with Intel GPU -
Linux GPU: installing
ipex-llm
on Linux with Intel GPU - For more details, please refer to the full installation guide
Code Examples
-
Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP6/FP4 inference: FP8, FP6 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
-
FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
-
Distributed inference
-
Save and load
-
Low-bit models: saving and loading
ipex-llm
low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.) -
GGUF: directly loading GGUF models into
ipex-llm
-
AWQ: directly loading AWQ models into
ipex-llm
-
GPTQ: directly loading GPTQ models into
ipex-llm
-
Low-bit models: saving and loading
-
Finetuning
-
Integration with community libraries
- Tutorials
API Doc
FAQ
Verified Models
Over 50 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Model | CPU Example | GPU Example |
---|---|---|
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 | link |
LLaMA 2 | link1, link2 | link |
LLaMA 3 | link | link |
LLaMA 3.1 | link | link |
LLaMA 3.2 | link | |
LLaMA 3.2-Vision | link | |
ChatGLM | link | |
ChatGLM2 | link | link |
ChatGLM3 | link | link |
GLM-4 | link | link |
GLM-4V | link | link |
Mistral | link | link |
Mixtral | link | link |
Falcon | link | link |
MPT | link | link |
Dolly-v1 | link | link |
Dolly-v2 | link | link |
Replit Code | link | link |
RedPajama | link1, link2 | |
Phoenix | link1, link2 | |
StarCoder | link1, link2 | link |
Baichuan | link | link |
Baichuan2 | link | link |
InternLM | link | link |
InternVL2 | link | |
Qwen | link | link |
Qwen1.5 | link | link |
Qwen2 | link | link |
Qwen2.5 | link | |
Qwen-VL | link | link |
Qwen2-VL | link | |
Qwen2-Audio | link | |
Aquila | link | link |
Aquila2 | link | link |
MOSS | link | |
Whisper | link | link |
Phi-1_5 | link | link |
Flan-t5 | link | link |
LLaVA | link | link |
CodeLlama | link | link |
Skywork | link | |
InternLM-XComposer | link | |
WizardCoder-Python | link | |
CodeShell | link | |
Fuyu | link | |
Distil-Whisper | link | link |
Yi | link | link |
BlueLM | link | link |
Mamba | link | link |
SOLAR | link | link |
Phixtral | link | link |
InternLM2 | link | link |
RWKV4 | link | |
RWKV5 | link | |
Bark | link | link |
SpeechT5 | link | |
DeepSeek-MoE | link | |
Ziya-Coding-34B-v1.0 | link | |
Phi-2 | link | link |
Phi-3 | link | link |
Phi-3-vision | link | link |
Yuan2 | link | link |
Gemma | link | link |
Gemma2 | link | |
DeciLM-7B | link | link |
Deepseek | link | link |
StableLM | link | link |
CodeGemma | link | link |
Command-R/cohere | link | link |
CodeGeeX2 | link | link |
MiniCPM | link | link |
MiniCPM3 | link | |
MiniCPM-V | link | |
MiniCPM-V-2 | link | link |
MiniCPM-Llama3-V-2_5 | link | |
MiniCPM-V-2_6 | link | link |
Get Support
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
最近版本更新:(数据更新于 2024-09-27 04:46:11)
2024-08-22 17:06:57 v2.1.0
2023-11-13 10:02:20 v2.4.0
2023-04-24 10:17:43 v2.3.0
2023-01-19 13:18:37 v2.2.0
2022-09-28 11:06:27 v2.1.0
2022-03-09 15:47:13 v2.0.0
2021-07-09 20:20:26 v0.13.0
2021-04-21 09:53:25 v0.12.2
2021-01-05 13:55:32 v0.12.1
2021-01-05 13:52:01 v0.11.1
主题(topics):
gpu, llm, pytorch, transformers
intel-analytics/ipex-llm同语言 Python最近更新仓库
2024-11-06 03:34:16 home-assistant/core
2024-11-05 16:16:26 Guovin/TV
2024-11-05 15:03:24 Cinnamon/kotaemon
2024-11-04 23:11:11 DS4SD/docling
2024-11-04 10:56:18 open-compass/opencompass
2024-11-04 08:51:21 yt-dlp/yt-dlp