April-2024
版本发布时间: 2024-04-10 00:09:05
unslothai/unsloth最新发布版本:September-2024(2024-09-24 05:32:53)
Long Context Window support
You can now 2x your batch size or train on long context windows with Unsloth! 228K context windows on H100s are now possible (4x longer than HF+FA2) with Mistral 7b.
How? We coded up an async offloaded gradient checkpointing in 20 loc of pure @PyTorch, reducing VRAM by >30% with +1.9% extra overhead. We carefully mask movement betw RAM<=>GPU. No extra dependencies needed.
Try our Colab notebook with Mistral's new long context v2 7b model + our new VRAM savings
You can turn it on with use_gradient_checkpointing = "unsloth"
:
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
)
Below shows the maximum possible sequence length with Mistral 7b QLoRA rank=32:
GPU | Memory | HF+FA2 | Unsloth | Unsloth New |
---|---|---|---|---|
RTX 4060 | 8 GB | 1,696 | 3,716 | 7,340 |
RTX 4070 | 12 GB | 4,797 | 11,055 | 19,610 |
RTX 4080 | 16 GB | 7,898 | 18,394 | 31,880 |
RTX 4090 | 24 GB | 14,099 | 33,073 | 56,420 |
A100 | 40 GB | 26,502 | 62,431 | 105,500 |
A6000 | 48 GB | 32,704 | 77,110 | 130,040 |
H100 | 80 GB | 57,510 | 135,826 | 228,199 |
Self Healing Tokenizers
We managed to smartly and on the fly convert a slow HF tokenizer to a fast one. We also automatically now load the tokenizer, and fix some dangling incorrect tokens. What can this be useful for?
- Broken tokenizers like Starling or CodeLlama can be “self healed” to work. Not healing them can cause unlucky out of bounds memory accesses.
- No need to manually edit the tokenizer files to support the ChatML format. Sloth automatically edits the sentencepiece tokenizer.model and other files.
- Sometimes model uploaders require you to use the slow tokenizer, due to the fast tokenizer (HF’s Rust version) giving wrong results. We try to convert it to a fast variant, and confirm if it tokenizes correctly.
28% Faster RoPE Embeddings
@HuyNguyen-hust managed to make Unsloth RoPE Embeddings around 28% faster! This primarily is useful for long context windows. Via torch profiler, Unsloth's original kernel made RoPE use up less than 2% of total runtime, so you will see maybe 0.5 to 1% speedups especially for large training runs. Any speedup is vastly welcome! See #238 for more details.
Bug Fixes
- Gemma would not convert to GGUF correctly due to tied weights. Now fixed.
- Merging to 16bit on Kaggle breaks since Kaggle only supports 20GB of disk space - we smartly delete the 4GB model.safetensors file, allowing you to merge to 16bit.
- Inference is finally fixed on batched generation. We did not accidentally account for the attention mask and position ids. Reminder inference is 2x faster natively!
- Finetuning on lm_head and embed_tokens now works correctly! See https://github.com/unslothai/unsloth/wiki#finetuning-the-lm_head-and-embed_tokens-matrices. Remember to set modules_to_save.
- @oKatanaaa via #305 noticed you must downgrade
protobuf<4.0.0
. We edited thepyproject.toml
to make it work.
As always, Colab and Kaggle do not need updating. On local machines, please use pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
to update Unsloth with no dependency changes.