January-2024
版本发布时间: 2024-01-19 02:28:56
unslothai/unsloth最新发布版本:September-2024(2024-09-24 05:32:53)
Upgrade Unsloth via pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
. No dependency updates will be done.
- 6x faster GGUF conversion and QLoRA to float16 merging support
# To merge to 16bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_16bit")
# To merge to 4bit:
model.save_pretrained_merged("dir", tokenizer, save_method = "merged_4bit")
# To save to GGUF:
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")
model.save_pretrained_gguf("dir", tokenizer, quantization_method = "f16")
# All methods supported (listed below)
To push to HF:
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_16bit")
model.push_to_hub_merged("hf_username/dir", tokenizer, save_method = "merged_4bit")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q8_0")
- 4x faster model downloading + >= 500MB less GPU fragmentation by pre-quantized models:
"unsloth/mistral-7b-bnb-4bit",
"unsloth/llama-2-7b-bnb-4bit",
"unsloth/llama-2-13b-bnb-4bit",
"unsloth/codellama-34b-bnb-4bit",
"unsloth/tinyllama-bnb-4bit",
-
packing = True
support making training 5x faster via TRL. - DPO support! 188% faster DPO training + no OOMs!
- Dropout. Bias LoRA support
- RSLoRA (Rank stabilized LoRA), LoftQ support
- Llama-Factory support as a UI - https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-comparison
- Tonnes of bug fixes
- And if you can - please support out work via Kofi! https://ko-fi.com/unsloth
GGUF:
Choose for `quantization_method` to be:
"not_quantized" : "Recommended. Fast conversion. Slow inference, big files.",
"fast_quantized" : "Recommended. Fast conversion. OK inference, OK file size.",
"quantized" : "Recommended. Slow conversion. Fast inference, small files.",
"f32" : "Not recommended. Retains 100% accuracy, but super slow and memory hungry.",
"f16" : "Fastest conversion + retains 100% accuracy. Slow and memory hungry.",
"q8_0" : "Fast conversion. High resource use, but generally acceptable.",
"q4_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K",
"q5_k_m" : "Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K",
"q2_k" : "Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.",
"q3_k_l" : "Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_m" : "Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K",
"q3_k_s" : "Uses Q3_K for all tensors",
"q4_0" : "Original quant method, 4-bit.",
"q4_1" : "Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.",
"q4_k_s" : "Uses Q4_K for all tensors",
"q5_0" : "Higher accuracy, higher resource usage and slower inference.",
"q5_1" : "Even higher accuracy, resource usage and slower inference.",
"q5_k_s" : "Uses Q5_K for all tensors",
"q6_k" : "Uses Q8_K for all tensors",