Skip to main content
Back to overview/Step 2 of 6
Unsloth15m

Choose a model and load it with Unsloth

Pick a base instruct model (e.g. Llama 3.1 8B), load it in 4-bit with Unsloth for QLoRA, and prepare it for training.

In this step

  • Choose an instruct model suitable for Colab
  • Load the model in 4-bit (QLoRA) with FastLanguageModel
  • Set max_seq_length and dtype
  • Add LoRA adapters for training

In this step you will choose a base model and load it with Unsloth in a way that keeps VRAM low (4-bit QLoRA) and prepares it for fine-tuning. Unsloth provides pre-optimized model variants on the Hugging Face Hub; using them ensures correct tokenizer and chat templates and often better behavior in Colab.

1. Why "instruct" and why 4-bit?

  • Instruct models (e.g. Llama 3.1 8B Instruct, Mistral 7B Instruct) are already aligned to follow instructions and use chat-style templates. Fine-tuning them with your data usually needs less data and fewer epochs than starting from a base (completion-only) model.
  • 4-bit loading (QLoRA) reduces memory by about 4× compared to 16-bit, so an 8B model fits comfortably on a T4. Unsloth's implementations are tuned so that 4-bit fine-tuning stays accurate.

2. Pick a model

For Colab's free T4 (15GB VRAM), a 7B–8B parameter model in 4-bit is a good default. Examples:

Model Hugging Face ID (Unsloth) Notes
Llama 3.1 8B Instruct unsloth/llama-3.1-8b-instruct-bnb-4bit ChatML, 8K context
Llama 3.2 3B Instruct unsloth/llama-3.2-3b-instruct-bnb-4bit Lighter, faster
Mistral 7B Instruct unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit Strong 7B option

Names ending in -bnb-4bit are quantized with (or compatible with) BitsAndBytes 4-bit. Unsloth also offers "unsloth" 4-bit variants (e.g. unsloth/llama-3.1-8b-unsloth-bnb-4bit) that can improve accuracy; use them when listed in the Unsloth model catalog.

Choose one model and set its ID in your notebook:

model_name = "unsloth/llama-3.1-8b-instruct-bnb-4bit"  # or another from the table

3. Load the model with Unsloth

Use Unsloth's FastLanguageModel.for_pretrained to load the model and tokenizer. Key arguments:

  • max_seq_length: Maximum sequence length for training. 2048 is a safe default for Colab; you can increase (e.g. 4096, 8192) if your data and VRAM allow.
  • dtype: Leave None for auto, or set torch.float16 / torch.bfloat16 if your GPU supports it.
  • load_in_4bit: True for QLoRA (recommended in Colab).
from unsloth import FastLanguageModel

max_seq_length = 2048  # Reduce to 1024 if you hit OOM later
dtype = None            # Auto
load_in_4bit = True     # QLoRA

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

If you prefer 16-bit LoRA (no quantization), set load_in_4bit=False and ensure your GPU has enough VRAM (e.g. 16GB+ for 8B).

4. Add LoRA adapters

Before training, you must attach LoRA adapters to the model. Unsloth's get_peft_model sets these up in a way that works with their kernels:

from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Saves more VRAM
    random_state=42,
)
  • r: LoRA rank. 8 or 16 is typical; 16 is a good default for quality vs. VRAM.
  • target_modules: The linear layers that get LoRA. The list above is standard for Llama/Mistral-style models.
  • use_gradient_checkpointing="unsloth": Reduces activation memory during training so longer sequences or larger batch sizes fit.

5. Verify model and tokenizer

Quick sanity check: the model should be on GPU and in trainable mode; the tokenizer should have a chat template:

print(next(model.parameters()).device)  # should be cuda:0
print(sum(p.numel() for p in model.parameters() if p.requires_grad))  # trainable params
# Optional: list supported chat templates
from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys())[:10])

Summary

  • You chose an instruct model (e.g. Llama 3.1 8B) and set model_name.
  • You loaded it with FastLanguageModel.from_pretrained in 4-bit and set max_seq_length.
  • You added LoRA adapters with get_peft_model and gradient checkpointing.

In the next step you will prepare your custom dataset (e.g. question–answer pairs or conversations) and format it so it can be used for training.