In this step you will choose a base model and load it with Unsloth in a way that keeps VRAM low (4-bit QLoRA) and prepares it for fine-tuning. Unsloth provides pre-optimized model variants on the Hugging Face Hub; using them ensures correct tokenizer and chat templates and often better behavior in Colab.
1. Why "instruct" and why 4-bit?
- Instruct models (e.g. Llama 3.1 8B Instruct, Mistral 7B Instruct) are already aligned to follow instructions and use chat-style templates. Fine-tuning them with your data usually needs less data and fewer epochs than starting from a base (completion-only) model.
- 4-bit loading (QLoRA) reduces memory by about 4× compared to 16-bit, so an 8B model fits comfortably on a T4. Unsloth's implementations are tuned so that 4-bit fine-tuning stays accurate.
2. Pick a model
For Colab's free T4 (15GB VRAM), a 7B–8B parameter model in 4-bit is a good default. Examples:
| Model | Hugging Face ID (Unsloth) | Notes |
|---|---|---|
| Llama 3.1 8B Instruct | unsloth/llama-3.1-8b-instruct-bnb-4bit |
ChatML, 8K context |
| Llama 3.2 3B Instruct | unsloth/llama-3.2-3b-instruct-bnb-4bit |
Lighter, faster |
| Mistral 7B Instruct | unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit |
Strong 7B option |
Names ending in -bnb-4bit are quantized with (or compatible with) BitsAndBytes 4-bit. Unsloth also offers "unsloth" 4-bit variants (e.g. unsloth/llama-3.1-8b-unsloth-bnb-4bit) that can improve accuracy; use them when listed in the Unsloth model catalog.
Choose one model and set its ID in your notebook:
model_name = "unsloth/llama-3.1-8b-instruct-bnb-4bit" # or another from the table
3. Load the model with Unsloth
Use Unsloth's FastLanguageModel.for_pretrained to load the model and tokenizer. Key arguments:
max_seq_length: Maximum sequence length for training. 2048 is a safe default for Colab; you can increase (e.g. 4096, 8192) if your data and VRAM allow.dtype: LeaveNonefor auto, or settorch.float16/torch.bfloat16if your GPU supports it.load_in_4bit:Truefor QLoRA (recommended in Colab).
from unsloth import FastLanguageModel
max_seq_length = 2048 # Reduce to 1024 if you hit OOM later
dtype = None # Auto
load_in_4bit = True # QLoRA
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit,
)
If you prefer 16-bit LoRA (no quantization), set load_in_4bit=False and ensure your GPU has enough VRAM (e.g. 16GB+ for 8B).
4. Add LoRA adapters
Before training, you must attach LoRA adapters to the model. Unsloth's get_peft_model sets these up in a way that works with their kernels:
from unsloth import FastLanguageModel
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # Saves more VRAM
random_state=42,
)
r: LoRA rank. 8 or 16 is typical; 16 is a good default for quality vs. VRAM.target_modules: The linear layers that get LoRA. The list above is standard for Llama/Mistral-style models.use_gradient_checkpointing="unsloth": Reduces activation memory during training so longer sequences or larger batch sizes fit.
5. Verify model and tokenizer
Quick sanity check: the model should be on GPU and in trainable mode; the tokenizer should have a chat template:
print(next(model.parameters()).device) # should be cuda:0
print(sum(p.numel() for p in model.parameters() if p.requires_grad)) # trainable params
# Optional: list supported chat templates
from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys())[:10])
Summary
- You chose an instruct model (e.g. Llama 3.1 8B) and set
model_name. - You loaded it with
FastLanguageModel.from_pretrainedin 4-bit and setmax_seq_length. - You added LoRA adapters with
get_peft_modeland gradient checkpointing.
In the next step you will prepare your custom dataset (e.g. question–answer pairs or conversations) and format it so it can be used for training.