Choose a model and load it with Unsloth | Fine-Tune an LLM in Google Colab with Unsloth and Your Own Custom Data

In this step you will choose a base model and load it with Unsloth in a way that keeps VRAM low (4-bit QLoRA) and prepares it for fine-tuning. Unsloth provides pre-optimized model variants on the Hugging Face Hub; using them ensures correct tokenizer and chat templates and often better behavior in Colab.

1. Why "instruct" and why 4-bit?

Instruct models (e.g. Llama 3.1 8B Instruct, Mistral 7B Instruct) are already aligned to follow instructions and use chat-style templates. Fine-tuning them with your data usually needs less data and fewer epochs than starting from a base (completion-only) model.
4-bit loading (QLoRA) reduces memory by about 4× compared to 16-bit, so an 8B model fits comfortably on a T4. Unsloth's implementations are tuned so that 4-bit fine-tuning stays accurate.

2. Pick a model

For Colab's free T4 (15GB VRAM), a 7B–8B parameter model in 4-bit is a good default. Examples:

Model	Hugging Face ID (Unsloth)	Notes
Llama 3.1 8B Instruct	`unsloth/llama-3.1-8b-instruct-bnb-4bit`	ChatML, 8K context
Llama 3.2 3B Instruct	`unsloth/llama-3.2-3b-instruct-bnb-4bit`	Lighter, faster
Mistral 7B Instruct	`unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit`	Strong 7B option

Names ending in -bnb-4bit are quantized with (or compatible with) BitsAndBytes 4-bit. Unsloth also offers "unsloth" 4-bit variants (e.g. unsloth/llama-3.1-8b-unsloth-bnb-4bit) that can improve accuracy; use them when listed in the Unsloth model catalog.

Choose one model and set its ID in your notebook:

model_name = "unsloth/llama-3.1-8b-instruct-bnb-4bit"  # or another from the table

3. Load the model with Unsloth

Use Unsloth's FastLanguageModel.for_pretrained to load the model and tokenizer. Key arguments:

max_seq_length: Maximum sequence length for training. 2048 is a safe default for Colab; you can increase (e.g. 4096, 8192) if your data and VRAM allow.
dtype: Leave None for auto, or set torch.float16 / torch.bfloat16 if your GPU supports it.
load_in_4bit: True for QLoRA (recommended in Colab).

from unsloth import FastLanguageModel

max_seq_length = 2048  # Reduce to 1024 if you hit OOM later
dtype = None            # Auto
load_in_4bit = True     # QLoRA

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

If you prefer 16-bit LoRA (no quantization), set load_in_4bit=False and ensure your GPU has enough VRAM (e.g. 16GB+ for 8B).

4. Add LoRA adapters

Before training, you must attach LoRA adapters to the model. Unsloth's get_peft_model sets these up in a way that works with their kernels:

from unsloth import FastLanguageModel

model = FastLanguageModel.get_peft_model(
    model,
    r=16,               # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Saves more VRAM
    random_state=42,
)

r: LoRA rank. 8 or 16 is typical; 16 is a good default for quality vs. VRAM.
target_modules: The linear layers that get LoRA. The list above is standard for Llama/Mistral-style models.
use_gradient_checkpointing="unsloth": Reduces activation memory during training so longer sequences or larger batch sizes fit.

5. Verify model and tokenizer

Quick sanity check: the model should be on GPU and in trainable mode; the tokenizer should have a chat template:

print(next(model.parameters()).device)  # should be cuda:0
print(sum(p.numel() for p in model.parameters() if p.requires_grad))  # trainable params
# Optional: list supported chat templates
from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys())[:10])

Summary

You chose an instruct model (e.g. Llama 3.1 8B) and set model_name.
You loaded it with FastLanguageModel.from_pretrained in 4-bit and set max_seq_length.
You added LoRA adapters with get_peft_model and gradient checkpointing.

In the next step you will prepare your custom dataset (e.g. question–answer pairs or conversations) and format it so it can be used for training.