Prepare your custom dataset | Fine-Tune an LLM in Google Colab with Unsloth and Your Own Custom Data

Your fine-tuned model's behavior depends heavily on what you train it on. In this step you will prepare your custom dataset: decide its purpose, choose a format (instruction/answer or multi-turn chat), load the data, and ensure it's in a shape that the next step (chat template + tokenizer) can consume.

1. Purpose and format

Before writing code, clarify:

Task: Domain Q&A, customer support, code generation, summarization, etc.
Format:
- Single-turn (instruction + output): One instruction (and optional input) → one model response. Good for form-style tasks, FAQs, classification.
- Multi-turn (conversations): Multiple user/assistant messages per example. Good for chat bots and dialogue.

For single-turn data, a minimal structure is:

instruction: The task or question.
input: (Optional) Extra context.
output: The desired answer.

For multi-turn, use a list of messages with role and content (e.g. user / assistant), which matches Hugging Face chat/OpenAI-style format.

2. Load data from a file (CSV / JSON)

If your data is in a CSV or JSON file (e.g. uploaded to Colab or from Google Drive), load it with pandas or the standard library and convert to a list of dicts.

Example: CSV with columns question, answer

import pandas as pd

# Upload your file in Colab: Files → Upload, or mount Drive
df = pd.read_csv("your_data.csv")   # or pd.read_json("your_data.json")

# Assume columns: question, answer
dataset = []
for _, row in df.iterrows():
    dataset.append({
        "instruction": row["question"],
        "input": "",           # optional
        "output": row["answer"]
    })

# Or as messages (for chat template later)
dataset = []
for _, row in df.iterrows():
    dataset.append({
        "messages": [
            {"role": "user", "content": row["question"]},
            {"role": "assistant", "content": row["answer"]}
        ]
    })

Example: JSON with a list of Q&A objects

import json

with open("your_data.json") as f:
    raw = json.load(f)

dataset = []
for item in raw:
    dataset.append({
        "instruction": item.get("question", item.get("instruction", "")),
        "input": item.get("context", ""),
        "output": item.get("answer", item.get("output", ""))
    })

3. Load from Hugging Face Datasets

If your data is already on the Hugging Face Hub:

from datasets import load_dataset

ds = load_dataset("username/dataset_name", split="train")
# Convert to list of dicts if needed
dataset = [{"instruction": x["instruction"], "input": x.get("input", ""), "output": x["output"]} for x in ds]

You can also use a public dataset (e.g. Alpaca-style) and then mix or replace with your own rows.

4. Minimum size and quality

Minimum: Aim for at least 100 high-quality examples. Better results usually come from 500–2000+.
Quality: Prefer clear, consistent instructions and correct, relevant answers. Remove duplicates and obviously bad rows.
Balance: If you have multiple classes or topics, avoid one class dominating; balance or sample so the model sees variety.

Quick checks:

print("Number of examples:", len(dataset))
print("Sample:", dataset[0])
# Optional: length stats
lengths = [len(d["output"]) for d in dataset]
print("Avg output length:", sum(lengths) / len(lengths))

5. Convert to Hugging Face Dataset

Training with the Trainer expects a Hugging Face Dataset object. Wrap your list:

from datasets import Dataset

train_dataset = Dataset.from_list(dataset)

If you have both train and validation splits:

from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_list(dataset_train)
eval_dataset = Dataset.from_list(dataset_eval)
dataset_dict = DatasetDict({"train": train_dataset, "eval": eval_dataset})

Summary

You defined the purpose of your dataset and chose single-turn (instruction/input/output) or multi-turn (messages).
You loaded data from CSV, JSON, or Hugging Face and built a list of dicts.
You ensured size (100+ examples) and quality, then wrapped it in a Hugging Face Dataset.

In the next step you will apply the correct chat template to this dataset and tokenize it so the model can be trained on it.