Your fine-tuned model's behavior depends heavily on what you train it on. In this step you will prepare your custom dataset: decide its purpose, choose a format (instruction/answer or multi-turn chat), load the data, and ensure it's in a shape that the next step (chat template + tokenizer) can consume.
1. Purpose and format
Before writing code, clarify:
- Task: Domain Q&A, customer support, code generation, summarization, etc.
- Format:
- Single-turn (instruction + output): One instruction (and optional input) → one model response. Good for form-style tasks, FAQs, classification.
- Multi-turn (conversations): Multiple user/assistant messages per example. Good for chat bots and dialogue.
For single-turn data, a minimal structure is:
instruction: The task or question.input: (Optional) Extra context.output: The desired answer.
For multi-turn, use a list of messages with role and content (e.g. user / assistant), which matches Hugging Face chat/OpenAI-style format.
2. Load data from a file (CSV / JSON)
If your data is in a CSV or JSON file (e.g. uploaded to Colab or from Google Drive), load it with pandas or the standard library and convert to a list of dicts.
Example: CSV with columns question, answer
import pandas as pd
# Upload your file in Colab: Files → Upload, or mount Drive
df = pd.read_csv("your_data.csv") # or pd.read_json("your_data.json")
# Assume columns: question, answer
dataset = []
for _, row in df.iterrows():
dataset.append({
"instruction": row["question"],
"input": "", # optional
"output": row["answer"]
})
# Or as messages (for chat template later)
dataset = []
for _, row in df.iterrows():
dataset.append({
"messages": [
{"role": "user", "content": row["question"]},
{"role": "assistant", "content": row["answer"]}
]
})
Example: JSON with a list of Q&A objects
import json
with open("your_data.json") as f:
raw = json.load(f)
dataset = []
for item in raw:
dataset.append({
"instruction": item.get("question", item.get("instruction", "")),
"input": item.get("context", ""),
"output": item.get("answer", item.get("output", ""))
})
3. Load from Hugging Face Datasets
If your data is already on the Hugging Face Hub:
from datasets import load_dataset
ds = load_dataset("username/dataset_name", split="train")
# Convert to list of dicts if needed
dataset = [{"instruction": x["instruction"], "input": x.get("input", ""), "output": x["output"]} for x in ds]
You can also use a public dataset (e.g. Alpaca-style) and then mix or replace with your own rows.
4. Minimum size and quality
- Minimum: Aim for at least 100 high-quality examples. Better results usually come from 500–2000+.
- Quality: Prefer clear, consistent instructions and correct, relevant answers. Remove duplicates and obviously bad rows.
- Balance: If you have multiple classes or topics, avoid one class dominating; balance or sample so the model sees variety.
Quick checks:
print("Number of examples:", len(dataset))
print("Sample:", dataset[0])
# Optional: length stats
lengths = [len(d["output"]) for d in dataset]
print("Avg output length:", sum(lengths) / len(lengths))
5. Convert to Hugging Face Dataset
Training with the Trainer expects a Hugging Face Dataset object. Wrap your list:
from datasets import Dataset
train_dataset = Dataset.from_list(dataset)
If you have both train and validation splits:
from datasets import Dataset, DatasetDict
train_dataset = Dataset.from_list(dataset_train)
eval_dataset = Dataset.from_list(dataset_eval)
dataset_dict = DatasetDict({"train": train_dataset, "eval": eval_dataset})
Summary
- You defined the purpose of your dataset and chose single-turn (instruction/input/output) or multi-turn (messages).
- You loaded data from CSV, JSON, or Hugging Face and built a list of dicts.
- You ensured size (100+ examples) and quality, then wrapped it in a Hugging Face
Dataset.
In the next step you will apply the correct chat template to this dataset and tokenize it so the model can be trained on it.