In this step you will apply the model's chat template to your custom dataset so each example becomes a single tokenized text (prompt + response). The model is trained to predict only the response part; the prompt is masked out in the loss. Unsloth provides fixed, tested chat templates per model family so you don't rely on potentially broken templates from the original uploaders.
1. Why chat templates?
Instruct models expect a specific text format (e.g. <|user|>, <|assistant|>, special tokens). That format is encoded in a chat template. Training data must use the same format so that at inference time the model sees the same structure. Unsloth's get_chat_template and apply_chat_template ensure your data matches the model you loaded.
2. Get the right template and apply it to the tokenizer
First, attach the correct template to your tokenizer. Template names are model-family specific (e.g. llama-3.1, llama-31, chatml, mistral). Unsloth documents which name to use for each model; for Llama 3.1 Instruct you typically use "llama-3.1" or "llama-31":
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1", # Use the one that matches your model; see CHAT_TEMPLATES.keys()
)
To see all available template names:
from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys()))
3. Define a formatting function
Your dataset has one of these shapes:
- Instruction/input/output (Alpaca-style):
instruction,input,output. - Messages:
messages= list of{"role": "user"|"assistant", "content": "..."}.
You need a function that turns each example into a conversation that the tokenizer's apply_chat_template can convert to a single string.
If your data has instruction / input / output:
def format_instruction_input_output(examples):
texts = []
for inst, inp, out in zip(
examples["instruction"],
examples.get("input", [""] * len(examples["instruction"])),
examples["output"]
):
messages = [
{"role": "user", "content": inst if not inp else f"{inst}\n\n{inp}".strip()},
{"role": "assistant", "content": out}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
texts.append(text)
return {"text": texts}
If your data already has messages:
def format_messages(examples):
texts = [
tokenizer.apply_chat_template(
msgs,
tokenize=False,
add_generation_prompt=False,
)
for msgs in examples["messages"]
]
return {"text": texts}
Use add_generation_prompt=False for training so the string includes the full assistant reply. Always use the same tokenizer that has the chat template applied.
4. Map the dataset and tokenize
Apply the formatting function with .map() then tokenize the "text" column:
# Choose the key that your format function fills
dataset = train_dataset.map(
format_instruction_input_output, # or format_messages
batched=True,
remove_columns=train_dataset.column_names,
)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=max_seq_length,
padding=False, # we'll use a collator that pads per batch
return_tensors=None,
)
dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
If you hit memory issues, reduce max_length or process in smaller batches inside map.
5. Data collator
Training uses a data collator that pads sequences to the same length in each batch. Use the standard DataCollatorForSeq2Seq (or equivalent) with the tokenizer's pad token:
from transformers import DataCollatorForSeq2Seq
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForSeq2Seq(
tokenizer=tokenizer,
padding=True,
return_tensors="pt",
)
6. Optional: train/eval split
If you didn't split earlier, you can do it now:
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]
Use eval_dataset in the Trainer's eval_dataset argument for validation during training.
Summary
- You applied the correct chat template to the tokenizer with
get_chat_template. - You wrote a formatting function that turns each example into messages and then a single text via
apply_chat_template. - You mapped the dataset and tokenized it, then set up a data collator for padded batching.
In the next step you will train the model with the Hugging Face Trainer (or Unsloth's recommended training API) and choose learning rate, batch size, and steps.