Skip to main content
Back to overview/Step 4 of 6
Unsloth20m

Apply chat template and format data for training

Use Unsloth's chat template for your model, format each example into a single text string, tokenize, and create the training dataset.

In this step

  • Select the right chat template (e.g. llama-3.1) for your model
  • Write a formatting function that builds prompt + response
  • Map the dataset and tokenize with the tokenizer
  • Set up data collator and training-ready Dataset

In this step you will apply the model's chat template to your custom dataset so each example becomes a single tokenized text (prompt + response). The model is trained to predict only the response part; the prompt is masked out in the loss. Unsloth provides fixed, tested chat templates per model family so you don't rely on potentially broken templates from the original uploaders.

1. Why chat templates?

Instruct models expect a specific text format (e.g. <|user|>, <|assistant|>, special tokens). That format is encoded in a chat template. Training data must use the same format so that at inference time the model sees the same structure. Unsloth's get_chat_template and apply_chat_template ensure your data matches the model you loaded.

2. Get the right template and apply it to the tokenizer

First, attach the correct template to your tokenizer. Template names are model-family specific (e.g. llama-3.1, llama-31, chatml, mistral). Unsloth documents which name to use for each model; for Llama 3.1 Instruct you typically use "llama-3.1" or "llama-31":

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",  # Use the one that matches your model; see CHAT_TEMPLATES.keys()
)

To see all available template names:

from unsloth.chat_templates import CHAT_TEMPLATES
print(list(CHAT_TEMPLATES.keys()))

3. Define a formatting function

Your dataset has one of these shapes:

  • Instruction/input/output (Alpaca-style): instruction, input, output.
  • Messages: messages = list of {"role": "user"|"assistant", "content": "..."}.

You need a function that turns each example into a conversation that the tokenizer's apply_chat_template can convert to a single string.

If your data has instruction / input / output:

def format_instruction_input_output(examples):
    texts = []
    for inst, inp, out in zip(
        examples["instruction"],
        examples.get("input", [""] * len(examples["instruction"])),
        examples["output"]
    ):
        messages = [
            {"role": "user", "content": inst if not inp else f"{inst}\n\n{inp}".strip()},
            {"role": "assistant", "content": out}
        ]
        text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
        texts.append(text)
    return {"text": texts}

If your data already has messages:

def format_messages(examples):
    texts = [
        tokenizer.apply_chat_template(
            msgs,
            tokenize=False,
            add_generation_prompt=False,
        )
        for msgs in examples["messages"]
    ]
    return {"text": texts}

Use add_generation_prompt=False for training so the string includes the full assistant reply. Always use the same tokenizer that has the chat template applied.

4. Map the dataset and tokenize

Apply the formatting function with .map() then tokenize the "text" column:

# Choose the key that your format function fills
dataset = train_dataset.map(
    format_instruction_input_output,  # or format_messages
    batched=True,
    remove_columns=train_dataset.column_names,
)

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=max_seq_length,
        padding=False,  # we'll use a collator that pads per batch
        return_tensors=None,
    )

dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])

If you hit memory issues, reduce max_length or process in smaller batches inside map.

5. Data collator

Training uses a data collator that pads sequences to the same length in each batch. Use the standard DataCollatorForSeq2Seq (or equivalent) with the tokenizer's pad token:

from transformers import DataCollatorForSeq2Seq

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    padding=True,
    return_tensors="pt",
)

6. Optional: train/eval split

If you didn't split earlier, you can do it now:

dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

Use eval_dataset in the Trainer's eval_dataset argument for validation during training.

Summary

  • You applied the correct chat template to the tokenizer with get_chat_template.
  • You wrote a formatting function that turns each example into messages and then a single text via apply_chat_template.
  • You mapped the dataset and tokenized it, then set up a data collator for padded batching.

In the next step you will train the model with the Hugging Face Trainer (or Unsloth's recommended training API) and choose learning rate, batch size, and steps.