Save, evaluate, and run inference | Fine-Tune an LLM in Google Colab with Unsloth and Your Own Custom Data

In this step you will load the fine-tuned adapter, switch the model to inference mode with Unsloth's optimized path, generate responses for a few test prompts, and optionally save to the Hugging Face Hub or export to GGUF for use in Ollama or llama.cpp.

1. Load the saved model and enable inference

After training you saved the adapter and tokenizer under ./outputs/final. Load them and then call FastLanguageModel.for_inference so Unsloth uses its faster inference kernels:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./outputs/final",
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=True,  # Match how you trained (QLoRA)
)

model = FastLanguageModel.for_inference(model)

If you trained in 16-bit LoRA, use load_in_4bit=False. Always call for_inference before generating; it's required for Unsloth's 2× faster inference.

2. Build the prompt and generate

Use the same chat template as in training. Build a list of messages (e.g. one user message), then apply the template and generate:

messages = [
    {"role": "user", "content": "Your test question or instruction here."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,  # Leave room for the assistant reply
)

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
)

Decode only the new tokens (skip the prompt):

generated = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(generated)

Adjust max_new_tokens, temperature, and top_p to control length and randomness. For deterministic answers, use do_sample=False.

3. Multi-turn chat (optional)

You can simulate a short conversation by passing multiple messages:

messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What is a good dish to try there?"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# ... same tokenize + generate + decode as above

4. Push to Hugging Face Hub (optional)

If you're logged in (huggingface_hub.login()), you can push the adapter and tokenizer to your account:

model.push_to_hub_merged("your-username/your-model-name", tokenizer, save_method="lora")
# Or push only the LoRA weights:
# model.push_to_hub("your-username/your-model-name-lora", tokenizer)

Then load from the Hub elsewhere:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="your-username/your-model-name",
    max_seq_length=max_seq_length,
    load_in_4bit=True,
)
model = FastLanguageModel.for_inference(model)

5. Export to GGUF for Ollama / llama.cpp (optional)

For local use in Ollama, llama.cpp, or LM Studio, you can export the merged model (base + LoRA) to GGUF. Unsloth and community notebooks provide scripts to merge and quantize; the exact steps depend on the Unsloth version. In general:

Merge the LoRA weights into the base model (in memory or on disk).
Convert the merged model to GGUF (e.g. with llama.cpp conversion scripts or Unsloth's export utilities).
Quantize (e.g. Q4_K_M) to reduce file size.

Check the Unsloth inference and deployment docs and their Colab notebooks for the current GGUF export flow.

6. Quick evaluation checklist

Run 3–5 diverse prompts (short, long, edge cases) and check that answers are on-topic and in the right style.
Compare before vs after fine-tuning on the same prompts if you still have the base model loaded.
If answers are off-topic or generic, consider more data, 1–2 extra epochs, or a slightly higher learning rate next time.
If answers are overfitting (e.g. memorizing training examples), reduce epochs or add more diverse data.

Summary

You loaded the saved adapter and tokenizer and called FastLanguageModel.for_inference(model).
You built prompts with the same chat template, generated with model.generate(), and decoded only the new tokens.
You optionally pushed the model to the Hugging Face Hub and noted how to export to GGUF for local use.

You've completed the pipeline: Colab + Unsloth + custom data → fine-tuned LLM. You can reuse this workflow with different base models, datasets, and hyperparameters for your own tasks.