Train LLMs on Hugging Face Jobs from Antigravity
Overview
This skill teaches the agent how to fine-tune language models on Hugging Face Jobs (managed cloud GPUs/CPUs) using:
- •TRL for training (start with SFT / supervised fine-tuning)
- •The Hugging Face CLI (
hf jobs ...) to submit and manage jobs
It is designed for fast iteration on small models first (sub‑1B) before scaling up.
When to use this skill
Use when the user mentions any of:
- •“fine-tune”, “train”, “SFT/DPO/GRPO”, “TRL”
- •“Hugging Face Jobs”, “hf jobs”, “cloud GPU”, “run training in the cloud”
- •“push trained model to the Hub”, “publish checkpoints”
- •“pick a GPU flavor / estimate time & cost”
Key directives (do these every time)
- •Default to a small-model smoke test unless the user explicitly asks for a big run.
- •Always set
--timeout(Jobs default timeout is often too short for training). - •Always pass
HF_TOKENas a secret when pushing to the Hub:- •CLI:
--secrets HF_TOKEN - •Training code:
trainer.push_to_hub(...)
- •CLI:
- •Prefer
t4-smallfor the very first test (cheap + good enough for <1B). - •Validate datasets on CPU first if the dataset is unfamiliar (saves GPU money).
Prerequisites checklist
- •Hugging Face CLI installed (
hf) - •Logged in:
hf auth login - •Confirm identity:
hf auth whoami - •User has a token with write permission if pushing a model
Recommended small-model-first workflow
Step 0 — Pick a base model + dataset (defaults)
For a first run, default to:
- •Base model:
Qwen/Qwen2.5-0.5B - •Dataset:
trl-lib/Capybara
These are used in Hugging Face’s Jobs quickstart and are known-good for SFT.
Step 1 — (Optional) Inspect dataset on CPU
Run a quick dataset inspection job (cheap) before spending GPU time:
bash
hf jobs uv run --timeout 10m --with datasets scripts/inspect_dataset.py --dataset trl-lib/Capybara --split train
Step 2 — Run a smoke-test SFT job on t4-small
Use the provided smoke-test training script:
bash
hf jobs uv run --flavor t4-small --timeout 45m --secrets HF_TOKEN --with trl scripts/train_sft_smoke.py
Notes:
- •This script trains for a small number of steps on a small subset of the dataset.
- •It pushes the resulting model to the Hub (configurable via env vars).
Step 3 — Monitor / manage jobs
Useful commands:
bash
hf jobs ps hf jobs inspect <job-id> hf jobs logs <job-id> hf jobs cancel <job-id> hf jobs hardware hf jobs stats
Scaling up after the smoke test
- •Increase dataset size and training steps/epochs.
- •Upgrade GPU flavor (example path):
t4-small→t4-medium/l4x1→a10g-small→a10g-large→a100-large - •Add memory savers if you hit OOM:
- •Reduce batch size, increase grad accumulation
- •Enable gradient checkpointing
- •Switch to LoRA/PEFT (optional)
Environment variables supported by the scripts
The training script reads:
- •
BASE_MODEL(default:Qwen/Qwen2.5-0.5B) - •
DATASET_ID(default:trl-lib/Capybara) - •
DATASET_SPLIT(default:train) - •
SMOKE_SAMPLES(default:200) - •
MAX_STEPS(default:50) - •
OUTPUT_REPO(default:<username>/smoke-sft-<base-model>-<timestamp>) - •
PUSH_TO_HUB(default:1)
Example:
bash
hf jobs uv run --flavor t4-small --timeout 45m --secrets HF_TOKEN -e MAX_STEPS=100 -e SMOKE_SAMPLES=500 scripts/train_sft_smoke.py
Common failure modes & fixes
Out of memory (OOM)
- •Lower
per_device_train_batch_size - •Increase
gradient_accumulation_steps - •Reduce
max_length/ context - •Upgrade GPU flavor
Job timeout
- •Increase
--timeoutand re-run - •Reduce steps/epochs for a test
Model not saved
- •Ensure you passed
--secrets HF_TOKEN - •Ensure
PUSH_TO_HUB=1
Included files
- •
scripts/train_sft_smoke.py— minimal SFT smoke test (small-model-first) - •
scripts/inspect_dataset.py— quick dataset inspector (CPU-friendly) - •
references/cli_cheatsheet.md— commonhf jobscommands - •
templates/train_sft_full.py— starting point for a real run