Train LLMs on Hugging Face Jobs from Antigravity

Overview

This skill teaches the agent how to fine-tune language models on Hugging Face Jobs (managed cloud GPUs/CPUs) using:

•TRL for training (start with SFT / supervised fine-tuning)
•The Hugging Face CLI (hf jobs ...) to submit and manage jobs

It is designed for fast iteration on small models first (sub‑1B) before scaling up.

When to use this skill

Use when the user mentions any of:

•“fine-tune”, “train”, “SFT/DPO/GRPO”, “TRL”
•“Hugging Face Jobs”, “hf jobs”, “cloud GPU”, “run training in the cloud”
•“push trained model to the Hub”, “publish checkpoints”
•“pick a GPU flavor / estimate time & cost”

Key directives (do these every time)

•Default to a small-model smoke test unless the user explicitly asks for a big run.
•Always set --timeout (Jobs default timeout is often too short for training).
•
Always pass HF_TOKEN as a secret when pushing to the Hub:
- •CLI: --secrets HF_TOKEN
- •Training code: trainer.push_to_hub(...)
•Prefer t4-small for the very first test (cheap + good enough for <1B).
•Validate datasets on CPU first if the dataset is unfamiliar (saves GPU money).

Prerequisites checklist

•Hugging Face CLI installed (hf)
•Logged in: hf auth login
•Confirm identity: hf auth whoami
•User has a token with write permission if pushing a model

Recommended small-model-first workflow

Step 0 — Pick a base model + dataset (defaults)

For a first run, default to:

•Base model: Qwen/Qwen2.5-0.5B
•Dataset: trl-lib/Capybara

These are used in Hugging Face’s Jobs quickstart and are known-good for SFT.

Step 1 — (Optional) Inspect dataset on CPU

Run a quick dataset inspection job (cheap) before spending GPU time:

bash

hf jobs uv run   --timeout 10m   --with datasets   scripts/inspect_dataset.py --dataset trl-lib/Capybara --split train

Step 2 — Run a smoke-test SFT job on `t4-small`

Use the provided smoke-test training script:

bash

hf jobs uv run   --flavor t4-small   --timeout 45m   --secrets HF_TOKEN   --with trl   scripts/train_sft_smoke.py

Notes:

•This script trains for a small number of steps on a small subset of the dataset.
•It pushes the resulting model to the Hub (configurable via env vars).

Step 3 — Monitor / manage jobs

Useful commands:

bash

hf jobs ps
hf jobs inspect <job-id>
hf jobs logs <job-id>
hf jobs cancel <job-id>
hf jobs hardware
hf jobs stats

Scaling up after the smoke test

•Increase dataset size and training steps/epochs.
•Upgrade GPU flavor (example path): t4-small → t4-medium/l4x1 → a10g-small → a10g-large → a100-large
•
Add memory savers if you hit OOM:
- •Reduce batch size, increase grad accumulation
- •Enable gradient checkpointing
- •Switch to LoRA/PEFT (optional)

Environment variables supported by the scripts

The training script reads:

•BASE_MODEL (default: Qwen/Qwen2.5-0.5B)
•DATASET_ID (default: trl-lib/Capybara)
•DATASET_SPLIT (default: train)
•SMOKE_SAMPLES (default: 200)
•MAX_STEPS (default: 50)
•OUTPUT_REPO (default: <username>/smoke-sft-<base-model>-<timestamp>)
•PUSH_TO_HUB (default: 1)

Example:

bash

hf jobs uv run   --flavor t4-small   --timeout 45m   --secrets HF_TOKEN   -e MAX_STEPS=100   -e SMOKE_SAMPLES=500   scripts/train_sft_smoke.py

Common failure modes & fixes

Out of memory (OOM)

•Lower per_device_train_batch_size
•Increase gradient_accumulation_steps
•Reduce max_length / context
•Upgrade GPU flavor

Job timeout

•Increase --timeout and re-run
•Reduce steps/epochs for a test

Model not saved

•Ensure you passed --secrets HF_TOKEN
•Ensure PUSH_TO_HUB=1

Included files

•scripts/train_sft_smoke.py — minimal SFT smoke test (small-model-first)
•scripts/inspect_dataset.py — quick dataset inspector (CPU-friendly)
•references/cli_cheatsheet.md — common hf jobs commands
•templates/train_sft_full.py — starting point for a real run

Train LLMs on Hugging Face Jobs from Antigravity

Overview

When to use this skill

Key directives (do these every time)

Prerequisites checklist

Recommended small-model-first workflow

Step 0 — Pick a base model + dataset (defaults)

Step 1 — (Optional) Inspect dataset on CPU

Step 2 — Run a smoke-test SFT job on t4-small

Step 3 — Monitor / manage jobs

Scaling up after the smoke test

Environment variables supported by the scripts

Common failure modes & fixes

Out of memory (OOM)

Job timeout

Model not saved

Included files

Step 2 — Run a smoke-test SFT job on `t4-small`