An AI scientist whose only research domain is making models
faster on AMD MI300X. It hypothesizes. It runs experiments.
It reads its own results. It writes the next experiment.
Qwen all the way down. Solo build, 24 hours.
Both panels generate the same response on Qwen2.5-7B at batch=8 on AMD MI300X. The left panel uses the fp32 baseline. The right panel runs the bf16 cast ROCKET autonomously kept. Watch them race.
Each card is one decision the agent made. Click to add or remove it from the stack and watch the cumulative speedup update in real time. The agent picked this exact order autonomously.
Every AMD developer's first question is "how do I make this fast on MI300X?" That question doesn't have an autonomous answer. Until now.
Sakana's AI Scientist and Karpathy's AutoResearch design experiments to improve model accuracy. ROCKET designs experiments to improve throughput on AMD silicon. Different question. Zero prior art.
ROCmPort takes CUDA code and makes it run on ROCm. ROCKET takes a model that already runs and makes it fast. Translation vs. optimization.
ReplayLab records GPU experiments and recovers from crashes. ROCKET is the autopilot. It doesn't just observe, it acts.
Aegis, Triage, MediVision are applied agents solving domain problems. ROCKET is a meta agent: it makes other AI faster. Judges remember meta.
Each marker is one decision the agent made on the MI300X. kept tried, reverted
At each step the planner reads the profile, picks one tool from the bounded toolbox, applies it. The validator either keeps the change or reverts.
ROCKET doesn't write arbitrary code. The bounded search space is the point. The agent has to be smart about which tool, when, with what params.
dtype_cast
Cast model to bf16/fp16. Halves memory, ~2× arithmetic throughput on MI300X.
torch_compile
Inductor-fused kernels via torch.compile. Best for stable shapes.
sdpa_attention
Memory-efficient fused attention. Big win on attention-bound workloads.
input_padding
Pad shapes to GPU-friendly multiples (128/256). Free perf when shapes are odd.
kv_cache_config
Enable KV-caching. Turns O(n²) into O(n) on autoregressive generation. Often a 2-4×.
If you got this far, drop a like on this Space. It counts toward the HF community prize at the AMD x lablab.ai hackathon.