01 / 11
next · back · scroll, click, or use arrow keys
// AN AI RESEARCHER · AMD MI300X · ROCm 7

ROCKET

An AI scientist whose only research domain is making models faster on AMD MI300X. It hypothesizes. It runs experiments. It reads its own results. It writes the next experiment.
Qwen all the way down. Solo build, 24 hours.

QWEN PLANNER QWEN TARGET ROCm 7.0 · MI300X 5.67× MEASURED
Watch it work
rocket@mi300x ~ python -m rocket.orchestrator --model Qwen/Qwen2.5-1.5B-Instruct ● running
MEASURED · AMD INSTINCT MI300X
1.00×
62.6 183.5 tokens/sec on Qwen2.5-7B
1 of 5 optimizations kept · agent reverted the rest · zero human input after pressing Run
SIDE BY SIDE

The same prompt. Same model. Same hardware.

Both panels generate the same response on Qwen2.5-7B at batch=8 on AMD MI300X. The left panel uses the fp32 baseline. The right panel runs the bf16 cast ROCKET autonomously kept. Watch them race.

BASELINE
Qwen2.5-7B · fp32 · batch 8
62.6 tok/s
> tell me about the AMD MI300X in one paragraph
ROCKET ⚡
Qwen2.5-7B · bf16 · batch 8
183.5 tok/s
> tell me about the AMD MI300X in one paragraph
2.93× faster · same output, same model, same MI300X
INTERACTIVE

Toggle the agent's discoveries.

Each card is one decision the agent made. Click to add or remove it from the stack and watch the cumulative speedup update in real time. The agent picked this exact order autonomously.

CUMULATIVE SPEEDUP
1.00×
39.9 39.9 tok/s
PROFILE BREAKDOWN
Where the time actually goes
Toggle optimizations above. Watch the bottlenecks dissolve.
TOTAL DEVICE TIME / TOKEN
25.1 ms
MEMORY FOOTPRINT
6.30 GB
BOTTLENECK
attention
// HOW THIS IS DIFFERENT

An AI scientist for a question no one's automated.

Every AMD developer's first question is "how do I make this fast on MI300X?" That question doesn't have an autonomous answer. Until now.

01 · vs. Sakana / AutoResearch

Different research domain

Sakana's AI Scientist and Karpathy's AutoResearch design experiments to improve model accuracy. ROCKET designs experiments to improve throughput on AMD silicon. Different question. Zero prior art.

02 · vs. ROCmPort AI

Different verb

ROCmPort takes CUDA code and makes it run on ROCm. ROCKET takes a model that already runs and makes it fast. Translation vs. optimization.

03 · vs. ReplayLab

Different posture

ReplayLab records GPU experiments and recovers from crashes. ROCKET is the autopilot. It doesn't just observe, it acts.

04 · vs. every applied agent

Meta, not applied

Aegis, Triage, MediVision are applied agents solving domain problems. ROCKET is a meta agent: it makes other AI faster. Judges remember meta.

REPLAY

The optimization journey

Each marker is one decision the agent made on the MI300X. kept tried, reverted

STEP5
TOK/S226.4
CUM ×5.67
TOOLtorch_compile
RUN COMPLETE
step5
tooltorch_compile
tok/s226.4
delta+27%
AGENT TRACE

What the agent was thinking

At each step the planner reads the profile, picks one tool from the bounded toolbox, applies it. The validator either keeps the change or reverts.

TOOLBOX

Five tools. The agent picks the order.

ROCKET doesn't write arbitrary code. The bounded search space is the point. The agent has to be smart about which tool, when, with what params.

dtype_cast

Cast model to bf16/fp16. Halves memory, ~2× arithmetic throughput on MI300X.

⚙️
torch_compile

Inductor-fused kernels via torch.compile. Best for stable shapes.

🎯
sdpa_attention

Memory-efficient fused attention. Big win on attention-bound workloads.

📐
input_padding

Pad shapes to GPU-friendly multiples (128/256). Free perf when shapes are odd.

💾
kv_cache_config

Enable KV-caching. Turns O(n²) into O(n) on autoregressive generation. Often a 2-4×.

ARCHITECTURE

Three agents, one loop, all on MI300X

PROFILER
torch.profiler / rocprof
→ hotspot summary
PLANNER
Qwen2.5-7B-Instruct (vLLM, local)
→ next tool to try
IMPLEMENTER
picks from toolbox
→ transformed model
VALIDATOR
re-bench + correctness
→ keep / revert
↺  until plateau or max iterations

An autopilot for AMD performance.

If you got this far, drop a like on this Space. It counts toward the HF community prize at the AMD x lablab.ai hackathon.

BUILT SOLO · AMD x LABLAB.AI DEVELOPER HACKATHON · MAY 2026
Powered by AMD Instinct MI300X · ROCm · PyTorch · Qwen
by Maruthi Kunchala