// AN AI RESEARCHER · AMD MI300X · ROCm 7

ROCKET⚡

An AI scientist whose only research domain is making models faster on AMD MI300X. It hypothesizes. It runs experiments. It reads its own results. It writes the next experiment.
Qwen all the way down. Solo build, 24 hours.

QWEN PLANNER QWEN TARGET ROCm 7.0 · MI300X 5.67× MEASURED

Watch it work

SIDE BY SIDE

The same prompt. Same model. Same hardware.

Both panels generate the same response on Qwen2.5-7B at batch=8 on AMD MI300X. The left panel uses the fp32 baseline. The right panel runs the bf16 cast ROCKET autonomously kept. Watch them race.

BASELINE

Qwen2.5-7B · fp32 · batch 8

62.6 tok/s

> tell me about the AMD MI300X in one paragraph

▌

ROCKET ⚡

Qwen2.5-7B · bf16 · batch 8

183.5 tok/s

> tell me about the AMD MI300X in one paragraph

▌

2.93× faster · same output, same model, same MI300X

INTERACTIVE

Toggle the agent's discoveries.

Each card is one decision the agent made. Click to add or remove it from the stack and watch the cumulative speedup update in real time. The agent picked this exact order autonomously.

CUMULATIVE SPEEDUP

1.00×

39.9 → 39.9 tok/s

PROFILE BREAKDOWN

Where the time actually goes

Toggle optimizations above. Watch the bottlenecks dissolve.

TOTAL DEVICE TIME / TOKEN

25.1 ms

MEMORY FOOTPRINT

6.30 GB

BOTTLENECK

attention

// HOW THIS IS DIFFERENT

An AI scientist for a question no one's automated.

Every AMD developer's first question is "how do I make this fast on MI300X?" That question doesn't have an autonomous answer. Until now.

01 · vs. Sakana / AutoResearch

Different research domain

Sakana's AI Scientist and Karpathy's AutoResearch design experiments to improve model accuracy. ROCKET designs experiments to improve throughput on AMD silicon. Different question. Zero prior art.

02 · vs. ROCmPort AI

Different verb

ROCmPort takes CUDA code and makes it run on ROCm. ROCKET takes a model that already runs and makes it fast. Translation vs. optimization.

03 · vs. ReplayLab

Different posture

ReplayLab records GPU experiments and recovers from crashes. ROCKET is the autopilot. It doesn't just observe, it acts.

04 · vs. every applied agent

Meta, not applied

Aegis, Triage, MediVision are applied agents solving domain problems. ROCKET is a meta agent: it makes other AI faster. Judges remember meta.

REPLAY

The optimization journey

Each marker is one decision the agent made on the MI300X. kept tried, reverted

STEP5

TOK/S226.4

CUM ×5.67

TOOLtorch_compile

RUN COMPLETE

step5

tooltorch_compile

tok/s226.4

delta+27%

AGENT TRACE

What the agent was thinking

At each step the planner reads the profile, picks one tool from the bounded toolbox, applies it. The validator either keeps the change or reverts.

TOOLBOX

Five tools. The agent picks the order.

ROCKET doesn't write arbitrary code. The bounded search space is the point. The agent has to be smart about which tool, when, with what params.

⚡

dtype_cast

Cast model to bf16/fp16. Halves memory, ~2× arithmetic throughput on MI300X.

⚙️

torch_compile

Inductor-fused kernels via torch.compile. Best for stable shapes.

🎯

sdpa_attention

Memory-efficient fused attention. Big win on attention-bound workloads.

📐

input_padding

Pad shapes to GPU-friendly multiples (128/256). Free perf when shapes are odd.

💾

kv_cache_config

Enable KV-caching. Turns O(n²) into O(n) on autoregressive generation. Often a 2-4×.

ARCHITECTURE

Three agents, one loop, all on MI300X

PROFILER

torch.profiler / rocprof

→ hotspot summary

→

PLANNER

Qwen2.5-7B-Instruct (vLLM, local)

→ next tool to try

→

IMPLEMENTER

picks from toolbox

→ transformed model

→

VALIDATOR

re-bench + correctness

→ keep / revert

↺ until plateau or max iterations

An autopilot for AMD performance.

If you got this far, drop a like on this Space. It counts toward the HF community prize at the AMD x lablab.ai hackathon.

View on lablab.ai Share on X

BUILT SOLO · AMD x LABLAB.AI DEVELOPER HACKATHON · MAY 2026
Powered by AMD Instinct MI300X · ROCm · PyTorch · Qwen

by Maruthi Kunchala