Appearance
Autoresearch Frameworks and Autonomous Code Hill-Climbing: Let AI Improve Your Code While You Sleep
Learn how Andrej Karpathy's autoresearch framework uses autonomous AI agents and hill-climbing loops to run machine learning experiments automatically, improve model performance overnight, and reduce manual coding effort.

Imagine setting up an experiment before bed and waking up to 100 completed runs, a better-performing model, and a log of every decision the agent made. No babysitting. No manual tweaking. Just results.
That used to be science fiction. In early 2026, Andrej Karpathy made it real with a project called autoresearch. The concept is simple but powerful: give an AI agent a training file, a clear goal, and a metric to optimize, then let it run experiments on its own. It modifies code, tests the change, and keeps or discards based on whether performance improved.
This is not just a neat trick. It is a shift in how software and ML research gets done. If you have ever spent hours tweaking hyperparameters or hunting for marginal model improvements, this is the guide for you.
What Is Autoresearch?
Autoresearch is an open-source project by Andrej Karpathy. The core idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and, hopefully, a better model.
In a single run, the framework ran 700 experiments in 2 days. It discovered 20 optimizations that improved training, including novel architecture tweaks like reordering QK Norm and RoPE.
The core loop looks like this:
1. Propose a code change
2. Run a fixed-time training experiment
3. Measure the metric (val_bpb = validation bits-per-byte)
4. Keep the change if it improves the metric
5. Revert it if it does not
6. Repeat indefinitelyThis is fundamentally different from asking a chatbot a question. Autoresearch is closed-loop. The agent acts on the world, observes the consequences, and adapts. It is the difference between reading a recipe and actually cooking, tasting, and adjusting.
The Three-File Architecture
The entire system is built around a strict contract between three files. This simplicity is what makes it work.
AutoResearch's design comes down to a contract between three files, each with strict rules about who can touch it.
autoresearch/
├── prepare.py # Fixed: data prep, tokenizer, evaluation metric
├── train.py # Agent's sandbox: model, optimizer, training loop
└── program.md # Human instructions to the agent (plain Markdown)Here is what each file does:
| File | Who Modifies It | Purpose |
|---|---|---|
prepare.py | Nobody (immutable) | Data prep, tokenizer, defines val_bpb metric |
train.py | The AI agent | Model architecture, optimizer, training loop |
program.md | The human | Instructions, constraints, goals for the agent |
prepare.py is immutable. Neither the human nor the agent modifies it, which guarantees that every experiment is measured against the same yardstick. train.py is the agent's sandbox. The agent can rewrite anything here as long as the modified code still trains and produces a val_bpb score. program.md is written in plain Markdown and is the only file the human author touches.
How the Hill-Climbing Loop Works
Hill-climbing is a search strategy. Instead of exploring every possibility, the agent only moves in the direction that improves the metric. Each accepted change becomes the new baseline.
The process: create a fresh branch for the run, initialize a results log, and run a baseline to establish the starting score. For each experiment, modify only train.py, commit, run training, parse val_bpb and memory usage, and append a row to results.tsv. If val_bpb improves (lower), keep the commit. If equal or worse, reset back to the last good state. If the run crashes or exceeds a timeout, treat it as a failure and move on.
Here is what a typical results.tsv looks like after a few experiments:
tsv
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
d4e5f6g 0.000000 0.0 crash double model width (OOM)The agent keeps a clean commit history. If you want to inspect or revert any experiment, the Git log is your audit trail.
Writing program.md: Your Instructions to the Agent
program.md is where you, the human, define the research direction. Think of it as writing a job description for an AI researcher.
Unlike traditional research where humans directly modify code, autoresearch inverts this relationship: humans write instructions in natural language, and the agent translates these into code modifications.
Here is a minimal example of what program.md might look like:
markdown
# Research Goal
Minimize val_bpb on the nanochat training setup.
## Constraints
- Only modify train.py
- Do not change batch size beyond 2x the baseline
- Prefer simpler solutions over complex ones
## Suggested Areas to Explore
- Learning rate schedules
- Attention head configurations
- Weight initialization strategies
- Optimizer hyperparameters
## Loop Instructions
- Run experiments continuously
- Never stop to ask for permission
- Log every result to results.tsv
- Revert any change that does not improve val_bpbIf you want the agent to focus on attention mechanisms, say so in program.md. If you want it to avoid touching the optimizer, add that constraint.
Setting Up and Running Autoresearch
Getting started takes less than 10 minutes on a machine with a single GPU.
Step 1: Clone and install dependencies
bash
git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv syncStep 2: Prepare the data (one-time setup)
bash
uv run prepare.py
# Downloads training data from HuggingFace
# Builds the BPE tokenizer with 8,192-token vocabulary
# Takes approx. 2 minutesStep 3: Verify your baseline
bash
uv run train.py
# Trains for exactly 5 minutes
# Outputs val_bpb score and peak VRAM usageStep 4: Start the agent loop
bash
# Launch via your preferred agentic coding tool (Claude Code, Codex CLI, etc.)
# Point it at the repo and say: "Follow program.md and start autoresearch"The agent only touches train.py. This keeps the scope manageable and diffs reviewable. Training always runs for exactly 5 minutes regardless of your specific platform. This means you can expect approximately 12 experiments per hour and around 100 experiments while you sleep.
Autoresearch vs. Traditional Frameworks
How does autoresearch compare to other tools commonly used for ML experimentation?
| Feature | Autoresearch | Optuna / Ray Tune | SWE-Agent / OpenHands |
|---|---|---|---|
| Search target | Code itself | Hyperparameter space | Arbitrary code tasks |
| Evaluation | Fixed-time training run | User-defined objective | Task-specific |
| Human involvement | Write program.md once | Define search space | Ongoing orchestration |
| Experiment tracking | Git commits + TSV log | Built-in dashboards | Varies |
| Scope | ML training loops | Parameter optimization | General software tasks |
| Revert mechanism | Git reset | N/A | N/A |
General-purpose coding agents like SWE-Agent, OpenHands, and Aider can write arbitrary code but are not built for the experiment-evaluate-keep/revert cycle that ML research actually requires. AutoResearch bets on the LLM's general knowledge to propose good experiments rather than constraining the search space for mathematical guarantees.
The Bigger Shift: From Coding to Directing
Autoresearch is part of a broader change in how engineers work.
Karpathy frames this as a natural progression in how engineers work with AI. In February 2026, he coined the term "agentic engineering": you are not writing the code directly 99% of the time, you are orchestrating agents who do, and acting as oversight. AutoResearch takes the next step. The human does not even orchestrate. They describe what good research looks like in a Markdown file and walk away.
The progression looks like this:
Vibe Coding
Human prompts → AI writes code → Human reviews
Agentic Engineering
Human orchestrates agents in real time
Autoresearch (Fully Autonomous)
Human sets direction → Agent runs independentlyIn Claude Code, the 99.9th-percentile turn duration nearly doubled from under 25 to over 45 minutes between October 2025 and January 2026, reducing the need for constant supervision while increasing the importance of robust guardrails.
Limitations and When It Struggles
Autoresearch is powerful, but it is not perfect. Know the failure modes before you run it overnight.
Hill-climbing algorithms, including autoresearch, can get stuck on local optima. The agent finds a parameter set that is better than its neighbors but far from globally optimal. It keeps making tiny changes, none of which improve the metric, and the loop stalls. The mitigation: run multiple loops from different starting points, add randomization to the experiment step, and use the loop to explore then apply human judgment to the best results.
Other known failure modes:
- Metric gaming: The agent optimizes the score, not your actual goal. If your metric is imperfect, the results will be too.
- VRAM blowout: The agent might propose architecture changes that exceed your GPU memory. The crash handling catches this, but it wastes experiment slots.
- Narrow search: Hill-climbing does not backtrack. If the early experiments set a bad baseline, later runs stay stuck in that neighborhood.
The deeper pattern autoresearch illustrates is an agent loop with an objective metric and a keep/discard gate. That pattern generalizes beyond ML when three conditions hold: you can define a measurable fitness signal, you can run a controlled experiment repeatedly, and you can automatically decide what survives.
Beyond ML: Applying the Pattern Elsewhere
The autoresearch pattern is not limited to machine learning. Any domain with a measurable metric and repeatable experiments can use it.
The same loop shows up in agent infrastructure frameworks doing recursive self-improvement: an agent logs outcomes, identifies failure patterns, proposes modifications to its own skills or routing logic, tests them, and keeps improvements. The difference is the substrate. Autoresearch operates on ML experiment code and val loss. Agent infrastructure recursive self-improvement operates on tool configs, skill files, and task success rates. Both are try-measure-keep/discard cycles.
Some practical extensions already in use:
- Prompt optimization: Run a hill-climbing loop on prompt variants, keep those that improve task scores.
- A/B testing pipelines: Mutate UI copy or pricing rules, keep only statistically significant improvements.
- Code quality improvement: Iterate on test coverage or lint scores autonomously.
- Customer support agents: Score agent traces, classify failures, and keep candidate improvements automatically.
Q&A
1. Do I need multiple GPUs to run autoresearch?
No. The project is built around a simplified single-GPU nanochat training workflow and is aimed at developers and researchers exploring automated model improvement on their own hardware.
2. What metric does autoresearch optimize by default?
The validation metric is val_bpb (validation bits-per-byte). Lower is better. This is defined in prepare.py and never changes, so every experiment is comparable.
3. Can I change what the agent is allowed to modify?
Yes. You specify constraints directly in program.md. You can tell the agent to avoid certain parts of the code, stay within VRAM limits, or focus only on specific components like the optimizer.
4. What happens if a training run crashes?
If the run crashes or exceeds a timeout of 10 minutes, it is treated as a failure. The agent optionally fixes trivial issues. Otherwise, it logs "crash" and moves on.
5. How many experiments can I expect per night?
You can expect approximately 12 experiments per hour and around 100 experiments while you sleep on a typical single-GPU setup.
6. Is this the same as hyperparameter search tools like Optuna?
Not exactly. Optuna searches over a predefined parameter space. Autoresearch lets the agent search over the space of code itself, including architecture changes, activation functions, and optimizer logic, which is a much broader search space.
7. Can I use autoresearch with my own training codebase?
Yes, with some adaptation. The core mechanics are the loop, the keep/discard logic, the mutation step, and why binary evaluations matter. You mutate from the best result, not the latest failed one. These principles apply to any repo once you define your three-file structure.
8. What agentic tools work with autoresearch?
The framework is model- and harness-agnostic and applies across Claude Code, OpenCode, Codex CLI, and related CLI agents.
9. What is val_bpb and why is it a good metric?
val_bpb stands for validation bits-per-byte. It measures how well a language model compresses text on a held-out validation set. Lower values mean the model predicts text more accurately. It is hardware-agnostic and comparable across different model sizes and architectures, making it ideal for a fixed-time budget experiment loop.
10. Is autoresearch safe to run unattended overnight?
The instructions in program.md explicitly say "NEVER STOP" so the agent does not pause to ask for permission, anticipating that the human may be asleep. VRAM is tracked as a soft constraint. Some increase is acceptable for meaningful gains, but the design avoids dramatic blowups. Always review the Git log and results TSV in the morning before using any resulting model.
My SaaS
Acluebox
Build modular and reusable system prompts with my SaaS,
Acluebox
. Also, free prompt template generators there. References
Autoresearch: AI agents running research on single-GPU nanochat training automatically
A Guide to Andrej Karpathy's AutoResearch: Automating ML with AI Agents
How to Build an AI Research Agent That Works While You Sleep
Autoresearch: Karpathy's Minimal "Agent Loop" for Autonomous LLM Experimentation
Karpathy Just Automated the Researcher: What autoresearch Means for the Future of AI Development
