Autoresearch Frameworks and Autonomous Code Hill-Climbing: Let AI Improve Your Code While You Sleep

Learn how Andrej Karpathy's autoresearch framework uses autonomous AI agent loops and code hill-climbing algorithms to run ML experiments and improve codebases automatically.

Imagine setting up an experiment before bed and waking up to 100 completed runs, a better-performing model, and a log of every decision the agent made. No babysitting. No manual tweaking. Just results.

That used to be science fiction. In early 2026, Andrej Karpathy made it real with a project called autoresearch. The concept is simple but powerful: give an AI agent a training file, a clear goal, and a metric to optimize, then let it run experiments on its own. It modifies code, tests the change, and keeps or discards based on whether performance improved.

This is not just a neat trick. It is a shift in how software and ML research gets done. If you have ever spent hours tweaking hyperparameters or hunting for marginal model improvements, this is the guide for you.

What Is Autoresearch?

Autoresearch is an open-source autonomous experimentation framework created by Andrej Karpathy that allows an AI agent to run machine learning training loops independently. By employing an autonomous code hill-climbing cycle, the agent modifies the codebase, trains the model, measures success metrics, and decides whether to keep or discard the code change, iterating continuously without human intervention.

In a single run, the framework ran 700 experiments in 2 days. It discovered 20 optimizations that improved training, including novel architecture tweaks like reordering QK Norm and RoPE.

The core loop looks like this:

1. Propose a code change
2. Run a fixed-time training experiment
3. Measure the metric (val_bpb = validation bits-per-byte)
4. Keep the change if it improves the metric
5. Revert it if it does not
6. Repeat indefinitely

This is fundamentally different from asking a chatbot a question. Autoresearch is closed-loop. The agent acts on the world, observes the consequences, and adapts. It is the difference between reading a recipe and actually cooking, tasting, and adjusting.

The Autoresearch Three-File Architecture

The entire system is built around a strict contract between three files. This simplicity is what makes it work.

AutoResearch's design comes down to a contract between three files, each with strict rules about who can touch it.

autoresearch/
├── prepare.py     # Fixed: data prep, tokenizer, evaluation metric
├── train.py       # Agent's sandbox: model, optimizer, training loop
└── program.md     # Human instructions to the agent (plain Markdown)

Here is what each file does:

File	Who Modifies It	Purpose
`prepare.py`	Nobody (immutable)	Data prep, tokenizer, defines `val_bpb` metric
`train.py`	The AI agent	Model architecture, optimizer, training loop
`program.md`	The human	Instructions, constraints, goals for the agent

prepare.py is immutable. Neither the human nor the agent modifies it, which guarantees that every experiment is measured against the same yardstick. train.py is the agent's sandbox. The agent can rewrite anything here as long as the modified code still trains and produces a val_bpb score. program.md is written in plain Markdown and is the only file the human author touches.

How the Autonomous Hill-Climbing Loop Works

Hill-climbing is a search strategy. Instead of exploring every possibility, the agent only moves in the direction that improves the metric. Each accepted change becomes the new baseline.

The process: create a fresh branch for the run, initialize a results log, and run a baseline to establish the starting score. For each experiment, modify only train.py, commit, run training, parse val_bpb and memory usage, and append a row to results.tsv. If val_bpb improves (lower), keep the commit. If equal or worse, reset back to the last good state. If the run crashes or exceeds a timeout, treat it as a failure and move on.

Here is what a typical results.tsv looks like after a few experiments:

tsv

commit    val_bpb    memory_gb    status    description
a1b2c3d   0.997900   44.0         keep      baseline
b2c3d4e   0.993200   44.2         keep      increase LR to 0.04
c3d4e5f   1.005000   44.0         discard   switch to GeLU activation
d4e5f6g   0.000000   0.0          crash     double model width (OOM)

The agent keeps a clean commit history. If you want to inspect or revert any experiment, the Git log is your audit trail.

How to Write program.md for AI Agent Instructions

program.md is where you, the human, define the research direction. Think of it as writing a job description for an AI researcher.

Unlike traditional research where humans directly modify code, autoresearch inverts this relationship: humans write instructions in natural language, and the agent translates these into code modifications.

Here is a minimal example of what program.md might look like:

markdown

# Research Goal
Minimize val_bpb on the nanochat training setup.

## Constraints
- Only modify train.py
- Do not change batch size beyond 2x the baseline
- Prefer simpler solutions over complex ones

## Suggested Areas to Explore
- Learning rate schedules
- Attention head configurations
- Weight initialization strategies
- Optimizer hyperparameters

## Loop Instructions
- Run experiments continuously
- Never stop to ask for permission
- Log every result to results.tsv
- Revert any change that does not improve val_bpb

If you want the agent to focus on attention mechanisms, say so in program.md. If you want it to avoid touching the optimizer, add that constraint.

Step-by-Step: Setting Up and Running Autoresearch

Getting started takes less than 10 minutes on a machine with a single GPU.

Step 1: Clone and install dependencies

bash

git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync

Step 2: Prepare the data (one-time setup)

bash

uv run prepare.py
# Downloads training data from HuggingFace
# Builds the BPE tokenizer with 8,192-token vocabulary
# Takes approx. 2 minutes

Step 3: Verify your baseline

bash

uv run train.py
# Trains for exactly 5 minutes
# Outputs val_bpb score and peak VRAM usage

Step 4: Start the agent loop

bash

# Launch via your preferred agentic coding tool (Claude Code, Codex CLI, etc.)
# Point it at the repo and say: "Follow program.md and start autoresearch"

The agent only touches train.py. This keeps the scope manageable and diffs reviewable. Training always runs for exactly 5 minutes regardless of your specific platform. This means you can expect approximately 12 experiments per hour and around 100 experiments while you sleep.

Autoresearch vs. Traditional Frameworks

How does autoresearch compare to other tools commonly used for ML experimentation?

Feature	Autoresearch	Optuna / Ray Tune	SWE-Agent / OpenHands
Search target	Code itself	Hyperparameter space	Arbitrary code tasks
Evaluation	Fixed-time training run	User-defined objective	Task-specific
Human involvement	Write `program.md` once	Define search space	Ongoing orchestration
Experiment tracking	Git commits + TSV log	Built-in dashboards	Varies
Scope	ML training loops	Parameter optimization	General software tasks
Revert mechanism	Git reset	N/A	N/A

General-purpose coding agents like SWE-Agent, OpenHands, and Aider can write arbitrary code but are not built for the experiment-evaluate-keep/revert cycle that ML research actually requires. AutoResearch bets on the LLM's general knowledge to propose good experiments rather than constraining the search space for mathematical guarantees.

The Shift to Agentic Engineering: From Coding to Directing

Autoresearch is part of a broader change in how engineers work.

Karpathy frames this as a natural progression in how engineers work with AI. In February 2026, he coined the term "agentic engineering": you are not writing the code directly 99% of the time, you are orchestrating agents who do, and acting as oversight. AutoResearch takes the next step. The human does not even orchestrate. They describe what good research looks like in a Markdown file and walk away.

The progression looks like this:

Vibe Coding
  Human prompts → AI writes code → Human reviews

Agentic Engineering
  Human orchestrates agents in real time

Autoresearch (Fully Autonomous)
  Human sets direction → Agent runs independently

In Claude Code, the 99.9th-percentile turn duration nearly doubled from under 25 to over 45 minutes between October 2025 and January 2026, reducing the need for constant supervision while increasing the importance of robust guardrails.

Limitations of Hill-Climbing and Autoresearch

Autoresearch is powerful, but it is not perfect. Know the failure modes before you run it overnight.

Hill-climbing algorithms, including autoresearch, can get stuck on local optima. The agent finds a parameter set that is better than its neighbors but far from globally optimal. It keeps making tiny changes, none of which improve the metric, and the loop stalls. The mitigation: run multiple loops from different starting points, add randomization to the experiment step, and use the loop to explore then apply human judgment to the best results.

Other known failure modes:

Metric gaming: The agent optimizes the score, not your actual goal. If your metric is imperfect, the results will be too.
VRAM blowout: The agent might propose architecture changes that exceed your GPU memory. The crash handling catches this, but it wastes experiment slots.
Narrow search: Hill-climbing does not backtrack. If the early experiments set a bad baseline, later runs stay stuck in that neighborhood.

The deeper pattern autoresearch illustrates is an agent loop with an objective metric and a keep/discard gate. That pattern generalizes beyond ML when three conditions hold: you can define a measurable fitness signal, you can run a controlled experiment repeatedly, and you can automatically decide what survives.

Applying the Autoresearch Pattern Beyond Machine Learning

The autoresearch pattern is not limited to machine learning. Any domain with a measurable metric and repeatable experiments can use it.

The same loop shows up in agent infrastructure frameworks doing recursive self-improvement: an agent logs outcomes, identifies failure patterns, proposes modifications to its own skills or routing logic, tests them, and keeps improvements. The difference is the substrate. Autoresearch operates on ML experiment code and val loss. Agent infrastructure recursive self-improvement operates on tool configs, skill files, and task success rates. Both are try-measure-keep/discard cycles.

Some practical extensions already in use:

Prompt optimization: Run a hill-climbing loop on prompt variants, keep those that improve task scores.
A/B testing pipelines: Mutate UI copy or pricing rules, keep only statistically significant improvements.
Code quality improvement: Iterate on test coverage or lint scores autonomously.
Customer support agents: Score agent traces, classify failures, and keep candidate improvements automatically.

Q&A

1. Do I need multiple GPUs to run autoresearch?

No. The project is built around a simplified single-GPU nanochat training workflow and is aimed at developers and researchers exploring automated model improvement on their own hardware.

2. What metric does autoresearch optimize by default?

The validation metric is val_bpb (validation bits-per-byte). Lower is better. This is defined in prepare.py and never changes, so every experiment is comparable.

3. Can I change what the agent is allowed to modify?

Yes. You specify constraints directly in program.md. You can tell the agent to avoid certain parts of the code, stay within VRAM limits, or focus only on specific components like the optimizer.

4. What happens if a training run crashes?

If the run crashes or exceeds a timeout of 10 minutes, it is treated as a failure. The agent optionally fixes trivial issues. Otherwise, it logs "crash" and moves on.

5. How many experiments can I expect per night?

You can expect approximately 12 experiments per hour and around 100 experiments while you sleep on a typical single-GPU setup.

6. Is this the same as hyperparameter search tools like Optuna?

Not exactly. Optuna searches over a predefined parameter space. Autoresearch lets the agent search over the space of code itself, including architecture changes, activation functions, and optimizer logic, which is a much broader search space.

7. Can I use autoresearch with my own training codebase?

Yes, with some adaptation. The core mechanics are the loop, the keep/discard logic, the mutation step, and why binary evaluations matter. You mutate from the best result, not the latest failed one. These principles apply to any repo once you define your three-file structure.

8. What agentic tools work with autoresearch?

The framework is model- and harness-agnostic and applies across Claude Code, OpenCode, Codex CLI, and related CLI agents.

9. What is val_bpb and why is it a good metric?

val_bpb stands for validation bits-per-byte. It measures how well a language model compresses text on a held-out validation set. Lower values mean the model predicts text more accurately. It is hardware-agnostic and comparable across different model sizes and architectures, making it ideal for a fixed-time budget experiment loop.

10. Is autoresearch safe to run unattended overnight?

The instructions in program.md explicitly say "NEVER STOP" so the agent does not pause to ask for permission, anticipating that the human may be asleep. VRAM is tracked as a soft constraint. Some increase is acceptable for meaningful gains, but the design avoids dramatic blowups. Always review the Git log and results TSV in the morning before using any resulting model.

My SaaS

Acluebox

Build modular and reusable system prompts with my SaaS,

Acluebox

. Also, free prompt template generators there.

Autoresearch Frameworks and Autonomous Code Hill-Climbing: Let AI Improve Your Code While You Sleep ​

What Is Autoresearch? ​

The Autoresearch Three-File Architecture ​

How the Autonomous Hill-Climbing Loop Works ​

How to Write program.md for AI Agent Instructions ​

Step-by-Step: Setting Up and Running Autoresearch ​

Autoresearch vs. Traditional Frameworks ​

The Shift to Agentic Engineering: From Coding to Directing ​

Limitations of Hill-Climbing and Autoresearch ​

Applying the Autoresearch Pattern Beyond Machine Learning ​

Q&A ​

References ​

Autoresearch Frameworks and Autonomous Code Hill-Climbing: Let AI Improve Your Code While You Sleep

What Is Autoresearch?

The Autoresearch Three-File Architecture

How the Autonomous Hill-Climbing Loop Works

How to Write program.md for AI Agent Instructions

Step-by-Step: Setting Up and Running Autoresearch

Autoresearch vs. Traditional Frameworks

The Shift to Agentic Engineering: From Coding to Directing

Limitations of Hill-Climbing and Autoresearch

Applying the Autoresearch Pattern Beyond Machine Learning

Q&A

References