$timemachine_
AI Agent Observability

Time MachineDebug the past. Fork the future.

Capture every agent step. Fork from any point. Replay with one click.

Works with Claude Code — zero config
Get StartedSign Up
~/my-project
claude code
$
Claude Code
Fork & Replay
Step-by-Step Tracing
Visual Diff
Time Travel
0%Reduced debugging costs
0%Faster issue resolution
0Hook events captured
< 0 minSetup time
3 SIMPLE STEPS

How It Works

From first install to your first debugged agent — in minutes, not hours.

01

Capture

Install the SDK or connect Claude Code hooks. Every agent step — LLM calls, tool uses, decisions — is automatically recorded.

02

Replay

Visualize every execution as an interactive timeline. Click any step to inspect inputs, outputs, tokens, and costs. Watch it unfold in real-time.

replaying...
03

Fork & Fix

Found the bug? Fork from that exact step. Change one variable, replay the rest. Compare original vs. fixed side-by-side — at a fraction of the cost.

CLAUDE CODE INTEGRATION

Every Claude Code Session. Automatically Captured.

Connect Time Machine to Claude Code with a single command. Every prompt, tool call, and file edit is recorded — no code changes needed.

~/my-project
claude code
Claude Code
Hooks
Time Machine API
Dashboard

Zero-config setup

One command installs hooks into Claude Code

Full session capture

Prompts, tool calls, file edits, errors

Subagent tracking

See when Claude spawns subagents and what they do

Cost visibility

Token counts and costs per session, per step

Query from Claude Code

Ask Claude Code to pull a failed run and inspect the trace. The debugging loop stays where the development loop lives.

FORK & REPLAY

Replay From the Fork Point, Not From Scratch

Fork any execution at any step. Modify the input. Only the steps after the fork point are re-executed — prior steps are reused instantly. No wasted compute, no waiting for the whole pipeline to run again.

Fork from any step — prior steps are reused, not re-run
Only replay the steps that actually need to change
Save 40-80% on compute costs with partial replay
Compare original vs forked output side-by-side
execution_7f3a
5 steps
Original
LLM Call1,234 tok
gpt-4o
Tool: search_db45ms
query users
Route Decisionfork point
confidence: 0.72
LLM Call892 tok
generate response
Final Output$0.03
sent to user
Forked
steps 1-2 reused instantly
Route Decisionmodified
confidence: 0.95
LLM Call1,102 tok
new prompt
New Output$0.02
improved result
Replaying from step 3 (skipping steps 1-2)...0/3 steps
2 steps skipped · 40% faster$0.02 saved vs full re-run
TIMELINE SCRUBBER

Scrub Through Every Agent Decision

Watch your agent execute like a video. Click anywhere on the timeline to jump to that moment — see exactly which files were read, what edits were made, and why the agent took each action.

Color-coded event blocks: reads, edits, bash, agent messages, user prompts
Click or drag the scrubber to jump to any point in the session
Dense clusters reveal activity bursts — sparse gaps show LLM thinking time
Step forward or backward one event at a time with full event detail
Session Replay
aurora-apiOpus 4.63:14fp 21/20
Event 1/20 · 0:00
userAdd rate limiting to auth endpoints
ReadEditBashAgentUser
EVAL PLATFORM

Ship with Confidence.
Test Every Change.

Define test suites from real production inputs. Assert on outputs. Gate deployments on passing scores. Every eval run is a replay — powered by the same fork & replay engine.

Customer Support Quality
Ready
Test Cases (4)assertions
Password Reset
Billing Inquiry
Complex Query
Edge Case
Suite Score
containsllm_judgeregexlatency_undercost_underjson_valid

Automated Pipeline

PR Opened
Run Suite
Score
Gate

10 Assertion Types

contains, regex, llm_judge, cost_under, latency_under, json_valid, and more.

CI/CD Quality Gates

Block merges on eval regressions. GitHub Actions integration in 5 minutes.

LLM-as-Judge

Grade subjective output quality with a rubric. Quantify what "good" means.

Cost & Latency Guards

Assert that every run stays under budget and within latency SLAs.

Save from Production

Click any dashboard execution → "Save as eval case". Real data, zero authoring.

Fork & Replay Powered

Each eval run replays your agent via fork — same infra, deterministic results.

STEP-BY-STEP TRACKING

Complete Execution Visibility

Every action your agent takes is captured — inputs, outputs, LLM prompts, tool calls, state snapshots, and cost breakdowns. Nothing hidden.

Full state snapshot at each execution step
Token usage and cost breakdown per step
Supports LLM calls, tool use, decisions, retrievals
Query all execution data via PostgreSQL
Execution Steps
completed
1LLM Call
gpt-4o1,234 tok
InputAnalyze user request...
OutputThe user wants to...
Cost: $0.02Latency: 320ms
2Tool Use
search_db
Input{ "query": "..." }
Output{ "results": [...] }
3Decision
Inputcontext_state
Outputproceed_with_action
EXECUTION TIMELINE

See the Full Picture at a Glance

Gantt chart and trace tree views show timing, dependencies, and bottlenecks across your entire execution. Spot slow steps instantly — find the 200ms tool call hiding behind a 2s LLM call.

Cascading Gantt bars color-coded by step type
Collapsible trace tree with parent-child hierarchy
Click any bar to inspect step details and metrics
Zoom, pan, and keyboard navigation for large executions
Execution Timeline
GanttTree
0ms
50ms
100ms
150ms
200ms
LLM: Analyze
45ms
Tool: search_db
Tool: fetch_api
28ms
LLM: Synthesize
62ms
Decision: Route
LLM: Generate
55ms
LLM CallTool UseDecision
VISUAL DIFF

Compare Models Side-by-Side

Run the same prompt through different models simultaneously. See outputs side-by-side with diff highlighting, metrics comparison, and cost analysis.

Dual-pane comparison: GPT-4o, Claude, Gemini, and more
Diff view highlights added and removed text
Token usage, latency, and cost metrics per model
Save and export comparison reports
Model Comparison
Diff ViewOn
GPT-4o$0.024

The analysis shows that

revenue increased by 15%

compared to last quarter.

892 tokens|1.2s
Claude 3.5$0.018

The analysis shows that

revenue grew by 18%

compared to last quarter.

756 tokens|0.9s
RemovedAdded
DATA DRIFT DETECTION

Catch Output Drift Before Users Do

Automatically detect when agent outputs change for the same inputs. Pinpoint whether drift comes from data, model, or prompt changes — before your users notice.

Auto-detect output drift across same-name executions
Variable analysis: model, prompt, retrieved data, tools
Visual divergence timeline showing exact drift point
Export drift reports for investigation and resolution
Drift Detection
Run A — Feb 8
Run B — Feb 10
Parse Input
Parse Input
LLM: Classify
LLM: Classify
Tool: fetch_data
Tool: fetch_data
LLM: Synthesize
LLM: Synthesize
Final Output
Final Output
Variable Analysis
Model
identical
System Prompt
identical
Retrieved Data
changed
Output
changed
Root cause: Retrieved data changed between runs
REVIEW QUEUE

Feedback Loop That Closes

Human reviewers mark outputs as correct or wrong. Developers get automated debug packages. Replay & Validate confirms fixes actually work.

Pending → Wrong → Resolved workflow
One-click debug package generation for developers
Replay with automatic validation (pass/fail)
Keyboard shortcuts for rapid review (C/W)
Review Queue
CcorrectWwrong
exec_001
2m ago
exec_002
5m ago
WHY TIME MACHINE

Not Just Observability. Debuggability.

Most tools show you what happened. Time Machine lets you change what happened.

FeatureOthersTime Machine
Execution logging
Step-by-step traces
Cost tracking
Fork from any step
Replay with modifications
Visual diff comparison
Claude Code + MCP integration
Eval suites with CI/CD gates
Data drift detection

Ready to Debug Smarter?

Start capturing your agent executions in under 2 minutes. Free to get started.