$timemachine_

AI Agent Observability

Time MachineDebug the past. Fork the future.

Capture every agent step. Fork from any point. Replay with one click.

Works with Claude Code — zero config

Get StartedSign Up

~/my-project

claude code

Claude Code

Fork & Replay

Step-by-Step Tracing

Visual Diff

Time Travel

0%Reduced debugging costs

0%Faster issue resolution

0Hook events captured

< 0 minSetup time

3 SIMPLE STEPS

How It Works

From first install to your first debugged agent — in minutes, not hours.

Capture

Install the SDK or connect Claude Code hooks. Every agent step — LLM calls, tool uses, decisions — is automatically recorded.

Replay

Visualize every execution as an interactive timeline. Click any step to inspect inputs, outputs, tokens, and costs. Watch it unfold in real-time.

replaying...

Fork & Fix

Found the bug? Fork from that exact step. Change one variable, replay the rest. Compare original vs. fixed side-by-side — at a fraction of the cost.

CLAUDE CODE INTEGRATION

Every Claude Code Session. Automatically Captured.

Connect Time Machine to Claude Code with a single command. Every prompt, tool call, and file edit is recorded — no code changes needed.

~/my-project

claude code

Claude Code

Hooks

Time Machine API

Dashboard

Zero-config setup

One command installs hooks into Claude Code

Full session capture

Prompts, tool calls, file edits, errors

Subagent tracking

See when Claude spawns subagents and what they do

Cost visibility

Token counts and costs per session, per step

Query from Claude Code

Ask Claude Code to pull a failed run and inspect the trace. The debugging loop stays where the development loop lives.

FORK & REPLAY

Replay From the Fork Point, Not From Scratch

Fork any execution at any step. Modify the input. Only the steps after the fork point are re-executed — prior steps are reused instantly. No wasted compute, no waiting for the whole pipeline to run again.

Fork from any step — prior steps are reused, not re-run

Only replay the steps that actually need to change

Save 40-80% on compute costs with partial replay

Compare original vs forked output side-by-side

execution_7f3a

5 steps

Original

LLM Call1,234 tok

gpt-4o

Tool: search_db45ms

query users

Route Decisionfork point

confidence: 0.72

LLM Call892 tok

generate response

Final Output$0.03

sent to user

Forked

steps 1-2 reused instantly

Route Decisionmodified

confidence: 0.95

LLM Call1,102 tok

new prompt

New Output$0.02

improved result

Replaying from step 3 (skipping steps 1-2)...0/3 steps

2 steps skipped · 40% faster$0.02 saved vs full re-run

TIMELINE SCRUBBER

Scrub Through Every Agent Decision

Watch your agent execute like a video. Click anywhere on the timeline to jump to that moment — see exactly which files were read, what edits were made, and why the agent took each action.

Color-coded event blocks: reads, edits, bash, agent messages, user prompts

Click or drag the scrubber to jump to any point in the session

Dense clusters reveal activity bursts — sparse gaps show LLM thinking time

Step forward or backward one event at a time with full event detail

Session Replay

aurora-apiOpus 4.63:14fp 21/20

Event 1/20 · 0:00

userAdd rate limiting to auth endpoints

ReadEditBashAgentUser

EVAL PLATFORM

Ship with Confidence.
Test Every Change.

Define test suites from real production inputs. Assert on outputs. Gate deployments on passing scores. Every eval run is a replay — powered by the same fork & replay engine.

Customer Support Quality

Ready

Test Cases (4)assertions

Password Reset

Billing Inquiry

Complex Query

Edge Case

Suite Score

—

containsllm_judgeregexlatency_undercost_underjson_valid

Automated Pipeline

PR Opened

Run Suite

Score

Gate

10 Assertion Types

contains, regex, llm_judge, cost_under, latency_under, json_valid, and more.

CI/CD Quality Gates

Block merges on eval regressions. GitHub Actions integration in 5 minutes.

LLM-as-Judge

Grade subjective output quality with a rubric. Quantify what "good" means.

Cost & Latency Guards

Assert that every run stays under budget and within latency SLAs.

Save from Production

Click any dashboard execution → "Save as eval case". Real data, zero authoring.

Fork & Replay Powered

Each eval run replays your agent via fork — same infra, deterministic results.

STEP-BY-STEP TRACKING

Complete Execution Visibility

Every action your agent takes is captured — inputs, outputs, LLM prompts, tool calls, state snapshots, and cost breakdowns. Nothing hidden.

Full state snapshot at each execution step

Token usage and cost breakdown per step

Supports LLM calls, tool use, decisions, retrievals

Query all execution data via PostgreSQL

Execution Steps

completed

1LLM Call

gpt-4o1,234 tok

InputAnalyze user request...

OutputThe user wants to...

Cost: $0.02Latency: 320ms

2Tool Use

search_db

Input{ "query": "..." }

Output{ "results": [...] }

3Decision

Inputcontext_state

Outputproceed_with_action

EXECUTION TIMELINE

See the Full Picture at a Glance

Gantt chart and trace tree views show timing, dependencies, and bottlenecks across your entire execution. Spot slow steps instantly — find the 200ms tool call hiding behind a 2s LLM call.

Cascading Gantt bars color-coded by step type

Collapsible trace tree with parent-child hierarchy

Click any bar to inspect step details and metrics

Zoom, pan, and keyboard navigation for large executions

Execution Timeline

GanttTree

0ms

50ms

100ms

150ms

200ms

LLM: Analyze

45ms

Tool: search_db

Tool: fetch_api

28ms

LLM: Synthesize

62ms

Decision: Route

LLM: Generate

55ms

LLM CallTool UseDecision

VISUAL DIFF

Compare Models Side-by-Side

Run the same prompt through different models simultaneously. See outputs side-by-side with diff highlighting, metrics comparison, and cost analysis.

Dual-pane comparison: GPT-4o, Claude, Gemini, and more

Diff view highlights added and removed text

Token usage, latency, and cost metrics per model

Save and export comparison reports

Model Comparison

Diff ViewOn

GPT-4o$0.024

The analysis shows that

revenue increased by 15%

compared to last quarter.

892 tokens|1.2s

Claude 3.5$0.018

The analysis shows that

revenue grew by 18%

compared to last quarter.

756 tokens|0.9s

RemovedAdded

DATA DRIFT DETECTION

Catch Output Drift Before Users Do

Automatically detect when agent outputs change for the same inputs. Pinpoint whether drift comes from data, model, or prompt changes — before your users notice.

Auto-detect output drift across same-name executions

Variable analysis: model, prompt, retrieved data, tools

Visual divergence timeline showing exact drift point

Export drift reports for investigation and resolution

Drift Detection

Run A — Feb 8

Run B — Feb 10

Parse Input

LLM: Classify

Tool: fetch_data

LLM: Synthesize

Final Output

Variable Analysis

Model

identical

System Prompt

identical

Retrieved Data

changed

Output

changed

Root cause: Retrieved data changed between runs

REVIEW QUEUE

Feedback Loop That Closes

Human reviewers mark outputs as correct or wrong. Developers get automated debug packages. Replay & Validate confirms fixes actually work.

Pending → Wrong → Resolved workflow

One-click debug package generation for developers

Replay with automatic validation (pass/fail)

Keyboard shortcuts for rapid review (C/W)

Review Queue

CcorrectWwrong

exec_001

2m ago

exec_002

5m ago

WHY TIME MACHINE

Not Just Observability. Debuggability.

Most tools show you what happened. Time Machine lets you change what happened.

FeatureOthersLangSmith, LangfuseTime Machine

Execution logging

Step-by-step traces

Cost tracking

Fork from any stepExclusive

Replay with modificationsExclusive

Visual diff comparisonExclusive

Claude Code + MCP integrationExclusive

Eval suites with CI/CD gatesExclusive

Data drift detectionExclusive

Ready to Debug Smarter?

Start capturing your agent executions in under 2 minutes. Free to get started.

Get Started Sign Up