Running Tests

Aileron leans heavily on the test suite. Every PR runs the full Go test set on Linux and Windows, the docs build, the UI tests, and a Playwright-driven end-to-end suite. This page is the contributor’s view of the same surface.

Run everything

task test

This runs the Go suite, the webapp tests, the UI tests, and the docs tests. Expect ~5–10 minutes on a modern machine. It does not bring up the stack, so it excludes both task test:integration (the Go HTTP/API suite) and the Playwright E2E suite. CI runs those in dedicated jobs that provision a running stack first.

For a faster inner loop, run individual targets:

task test:go              # Go unit tests across the workspace
task test:go:cover        # Go unit tests with coverage summary
task test:go:ci           # what CI runs: race + coverage + JUnit
task test:docs            # docs site unit tests (rehype plugins, etc.)
task test:ui              # UI unit and component tests
task test:integration     # Go HTTP/API integration suite against a running daemon
task test:e2e:integration # Playwright E2E against a real stack

task test:integration needs a daemon already listening on localhost:8080. It does not start the stack itself, and task test deliberately excludes it. Run it standalone against a cold machine and every HTTP test fails with dial tcp [::1]:8080: connect: connection refused. Bring the stack up first. The simplest path is:

task test:integration:coverage  # brings the compose stack up --wait, runs the Go suite, then tears it down

Or run the explicit sequence that task ci uses:

task up -- -d --build --wait
task test:integration
task down

Run a single Go package

The Taskfile’s test:go target wraps go test across the workspace. For tight iteration on one package, go test directly is faster:

go test ./internal/sandbox/...
go test ./internal/cstore -run TestForwarder -v
go test ./internal/wrap -coverprofile=/tmp/cov.out

The go.work workspace handles module resolution; no manual cd required.

Race detector

go test ./internal/sandbox -race

The sandbox package has the densest concurrency surface (per-invocation state, shared executor, audit emission). Run with -race whenever you change any of those paths. CI does the same automatically.

The race detector is a C runtime, so every -race task (test:go:ci, test:integration, test:integration:coverage, and the sandbox-integration targets) needs CGO_ENABLED=1 plus a C compiler on PATH. macOS and Linux satisfy this out of the box once you have the Xcode Command Line Tools or your distro’s gcc. On a stock Windows host Go defaults to CGO_ENABLED=0 with no compiler present, so go test -race aborts with exit status 2 before any test runs. Install MinGW-w64 (scoop install mingw or choco install mingw) and set CGO_ENABLED=1. CI’s Windows runners already ship MinGW-w64 gcc, so this local setup matches CI rather than working around it. See Building from Source for the per-OS install commands.

Coverage

The project doesn’t pin a hard coverage threshold, but the convention is:

>80% on new code is the working bar.
Don’t chase metrics on filesystem-error wrappers, concurrent-install race recovery, or other paths where the test fixture would be brittle and not catch real bugs. Tests should strengthen Aileron, not satisfy a metric.
Bug fixes need a regression test that fails before the fix and passes after.

To see what’s covered:

task test:go:cover

Or for a single package:

go test ./internal/cstore -coverprofile=/tmp/cov.out
go tool cover -func=/tmp/cov.out
go tool cover -html=/tmp/cov.out  # opens an HTML report in the browser

Linting

task lint        # everything
task lint:go     # go vet across the workspace
task lint:docs   # docs site type-check
task lint:webapp # webapp type-check

golangci-lint is recommended but not required locally. CI runs go vet plus a stricter check.

Reproducing a CI failure

CI’s Go suite runs with task test:go:ci. To reproduce locally:

task test:go:ci

This runs with -race, full coverage, and JUnit output (under test-results/). Most CI failures reproduce on the first run.

If a test passes locally but fails in CI, the usual suspects are:

Goroutine leaks or races — surface under -race; the inline task test:go skips it.
TempDir vs HOME — tests that touch ~/.aileron/ need t.Setenv("HOME", t.TempDir()). The CI runners have no fallback path.
Time-of-day or timezone — tests that compare against time.Now() without an injected clock will be flaky on slow CI runners.

System tests (black-box CLI)

The system-test suite sits above the unit, integration, and sandbox-integration layers. It builds the shipped aileron binary and drives the real aileron launch <agent> -- <agent-flag> "..." path against a live Docker sandbox, for example aileron launch codex -- exec "..." or aileron launch claude -- -p "...", then asserts on the result with shell and jq. The lower layers prove that Docker works on the host. The test:go unit layer exercises Go functions in isolation. The task test:integration layer runs the Go HTTP/API integration tests against a running daemon. The integration_sandbox Go tests call docker run and the sandbox Go functions directly. The system suite proves that aileron launch itself correctly drives Docker on the host. It does not replace any of those layers, and it sits above them.

Run it

task test:system               # lib contract tests + harness smoke + the codex and claude scenarios
task test:system:lib           # Go contract tests for the shared scenario library (no Docker, no shell, CI-safe, Windows-runnable)
task test:system:smoke         # harness self-test: build fires, Docker precondition gates, defer cleanup runs
task test:system:launch:codex  # the codex scenario in isolation: aileron launch codex -- exec "..."
task test:system:launch:claude # the claude scenario in isolation: aileron launch claude -- -p "..."

Each agent scenario builds a fresh aileron (plus the Linux aileron-mcp sibling), runs the launch once, and on exit a deferred cleanup removes the sandbox container and the temporary workspace even when an assertion failed.

Host prerequisites

A reachable Docker daemon. On macOS and Windows this means Docker Desktop running; on Linux it means dockerd. The suite checks docker info before any launch.
The target agent’s auth already present. The codex scenario expects ~/.codex/auth.json (created by codex login). The claude scenario expects ~/.claude/.credentials.json (created by claude /login). v1 does not inject any LLM secret; you authenticate once with your own aileron launch <agent> login and the suite reuses that file.
Optional: a running Aileron daemon if you want the audit round-trip assertion to read real records (AILERON_STATE_DIR defaults to ~/.aileron).

A missing prerequisite stops the run immediately and prints the exact remediation command, for example Authenticate first with: claude /login. The suite never silently skips a scenario when a prerequisite is absent.

A live agent scenario needs a real login and consumes LLM tokens, so it is run by hand. The headless path validates the wiring without launching:

task --dry test:system:launch:codex   # compiles the target, resolves deps and preconditions, does not launch

Cross-OS

The same suite runs unmodified on Ubuntu, Fedora, macOS, and Windows, with the container path included on all four. Task runs each target’s command steps through its embedded mvdan/sh interpreter, so the Taskfile-level shell logic is portable without a host Bash. Windows uses the stdio exec path and does not use a Unix PTY. OS and distribution versions are unpinned in v1.

The launch container path is not gated by the spawn-primitive availability probes (internal/sandbox/sandbox_available_*.go, ADR-0014). Those probes guard the separate spawn-primitive OS confinement subsystem. The aileron launch container path works on all four OS families regardless of that gate, so the system suite runs everywhere Docker runs.

Scope boundary

v1 is portable and run by hand. task test:system is not wired into CI. The maintainer runs it on real hosts across the four OS families. CI-matrix automation, including self-hosted runners, cloud VMs, and secret injection, is deferred to a later initiative.

The human-driven manual acceptance this suite complements is tracked in issue #962 ↗. The scenario bodies and the shared probe library are documented in test/system/README.md in the repository.

Testing philosophy

Per the project’s CLAUDE.md, tests are written against the contract of the code (inputs, outputs, side effects, error conditions defined by the function signature or API spec), never against implementation internals. A refactor that preserves the contract should leave the suite green; if it doesn’t, the test was coupled to internals.

Two consequences:

Happy path is mandatory. A test that only asserts on failure modes tells you nothing about whether the feature works.
Implementation accidents are not contracts. If a test passes because of how the code happens to be structured (e.g., “this fails because it tries to reach Google”), that’s a mirror, not a test.

See the project’s root CLAUDE.md for the full statement.