Testing Strategy
afk is a Claude Code plugin whose product is the markdown itself: skills, agent definitions, and hooks. The test suite protects that product at four levels, from instant zero-token static checks up to model-backed behavioral evals that make real LLM calls.
unit + integration → e2e smoke → behavioral evals → trigger activation
bun run test test:e2e test:evals test:triggers
zero tokens ~$0.01 real LLM calls ~$10–14
every edit before release before release before releaseThe rule of thumb: bun run test runs on every edit (it costs nothing and takes seconds); the model-backed layers run before a release, not on every change.
The layers
Unit: bun run test:unit
File-level checks that read one file at a time. Zero tokens, runs in tests/unit/run-unit-tests.ts.
It validates:
- Plugin manifests:
.claude-plugin/*.jsonis valid JSON andplugin.jsonhas anameandversion. - Skill frontmatter:
name:matches the directory name, is lowercase kebab-case, carries no reserved words (anthropic,claude) or XML tags; thedescription:is present, within 1024 chars, and starts withUse when; only allowed frontmatter keys appear; required sections (## When to Use,## Process,## Stop and Ask,## Output) exist; andSKILL.mdstays under 500 lines. - Agent frontmatter: the two named agents pin their tier and tool allowlist (
implement-orchestrator→opus, read-only;implementation-worker→sonnet, withBash/Edit/Write). Omittingmodelwould inherit the user's default tier; omittingtoolswould inherit all tools and erase the least-privilege guarantee, so the lint fails the build. - The brain index hook: runs
auto-index-brain.shagainst a throwaway temp vault and asserts summary lines become entry descriptions, list markers are stripped, title-only notes stay bare wikilinks, and a rebuild is idempotent. - No
.shtest runners: the test pipeline is Bun/TypeScript only.
Integration: bun run test:integration
Cross-file checks that verify references between files line up. Also zero tokens, in tests/integration/run-integration-tests.ts.
It validates:
- Eval specs: every
evals.json/triggers.jsonundertests/e2e/evals/specs/belongs to a real skill and has the required shape (judged cases keep anexpectationsarray; routing cases carry anexpect/forbidroutingblock). - Internal file references: every
references/…orskills/…path mentioned in a skill resolves to a file that exists. - Markdown links: every
.mdlink across the repo points at a real file (dead links fail the build). - Skill & agent catalog: every
/afk:<skill>referenced in the README, thehelpCSV catalog, the docs, and the skills resolves to a real skill or agent. - Marketplace:
marketplace.jsonandplugin.jsonagree on the plugin name, the source points at the repo root, and a version anchor is present.
bun run test: the every-edit gate
tests/check.ts runs unit and integration in sequence and exits non-zero if either fails. This is the check to run after every edit: it is fast, free, and catches most regressions (broken frontmatter, dead links, a skill renamed without updating the catalog).
e2e smoke: bun run test:e2e
tests/e2e/plugin-load.ts runs a single headless claude -p turn (~$0.01) against the working tree with --plugin-dir . and confirms the plugin loads: a system/init event is emitted, afk appears in the loaded plugins list, no plugin_errors are reported, and the run completes without a Claude error. This catches packaging problems the static checks can't see: a malformed hooks.json, a frontmatter shape the runtime rejects.
Behavioral evals: bun run test:evals
tests/e2e/evals/run-evals.ts is the deepest layer: it drives each skill end-to-end with real LLM calls and grades the behavior. Specs live under tests/e2e/evals/specs/<skill>/evals.json.
Each eval runs the skill in a fresh temp project (its fixture.files are written in first), then grades the transcript three ways:
- Deterministic assertions:
required_substrings,forbidden_substrings,required_files,required_file_substrings, andunchanged_filesare checked in code against the output and the resulting project. - Judged cases (
kind: "judged"): an LLM judge scores the transcript against a list ofexpectations. The case passes when the mean score across trials clears the threshold (default 70%). - Routing cases (
kind: "routing") are code-graded: the output must contain everyexpectsubstring and none of theforbidsubstrings. A case passes on a strict majority of trials;overblock_guardflags over-eager refusals.
Each case runs multiple trials (default 3) to surface flaky routing. Useful env knobs:
| Variable | Default | Purpose |
|---|---|---|
AFK_EVAL_TRIALS | 3 | trials per eval |
AFK_EVAL_SCORE_THRESHOLD | 70 | judged pass threshold (%) |
AFK_EVAL_SKILL | (none) | run only one skill's evals |
AFK_EVAL_ID | (none) | run only one eval id |
AFK_EVAL_JUDGE_MODEL | claude-haiku-4-5 | the judge model |
AFK_EVAL_MAX_BUDGET_USD | 0.50 | per-eval budget cap |
The invariant: write the eval red first. A case that cannot fail proves nothing. See Eval-first for how this drives the flow, and the write-evals skill for scaffolding new specs.
Trigger activation: bun run test:triggers
The Trigger-Activation Runner measures whether AFK skills fire organically from bare natural-language prompts — no /afk: prefix. It reads a single shared corpus at tests/e2e/triggers/corpus.json, sends each prompt headless through claude -p, and detects which AFK skill fires first (the first Skill tool-use naming an afk: skill).
It reports three metrics over the corpus:
- Activation % — positive queries where any AFK skill fired.
- Accuracy % — positive queries where the expected owner fired.
- False-positive % —
nonequeries where any AFK skill fired.
The runner runs 3 trials per query (strict-majority vote per query), prints a confusion matrix (expected owner × fired skill), and exits non-zero if activation < 80%, false-positive > 10%, or any per-query majority fails. Cost is roughly $10–14 per full suite, so this check is local pre-release only — it is not in bun run test and not in CI.
Where things live
| Path | What |
|---|---|
tests/unit/ | file-level checks |
tests/integration/ | cross-file checks |
tests/e2e/plugin-load.ts | plugin-load smoke test |
tests/e2e/evals/ | model-backed behavioral evals + specs |
tests/lib/ | shared Bun test helpers (TestRun, fs/path utils) |
Adding tests
- Changing skill or agent structure (frontmatter, sections, line budgets)? Add or extend a check in
tests/unit/. - Adding a cross-file relationship (a new catalog, a new reference style)? Add a check in
tests/integration/. - Changing a skill's behavior? Add a behavioral eval under
tests/e2e/evals/specs/<skill>/evals.jsonand prove it fails before your change makes it pass.
Run bun run test before every commit; run bun run test:e2e, bun run test:evals, and bun run test:triggers before cutting a release.