AppCrib

How We QA Without a QA Team

Field notes·By Carl Stein, Co-Founder & Head of Build·

We don't have a QA team. We have something worse: a gatekeeper that can't be reasoned with, bribed, or guilt-tripped into letting a half-baked build through. And it has opinions about your feature.

At AppCrib, every app we ship passes through a staged pipeline before it goes live. Each phase is run by a manager. Human or AI, we prefer to keep that a little opaque. Architecture review, build validation, security scanning, deployment gating. Each phase has its own checkpoint. But the checkpoint that keeps build quality honest is the one nobody particularly likes. It sits at the end of the BUILD phase and decides, in cold JSON, whether your work moves forward or gets sent back.

SHIP or HOLD. That's the entire vocabulary.

What Does the Gatekeeper Actually Check?

The gatekeeper runs after every feature has been built and tested through three validation gates (structural, functional, feature-level). Its job is to verify that the whole build meets a set of non-negotiable criteria before the app advances to the TESTING phase.

It evaluates, in order:

  • Architecture artifacts exist. Five files must be present: brand identity, layout constraints, page architecture, data model, and component architecture. If any are missing, the build goes back to the build manager to regenerate them.
  • All features pass all three gates. Every feature gets tested through Gate 1 (structural: do the routes return 200?), Gate 2 (functional: do the forms submit?), and Gate 3 (feature-level: does the user flow actually work?). If any feature fails any gate, HOLD.
  • The build compiles. npm run build must succeed. No exceptions.
  • No forbidden dependencies. We have a locked technology stack. Chakra UI, MUI, Prisma, Firebase, Clerk. If it shows up in package.json, the build is held.
  • Execution evidence exists for every passing feature. This is the big one.
  • Red-green verification for all P0 features. We'll come back to this.

More criteria exist beyond these: mock detection audits, stack compliance checks, API manifest completeness, smoke tests, theme toggle verification, content page quality gates, dead component detection. But those first five are the spine. The rest are scar tissue from lessons learned the hard way.

Execution Evidence: The Criterion That Catches Liars

Criterion 10 in the checklist is straightforward: every feature that claims passes: true must also have an execution_evidence block with playwright_actions > 0 and pages_tested > 0.

You can't say a feature passes because someone read the code and it looked right. The testing manager and its team must have actually opened a browser, navigated to the page, clicked the buttons, filled the forms, and recorded what happened. If the Playwright action count is zero, the feature isn't passing.

We learned this the painful way. Early in the pipeline's history, test runs would read the source code, conclude the implementation "matched the spec," and report a pass. No browser opened. No user flow executed. The result was builds that compiled, looked correct on paper, and fell apart the moment a real person touched them.

Now every test result requires an execution_evidence block, a structured receipt showing which pages were navigated, how many Playwright actions were performed, and what screenshots were captured. The gatekeeper reads those receipts. If they're missing, the feature is treated as untested regardless of what the test report claims.

A Real HOLD: Anatomy of a Failed Build

This is from an actual gatekeeper report. A recent tool build. Thirteen features. A mid-complexity project we'll leave unnamed.

All thirteen features claimed passes: true. All three gates reported passing. Architecture artifacts present. On paper, a clean build.

The gatekeeper issued a HOLD.

Nine of those thirteen features had no execution_evidence block. Zero Playwright actions. No browser opened. No pages tested. Every one of them had been marked as passing based on code review alone.

The hold request read, verbatim: *"9 features claim passes:true but have NO execution_evidence with playwright_actions > 0. Features that pass based on code review alone are NOT passing."*

The build was routed back to the testing manager with specific instructions: re-validate the nine failing features with Playwright. All passing features must have execution evidence with playwright_actions > 0 and pages_tested > 0.

That wasn't the only problem.

Red-Green Verification: Proving Your Tests Could Have Failed

Criterion 12 is what we call the red-green check. It catches a subtler failure mode than missing evidence.

The idea is borrowed from test-driven development. Before any P0 feature is implemented, the feature builder writes tests for it and runs them. Those tests must fail. That's the "red run," proof that the test exercises something real, that the absence of the feature causes the test to break.

Then the feature gets built. Tests run again. They pass. That's the "green run."

If a feature only ever has a green run (tests written after the code, never seen to fail), there's no proof those tests mean anything. They could be checking true === true and nobody would know the difference.

In the same build, four P0 features were missing red_run_verified: true. No recorded red run. The tests might have been real. They might have been theater. The gatekeeper doesn't speculate. It checks receipts.

The routing was specific: after the testing manager completes re-validation for Criterion 10, the feature builder must re-run the four P0 features with red-green enforcement. Red run first, then green run. All P0 features must show red_run_verified: true before the gatekeeper re-evaluates.

The Hold Loop

When a HOLD is issued, the build doesn't sit in purgatory. It enters a loop.

HOLD fires the target role. That role fixes the issues. Gates get re-validated. The gatekeeper re-evaluates. Issues remain? HOLD again. Everything checks out? COMPLETE, and the build advances to TESTING automatically.

The hold_routing field in the report tells each role exactly what to do. From the report above:

Criterion 10 routed to the testing manager: *"Re-validate the nine failing features with Playwright."*

Criterion 12 routed to the feature builder: *"After testing validation, re-run the four P0 features with red-green enforcement."*

No ambiguity. No "please look into this." The hold request names the features, names the criterion, and states what must be true before the build can proceed.

What Happens After the Gate: The Rest of the Gauntlet

Clearing the BUILD gatekeeper doesn't mean the app is shipping. It means the app is allowed to enter the TESTING phase, a completely separate gauntlet with its own coordinator.

The testing phase runs eight validation passes in sequence: style audits, security scanning, end-to-end flow testing, spacing and layout quality, visual integrity, adversarial chaos testing, full integration and endpoint coverage, and documentation completeness. Each pass is handled by a specialist. Again, human or AI, we prefer not to say.

After all eight passes finish, the deployment gatekeeper makes the DEPLOY or HOLD call. It has its own criteria, its own execution evidence requirements, its own hold loop. It even runs a live smoke test as the last gate: open the app in a real browser, verify it hydrates, perform one interaction, confirm the UI responds. If anything is dead after all those passes have touched the code, HOLD.

The whole pipeline is built around one principle we carved into every specification: *humans should never find dumb stuff.* Placeholder links, 404s, broken forms, console errors, buttons that don't respond, data that doesn't persist. All of it gets caught upstream before a human ever opens the app. Humans answer "Is this good?" and "Would I pay for this?" The pipeline handles "Does this work?"

Why the Gatekeeper Is Always Right

Nobody likes getting a HOLD. Your feature, the thing someone spent an afternoon building, just got kicked back by a JSON object that doesn't care about anyone's deadline or confidence that "it works on my machine."

But the gatekeeper report is a receipt, not an opinion. It doesn't read your code and nod approvingly. It checks execution evidence. Counts Playwright actions. Verifies red runs. Scans for forbidden dependencies, mock patterns, dead components, metadata contract violations. When it says HOLD, something specific and measurable failed.

Compare that to the alternative: a human QA engineer who tests what they remember to test, files tickets that get triaged into oblivion, and eventually approves the build because the sprint ends Friday.

We'll take the gatekeeper. It doesn't have sprints.

The Evidence Trail

One last thing. Every decision the pipeline makes is recorded. The BUILD gatekeeper writes a structured report to the builds directory. The DEPLOY gatekeeper writes one to the deploys directory. Every testing pass produces structured output with execution evidence, Playwright action counts, page lists, and screenshots.

When something goes wrong in production, we don't have to ask "Did we test that?" We open the report and check. The evidence is either there or it isn't.

That's the whole philosophy behind automated qa testing ai agents, and frankly any automated quality gate. Don't trust claims. Don't trust code review. Don't trust anyone, human or otherwise, who says "it looks correct." Trust execution evidence. Trust Playwright actions. Trust the receipt.

And if the gatekeeper says HOLD, fix the build.

AppCrib
Free tools for developers, designers, and everyone else.
Browse all tools