Every detection system lives somewhere on a trade-off between two failure modes. Miss a real problem, and you have a false negative. Flag a non-problem, and you have a false positive. Most scanners are tuned hard toward catching everything, because a missed vulnerability is the failure everyone fears and a false alarm is the failure everyone tolerates. The result is a familiar artifact: a report with hundreds of "criticals," most of which are wrong, handed to a team that no longer reads past the first page.
We made the opposite bet, and we made it on purpose. Our output is small. A clean estate gets a short report, sometimes an empty critical list, and we say so plainly. The bet is that one reproducible finding a customer can act on beats a hundred they have to triage, and that over time the team that ships only true things is the team that gets believed when it matters. This is the doctrine behind that bet. Verifiable security.
The cost asymmetry nobody prices in
The reason recall-maximizing tooling feels safe is that the cost of a false positive is paid by someone else: the recipient's analyst, not the vendor's report generator. That cost is real and it compounds. The first false critical costs an hour. The hundredth trains the team to assume the tool cries wolf, and from then on every finding the tool produces, including the true ones, is discounted. A detection system that is wrong often enough becomes indistinguishable from no detection system at all, except that it also generates work.
A scanner that calls everything critical is not cautious. It is noise with a severity label, and the analyst it exhausts is the same analyst who needs to see the one finding that is real.
So the question is not "did we catch the issue." Recall is easy; you can catch everything by flagging everything. The question is "can we stand behind every word of what we shipped." That reframing changes the whole pipeline. The expensive, load-bearing step is not detection. It is the verification gate that sits between a candidate and a signature, and that gate is where most of the work actually happens.
The four tests a finding must survive
A candidate finding is a hypothesis, not a result. Before it becomes something we sign and ship, it has to pass four tests in order. Most candidates die at the first or second. That is the system working. The examples below are illustrative, the everyday shapes these tests handle, not a description of any one engagement.
- Reproduce the fact off the wire, now. Not from a cached scan, not from a banner captured an hour ago. Re-observe the underlying fact at verification time, by hand or by a deterministic re-probe. State rotates: certificates expire, services restart, configurations change. If the fact is not still true when we look again, it does not ship. This single test kills the largest class of false positives, the ones that were true for a moment and then were not.
- Point at the byte. Every finding must reduce to a specific, observable artifact: an expiry date in a certificate, a token reflected unencoded into a response body, a header that is present or absent, a response that should have been a reject and was an accept. If the strongest statement we can make is "the version suggests this might be vulnerable," we do not have a finding. We have a guess, and we do not sign guesses.
- Rule out the benign explanation. A lot of alarming-looking signals are systems working as designed. A hostname that does not match a certificate can be a misconfiguration, or it can be a federation endpoint that is supposed to serve a third party's certificate. A reflected parameter can be injection, or it can be an error page echoing a URL with no execution. A login page returned where you expected data can be a redirect doing its job, not a leak. Before a candidate ships, the benign reading has to be actively excluded, not merely overlooked. If we cannot rule out "this is correct behavior," we hold it.
- Confirm it is the customer's, and that it is real. Evidence collected through a redirect can belong to a third party. A reflected canary token that the server merely echoes back, rather than evaluating, proves nothing and must not be read as execution. Evidence generated by our own synthetic test fixtures is, by construction, not a customer finding. A candidate has to be tied to an asset the customer actually owns and scoped us into, and it has to be free of any synthetic-test marker, before it is eligible to ship.
Four tests, in order. Most candidates die at step 1 or 2. Only what survives all four earns a signature, and the signature means the customer can reproduce the fact themselves.
What "honest" looks like in a report
This doctrine produces reports that look different from the genre. Three habits in particular.
We say "not measured" instead of inventing a number. If we did not measure something, we report that we did not measure it. We never fabricate a percentage or a confidence score to fill a field. A made-up metric is just a false positive wearing a lab coat.
We report an empty critical list as a result, not a failure. When an estate is genuinely hardened, the honest finding is "hardened." Server versions masked behind a content delivery network make version-to-CVE matching legitimately empty, and we say that is what happened rather than manufacturing a finding to justify the engagement. A short report from a clean estate is the system telling the truth.
We separate what we observed from what we inferred. A directly observed fact and a reasonable inference are different epistemic objects, and we keep them visibly distinct. The capsule ships the observation. The narrative may discuss the inference, clearly labeled as inference. The customer always knows which is which.
Adopt the doctrine on your own pipeline this week
- Add a re-verification step between detection and reporting. Whatever your scanner flags, re-observe it before it reaches a human. The gap between scan time and report time is where most false positives are born.
- Require an artifact, not a heuristic, for anything you call critical. If the finding cannot be reduced to a specific observable byte, downgrade it to "needs review," not "critical."
- Maintain an explicit benign-explanation list per check. For every detection, write down the legitimate configurations that trip it, and exclude them in code rather than in the analyst's head.
- Tag and quarantine synthetic test data at the source. Test fixtures that exercise detection logic must never be eligible to appear in a customer report. Mark them once, filter them everywhere.
- Measure your false-positive rate and treat a regression as a P1. The metric that protects trust is the one you watch. If a release raises the false-positive rate, that is a shipping-blocker, not a footnote.
- Make every shipped finding reproducible by the recipient. If the customer cannot independently confirm the fact, you are asking them to trust you. Replace trust with a repeatable observation and a signature.
The trade-off we accept
We are not going to pretend this is free. Optimizing against false positives means we will occasionally hold something that turns out to have been real, because it did not survive the gate on the day we looked. We accept that cost deliberately, for two reasons. First, the gate is re-run continuously, so a real defect that was momentarily un-reproducible gets caught on a later pass, while a false positive that ships is loose forever. Second, the entire value of a security signal is its credibility, and credibility is destroyed far faster by crying wolf than by an occasional quiet miss that the next pass recovers.
The customers who feel this most are the ones with hardened estates, who get a short, honest report instead of a padded one. They tend to be the customers who understand exactly why that is the right answer.
How Celvex operationalizes it
Find. Prove. Fix. Verify.
Read-only probes generate candidate findings at breadth, deliberately tuned toward catching the shape of a defect, knowing the verification gate will discard whatever cannot be proven.
The four-test gate re-observes each candidate off the wire, requires a specific artifact, excludes the benign reading, and confirms customer ownership before anything is signed into a Proof Capsule.
Only surviving findings reach the customer, each with a remediation block tied to the exact observed fact, so the fix targets the proven defect rather than a guess.
After remediation, the same gate re-runs and the finding closes only when the fact is no longer reproducible. The verified-fix event is recorded for the audit trail.
This is the discipline behind every other piece in this series. When we wrote about the factory default certificate on the perimeter, the reason it is a finding and not a guess is that all four tests pass: the certificate is read off the wire now, the three tells are specific observable bytes, the benign reading (an intentional internal certificate) is excluded, and the host is the customer's. When we compose an attack chain from two public CVEs, the same gate decides whether each link is a real, citable primitive or just a plausible story. A chain whose links cannot be grounded is a false positive at the level of the whole capability, and the doctrine governs it exactly as it governs a single byte.
Verifiable security. Find it. Prove it. Fix it. Verify the fix held. And ship only what you can stand behind.
Sources
See what survives the gate on your estate.
Free Exposure Check, no signup required. We run the four-test verification gate against the assets you scope in and ship a signed Proof Capsule only for findings we can reproduce off the wire.
Run a Free Scan →