What Code Review Can't See (And Bad Data Always Finds)
Introduction
The bug was not subtle. It was quite trivial after it was discovered. A tenant could see another tenant’s data. The path had the correct tenant. The authenticated user had the correct tenant. The UI never sent the wrong value. The controller even looked clean in review.
The problem was the query to the database.
Somewhere between the service call and the repository, the code was filtering by tenantId from the path instead of the tenantId resolved from the auth context. Both values were present. Both looked reasonable in isolation. The query was just trusting the wrong one. Nobody noticed in code review because the diff was ordinary: a new endpoint, a new DTO, a service call, a repository method. The reviewer was reading the intended flow. A single bad request, one where the tenantId was random, found the bug in seconds.
That’s the difference between code review and bad-data testing. Code review inspects what the code appears to do. Bad data asks what the system actually accepts.
Both are useful, and they answer different questions.
Reviewers read code through intent
A reviewer is usually trying to answer several questions at once: Is this understandable? Does it fit the design? Are there obvious security problems? Will this be painful to maintain? Did the author miss an edge case?
That is already a lot.
Most code review happens under imperfect conditions. The reviewer has meetings in ten minutes. The diff is larger than it should be. The service has conventions that aren’t written down. The author says they tested it locally. The reviewer knows the frontend “always” sends a certain shape. The repository method name sounds trustworthy.
So the reviewer reads the code as it was meant to work.
Bad-data testing starts from a different angle. It doesn’t know what the UI normally sends. It doesn’t respect the domain language. It doesn’t care that the field is required in the form, or that the enum only has four values in the dropdown. It sends the request anyway.
That’s why malformed payloads, weird strings, missing fields, duplicate identifiers, and inconsistent object graphs find bugs that review misses. They attack the boundary between what the code assumes and what the system actually enforces.
That boundary is usually where the interesting bugs live.
The diff is not the boundary
Code review usually starts from the diff. Bad-data testing starts from the API surface.
A pull request might show a small change:
@PostMapping("/accounts/{accountId}/transfers")
public TransferResponse createTransfer(
@PathVariable UUID accountId,
@RequestBody CreateTransferRequest request) {
return transferService.create(accountId, request);
}
The review will focus on whether the endpoint is named well, whether the service owns the right logic, whether the DTO matches the API spec, whether the response is consistent.
Bad data asks different questions:
- What happens if
request.accountIdalso exists and doesn’t match the path? - What happens if the amount is 0, negative, or
999999999999999999999999? - What happens if the currency is valid JSON but not a supported business value?
- What happens if the description is 100 KB?
- What happens if the recipient account belongs to another tenant?
- What happens if the same idempotency key is reused with a different body?
Some of these are visible in the diff, but it’s very hard to always think about all.
The code may look correct locally while being wrong at the boundary. The bug isn’t in a suspicious line. It’s in the relationship between fields, layers, defaults, and assumptions.
This is one reason code review is a weak tool for finding input-handling bugs. A reviewer sees the changed code. A request exercises the whole path: HTTP parsing, framework binding, validation annotations, custom validators, mappers, authorization checks, service logic, database constraints, transaction boundaries, logging, metrics, and error handling.
The diff is only one slice of that.
Bad data finds disagreement between layers
Many production bugs are caused by layers quietly disagreeing.
The API spec says a field is required. The DTO allows it to be null. The database column is nullable because the migration was easier that way. The service assumes it’s present because the UI always sends it. The reporting job later treats null as “unknown.” The analytics export treats null as “false.”
No single line looks outrageous.
Send the field as missing and the system tells you which layer was actually in charge.
This happens constantly with validation. In one system I worked on, the API accepted a list of external references. The
OpenAPI schema had maxItems: 50. The frontend never sent more than 20. The backend DTO had no size constraint. The
database insert was batched, but another downstream call was not. A payload with 5,000 references didn’t fail
validation - it generated a long sequence of downstream calls, then timed out, then retried, then created duplicate
noise in an integration queue.
A reviewer might have spotted the missing @Size(max = 50). But the problem only became obvious when the request
crossed the whole system. The failure wasn’t “forgot annotation.” The failure was that the limit existed in a document
and in a UI, but not at the boundary that mattered.
Another example: date ranges. A search endpoint accepts from and to. The UI sends the first day of the month and
today. The controller accepts both as strings. The service parses them into local dates. The repository converts them to
timestamps. The database stores UTC. Around midnight, some users see transactions disappear from the last day of the
range.
In review, each line looks reasonable. The bug is in the conversion chain. Send a range that crosses a daylight-saving
boundary, or a to date equal to midnight, and suddenly the hidden policy becomes visible.
It’s hard to infer runtime behaviour from a code review.
The most dangerous inputs are valid-looking
People hear “bad data” and think of obviously malformed requests: broken JSON, wrong content type, missing braces, strings where numbers should be.
Those are useful. They mostly test the outer wall.
The more interesting bugs come from data that is syntactically valid and semantically awkward.
A negative amount is easy to reject if someone remembered the validation. But what about an amount with more decimal places than the currency supports?
{
"amount": "10.999",
"currency": "EUR"
}
The JSON is valid. The field type might be valid. The value might even parse cleanly into BigDecimal. The bug appears
later: one layer rounds, another truncates, another rejects, another logs the original value, and the reconciliation
report has a one-cent ghost.
A string of 80 KB is valid JSON. It may be valid according to a schema that only says type: string. It becomes a
problem when it’s copied into logs, indexed into search, displayed in an admin UI, included in an error email, or passed
to a third-party API with a smaller limit.
A duplicated field can be valid enough for a parser to accept:
{
"role": "user",
"role": "admin"
}
Different parsers handle this differently. Some keep the last value. Some keep the first. Some reject it if configured strictly. An intermediate gateway may parse one value for policy checks while the service sees another. That’s not theoretical parser trivia - it’s exactly the kind of thing that becomes a security issue when layers disagree.
A request body with both path and body identifiers is another common valid-looking bug:
PUT /users/123/email
{
"userId": "456",
"email": "new@example.com"
}
The shape is fine. The question is which ID wins. If the answer isn’t explicit, the system will eventually choose for you - usually in a mapper or ORM merge operation nobody was thinking about during review.
Experienced engineers know this category of bug. It rarely looks dramatic in code. It looks like “reasonable input” plus “unclear ownership.”
Review is bad at finding missing hostility
Code review tends to reward clarity. And that’s good - clear code is easier to maintain.
But input boundaries need a certain amount of hostility. Not theatrical paranoia. Just a refusal to trust data because it arrived through a route that normally behaves.
The frontend is not a security boundary. Generated clients are not a security boundary. OpenAPI documentation is not a security boundary. Internal services aren’t automatically trustworthy because they’re internal. Message queues aren’t clean just because the producer is “ours.”
These statements are familiar, but teams still violate them in small ways.
A service consumes events from another service and assumes the enum is known. Then a new enum value is deployed by the producer before the consumer. The consumer crashes or parks messages indefinitely.
An admin endpoint assumes only the admin UI calls it. Then someone writes a script for support and sends partial payloads the UI never produces.
A mobile app keeps an older client version alive for six months. The backend starts assuming a new field exists because the web app has sent it for weeks.
A batch import endpoint assumes CSV columns are trimmed. One vendor sends a non-breaking space. The value looks empty in the UI but is not empty in code.
Bad-data testing is useful because it encodes a little hostility into the development process. It asks: pretend the caller is confused, outdated, unlucky, inconsistent, or running a different stack. What still holds?
Agents make review cheaper, they don’t fix boundaries
This doesn’t stop being true because agents now write code, generate tests, and comment on pull requests.
If anything, the problem gets easier to miss. Generated code often looks ordinary - the controller is thin, the DTO has annotations, the service has the expected shape, the tests call the endpoint with a plausible payload. An agent reviewing the change may even suggest a missing null check or a more consistent error message. That’s very useful, but it’s not the same thing as proving the boundary.
An agent can write the implementation and the tests from the same assumption. If the generated endpoint trusts a body ID, the generated test may trust it too. If the implementation treats unknown JSON fields as harmless, the generated tests may never send unknown fields. If the service assumes the path tenant and body tenant always match, the tests may only cover the case where they do. This makes the generated code is coherent, but the assumption is still untested.
The useful response is to change what we ask agents to produce: hostile examples, conflicting identifiers, replayable failures, and regression tests. Don’t only ask an agent to review the diff. Ask it to attack the boundary:
- Generate requests the UI would never send
- Make path, query, body, headers, and authenticated context disagree
- Mutate OpenAPI examples into negative cases
- Try valid JSON that violates business rules
- Produce replayable curl commands
- Turn each real failure into a regression test
This is also where purpose-built tooling earns its place. Asking an agent to generate negative cases ad-hoc is better than nothing, but it’s inconsistent, the quality depends on the prompt, the model’s mood, and whether anyone thought to ask. A tool that reads your OpenAPI spec and systematically applies the same adversarial playbooks on every run gives you something an agent conversation doesn’t: a repeatable quality gate you can put in CI and forget about.
The move that matters: from “does the code look right?” to “what request proves this assumption is enforced?”
Code review - human or automated - can follow the intended story. Bad-data testing verifies whether the running system actually enforces it.
Some bugs are too boring for humans
There’s a class of bugs that humans are bad at finding because they’re tedious.
Null handling. Empty strings. Whitespace-only strings. Very long strings. Arrays with zero items. Arrays with one item.
Arrays with too many items. Unknown enum values. Duplicate query parameters. Parameters with different casing. Strings
with trailing spaces. Unicode normalization. Numbers at the edge of precision. 0. -1.
A reviewer can look for these, but manually simulating every boring input across every endpoint is not a good use of human attention.
Machines are good at being boring.
A simple negative test generator can throw the same family of awkward values at things repeatedly. It doesn’t get tired. It doesn’t assume “we probably handle that.” It doesn’t skip the endpoint because the diff looks straightforward. It also doesn’t understand your business, so it will produce noise. That’s the trade-off.
Automation is useful here because it can handle the repetitive disrespect, while humans stay focused on intent, design, and consequences.
For API work, I think about bad input in three layers:
The first is shape violations: wrong type, missing required field, malformed JSON.
The second is boundary values: empty, huge, min/max, too many items, too much nesting.
The third is semantic contradictions: path ID differs from body ID, valid user from wrong tenant, supported currency
with invalid precision, date range where from > to.
The third category is where experienced engineers usually find the expensive bugs. It’s also the category generic tools understand least - you have to teach the tests something about the domain.
That’s where code review can actually help bad-data testing. Reviewers can ask: what are the contradictions this endpoint must reject? Then tests can preserve those decisions.
The false comfort of validation annotations
Validation annotations make code look safer than it is.
public record CreateUserRequest(
@NotBlank String name,
@Email String email,
@Size(max = 50) String displayName) {
}
Annotations handle local field rules. Most of the bugs I’ve seen live in relationships:
startDatemust be beforeendDatecurrencydetermines allowed decimal precisionaccountIdmust belong to the authenticated tenant- status transitions depend on current persisted state
idempotencyKeycan be reused only with the same request body
A request can pass every annotation and still be wrong.
Teams usually learn this late. Validation it’s a set of decisions distributed across parsing, schema checks, field checks, cross-field checks, authorization, persistence constraints, and business invariants. If those decisions are scattered, review becomes harder. Bad data will find the gaps, but it may find them in production.
A useful heuristic: for any endpoint that mutates state, write down the invariants that involve more than one field or more than one resource. Those are the cases code review is likely to discuss abstractly and tests should make concrete.
Code review sees names. Bad data sees behavior.
Good names can mislead reviewers.
getCurrentUserAccount() sounds safe. validateRequest() sounds complete. isAuthorized() sounds final. toDomain()
sounds boring. save() sounds harmless.
Names compress behavior - that’s what makes them useful. They also hide questions.
What does validateRequest() validate? Shape? Business rules? Authorization? Cross-field constraints? Does it normalize
before validating? Does it trim strings? Does it reject unknown fields? Does it treat empty string as missing?
Bad data expands the name back into behavior.
This matters especially with framework magic. Spring, Jackson, Hibernate, ASP.NET, Rails, Django, Express middleware, API gateways, validation libraries, ORMs - all of them can make code shorter and easier to read while moving behavior out of sight.
Can the binder set fields you didn’t expect? Are unknown JSON properties ignored? Are empty strings coerced to null? Are numbers coerced to strings? Is case-insensitive enum matching enabled? Are duplicate fields rejected? Are validation groups applied consistently? Does the same DTO get used for create and update even though the rules differ?
These details rarely come up in review unless the team has been burned before.
Bad-data tests surface those details earlier, before they become production surprises.
Code review is still relevant
It would be silly to argue that code review doesn’t matter.
Bad-data testing can tell you that a request with mismatched IDs succeeds. It can’t tell you the cleanest way to model ownership. It can tell you that error handling is inconsistent. It can’t decide your error taxonomy. It can tell you an endpoint accepts a huge string. It can’t choose the right product limit.
Code review is good at things execution doesn’t surface on its own: whether the abstraction is pulling its weight; whether the change puts logic somewhere the team can actually maintain it; whether a shortcut will become a migration problem in a year; whether the naming matches the domain; whether the test expresses the right invariant; whether the failure mode is acceptable for this system.
Review also catches bugs that bad-data testing may miss: concurrency assumptions, transaction boundaries, caching mistakes, lifecycle issues, operational hazards, and changes that are technically correct but hostile to future work.
The limitation is narrower than people think: code review is not enough to validate input boundaries. It’s too dependent on imagination, context, and whatever energy the reviewer has that afternoon.
A good review process should assume reviewers will miss boring edge cases and runtime disagreements - and then arrange for tests and tools to be annoying on purpose.
Practical heuristics that help
The most useful habit is to stop asking only “Is this request valid?” and start asking “Who decided that?”
For every API boundary, try to locate the decision point for each of these:
- Which fields are required, and where is that enforced?
- Are unknown fields rejected, ignored, or stored?
- Are strings trimmed, normalized, length-limited, and encoded consistently?
- Which identifier wins when the same concept appears in path, query, body, token, or session?
- Are tenant and authorization checks based on trusted context or caller-provided data?
- Are cross-field invariants tested with concrete examples?
- Do error responses have a stable shape?
- Can a failed request be reproduced from logs without exposing sensitive data?
People might find this really boring. But it catches real bugs.
For pull requests that add or change endpoints, I use a small rule: at least one test should send a request the UI would never send. Not a random one, a meaningful one.
For example:
- Create an order with a valid product ID from another tenant
- Update user 123 with body
userId = 456 - Send
amount = 10.999for a currency with two decimal places - Send a valid enum value that isn’t allowed in the current state
- Reuse an idempotency key with a different payload
- Send a date range where both dates are valid but the interval is not
- Send an empty array when the field is required to contain at least one item
These tests are cheap compared to the bugs they prevent. They also document assumptions better than comments do.
Another heuristic: make automated tests produce artifacts a human can use. If an API fuzz test fails, save the request, response, seed, relevant headers, and a curl command. A failure that can’t be replayed is just a rumor with a stack trace.
One more: treat error responses as testable behavior. If bad data produces a useless error, that’s not just a test failure, it’s an API design smell. The caller did something wrong, but the API still owns the explanation.
People underestimate how often “impossible” data becomes possible. Old clients exist. Mobile apps stay installed. Internal scripts bypass UI constraints. Support teams patch data directly. Queues replay old messages. Vendors send files with invisible characters. Feature flags create combinations nobody tried locally. Partial deploys happen. Migrations leave old rows behind. Browser extensions and proxies modify requests. Humans paste from Excel.
People also underestimate how much damage a “handled” bad request can do. A request doesn’t need to return 500 to be a problem. It can return 200 and store nonsense. It can return 400 but log a massive payload. It can return 409 but still publish an event. It can fail validation after doing expensive work. It can create metrics cardinality problems. It can poison a cache. It can trigger retries. It can expose in an error message that a resource exists.
A system that rejects bad input too late may still be doing most of the damage. That’s another thing bad-data testing reveals: not just whether the request fails, but where it fails and what it touches before failing.
A malformed request should ideally die near the boundary. A semantically invalid request should die before mutation. An unauthorized request should not perform a lookup that leaks existence unless that’s a deliberate product decision. A request that exceeds size limits should not be fully parsed, logged, and indexed before someone says no.
Review can reason about these things. Tests can prove them often enough to keep everyone honest.
Where bad-data testing makes things worse
Bad-data testing has limits.
It can become noisy. It can produce failures nobody cares about. It can encourage teams to play whack-a-mole with payloads instead of fixing the validation model. It can spend too much time on syntactic weirdness and not enough on domain contradictions. And it can give a false sense of safety if it only checks for 500 responses.
“No crash” is a low bar.
An API that returns 400 for everything may be stable and still unpleasant to use. An API that rejects malformed JSON may still allow cross-tenant access. An API that survives fuzzing in staging may still fail when a downstream service has different limits.
The best bad-data tests aren’t purely random. They’re guided by the contract, the domain, and the places where layers might disagree.
For a payments API: test money precision, idempotency, state transitions, duplicate callbacks, and account ownership. For a search API: test pagination stability, invalid filters, large limits, sorting with equal values, and date boundaries. For a multi-tenant SaaS API: test every place an ID can be swapped. For an event-driven system: test unknown enum values, missing optional fields, old event versions, and replayed messages. And so on.
The review comment I wish I saw more often
A useful review comment is not only: “Can we add validation here?”
A better one is: “What bad request proves this assumption?”
That question changes the conversation.
If the assumption is “the path account belongs to the current user,” prove it with a request using someone else’s account. If the assumption is “the body ID is ignored,” prove it with a conflicting body ID. If the assumption is “the limit can’t be huge,” prove it with a huge limit. If the assumption is “clients can safely handle our errors,” prove it with the actual error shape. If the assumption is “this event is backward compatible,” prove it with the previous version of the consumer.
This keeps code review from becoming a place where edge cases are merely mentioned. Mentioned edge cases have a way of disappearing. Executable edge cases tend to stick around.
Bad data is not smarter than reviewers
Bad data is useful for a boring reason: it keeps trying cases people do not have the time or patience to simulate by hand.
It tries the thing a tired reviewer didn’t simulate. It crosses layers the diff didn’t show. It exercises framework defaults nobody remembered configuring. It turns “the client will never do that” into a request sitting in your logs with a timestamp and a correlation ID.
Agents don’t change that. They can make some of this cheaper, and that’s useful - generating negative cases, mutating examples, producing replay commands, turning failures into regression tests. But if they only read the diff and generate tests from the same assumptions as the implementation, they inherit the same blind spot, just in a faster loop.
The practical lesson is to give code review a narrower job. Let it challenge assumptions, ownership, and design. Then make those assumptions executable with tests that attack the boundary.
A useful split is this: review should name the assumptions, automation should attack them repeatedly, and bad-data tests should prove which ones are actually enforced.
Code review is where intent gets inspected; bad-data testing is where intent gets confronted.