Cross-model code review: ask one AI to grade another

There's a quiet pattern emerging on senior dev teams that I haven't seen documented yet. It's worth naming because it's effective, cheap, and not obvious until someone shows you.

A senior developer I was working with had just finished building a complex audit logging system. He'd used Claude with a developer skill — multi-thread, careful, structured. He ran Claude's built-in self-review twice. Both passes came back clean. By his own assessment he was 90% confident in the code.

Then he opened a different thread, with Codex this time, and asked it to review what Claude had written.

Codex found three real issues.

One was a retry-logic bug — the implementation would fail silently after the first batch in certain queue configurations. One was an access-control gap that the original requirements had implicitly assumed but never made explicit. One was an edge case in idempotency where two events arriving in the same millisecond could produce duplicate records.

All three were real. All three would have shipped. All three were missed by Claude reviewing Claude.

The pattern has a name now: cross-model code review.

Why it works

The intuition is straightforward: models trained on different data with different fine-tuning objectives have different blind spots. When a model reviews its own output, it's checking against patterns it was trained to consider correct. It's looking in the places it knows to look.

A model trained on different data looks in different places.

The retry bug in the example above was something Claude had written confidently because the pattern matched something it had seen in training. Self-review didn't flag it because the pattern looked right by Claude's standards. Codex, evaluating it against different training patterns, noticed that this specific implementation deviated from the conventions it had absorbed for queue handling.

Neither model is "better." They're differently calibrated. That difference is the leverage.

How it's different from ensemble or voting

This isn't ensemble voting where multiple models produce candidates and you pick the most-agreed-upon answer. It's also not the reflexion loop where a model criticizes its own output (though that has value too).

Cross-model code review is adversarial:

The reviewing model is given the producing model's output as a fait accompli.
It's asked to find what's wrong, not to suggest improvements.
The reviewer doesn't try to rewrite — only to flag.
The developer makes the call on which flags are real.

It's the same shape as a code review between two human engineers, except the engineers are two different AI systems with different training distributions.

What the workflow looks like

The senior developer's actual process:

Build in one thread. Claude with developer skill. Spec is loaded as input. Code is produced incrementally with reviewable diffs.
Self-review in the same thread. Run the developer skill's code-review prompt twice. Fix what surfaces. Get to a confident state.
Open a new thread with a different model. Codex in this case. New context. Fresh start.
Hand over the code and ask for review. Not "improve this code." Specifically: "find problems."
Surface issues in a list. Each issue should be specific: what's wrong, where, why.
Developer triages. Some flags are real. Some are stylistic differences between the models' training. Some are out-of-scope. The developer decides.
Fix the real ones in the build thread, or open a new build thread to address them.

Cost: roughly 30 minutes of extra work per significant component. Hit rate: 2-4 real issues per review in the examples I've seen.

When it adds the most value

Code that's hard to test or review by running. If you can't easily simulate the failure mode (race conditions, error handling, retry logic, integration code), cross-model review is one of the few practical ways to surface issues before production.
Code at scope boundaries. Anywhere a component meets the outside world — APIs, databases, queues, third-party services — the conventions differ between models' training. Cross-model review surfaces these mismatches.
Security-relevant code. Different models have different sensitivities about security patterns. Use this asymmetry deliberately.
Long-context output. When the producing model has built something across many turns, its self-review is operating on the same potentially-drifting context. A reviewer in a fresh thread sees the artifact, not the conversation.

When it doesn't help

Throwaway code. If the code is going to be deleted in a week, don't review it.
Code you can fully test by running. Tests catch what cross-model review would have flagged, and you get test artifacts as a bonus.
Code in a deeply familiar domain. Sometimes the developer knows more than either model. Cross-model review can produce false positives where the developer's domain knowledge correctly overrides both models.
Code that's already passed a thorough human review. Two AI reviews on top of a human review usually doesn't add much.

The deeper pattern

Cross-model code review is part of a broader pattern I've been calling the two-thread pattern. Different models have different strengths, sequenced in separate threads, with deliberate handoffs between them.

In the two-thread pattern, the sequence is: spec model -> build model. Cross-model review extends it: spec model -> build model -> reviewer model. Three threads, three artifacts (spec, build, review notes), three models with complementary strengths.

The cost is small. The value, on senior production code, is genuine and repeatable.

How to start

If you're working with Claude regularly, try this on your next non-trivial completion:

Finish what you'd normally consider done.
Open a Codex thread. Paste the relevant files. Ask: "Review this code. Find what's wrong. Be specific."
Read the response. Sort the flags into "real", "stylistic", and "out of scope."
Fix the real ones.

If you're working with Codex regularly, do it in reverse: ask Claude to review.

The first time you do it, expect to be surprised by what you find. The second time, expect to start designing your work around it. By the fifth time, it's a habit.

Based on observations from a recent Mindtastic Sprint engagement, May 2026. The pattern was developed independently by a senior developer who reported using it as standard practice for production code. Customer identity withheld.

Was this helpful?