Trust in Judgment | Evan Hourigan

You’re reviewing a pull request in the part of the system nobody touches unless they have to.

Conditionals stacked on conditionals. Edge cases that only exist because production taught the team a few painful lessons. The code itself is the only thing that really defines what “correct” means. The PR title says “Refactor for readability,” but the change is in a spot that makes reviewers tense as soon as they see the file list: request routing, pricing logic, permissions. The kind of place where “refactor” can quietly turn into “behavior changed.”

The author used AI to help untangle it. Then the comments start showing up. “Did you write this?” “This looks AI-generated.” “Walk me through it.”

That can be a reasonable ask. But it often lands like a challenge to competence, and now both people are stuck. The reviewer is trying to keep the system safe. The author is trying to keep their dignity. Neither one is wrong.

When there isn’t a shared way to show what was thought through and what was checked, reviewers start looking for signs of AI instead of evaluating the change. The review stops being about correctness and becomes about authorship.

The thinking behind a change isn’t visible, and neither is the checking. That’s the gap, and AI makes it wider.

Why this matters now

AI adoption gets treated like a tooling rollout: new capability, a few guidelines, some internal sessions, wait for usage to climb. For a lot of engineers, though, AI changes what competence looks like day to day and what gets questioned. It hits people in their sense of worth and standing, which makes it a leadership problem.

Uneven adoption usually gets chalked up to training or willingness. But the people experimenting have stopped explaining what they did, the people who are skeptical get louder because that stance is professionally safe, and people stop learning from each other.

Two engineers can ship a big refactor quickly. One may have a clear model of the system and the risks, and the other may not. The difference isn’t the output, it’s the thinking behind it.

When people resist quietly

When teams stall on AI adoption, it almost never shows up as open pushback.

More often it looks like polite agreement and quiet avoidance.

This starts earlier than leaders think, the moment people stop being able to predict how their work will be judged.

The bar shifts depending on what went wrong most recently and who’s doing the review. When people can’t predict what counts as good work, they hold back.

Some of it is simpler than social dynamics. An engineer has a four-hour ticket tied to a ship date. They’ve done this kind of work before and know how long it takes. AI might be faster, but they got burned by a hallucination a few months ago, and four hours doesn’t leave room to find out whether the tool has improved since then. They pick the known path and finish on time.

People stop sharing what they’re learning, and initiative shrinks. Big refactors and bolder proposals stop showing up because the social downside is too large.

Leadership can make this worse by overselling with unrealistic expectations, issuing mandates that remove choice, or shutting down dissent in a way that forces real concerns underground.

Most of what looks like “resistance” is people trying to avoid a situation where a normal learning curve gets treated like a character flaw.

The fix starts by steadying two things: the standard, and what happens when someone raises a concern.

But once people feel the bar moving around, they start wondering what the organization is going to do with all the extra output.

Fear of being the next one out

There’s a specific hesitation that shows up even on healthy teams. It’s not refusal so much as the sense that the rules are changing in real time, and the new ones haven’t been worked out yet.

When output gets easier to produce and nobody says what the new expectations are, people fill in the blanks themselves. They usually fill them with worst-case assumptions.

The way out is to make it clear what you reward, so people can predict how they’ll be evaluated. Teams settle down when the signal is consistent. Good thinking, clear ownership, reliable changes, and good taste still matter whether or not AI helped produce them.

When the tool gives you different answers

Teams lose confidence in AI-assisted work fastest when they see inconsistency in places where they expect the same answer every time.

You ask the same question twice and get two different answers. Both sound plausible. One is wrong, and wrong in a way that doesn’t surface until production.

Inconsistency in fuzzy tasks (e.g. naming, brainstorming, drafting) is fine. It’s a problem when the task has a right answer and the tool gives you a confident wrong one.

The common response is to keep asking the same thing in slightly different ways until the answer sounds right. That feels like progress, but “sounds right” isn’t the same as “is right.” Teams end up chasing a satisfying explanation rather than checking behavior.

After someone gets burned a couple times, they either keep asking until it finally lines up, or they decide they have to double-check everything personally. Either way, you pay for it.

Someone tries AI on a real task, the code looks right, and it fails at runtime because it called a method that doesn’t exist. Months later the models are better, but the engineer tried AI and moved on. The model they used in October isn’t the model available in February, but people don’t usually distinguish between versions the way they would between releases of a database or a framework. It’s all just “AI,” and the one bad experience stands in for the whole category.

A wrong answer can sound convincing, so you need a few steady questions to check against: what must stay true, what proof exists, and what’s the most likely way this is still wrong. When that’s habitual, inconsistent answers stop shaking people.

Pride, ownership, and the “nobody really checked this” problem

The tension gets sharpest when AI touches the parts of the job people quietly take pride in: naming, structure, clear explanations, refactors that feel clean and elegant. They don’t always feel that as fear. More often it comes out as judgment.

Someone ships a PR with unusually good naming, or a refactor reads clean in a way that feels almost too polished. Without shared expectations, the reaction shifts from “is this correct?” to is this real?

AI-assisted work is often “almost right,” and there’s a difference between opening a blank file and making something versus opening someone else’s draft and spending your afternoon asking “is this actually right?” Both require skill, but when the work feels like auditing, what you take pride in starts to change.

Engineers invest in naming and structure because it helps the next person who touches the code understand what they’re looking at. The “next person” is shifting. As more of the next-touch work moves to AI tools, and tests and guardrails get stronger, the human-readable structure starts looking like convention rather than correctness. How much it matters depends on whether the next reader is going to be a person making a judgment call or an agent running through the file. The answer is moving, and it’s moving in one direction. Whether that’s a relief or a loss depends on what you found valuable about the work in the first place.

Then credit feels rigged. If it’s good, it was AI. If it’s bad, it was me. After a while, taking ownership starts to feel like sticking your neck out.

When ownership isn’t clear, the conversation drifts from whether the change is safe to who’s doing “real work.”

If someone can explain the calls they made, what couldn’t break, and how they checked it, it’s their work regardless of how the first draft got produced. That bar has to be steady and written down, not something people have to guess at.

When mistakes aren’t embarrassing, people share. “Here’s what I tried.” “Here’s what could still be wrong.” That only happens when the team treats a weird AI answer as normal, not as a confession.

Beyond PRs

The same dynamic plays out anywhere work gets evaluated. A design doc that reads too smoothly triggers the same suspicion as a too-clean refactor. “This reads like an LLM.” “Where’s your thinking?” The fix is the same: surface the constraints, the alternatives you didn’t take, and the assumptions that have to hold. Once those are visible, nobody cares how polished the prose is.

Performance conversations are trickier. “How much of this was you versus the tool?” sounds reasonable, but it lands as an inspection. Once people feel judged by effort signals, they pull back. The conversation stays fair when it sticks to the work: what call did you make, where could it have gone wrong, and how did you verify it.

Litmus test

Pull up the last few PRs on code that matters. Did the review comments focus on the change, or on how it was produced? Check whether anyone asked “walk me through it” as a genuine question versus a challenge. Look at who’s picking up the risky refactors. Is it shrinking to the same two or three people? Ask yourself when someone last shared a failed AI experiment in a channel without hedging.

If those feel uncomfortable to answer, that’s the signal.

What comes next

Individual judgment and checking only scale so far. As soon as AI helps a team move faster, two engineers can ship two different “reasonable” solutions to the same problem in the same week, both well-written and well-tested, and the system starts pulling in two directions.

Next is what changes when the work isn’t yours to personally verify.

A practical takeaway

Pick one PR this week on code that matters. Ask the author to add a short note: what changed, what had to stay true, and how they checked it. See what happens to the review.