Trust in Work You Didn't Personally Do
Output is up. Confidence isn't. Seven ways engineering leaders react, and which ones backfire.
The team is shipping more than last quarter. The backlog is shrinking, release notes are getting longer, and by most measures things look healthy. But incidents have a different texture lately. Fixes that used to take an afternoon drag into the next day because nobody is quite sure what changed or why. Handoffs between teams have gotten touchier. None of this shows up on a dashboard.
The worry isn’t that people are slacking. It’s whether the work is actually safe to ship. Output is rising faster than confidence.
When people can’t easily show what they thought through and what they checked, the conversation drifts. Instead of asking “what did you verify?” people start asking “who wrote this?” Once that happens, trust stops being about the work and starts being about the person.
The problem you’re dealing with
A PR lands in the review queue: “Refactor: clean up legacy pricing logic.” It’s 400 lines, it touches the code that calculates what customers actually pay, and the author used Copilot to help untangle years of spaghetti. The tests pass and the diff looks reasonable.
But the pricing code accumulated weird edge cases over years. Some of the spaghetti was probably bugs. Some of it was probably intentional workarounds that nobody documented. The person who wrote it originally left two years ago. The tests cover the cases someone thought to write tests for, but they don’t cover the cases nobody remembered existed.
The refactor looks clean, but clean isn’t the same as correct. The AI didn’t know why the spaghetti existed. It just made it neater. The reviewer is stuck wondering: did we just remove a landmine, or did we just arm one?
Why this got harder recently
First, there are more changes. AI tools make it easier to produce code, which means the surface area of what needs to be understood, reviewed, and maintained is growing faster than most teams expected. One study of 10,000+ developers found individual output jumped dramatically, with developers completing 21% more tasks and merging 98% more pull requests, but the downstream systems couldn’t absorb it. Review queues ballooned 91%. PR sizes grew 154%. The increase was real, but the infrastructure to handle it wasn’t.
Second, the range of quality is wider. The best work is better than ever. But it used to be that bad code looked bad. Now it can look plausible, even clean, and still be wrong in ways that don’t show up until production.
Third, the old comfort cues don’t cover as much ground. Leaders used to rely on familiarity. You knew someone’s style, you knew how they thought, and you could glance at a diff and have a decent sense of whether it was solid. When a lot of the code came through a tool, the fingerprints you’re used to reading aren’t there anymore.
How this usually goes sideways
Leaders tighten control to reduce surprise
When the bar for “safe” is fuzzy, attention becomes the substitute. Leaders insert themselves into more decisions, not because they want to, but because their own attention is the most reliable signal they have. Leadership time becomes the bottleneck, and teams start optimizing for presenting well rather than building well.
Making the bar explicit changes this. If teams know what “safe” means for the work that matters most, they can hit that bar without waiting for someone to check.
Autonomy without a shared bar
When teams have autonomy but no shared definition of what “done” means, one team’s “ready to ship” is another team’s “not even close.” Incidents hit, and the instinct is to pull autonomy back. The org swings from freedom to clampdown, and nobody ends up happy.
A shared bar for risky work is the missing piece. What has to be true before something ships, and how the team knows it’s been met.
Late escalations from unclear ownership
“Can you weigh in on this?” pings arrive after the work is already half done. By the time leadership gets pulled in, positions have hardened and the options are worse than they would have been at the start.
The recurring friction points are usually knowable ahead of time: shared interfaces, permissions, anything that changes how money moves, anything that’s hard to roll back. Making ownership explicit for that short list keeps most decisions from climbing the ladder.
When AI use becomes politically charged
When people can’t predict how AI use will be perceived, they get cautious. Disclosure becomes selective, teams get labeled “fast” versus “careful,” and the conversation shifts from whether the work is good to how it was produced.
Clarity helps. Evaluate work on outcomes and decisions. Make “AI helped here” unremarkable context for reviews and debugging, the same way someone would mention which library they used or which docs they referenced.
Deferring to automated checks
When automated review tools generate enough feedback, it’s easy to defer to them. The bar drifts from “this is solid” to “this won’t get flagged,” and human judgment gets thinner across the org without anyone noticing.
For risky work, stating what was checked and what could still be wrong keeps human judgment in the loop. A sentence or two is enough.
Plausible-but-wrong changes
Code that looks fine, reviews fine, and passes tests can still break something important. When this happens a few times, confidence drops across the org, even on work that had nothing to do with the original problem.
What are the most likely ways high-risk changes fail, and what checks would catch them? That list is usually short enough to be practical.
Banning AI from legacy code
Sometimes the ban is a policy decision: AI for prototypes and new features, not for core systems. More often engineers sort themselves. AI gets used freely on greenfield work where the context is self-contained and barely touched on older codebases and barely touched on older codebases where the context is tangled and a bad change is hard to trace. Leadership may not know the split exists because it wasn’t announced.
Either way, it blocks the upside where teams need it most. Legacy code is where unwritten assumptions and undocumented edge cases live. When those aren’t made explicit, AI-assisted changes can look perfectly fine and still break something important.
Higher-risk areas deserve stronger checks. Lower-risk areas deserve a lighter touch. Writing down the key “must stay true” rules in high-risk areas means correctness is checkable, not dependent on whoever remembers how it used to work.
What this looks like in practice
A director asks her team to walk through a recent incident. The PR was reviewed, the tests passed, it was approved. But when she asks what specifically the reviewer verified, the answer is vague. The gap isn’t effort or intent. It’s that nobody can point to what was actually checked. She starts requiring a short note on high-risk PRs: what the change is for, what could go wrong, what was verified. Within weeks, escalations drop because teams are catching more on their own before the work reaches her.
Cross-team seams are where this gets harder. Two teams building on a shared API both ship on time, both pass their tests, and something still breaks in production. One team had refactored their side with AI to clean up legacy code. The refactor was tidy, but it changed a subtle timing behavior the other team depended on. The original behavior was never documented. It was just how the old code happened to work. Generated code is locally correct, but it doesn’t carry the context of why an older constraint existed. Now changes to shared interfaces get a review from a consuming team, not just the team making the change.
When calibration debates turn into arguments about whether the fast team or the careful team is doing better work, the question is usually about method, not outcomes. Whether it caused problems downstream and whether issues got caught before customers saw them tells you more.
What comes next
The changes here work within a single team. But they hold up better when the org around you is designed to support them.
Next is what it takes to build trust into the organization itself.
A practical takeaway
Ask your team in your next 1:1s: is there anywhere you’re holding back because you’re not sure how the work will be received? The answers usually point straight at which failure modes are already in play.
If you want to know when the next one comes out: