Vibes Are Not a Metric

Introduction

Something strange is happening in software development. Engineers finish a coding session with an AI assistant and walk away convinced they just had the most productive afternoon of their careers. The code flowed. The AI handled the boilerplate. It felt fast.

But when researchers at METR (a nonprofit AI safety lab) ran a controlled experiment, giving 16 experienced developers 246 real tasks and measuring actual completion times, the numbers told a different story. Developers using AI tools were 19% slower. Not faster. Slower. And yet those same developers believed they were 20% faster. Expert forecasters (economists, ML researchers) predicted 38-39% speedups. The gap between what people feel and what the data shows is 39 percentage points wide.

This is not an isolated finding. Google's 2024 DORA report, surveying over 39,000 tech professionals, found the same pattern: 75% felt more productive with AI while system-level performance metrics declined. At the organizational level, MIT research finds 95% of generative AI pilots deliver no measurable business impact. The problem is not AI itself. The problem is that most organizations have replaced measurement with vibes.

Key Findings

What the Data Shows

The perception gap is the defining finding of the AI productivity era. METR measured a 19% slowdown while developers perceived a 20% speedup. Expert forecasters predicted 38-39% gains. This 39-point gap appears consistently across studies.
Benchmarks systematically overstate production readiness. 38% of AI-generated pull requests passed automated tests, but 0% were mergeable without human intervention. SWE-Bench overestimates agent capabilities by over 50% for some models.
95% of AI pilots deliver no measurable business impact. Independently reported across MIT research, healthcare, and enterprise contexts. Only 6% of organizations qualify as AI high performers.
Agent reliability collapses under repeated use. A 70% single-trial agent drops to 34% when it must succeed three times in a row. Production systems need consistency, not occasional brilliance.
83% of AI evaluations measure only technical metrics. A review of 84 academic papers found just 15% combine technical and human-centered dimensions. Only 5% include any longitudinal tracking.
Organizations with structured measurement achieve real gains. Booking.com achieved 16-31% throughput lifts across 3,500 engineers. Apollo.io reached 40% ticket deflection in two weeks. The difference is measurement, not technology.
Gartner predicts 40% of agentic AI projects will be canceled by 2027. The primary causes: escalating costs, unclear business value, and inadequate risk controls. Meanwhile, 61% of senior leaders face growing pressure to prove AI ROI.

Analysis

Your Brain Is Lying to You

The METR study, conducted between February and June 2025, is the most rigorous measurement of AI-assisted developer productivity to date. Sixteen experienced open-source developers completed 246 real tasks across mature repositories, with researchers analyzing 143 hours of screen recordings.

The psychological mechanism is revealing. Quentin Anthony, a study participant who trains AI models, described the surprise: "At the end, they sent me the early draft of the paper and said, We think you'll be very surprised. And I was." His explanation: "People overestimate speed-up because it's so much fun to use AI. We sit and work on these long bugs, and then eventually AI will solve the bug. But we don't focus on all the time we actually spent." Security researcher Marcus Hutchins put it more bluntly: "LLMs inherently hijack the human brain's reward system... LLMs give the same feeling of achievement one would get from doing the work themselves, but without any of the heavy lifting."

The perception gap extends to leadership. A Multiverse study found 61% of leaders believe AI is fully implemented across their organization, compared to only 36% of workers. Executives claim AI saves 8 hours weekly; workers report under 2 hours. When everyone is guessing, optimism wins by default.

Key Insight

69% of developers continued using AI tools after the study ended, despite the measured slowdown. The perception bias is so strong that objective evidence does not change behavior.

Passing Tests, Failing Production

If human self-assessment is unreliable, can we trust automated metrics instead? The evidence says: not the way we currently use them.

METR's August 2025 follow-up crystallized the problem. Claude 3.7 Sonnet achieved a 38% pass rate on maintainer-written test cases. But when those passing pull requests were manually reviewed, zero percent were mergeable. Every single one had at least three categories of failure: missing functionality, inadequate tests, missing documentation, formatting violations, or code quality issues. Each "passing" PR still required about 26 minutes of human work.

The benchmark ecosystem has the same flaw at scale. Microsoft researchers showed that SWE-Bench overestimates agent capabilities by over 50% when formal GitHub issues are rewritten as realistic user queries. Agents score over 70% on SWE-Bench Verified but only 23.3% on the harder SWE-Bench Pro. NIST has documented widespread benchmark gaming, where agents access walkthrough information or exploit scoring system gaps. When Perplexity agents were blocked from HuggingFace, their accuracy dropped 15 percentage points.

A systematic analysis of 12 major benchmarks found three critical blind spots: cost is unmeasured despite 50x variations (from $0.10 to $5.00 per task for similar accuracy), reliability is untested despite dramatic drops under repeated use, and operational requirements are absent despite 37% lab-to-production gaps. A review of 84 academic papers confirmed the structural problem: 83% of evaluations focus on technical metrics while only 30% consider human factors and 30% consider economics.

Key Insight

A demo only has to work once. Production code has to work a million times without breaking. A 70% single-trial agent drops to 34% on three consecutive runs and 25% on eight, according to research from Anthropic and Sierra.

Build the Test Before the Agent

The antidote to vibe-based development has a name: eval-driven development. Build your evaluations before your agent capabilities. Run baseline-measure-compare loops. Never ship on gut feeling.

The major AI labs are converging on this approach. Anthropic recommends starting with 20-50 simple test cases drawn from real failures. OpenAI explicitly warns against "it seems like it's working" as an evaluation strategy. Vercel built its v0 product on eval-driven development, drawing parallels to two decades of web search quality engineering. AWS's DevOps Agent team uses a baseline-measure-compare loop designed to protect against confirmation bias.

In practice, this means layered grading. Teams at Descript evaluate around three dimensions: "don't break things, do what I asked, and do it well." LLM-based judges now achieve 80-90% agreement with human evaluators at a fraction of the cost ($0.01-0.10 per assessment). Databricks' "Grading Notes" technique, which gives short per-question annotations rather than comprehensive rubrics, significantly outperforms fixed prompts for domain-specific judging.

The key distinction is between two metrics that sound similar but diverge dramatically. pass@k asks: "Can this agent ever succeed?" pass^k asks: "Can it reliably succeed?" At 10 attempts, the first can approach 100% while the second falls to near zero. Any organization deploying agents without measuring consistency is flying blind.

Measurement Pays Off

The case for measurement is not theoretical. Organizations that measure systematically achieve results that vibe-driven teams cannot.

Booking.com deployed AI tools to 3,500 engineers and, using DX Core 4 data, identified that daily active users had 16% higher throughput, eventually reaching a 31% improvement from baseline. Their critical insight: developers using AI on twelve or more days per month were significantly more effective. This shifted their goal from adoption to daily adoption. Apollo.io's AI agent achieved 40% ticket deflection in its first two weeks, double the typical first-month benchmark, because they measured from day one.

The DX framework now offers 4 million benchmark samples across hundreds of organizations, providing empirical baselines across speed, effectiveness, quality, and impact. A one-point improvement on their index saves approximately 10 minutes per week per engineer.

But these successes remain exceptions. McKinsey's 2025 State of AI report found 88% of organizations use AI in at least one function, but only 33% have scaled beyond pilots and only 6% qualify as high performers. The top performers are 3x more likely to have senior leaders who actively champion AI and nearly half spend 20% or more of digital budgets on AI. The differentiator is organizational commitment to measurement, not the technology.

Key Insight

The top 20% of AI implementations achieve 500%+ ROI through superior change management and measurement. The bottom 80% struggle. Same technology, different discipline.

Closing the Loop

If vibes are the disease, feedback loops are the treatment. The evidence points to a specific intervention: track what you estimate, compare it to what happens, and let the delta teach you.

Reflexion, a framework for verbal self-reflection published at NeurIPS 2023, demonstrated that language agents can improve by reviewing their own performance with no weight updates. It achieves 91% pass@1 on HumanEval, compared to GPT-4's 80%. The mechanism: agents verbally reflect on task feedback, store reflections in episodic memory, and use them as a semantic gradient for subsequent trials. The same principle applies to human-AI workflows. Record the estimate. Record the actual. Let the gap do the teaching.

Task complexity is the primary driver of estimation error. Thoughtworks' experiments found boilerplate code (API contracts, adding fields) yields 30-50% time savings, while business logic yields only 10-40% with high probability of manual adjustment. "What's difficult for the developer is also difficult for Copilot." Anthropic's own estimation research confirms a compression bias: Claude overestimates short tasks and underestimates long ones (Spearman rho = 0.44). Knowing this pattern tells you where to focus your tracking.

Production tooling is catching up. Braintrust converts production traces into test cases with one click. Basis achieves 2-3x lower error rates by routing human review only to low-confidence outputs. Even primitive approaches work: a Portland startup with six engineers cut production bugs by 35% in three months using nothing more than a Google Sheets scoring system.

But sustainability is a hard constraint. Research shows 68% of developers report burnout after four months of constant scoring. The answer is not to score everything. It is to score strategically: focus measurement on tasks where estimation error is highest and business impact is largest. The perception gap closes not by measuring more, but by measuring the right things.

Key Insight

A simple estimate-versus-actual log, maintained per task type, will teach you more about AI productivity in your codebase than any vendor benchmark. Trust is a calibration problem: once users fall into over-trust or under-trust patterns, escape requires new evidence. If no one collects that evidence, beliefs never update.

The Stakes Are Real

Healthcare provides the starkest warning. A diagnostic AI achieving 95% benchmark accuracy was relegated to limited advisory roles because no one measured trust and workflow integration. That same system dropped to 70% accuracy on real patient data. A healthcare CEO described the pattern: "Their AI pilot worked beautifully in controlled settings, but the moment they tried to use it with real patients and real doctors, everything fell apart."

Across the industry, hidden supervision labor can exceed $2 million annually in a mid-sized deployment, with 59% of frequent AI users spending 30+ minutes daily supervising outputs. That time is neither tracked nor recognized in productivity expectations. It is invisible cost, driven by the absence of measurement.

Gartner predicts over 40% of agentic AI projects will be canceled by end of 2027. The cause is not that AI does not work. The cause is that organizations cannot tell whether it works, because they never built the instruments to find out.

Timeline

Key Events

Feb 2025 METR's developer productivity RCT begins, running through June 2025
Mar 2025 METR publishes time horizon paper showing 7-month doubling time for AI capabilities
Jun 2025 Gartner predicts 40% of agentic AI projects will be canceled by 2027
Jul 2025 METR publishes RCT results: 19% slowdown with AI, 20% perceived speedup
Aug 2025 METR follow-up: 38% pass rate, 0% mergeable PRs
Sep 2025 Systematic review of 84 papers reveals 83% focus on technical metrics alone
Oct 2025 Microsoft researchers demonstrate SWE-Bench overestimates by over 50%
Nov 2025 McKinsey State of AI 2025: 88% adoption, only 33% scaling beyond pilots
Dec 2025 NIST documents widespread cheating on agent evaluations
Jan 2026 Anthropic and AWS publish eval-driven development guides
Feb 2026 Survey of 6,000 executives: 89% saw no productivity change from AI over three years

Vibes Are Not a Metric

What the Data Shows

Your Brain Is Lying to You

Passing Tests, Failing Production

Build the Test Before the Agent

Measurement Pays Off

Closing the Loop

The Stakes Are Real

Key Events

References

Research & Official Reports

Academic Papers

Case Studies & Industry Data

Feedback Loops & Calibration

Journalism & Analysis

Run Your Own Research