The Invisible Line Between Vibe Coding and Professional AI Engineering

Sebastian Sussmann
2 days ago
15 min read

Why software development needs both — and how to know which one you’re doing

Our clients ask us this question regularly: why does some AI development take a few hours while other work takes days or weeks? Both use AI. Both are faster than traditional development. But the effort is completely different — and that is confusing when everything gets called “AI.”

The honest answer is that AI is not AI. The same tools can be used in fundamentally different ways, and different needs require different approaches. A quick prototype for an internal demo is not the same as a production system for a regulated client — even if both were built with Claude Code. AI is always a must, and it is always faster than doing things the old way. But “faster” does not mean “the same.”

We see the same pattern every month. Someone builds something with AI in a week. It works. It looks good. They call it “AI-powered development.” Then a different team spends ten weeks writing specifications, designing architecture, feeding structured prompts into the same tools, reviewing every line of output, and shipping a production system with governance, audit trails, and a Definition of Done. They also call it “AI-powered development.”

Same tools. Same models. Completely different things. The first is vibe coding — fast, intuitive, and often built by someone who cannot explain what database the app is running on. The second is professional AI engineering — structured, governed, and owned by people who understand every decision the system makes.

Axel Molist, CEO of We UC, captured this perfectly in a recent video comparing two real cases: a non-developer who shipped a full contractor management app in a week without knowing his own tech stack, and a team that wrote documentation first, then used AI to build a production SaaS over two months. Both called it “AI coding.” His conclusion: the line between the two is invisible to most people right now.

We agree. And after eighteen months of structured AI work across fourteen teams, we have a clear view of where that line sits, why it matters, and what to do about it.

That invisible line is the subject of this article. Not because vibe coding is the enemy — it is not — but because the inability to see the line is creating real risk in real organizations. And because the line is not where most people think it is.

What vibe coding actually is — and is not

The term was coined by Andrej Karpathy in February 2025. His description: “There is a new way to code that I call ‘vibe coding,’ where you fully lean into the vibes, embrace the exponentials, and forget that code even exists.” Collins Dictionary named it 2025 Word of the Year. By April 2026, the ACM’s Technology Policy Council had issued a formal TechBrief warning about its risks in production environments.

The key to the definition is not who writes the code. It is whether anyone understands it. Simon Willison put it precisely: if an LLM wrote every line of your code, but you have reviewed, tested, and understood it all, that is not vibe coding — that is using an LLM as a typing assistant.

Vibe coding is when you accept what the AI returns without comprehending what it does. You prompt, you see output, you check if it works, and you ship. The code is a black box that happens to pass a demo. That is fine for a weekend project. It is not fine for software that handles money, health data, or other people’s information.

The data on what happens when vibe-coded software reaches production is now substantial. A December 2025 analysis by CodeRabbit of 470 open-source GitHub pull requests found that AI co-authored code contained 1.7 times more major issues and 2.74 times more security vulnerabilities than human-written code. A separate audit of 1,645 web applications generated by the vibe coding platform Lovable found that 10% had critical vulnerabilities exposing user data. The ACM TechBrief summarized the pattern: vibe coding often skips core engineering practices that ensure systems are secure, reliable, and maintainable.

None of this means vibe coding has no value. It means vibe coding has a scope — and that scope is smaller than most people assume.

Two modes of working with AI

*Both are valid — the right choice depends on what you’re building and how long it needs to last.*

The distinction is not binary. It is a spectrum, and the right position on that spectrum depends on what you are building. A quick demo or throwaway prototype? Vibe code it. An internal tool or MVP? Start with vibes, but enforce code review. A production system for an enterprise client? You need structured prompts, schemas, repeatable output, and human review at every stage.

*AI is a tool. The result depends on what you build and how.*

The research supports this gradient. A 2025 study (Borg et al., “Echoes of AI”) found that AI-assisted prototyping was roughly 30% faster with no measurable quality loss for throwaway code. But the METR randomized controlled trial found that experienced developers using AI on production-grade open-source projects were actually 19% slower — even though they felt 24% faster. Perception and reality diverge precisely at the point where the code needs to last.

Dan Shapiro’s Five Levels — and where the line sits

In January 2026, Dan Shapiro, CEO of Glowforge, published a five-level taxonomy of AI-assisted programming, modeled on the NHTSA’s levels of driving automation. The framework has become the industry’s common vocabulary for a reason: it makes the invisible line visible.

Level 0 — Spicy Autocomplete. No AI assistance beyond tab-complete. You are typing everything.

Level 1 — The Coding Intern. AI handles discrete, scoped tasks. You write the important stuff.

Level 2 — The Junior Developer. AI performs multi-file changes. You feel free. This is where 90% of “AI-native” developers live right now.

Level 3 — The Waymo with a Safety Driver. AI is the senior developer. You are the reviewer. Your life is diffs. For many people, this feels like things got worse. Almost everyone tops out here.

Level 4 — The Robotaxi. You are not a developer anymore. You are a PM. You write specs, craft skills, review plans, leave for 12 hours, and check if the tests pass.

Level 5 — The Dark Factory. Nobody reads the code. Nobody reviews the code. The goal of the system is to prove that the system works. StrongDM’s three-person team operates here, with two rules: code must not be written by humans, and code must not be reviewed by humans.

The critical insight is not the taxonomy itself. It is the trap at every level. Each level feels like you are done. You are not done. And the jump from Level 2 to Level 3 is not incremental — it requires a fundamentally different way of working. Not better tools. Better structure.

Level 2 is vibe coding with a safety net. Level 3 and above is professional AI engineering. The line sits between them. And crossing it is uncomfortable, because Level 3 often feels worse before it feels better — more review, more diffs, more discipline.

The line is not between “using AI” and “not using AI.” It is between accepting what AI produces and owning what AI produces.

Where we draw the line — AI Responsibility Levels

We have spent eighteen months running early-adoption squads across fourteen teams, building enablement structures, and measuring what actually works. Our AI Adoption Leadership Handbook defines four AI Responsibility Levels that govern how every team uses AI:

Level 1 — AI Assisted. AI accelerates execution. Human fully owns thinking and decisions. Code completion, boilerplate, commit messages. Low risk.

Level 2 — AI Augmented. AI supports thinking, exploration, comparison. Human decides. Bug diagnosis, architecture suggestions, refactoring. Medium risk.

Level 3 — AI Supervised. AI produces work. Human reviews, approves, owns outcome. Feature implementation, test generation, migration scripts. Managed risk.

Level 4 — AI Orchestrated. Humans design systems where AI works continuously; monitor and intervene. CI/CD pipelines, automated QA, agent systems. High oversight.

You can delegate work to AI. You can never delegate responsibility. From the Axon Active AI Adoption Leadership Handbook (2026)

A reader paying attention will notice that Shapiro’s framework has five levels (0–5) and ours has four (1–4). This is not an oversight — they answer different questions. Shapiro asks: how autonomous is the AI? His levels describe capability, from spicy autocomplete to a dark factory where no human reads the code. Our framework asks: who is accountable for what the AI produces? Our levels describe governance — the working agreements, review requirements, and human ownership that apply at each stage of AI autonomy.

The two systems are complementary, not competing. A team can operate at Shapiro’s Level 4 (AI writes all the code, the human reviews plans and checks tests) while governed under our Level 3 (AI Supervised — human reviews, approves, and owns the outcome). What matters is not the level of AI autonomy in isolation. What matters is whether the governance matches the autonomy. StrongDM’s dark factory is fascinating as an extreme case — but for our clients in Swiss banking, European insurance, and regulated enterprises, FINMA, the EU AI Act, and FADP all expect identifiable human accountability. A dark factory is not an option. Our framework is designed for the world where accountability has a name and a face.

Every level inherits all rules from the levels below it. And the baseline rule, from the first moment any AI tool is introduced, is non-negotiable: AI-generated code is reviewed by a developer who understands and owns it.

Our target for 2026: all teams operating at Level 3, with Level 4 patterns in select contexts under enterprise-grade governance. We run agents at the highest level of autonomy where governance and quality allow. We do not pretend that means the human has left the building.

The harness — why structure is the difference

There is a useful word for what separates vibe coding from professional AI engineering: the harness. Bill Cox, a veteran Silicon Valley engineer who has personally written more than 240,000 lines of production code under AI supervision, uses the term to describe the structural discipline a human builds around an AI system to keep its speed useful. Without the harness, the same AI that produced 35,000 lines of clean, shipping code in one project produced 58,000 lines of unusable bloat in another. Same model. Same engineer. Different discipline. (We explore the harness concept in detail in our companion article, “AI Changes the Practice. The Scrum Team Stays.”)

Our working agreements are the harness. Five elements, all of them operational in our delivery model:

Design discipline. Clear architectural standards, modular boundaries, and interface contracts that the AI must respect. AI cannot be allowed to dissolve the structure of the system one pull request at a time.

Locked-down components. Security-critical and compliance-sensitive components — authentication, encryption, payments, audit logs — are written by humans, with explicit customer approval required before any AI involvement.

Fake-first integration testing. Build integration tests against fake implementations before the real ones exist. The real code has to conform to the fake — not the other way around.

Resets, not patches. When AI output drifts, the answer is not to patch it. The answer is to reset, prompt more tightly, and regenerate.

Reusable prompts. Version-controlled prompt libraries shared across teams. Not ad-hoc conversations — reproducible, debuggable, transferable.

Vibe coding has no harness. That is the definition. Professional AI engineering is the harness. Same AI, same engineer, different structure — and that structure is the difference between code that ships and code that gets thrown away.

The Cursor FastRender experiment illustrates what happens at scale without it. Two thousand AI coding agents worked in parallel on a shared project for nearly a week, producing over a million lines of code. The developer community called it “AI slop” — code that relied heavily on existing libraries the agents chose themselves, that was difficult to navigate, and that for a long time could not compile on many systems. In response, a single developer built a comparable prototype with one agent in three days: roughly 20,000 lines of clean Rust. As THE DECODER’s Frontier Radar analysis of the incident concluded: a well-designed framework, clear specifications, and human guidance often accomplish more than an agent swarm with a massive token budget. More agents does not mean better results. Structure does.

Why this matters more than people think

*AI does not lie — it guesses. Your job is to catch the bad guesses before they ship.*

AI models do not understand what they produce. They predict the next plausible token. They optimize for output that looks right, not output that is right. They fill knowledge gaps with confident-sounding fabrication. They have no built-in truth checker. Every one of these properties is manageable — with structure. Ground it with real data. Write specific, constrained prompts. Treat every output as a first draft. Make human review the truth checker.

Vibe coding skips all four of those steps. That is why 45% of AI-generated code fails basic security tests. That is why a startup had to fully refactor its backend after a year of vibe coding without structured reviews, at a cost of six months. That is why Trend Micro’s March 2026 analysis concluded that vibe coding does not just accelerate development — it accelerates risk.

How fast can that risk materialize? In April 2026, an AI coding agent with unrestricted infrastructure access deleted a production database and all volume-level backups in nine seconds. When asked to explain what happened, the agent listed the specific safety rules it had violated. The incident, reported by founder Jer Crane, went viral across the industry and was discussed on the All-In podcast. The infrastructure provider’s CEO had to personally intervene to recover the data. This is not a hypothetical scenario from a risk assessment. It is what happened to a real company, in production, last month — because the agent had no harness, no locked-down components, and no human gate between an API call and catastrophic data loss.

The risk is not that AI produces bad code. The risk is that it produces plausible code faster than teams can validate it. As BCG’s 2026 study of 1,488 US workers found, roughly 14% of AI users report cognitive exhaustion from constantly supervising AI output. After two hundred ‘Accept’ clicks, attention degrades. The AI is now unsupervised whether the human meant it or not.

BetterUp and the Stanford Social Media Lab have a name for what this produces at organizational scale: “workslop” — AI-generated content that looks formally plausible but is substantively thin and requires downstream processing. In their survey of 1,150 US workers, 40% reported receiving such output in the past month. Each incident cost roughly two hours to deal with. What looks like one team’s efficiency reappears as another team’s rework.

Then there is the security dimension. A comprehensive red-teaming study with nearly 2,000 participants and 1.8 million attacks on AI agents achieved a 100% behavioral success rate — every single agent could be compromised. Anthropic’s own numbers show that Claude Opus 4.5 can be cracked via prompt injection in about 30% of cases across ten attack attempts. For regulated industries, these are not acceptable error rates. They are reasons to build the harness deeper, not skip it.

The wave is already here

The wave of vibe-coded applications entering production is not theoretical. It is happening now. Across the industry, service providers are seeing clients arrive with AI-built prototypes that need to become production systems — custom CRMs, internal tools, workflow platforms, all built fast, all missing the engineering that makes them maintainable. Molist describes the same pattern from his own practice: multiple integration requests from customers who have built their own platforms with AI tools, in a single quarter.

For companies like ours that build and maintain software for regulated enterprises, this creates a specific challenge. Our clients in Swiss banking, European insurance, and government face regulatory requirements under FINMA, GDPR, FADP, and the EU AI Act that expect identifiable human accountability for every piece of software in production. Vibe-coded software — where nobody fully understands the code, nobody can explain its behavior under edge cases, and nobody owns its architectural decisions — is a compliance problem before it is a quality problem.

*Our teams carry 5–15+ years of product knowledge. AI skills are added in weeks — not the other way around.*

The answer is not to avoid AI. The answer is to match the approach to the stakes. For prototypes and internal tools, vibe coding is fine — fast, cheap, disposable. For production systems that handle real data, real money, and real regulatory obligations, the harness is not optional.

The industry is converging on this answer

We are not alone in arriving at this picture. At DevDay 2026 in Da Nang, a panel of technology executives and academics examined how AI is reshaping software engineering, team structures, and enterprise strategy. The panel’s conclusions, covered in Vietnam Economic Times (April 2026), were unambiguous: coding is becoming easier and faster, but delivering reliable business outcomes remains a human responsibility.

The same conclusion emerges from multiple independent directions. Practitioner engineers like Bill Cox and Thomas De Vos describe identical structural requirements. AI tooling vendors like Anthropic design their tools around human oversight. Regulators push accountability toward identifiable humans. The ACM’s TechBrief recommends rigorous testing, human oversight, and governance controls. StrongDM, operating at the most extreme end of the spectrum, still requires humans to design systems, write specifications, and architect validation — they just do not require humans to write or read the code itself.

Five different vantage points, same conclusion. The position the field is converging on is not anti-AI. It is pro-structure.

*AI changes HOW we work. Scrum protects WHY and for WHOM we work.*

How to know which side of the line you are on

The questions are straightforward. Answer them honestly:

Can your developers explain the AI-generated code? Not “does it work” — can they explain why it works, what it depends on, and what breaks if the inputs change?

Do you have working agreements for AI usage? Not a policy document that nobody reads — actual working agreements that teams follow, with rules at each level of AI autonomy?

Does your Definition of Done include AI-specific clauses? Review by a senior developer. Security scanning. Provenance tracking. If your Definition of Done is the same as it was in 2023, it is not done.

Do you know what database you are running? This is not a joke. If the person who built the system cannot answer basic architectural questions, the system is vibe-coded — regardless of how sophisticated the prompts were.

Is there a human who owns the output? Not “the team” in the abstract. A specific person who reviewed, understood, and took responsibility for what shipped.

If the answer to any of these is no, you are vibe coding in production. Whether you call it that or not.

Start with vibes. Scale with engineering.

The invisible line is not going away. As AI tools get more capable, the output will get more plausible, the speed will get more seductive, and the temptation to skip the harness will grow. The gap between “it works in a demo” and “it works in production” will widen, not shrink.

The businesses that thrive will be the ones that can do both. Vibe code the prototype on Monday. Ship the production version on Friday with working agreements, senior review, and a Definition of Done that means something. Use AI at the highest level of autonomy where governance allows — and know exactly where that boundary is.

That is not a conservative position. It is the position the field is converging on. And it is the position that lets you move fastest without breaking things that matter.

Start with vibes to explore. Scale with engineering to ship. Production AI demands the right side.

Want to see how structured AI engineering works in practice?

Talk to our team about how we organize AI-augmented delivery for regulated enterprises — with the governance that makes speed safe.

Sources and further reading

All sources used in this article, with direct links to the original research and primary documents.

Frameworks and taxonomies

Dan Shapiro — “The Five Levels: from Spicy Autocomplete to the Dark Factory” (January 2026). The five-level taxonomy of AI-assisted programming.

Axon Active — AI Adoption Leadership Handbook (March 2026). Internal document defining AI Responsibility Levels, working agreements, and governance.

Axon Active — AI Sidekicks operating model (April 2026). Internal document governing team-level AI adoption.

Research and analysis

ACM Technology Policy Council — “AI-Assisted Software Development, or Vibe Coding: Benefits and Risks of AI-Driven Software Development” TechBrief (April 30, 2026).

CodeRabbit — Analysis of 470 open-source GitHub pull requests (December 2025). AI co-authored code: 1.7x more major issues, 2.74x more security vulnerabilities.

Borg et al. — “Echoes of AI” (arXiv:2507.00788, 2025). 151 developers; AI-assisted prototyping ~30% faster with no quality loss for throwaway code.

METR — Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (July 2025). 16 developers, 246 tasks; 19% slowdown despite feeling 24% faster.

BCG Henderson Institute — “When Using AI Leads to ‘Brain Fry’” (Harvard Business Review, March 2026). 1,488 US workers; ~14% report cognitive exhaustion from AI oversight.

Trend Micro — “The Real Risk of Vibecoding” (March 2026). Analysis of how vibe coding accelerates uncontrolled software change.

Practitioner sources

Axel Molist — “I found the line splitting AI coding” (YouTube, April 2026). The contractor app vs dental practice SaaS comparison; the invisible line concept.

Bill Cox — AI at the Helm and CodeRhapsody. 240,000 lines of AI-supervised practice; the harness concept.

StrongDM — Software Factory manifesto (February 2026). Three engineers, no human code review, scenario-based validation.

Jer Crane (@lifeof_jer) — “An AI Agent Just Destroyed Our Production Data. It Confessed in Writing.” (X, April 25, 2026). AI coding agent deleted production database and all backups in nine seconds via unrestricted API access.

Stanford Law School CodeX — “Built by Agents, Tested by Agents, Trusted by Whom?” (February 2026). Legal and accountability analysis.

Industry analysis

THE DECODER — Frontier Radar #1: From chatbots to problem solvers — the state of AI agents in 2026 (February 2026). Harness engineering, agent reliability, Cursor/FastRender experiment, prompt injection statistics, and the case for structure over scale.
THE DECODER — Frontier Radar #2: Why AI productivity gets lost between benchmarks and the balance sheet (March 2026). The productivity gap between task-level AI gains and company-level impact; workslop, approval fatigue, and the measurement problem.
THE DECODER — AI coding can make developers slower, even if they feel faster (July 2025). Coverage of the METR randomized controlled trial.
THE DECODER — Anthropic cuts AI productivity forecasts in half after analyzing Claude’s real-world failure rates (January 2026). Anthropic’s 4th Economic Index Report; real-world failure rates halved productivity projections from 1.8pp to 0.6–0.8pp.
THE DECODER — AI coding tools hurt learning unless you ask why, Anthropic study finds (January 2026). Developers 17% worse on knowledge tests with heavy AI delegation; how you use AI determines whether you learn.
THE DECODER — Corporate AI agents use simple workflows with human oversight instead of chasing full autonomy (December 2025). MAD study: 92.5% of productive AI agent systems serve humans directly; simple workflows dominate.

Agile thought leadership

Mike Cohn — The Cost of Change Curve Is Outdated (Mountain Goat Software, March 2026). AI is flattening the cost-of-change curve; coding becomes revision, not construction. Supports the article’s distinction between vibe coding (cheap experimentation) and production engineering (where the cost of understanding, not writing, is the bottleneck).

Media coverage

Vietnam Economic Times — “Human touch” (Issue 454, April 27, 2026). DevDay 2026 panel coverage.

Regulatory frameworks

FINMA — Swiss Financial Market Supervisory Authority.

EU AI Act — Regulation (EU) 2024/1689.

FADP — Swiss Federal Act on Data Protection (in force September 2023).

GDPR — General Data Protection Regulation.

Companion articles

Axon Active — “AI Changes the Practice. The Scrum Team Stays.” The companion piece on how we organize delivery teams with AI. (Link to be inserted at publication.)