So You Vibe Coded and Shipped. Now What?

Three months of AI-driven bug hunting, test generation, and PR reviews taught me what it actually takes to go from 'it works' to 'it's ready'

If you’ve shipped something with AI, you know the feeling. The code works. The docs are written. People are using it.

And then the bugs start showing up.

Not the obvious ones. The weird edge cases. The things that “should work” but don’t. The things that “shouldn’t work” but somehow do. And suddenly you’re staring at a codebase you didn’t fully write, trying to figure out how to fix something you don’t fully understand.

That’s where I was three months ago with EZ.

In my last article, I explained how I built the EZ programming language using nothing but AI prompts and vision. That article was about the initial building. This one is about what comes next: the long road from “it works” to “it’s ready.”

Because here’s what I have yet to tell you: that first article was written in November 2025. EZ v1.0 didn’t ship until late January 2026. What happened in those three months?

Bug hunting. Testing. More bug hunting. Issues. PRs. Reviews. And a whole lot of AI-driven quality assurance.

Vibe coding can get you a working prototype. It doesn’t guarantee a stable release. The gap between “it runs” and “it’s ready” is where most AI-built projects die.

This is how I closed that gap.


The Real Workflow: Tests as Context

Here’s what I want people to know about vibe coding: the AI needs to learn your project as it grows.

In the beginning, Claude Code only knew what I told it: my vision, my planning docs, my examples. But as EZ got more complex, that wasn’t enough. The AI would fix one thing and break another because it didn’t have full context on what “working” meant.

My solution: integration tests written in EZ itself.

I had Claude Code write two kinds of tests:

  1. Passing tests: code that demonstrates correct behavior
  2. Failing tests: code that should fail with specific errors

The passing tests became the new source of truth. When I started a session, I’d say:

Review the integration tests in /integration-tests/pass/. These
represent the expected behavior of the EZ language. Any change
you make must keep these tests passing.

Now Claude Code had hundreds of concrete examples of what “correct” looked like. Not abstract documentation. Real, runnable code.

The failing tests were equally important:

Review the integration tests in /integration-tests/fail/. These
represent code that SHOULD produce errors. Each test has an
expected error code. Verify your changes don't accidentally
make invalid code pass.

This changed everything. Claude Code stopped breaking things it had already fixed because it had context. Not just instructions, but evidence.


Bug Hunting Sessions: Different Roles, Different Bugs

Once the tests were in place, I started running what I call bug hunting sessions. The key insight: different personas find different bugs.

I’d start each session with:

It's time for a bug hunting session. This session you are: [ROLE]

And I’d rotate through three roles:

Role 1: Senior QA Engineer

You are a Senior QA Engineer whose job is to intentionally break
the EZ programming language. You are adversarial. You get paid
for every bug you find.

Try to:
- Break the type system with edge cases
- Crash the interpreter with malformed input
- Find inconsistencies between documented and actual behavior
- Discover undefined behavior that could surprise users
- Exploit any assumptions the implementation makes

This role found stuff like:

  • Integer arithmetic was capped at int64 despite arbitrary precision storage (#917)
  • Fixed-size arrays accepted more elements than declared (#1029)
  • Single variable assignment from multi-return functions leaked internal types (#986)
  • Integer type narrowing bypassed range checking (#962)

Role 2: Experienced Programmer

You are an experienced programmer trying EZ for the first time.
You're used to languages like Go, Python, and Rust. You want to
build real things.

Try complicated but reasonable things that SHOULD work:
- Nested data structures
- Complex control flow
- Chaining operations
- Patterns you'd use in production code

If something doesn't work that you'd expect to work, that's a bug.

This role found the “why doesn’t this work?” bugs:

  • Cannot modify array inside struct inside map (#1065)
  • Cannot assign struct member to variable of same declared type (#1008)
  • Multi-return type inference failed for stdlib calls via using (#977)
  • Multi-default value parameter functions threw type errors when called (#1049)

Role 3: Complete Beginner

You are a complete beginner learning to program. EZ is your first
language. You make mistakes. You try dumb things. You don't read
documentation carefully.

Try things that probably SHOULDN'T work, but might:
- Misspell keywords
- Forget required syntax
- Use wrong types
- Write nonsensical code

If any of this accidentally succeeds when it should fail, that's
a bug. Bad code should produce helpful errors, not silent chaos.

This role found the “why DOES this work?” bugs:

  • Leading zeros silently created octal literals (#915)
  • Bare function names without parentheses produced no error (#985)
  • Invalid string interpolation syntax not caught at check time (#984)
  • using std caused misleading errors for undefined types (#914)

The Compound Effect

Each session found bugs. But more importantly, each session generated new tests.

Bug found → test written → test becomes context → next session is smarter.

After two months of this, the integration test suite had hundreds of tests. Claude Code knew exactly what “working” and “broken” looked like. The bug hunts got shorter because there were fewer bugs to find.


AI Writing Tests (The Right Way)

“Just have AI write tests” sounds simple. It’s not.

Early on, I made a mistake. I told Claude Code:

Write comprehensive tests for the EZ interpreter.

What I got was garbage. Hundreds of tests that checked implementation details instead of behavior. Tests that passed but didn’t actually verify anything meaningful. Tests that were so tightly coupled to the current code that any refactor broke them.

The problem: Claude Code didn’t know what to test or why.

The fix: I changed my prompting strategy entirely.

Test Prompting That Actually Works

Step 1: Behavior-first test specs

Instead of “write tests,” I now say:

Review the README documentation for the `when/is` statement. Write
tests that verify DOCUMENTED BEHAVIOR, not implementation details.

Each test should:
1. Have a clear name describing the behavior being tested
2. Test one thing only
3. Include a comment explaining WHY this behavior matters
4. Use realistic examples a user might actually write

Do NOT test internal functions. Test the language from a user's
perspective. Write EZ code, run it, verify output.

Step 2: Edge case generation

For each feature in the README, generate edge case tests:

- What happens with empty input?
- What happens at type boundaries?
- What happens with nil/missing values?
- What happens with nested structures?
- What happens when features interact?

Each test should either PASS (correct behavior) or FAIL with a
helpful error message. Silent failures are bugs.

Step 3: Regression tests from bugs

Every bug gets a test. Non-negotiable.

Bug #47: String interpolation fails with escaped braces.

Write a regression test that:
1. Reproduces the original bug
2. Verifies the fix works
3. Tests related edge cases we might have missed
4. Will catch if this bug ever returns

Add this to the integration test suite with a comment linking to
the original issue.

The Test Quality Check

I also have Claude Code review its own tests:

Review the tests you just wrote. For each test, answer:

1. If the implementation was completely wrong, would this test fail?
2. Could this test pass while the feature is actually broken?
3. Is this testing behavior or implementation?

Rewrite any test where the answer to #1 is "no" or #2 is "yes".

This catches the sneaky bad tests. The ones that look comprehensive but actually verify nothing.


AI Making GitHub Issues

This started as an experiment and became essential.

When Claude Code finds a bug during a hunt, I don’t want to manually create issues. That’s friction. Friction means bugs get forgotten.

So I taught Claude Code to create issues directly using gh issue create. And the quality? Often better than human-reported issues.

Here’s a real example from issue #917. Integers stored correctly but arithmetic failed:

## Summary

The `int` type can store arbitrary precision values (backed by `big.Int`),
but arithmetic operations incorrectly check against int64 bounds and fail
with overflow errors.

## Steps to Reproduce

**Storage works (arbitrary precision):**
​```ez
temp big int = 999999999999999999999999999999999999999999999999999999999999
println("Big: ${big}")  // Works!
​```

**Arithmetic fails (int64 cap):**
​```ez
temp a int = 999999999999999999999999999999
temp b int = a + a  // ERROR: exceeds int64 range
​```

## Expected Behavior

Since `int` uses `big.Int` internally, all arithmetic operations should
also use arbitrary precision.

## Root Cause

The evaluator is likely converting `big.Int` to `int64` before performing
arithmetic operations, then checking for overflow. The fix should use
`big.Int` methods directly (`Add`, `Mul`, etc.) without int64 conversion.

Notice what Claude Code does that humans often don’t:

  • Tests the opposite case (storage works, arithmetic doesn’t)
  • Identifies the root cause in the Go implementation
  • Suggests a specific fix approach

Here’s another one. Issue #986, marked critical:

## Description

When a function that returns a tuple (any number of values ≥2) is assigned
to a single variable using `temp`, the variable stores the raw `ReturnValue`
object instead of either:
1. Unwrapping the first value, or
2. Producing a compile-time error

This causes `typeof()` to return `"RETURN_VALUE"` instead of the actual type.

## Comprehensive Test Results

**Full test suite: 11 passed, 11 failed**

--- Section 1: 2-value returns ---
  [FAIL] 2-val single: got 'RETURN_VALUE', expected 'int'
  [PASS] 2-val tuple[0]: int

--- Section 5: Complex return types ---
  [FAIL] array single: got 'RETURN_VALUE', expected '[int]'
  [PASS] array tuple: [int]
  [FAIL] map single: got 'RETURN_VALUE', expected 'map[string:int]'

## Scope of Bug

| Category | Examples |
|----------|----------|
| **Return count** | 2, 3, 4, 5+ values - ALL fail |
| **Return types** | int, string, bool, [int], map[string:int] - ALL fail |
| **Source** | User functions, @io module, @os module - ALL fail |

That issue included a full test matrix, root cause analysis pointing to specific line numbers in the Go source, and a suggested fix with code. I didn’t write any of that. Claude Code did.

99% of the 1,066 issues in the EZ repo were created this way. Most of them look like that. Comprehensive, reproducible, actionable.

Issue Triage with AI

It works the other way too. When I or someone else submits issues, I have Claude Code triage them:

New issue submitted: [paste issue]

Analyze this issue:
1. Is it reproducible? Try the code.
2. Is it a bug or expected behavior? Check docs/tests.
3. Is it a duplicate? Search existing issues.
4. What's the severity? (crash, wrong output, cosmetic)
5. What component is affected? (lexer, parser, interpreter, stdlib)

Provide your analysis and recommend labels/priority.

This saves hours. Most user-reported issues need clarification or are duplicates. Claude Code catches that before I spend time investigating.


The Plot Twist: Catching Vibe-Coded PRs

Here’s where things get weird.

EZ started getting pull requests. Some were excellent. Clear code, good tests, follows project conventions. Others were… different.

The code worked. The tests passed. But something felt off.

Then I realized: some contributors were vibe coding their contributions.

Nothing wrong with that in principle. I built the whole language that way. But there’s a difference between me vibe coding my project and a stranger vibe coding a contribution to my project.

The tells were subtle:

  • Code that worked but didn’t match existing patterns
  • Overly complex solutions to simple problems
  • Comments that explained what but not why
  • Test coverage that was technically complete but missed obvious edge cases

I needed a way to review these PRs without spending my time reverse-engineering someone else’s AI-generated code.

The AI-Reviewing-AI Protocol

Yes, I’m having AI review code that was probably written by AI. It’s AIs all the way down.

Review this pull request as a senior maintainer of the EZ project.

Context:
- EZ coding conventions: [link to CONTRIBUTING.md]
- Project architecture: [brief description]
- PR claims to fix: [issue description]

Analyze:
1. Does this actually fix the stated issue?
2. Does it follow project conventions?
3. Are there any subtle bugs or edge cases missed?
4. Is the code quality consistent with the rest of the codebase?
5. Are the tests actually testing the fix, or just passing?
6. Does anything suggest this was AI-generated without understanding?

Flag any concerns, even minor ones. I'd rather over-review than
merge problematic code.

Red flags Claude Code catches:

  • Solutions that are technically correct but don’t fit the architecture
  • “Clever” code that could be simple
  • Tests that test the implementation instead of the behavior
  • Missing error handling that the rest of the codebase includes
  • Inconsistent naming conventions
  • Suspicious patterns that suggest copy-paste from AI

The Conversation

On Occasion, I’ve had to reject PRs that work because they don’t fit.

I’ve learned to be direct:

“@Contibutors_Name Thanks for the contribution! The fix works, but the implementation doesn’t match the project patterns. Specifically: [list issues]. Before merging these issue need to be addressed.

Simple as that.


What I’ve Learned About AI-Driven Development

Lesson 1: You Need Systems, Not Sessions

Basic prompting doesn’t scale for maintenance. You need:

  • Scheduled bug hunts (I do weekly)
  • Automated test generation (on every feature change)
  • Issue templates (so AI creates consistent reports)
  • PR review checklists (so nothing slips through)

Build the system once, run it forever.

Lesson 2: AI Catches What Humans Miss (and Vice Versa)

Claude Code is great at:

  • Finding edge cases humans forget
  • Maintaining consistency across a large codebase
  • Generating comprehensive test coverage
  • Catching regression bugs

Humans are great at:

  • Knowing if something “feels wrong”
  • Understanding user intent behind bug reports
  • Making judgment calls about priorities
  • Deciding when to break conventions

Use both.

Lesson 3: Document Everything (For the AI)

Your documentation isn’t just for users anymore. It’s training data for your AI maintainer.

If the docs are wrong, Claude Code will write tests against wrong behavior. If CONTRIBUTING.md is vague, PR reviews will be inconsistent. If your code has no comments, bug hunts will miss context.

Great documentation = good AI maintenance.

Lesson 4: Trust But Verify

I put a lot of faith into Claude Code, But I still verify everything that I can.

Every bug hunt gets manual review. Every generated test gets a sanity check.

In my opinion AI tools are just that…tools they are NOT a full on replacement. If you trust blindly, you’ll merge bugs blindly.


The Workflow That Got Me to v1.0

Here’s what the development process actually looked like over those three months:

Bug hunting sessions: 60-90 minutes, 2-3 times a week

Each session had a specific role (QA, experienced programmer, beginner). Each session found bugs. Each bug became a test. The test suite grew. The sessions got more productive.

Test coverage improvements: Monthly

Once a month, I’d step back and look at test coverage. Were there gaps? Features lacking tests? Edge cases I have myself haven’t tried? This was less frequent but more strategic.

PR reviews: ~10 minutes average for AI review

Claude Code would analyze the PR, check for convention violations, test the changes, flag concerns. I’d review its analysis, make the final call. Most PRs were straightforward. Some needed back-and-forth with contributors or even Claude.

Issue triage: As needed

When bugs came in from sessions or (rarely) users, Claude Code would categorize, label, and prioritize. I’d review and adjust.

No formal “release workflow.” When the bug count was low and the tests were green, I shipped.


If You’re Trying to Ship an AI-Built Project

Here’s my advice:

1. Build Your Bug Hunting Prompts

Don’t wait for users to find bugs. Hunt them proactively. Save your best prompts and reuse them.

2. Test Behavior, Not Implementation

AI writes great tests for the wrong things. Force it to test from the user’s perspective.

3. Create Issues Automatically

Any friction between “found bug” and “tracked bug” means bugs get forgotten. Automate the path.

4. Review AI Code with AI (But Stay Involved)

You can’t manually review every line of AI-generated code. But you can have AI do first-pass reviews and flag concerns for human judgment.

5. Accept the Irony

Yes, you’re using AI to maintain AI code and catch AI contributions. So what? Embrace it.


What’s Next

EZ v1.0 has shipped. Three months of bug hunting, test writing, issue creation, and PR reviews got me there.

The codebase is in a good place. The test suite is comprehensive. The workflow is sustainable. And with over 1,000 downloads, there are real users and contributors now. The bug hunting isn’t theoretical anymore. Every bug I find before someone else does can make a world of difference.

Next article, I’ll cover something I’ve been experimenting with: using AI to write the documentation that trains better AI. It’s getting meta, but it’s also getting powerful.

Until then, keep building. Keep shipping. And when things break (because they will), remember: you have a development partner that never sleeps, never gets bored, and actually enjoys hunting bugs.

Try EZ:GitHub: SchoolyB/EZ


This is the second in a series about building and maintaining EZ. Previously: I Built a Programming Language Using Only AI Prompts