Article

Jun 9, 2026

How to Vet an AI Development Agency: 12 Questions With Passing Answers

An adversarial vetting checklist for AI agencies, with passing answers, red flags, and the compliance question almost no one asks before August 2026

Single thin orange seam dividing two black architectural planes against deep void

TL;DR

  • 75% of enterprises rolled back customer-facing AI agents per Sinch's 2025 study — vetting is now a governance problem, not a portfolio problem.

  • The 12 questions below cover code review, security, compliance, ownership, and unit economics — each with a passing answer and the red flag that should kill the deal.

  • EU AI Act Article 50 transparency obligations take effect August 2, 2026. Ask your agency how they're handling it before you sign.

  • Roughly 25% of YC's W25 cohort shipped ~95% AI-generated codebases. "Who reviews the AI's code?" is no longer hypothetical.

  • Skip to the printable scorecard at the bottom if you're vetting someone this week.

1. Why vetting changed when agencies started building with AI

Here's the honest version: most vendor checklists you'll find online were written for a world where humans typed every line of code. That world is gone.

In March 2025, TechCrunch reported that roughly a quarter of Y Combinator's Winter 2025 startups were shipping codebases that were ~95% AI-generated. That number is only going up. If you're hiring an AI development agency in June 2026 and you're not asking who reviews the model's output before it touches production, you are not vetting — you are gambling.

The stakes got concrete in October 2025. Sinch surveyed 2,500+ enterprise leaders and found 75% had rolled back customer-facing AI agents. Top causes: data exposure (31%), hallucination (22%), and inability to diagnose issues (16%). Every one of those failures traces back to a question that wasn't asked during vetting.

So we wrote the checklist we wished operators would send us. Twelve questions, each with the passing answer and the red flag, grouped by what they actually protect.

2. The 12 questions, grouped

The questions split into five buckets, ordered by stakes:

Code quality (Q1–Q3). Who reviews AI-written code, what's the test coverage policy, what's the rollback plan when production breaks at 2 AM.

Security (Q4–Q5). Where does customer data flow during model calls, and what's the secrets-management posture.

Compliance (Q6–Q7). EU AI Act Article 50 readiness, and the audit trail story.

Ownership (Q8–Q10). Who owns the source code, the prompts, the fine-tuned weights, and the vendor accounts.

Economics (Q11–Q12). What does inference actually cost at your volume, and what's the exit cost if you fire them in month four.

The full grid is below. Screenshot it, send it to your shortlist, and watch which agencies answer in specifics versus which ones answer in adjectives.


Twelve vetting questions with passing answers and red flags, grouped by category

Screenshot this. Send to your shortlist. Score 0/1/2 per row.

3. Passing answers vs red flags, question by question

A few of these deserve more than a row in a table.

Q1 — Who reviews the AI-written code?

Passing answer: a named senior engineer, with a documented review checklist, who signs off on every PR before merge. They can tell you the last three bugs the AI introduced that they caught.

Red flag: "Our agents self-review." Or worse: "We trust the model." If the agency can't name the human accountable for the last shipped commit, you're paying them to ship unreviewed code into your production.

Q4 — Where does customer data flow during model calls?

Passing answer: a data-flow diagram you can read in under 90 seconds, naming every endpoint the data touches, with PII redaction happening before the model call, not after. They mention their DPA with the model provider by name.

Red flag: vague reassurance about "enterprise-grade encryption." That phrase tells you nothing about whether your customer's email address ends up in OpenAI's training data. The 31% of Sinch respondents who rolled back over data exposure mostly skipped this question.

Q8 — Who owns the source code, prompts, and fine-tuned weights?

Passing answer: you do, on day one, in writing. The contract names the repository, the prompt library, and any custom weights as work-for-hire. You get admin access to the cloud accounts, not just "a copy" you have to ask for.

Red flag: the agency holds the repository and grants you "read access." Or the prompts live in their proprietary platform you can't export. This is the single most common way agencies trap clients — and the cheapest one to prevent at contract signing.

For the rest, the grid above carries the load. If you want our take on why so many of these failures land in the same place, we wrote about it in why companies are rolling back AI agents.

4. The red flags that predict you'll join the 75% rollback statistic

Three patterns show up in almost every failed engagement we've reviewed.

The demo-first agency. They show you a polished demo in week one and ask for a deposit in week two. They have not asked about your data, your compliance posture, or your existing systems. The 22% of Sinch respondents who rolled back over hallucination almost universally bought from this type.

The "we'll figure out logging later" agency. Sinch found 16% of rollbacks happened because teams couldn't diagnose what the agent did wrong. If your vendor treats observability as a phase-two concern, you will be that 16%. Audit trails are a primitive, not a feature.

The agency that won't quote inference cost. "It depends on usage" is a non-answer. A serious agency will model your token spend at 1x, 10x, and 100x of expected volume, name the model tier they're using, and tell you when caching pays for itself. If they can't, they haven't shipped enough production agents to know.

We've written more on the integration side of this in integrating AI agents with existing systems — most of the rollback causes trace back to plumbing, not models.

5. The compliance question with a hard deadline

Here's the question almost no one asks: how are you handling EU AI Act Article 50 transparency obligations, which take effect August 2, 2026?

Article 50 requires that anyone interacting with an AI system be told they're interacting with an AI system. It applies to chatbots, voice agents, and most customer-facing automation. Per Gibson Dunn's November 2025 analysis, while several high-risk provisions were postponed to 2027–2028 under the EU's omnibus agreement, the Article 50 transparency deadline of August 2, 2026 held.

That's six to eight weeks from when most operators are reading this. If your vetting call happens in June 2026 and your shortlist agencies say "we'll look into that," they are telling you they have not shipped a compliant system yet. The passing answer names the specific disclosure pattern they implement (in-chat banner, voice intro line, API response header), where the disclosure logic lives in the codebase, and how they version it when the rules update.

This isn't legal advice — get your own counsel — but it is operational reality. The agencies that thought about this six months ago will be the ones whose deployments don't get yanked offline.

6. How we'd answer our own checklist

Fair question. If you sent us this checklist, here's roughly what you'd get back.

Q1 (code review): Every AI-written PR is reviewed by a named senior engineer on our side before merge. We keep a running log of model-introduced bugs by category — the most common one in our 2026 work has been over-eager error swallowing in async handlers.

Q4 (data flow): We send you a diagram during scoping, not during onboarding. PII redaction sits in front of the model call. We name our DPAs in the SOW.

Q6 (Article 50): Disclosure logic is a middleware layer, versioned, and the disclosure copy lives in your CMS so legal can update it without redeploying.

Q8 (ownership): Your repository, your cloud accounts, your prompts, your weights. We get revoked when the engagement ends. The handover is in the contract.

Q11 (inference cost): We model your spend at 1x, 10x, and 100x volume during scoping. We tell you which calls should be cached, which should hit a cheaper model tier, and where the break-even point is for a fine-tune.

The rest is in our software development service page, including the engagement structure we use to make these answers durable rather than aspirational.

7. Printable scorecard

Use the grid above as your scorecard. Send it to three agencies. Score each answer 0 (no answer or red flag), 1 (partial), or 2 (passing). Anything under 18 out of 24 means you're absorbing risk the agency should have priced into the SOW.

FAQ

What's the single most important question to ask an AI development agency?

Who reviews the AI-written code before it ships to production, and can they name the last three bugs the model introduced that the reviewer caught. This question separates agencies that have shipped AI-built systems from agencies that are letting models ship unreviewed code into your environment.

How do I check if an AI agency is EU AI Act compliant?

Ask specifically about Article 50 transparency obligations, which take effect August 2, 2026 per the EU's November 2025 omnibus agreement. A compliant agency will name the disclosure pattern they implement, show you where the logic lives in code, and explain how they version disclosure copy when regulations update.

What are the biggest red flags when choosing an AI software partner?

Three predict rollback: a demo-first sales motion without data or compliance discovery, treating observability and audit trails as phase-two work, and refusing to quote inference cost at projected volume. Sinch's October 2025 study tied 69% of rollbacks (data exposure, hallucination, diagnosis failure) to these patterns.

Who should own the source code and prompts after an AI engagement ends?

You should, on day one, in writing. The contract should name the repository, prompt library, and any fine-tuned weights as work-for-hire under your ownership. You should hold admin access to cloud accounts and model provider accounts, not delegated access through the agency's platform.

How much should AI inference actually cost for a production system?

It depends entirely on model tier, call volume, caching strategy, and context window usage — so any flat answer is suspect. A serious agency will model your spend at 1x, 10x, and 100x of expected volume during scoping, name the specific model tiers, and identify which calls should be cached versus routed to cheaper models.

Closing

Send the scorecard to three agencies this week. Score the answers Friday. Decide Monday. If you want us on that list, start here.

© All right reserved

© All right reserved