How to Evaluate an AI Agency | 2026 Buyer's Checklist

There are roughly ten thousand AI agencies in business in 2026, and most of them launched in the last 24 months. Some are excellent. Some are excellent at PowerPoint and not at much else. Telling them apart matters, because the cost of picking wrong isn't just the project fee — it's six months of lost time, an under-performing system, and starting over with a new vendor while the original one cashes their final invoice. This is the buyer's checklist we'd hand to a friend before they signed any AI agency contract.

Why this is harder than it looks

Buying from an AI agency in 2026 is not like buying from a web development agency in 2015. The technology is shifting every quarter. The vendor's underlying tools (models, APIs, infrastructure) change underneath them. The "industry best practice" you read about last month is already outdated. And most importantly, the line between an agency that ships engineering and an agency that ships PowerPoint is harder to spot than ever because everyone's website looks great.

The good news is that a few well-chosen questions sort it out fast.

The 20 questions to ask before signing

These are in priority order. The first five are the deal-breakers.

1. Who actually builds what we're buying?

Find out who specifically will design, build, and maintain your system. Names, roles, time commitments. Agencies frequently sell with senior staff and deliver with juniors or subcontractors. Ask: "Who is the engineer assigned to my project, and what percentage of their time is allocated to my account?"

2. Show us a system you've shipped that's still in production

Anyone can demo a prototype. Ask to see a system that's been live for 6+ months, ideally with metrics. If they can't share live work without an NDA, ask for a recorded walkthrough. If they can't provide either, treat that as a red flag.

3. What's your underlying model choice and why?

A capable agency has opinions about when to use Claude vs. GPT vs. Gemini vs. open-source. They can explain the tradeoffs — cost, context length, tool use, latency, data residency. An agency that just says "we use AI" or "we use the best model" is selling a wrapper, not engineering.

4. Are you billing API costs at cost or with a markup?

Some agencies bundle API costs into the monthly fee with a 2–5x markup buried inside. Ask directly: "Are the underlying LLM/API costs passed through at vendor cost, or are they marked up?" Then ask to see the breakdown on a recent invoice from a comparable client (names redacted). For more on this, see API costs explained — BYO vs. bundled.

5. What happens to the code, data, and models if we leave?

This is the lock-in question. The right answers:

Code: you own it or have perpetual license
Training data: yours, exportable
Configuration: yours, exportable
Models: hosted however you choose

The wrong answers involve any version of "all of that lives in our platform."

6. How do you handle prompt versioning and rollback?

Production AI systems break when prompts change. A capable agency uses version control, has rollback procedures, and can show you the change history. If they don't, you're paying for ad-hoc work that will silently degrade.

7. How do you evaluate model outputs?

"We test it manually" is not a strategy. Look for evals — a structured testing framework, ideally with regression tests, that runs on every change. Without evals, you're shipping by vibe.

8. What's your approach to hallucinations and grounding?

Every credible agency has a clear story about retrieval-augmented generation (RAG), citation, and fallback behavior. They can show you what their system does when it doesn't know the answer. If their answer is "GPT-4 doesn't hallucinate that much," walk away.

9. How do you handle PII and sensitive data?

For any deployment touching customer data, the agency should walk you through their data handling — what gets sent to the LLM, what gets logged, what gets encrypted, what gets purged. If they shrug and say "the platform handles it," they haven't designed it.

10. Have you handled this specific use case before?

A chatbot for an e-commerce store is different from a voice agent for a dental practice is different from an internal research tool for a law firm. Industry experience matters because the integration patterns, regulatory constraints, and customer expectations differ. Ask for 2–3 examples that are close to your situation.

11. What's your typical implementation timeline?

If they quote "2 weeks" for anything beyond a glorified FAQ bot, they're shipping a demo, not a system. Real custom deployments take 6–16 weeks depending on scope. For benchmarks, see the AI implementation timeline we publish.

12. What's the maintenance plan after launch?

A production AI system needs ongoing work. The agency should propose a clear post-launch retainer with deliverables — model updates, integration changes, performance monitoring, evals. "We'll respond if something breaks" is reactive maintenance, not active improvement.

13. How do you measure success?

A real agency will propose KPIs that map to your business outcomes — leads captured, deflection rate, resolution rate, cost per conversation, customer satisfaction. Beware proposals built around vanity metrics like "messages handled" or "users engaged."

14. What integrations have you built into our specific stack?

If you run on HubSpot + Stripe + Shopify, ask what they've built into each. If their answer is "we can integrate anything," verify with a code review or a reference call. Anything they haven't built into already adds risk and cost.

15. Show us your evals results on a representative task

A serious agency runs evals. Ask to see anonymized eval results — accuracy on a benchmark set, regression test results, performance over time. If they don't have any, they're shipping without measurement.

16. What's your team's AI background?

You don't need a PhD on every project, but you do need at least one team member who has worked in AI/ML beyond the last 24 months. Ex-Big Tech, ex-research, ex-AI-first startup all count. Career marketers who pivoted to "AI consultant" three months ago do not.

17. What's the contract structure?

Look for clear scope, clear deliverables, clear acceptance criteria, and clear payment milestones tied to deliverables (not the calendar). A vendor demanding 100% upfront for a 12-week project is a red flag.

18. Who owns the relationship after launch?

In smaller agencies, this is whoever sold you. In larger agencies, you'll be handed to an account manager. Find out who, and meet them before signing.

19. What's your client retention rate?

A capable agency keeps clients. Ask what their 12-month retention rate looks like and what causes the churn that does happen. Honest answers are illuminating.

20. Can we talk to a current client?

Reference calls are gold. Pick one from their list and one of your choosing (find a similar client of theirs via LinkedIn). Ask the references three things: did they deliver on time, did the system do what was promised, and would they hire them again.

Red flags in 2026

A few patterns we see repeatedly that indicate trouble:

"AI strategy" with no engineering team. PowerPoint shops have proliferated. If their team list is all "strategists" and "consultants," they're outsourcing the actual build.
"Proprietary AI" claims without explanation. Most agencies are running on the same underlying LLMs. "Proprietary AI" usually means a wrapper around GPT. Not bad, but be honest about it.
No code samples, no GitHub presence, no technical blog. A capable team produces visible technical work. Total opacity is a warning sign.
Vague pricing or "contact for pricing" with no published ranges. Reputable agencies publish ranges or share them on request. Total opacity here usually means custom pricing is high for unsophisticated buyers.
Long contracts with steep early-termination fees. Acceptable for genuinely large engagements. Suspicious for routine work.
Heavy use of stock-photo testimonials. Look for real names, real titles, real LinkedIn profiles.
"Replacement of your team" framing. AI agencies that pitch full team replacement usually under-deliver. The realistic pitch is augmentation, with clear handoff to humans where it matters.
No mention of evals, monitoring, or rollback. Indicates an agency that builds for demo, not for production.
Pressure tactics ("price expires Friday"). Reputable agencies don't sell like timeshares.
No discussion of API cost markup. Silent markups are an industry-wide issue. If they refuse to clarify, assume the markup exists and is meaningful.

Pricing benchmarks for 2026

A practical baseline for what's reasonable. These are typical ranges — outliers exist in both directions.

Project type	Setup	Monthly run
Light chatbot (FAQ, single channel, light integration)	$5,000–$15,000	$300–$1,000
Standard chatbot (multi-channel, CRM integration)	$15,000–$40,000	$800–$3,000
Custom voice/calling agent	$10,000–$50,000	$1,500–$6,000
Mid-complexity automation suite	$15,000–$60,000	$1,000–$5,000
Custom AI application (SaaS-grade)	$40,000–$200,000+	$5,000–$25,000+
Enterprise multi-system deployment	$100,000–$500,000+	$10,000–$50,000+

Anything substantially below these ranges almost always means a thin wrapper without proper engineering. Anything substantially above usually means corporate sales overhead, custom infrastructure work, or compliance scaffolding that genuinely costs more.

For deeper breakdowns, see what AI chatbots actually cost in 2026 and free AI tools vs. agency hidden costs.

Don't pick on the first call. A realistic evaluation looks like:

Discovery call (30 min) — what they offer, basic fit check
Use-case deep dive (60 min) — your specific problem, their proposed approach, examples of prior work
Technical interview (60 min) — meet the engineer who'd build it. Ask the technical questions above directly.
Reference calls (2x 30 min) — two of their current clients
Pilot or paid discovery scope — a small, scoped engagement before signing the larger project

Reputable agencies welcome this. The ones who push for a same-week signature are filtering for bad buyers.

What good agencies look like in 2026

The healthy signs:

Named, accessible engineers with real AI/ML backgrounds
Public technical writing (blogs, talks, case studies with metrics)
Honest scoping — including telling you when a platform is the better answer
Transparent cost structure with passed-through API fees
Production deployments live for 6+ months with measurable outcomes
Real references willing to talk
Reasonable contract terms with milestone-tied payments
A maintenance offering that's about active improvement, not just bug fixes

How SpeedX Marketing answers this checklist

Briefly, because we'd rather you ask us directly:

Team: Senior engineers and AI specialists across Pakistan and the UK, with 12+ years of design and development experience across our agency. Multinational team, not subcontractors.
Stack: GPT, Claude, Gemini, plus open-source for data-residency-sensitive work. Choice driven by use case.
Pricing: Setup $10,000–$60,000 for most engagements. Monthly $500–$5,000 for ongoing work. API costs passed through at vendor cost. No markup.
Ownership: You own code, training data, and configuration.
Maintenance: Active retainer model, not reactive.
References: Available on request — we've delivered 600+ business engagements across our history.

For service overviews, browse our AI automation services in New York, AI chatbot development services in Los Angeles, or AI application development services in San Francisco.

Free automation opportunity assessment

If you're evaluating agencies and want a no-pressure conversation about scope, cost, and approach — book a free 30-minute call. We'll answer the 20 questions above on the call, in detail, with examples. No deck. Message us on WhatsApp, email info@speedxmarketing.com, or reach out through our contact page.