2 months ago 2 months ago

How to Run Production AI Agents on Vercel’s AI Cloud Platform

Vercel just reported an 86% year-over-year revenue jump to a $340 million GAAP run-rate — and the story behind that number is a direct playbook for anyone deploying AI agents in production. As [Forbes reported in March 2026](http://www.forbes.com/sites/richardnieva/2026/03/17/vercel-guillermo-rauch/

by marketingagent.io 2 months ago2 months ago

17views

Vercel just reported an 86% year-over-year revenue jump to a $340 million GAAP run-rate — and the story behind that number is a direct playbook for anyone deploying AI agents in production. As Forbes reported in March 2026, the growth is being fueled by the AI coding boom and Vercel’s aggressive repositioning as the infrastructure layer — the “AI Cloud” — that sits beneath autonomous agents. This tutorial walks through exactly how Vercel’s platform works, what its core primitives do, and how to use them to deploy agents that don’t fall over, blow your budget, or create security nightmares.

What This Is

Vercel has historically been known as the easiest way to ship a Next.js app. That reputation remains, but it now sits on top of a fundamentally different infrastructure story. According to the Vercel Agent Orchestration Strategic Briefing produced by NotebookLM deep research, the company has rebuilt its platform around a set of primitives specifically designed for agentic workloads — the kind of compute that is long-running, non-deterministic, and expensive if mismanaged.

The shift is significant. Traditional serverless infrastructure was designed for short-lived, stateless functions: a user hits an API route, the function runs for 50 milliseconds, it returns a response, and the container dies. AI agents break every assumption in that model. An agent orchestrating a multi-step content generation pipeline might be alive for 45 minutes. It might call five different LLMs. It might write to a database, hit an external API, and execute code it generated itself. Serverless cold-start economics become irrelevant when your “function” needs to survive a network timeout halfway through a 30-step reasoning chain.

Vercel’s answer to this is a stack of purpose-built primitives. The five core components, as documented in the research report, are:

Sandboxes: Isolated Linux virtual machines that execute code generated by autonomous agents without exposing the host system.
Fluid Compute: A compute execution model that scales with long-running, data-heavy workloads — files, video processing, complex pipelines — rather than forcing everything into millisecond-capped functions.
AI Gateway: A single unified endpoint that routes requests to hundreds of different models, handles fallbacks automatically, and provides centralized budget controls.
Workflows: Durable orchestration for multi-step agent operations, built on top of Temporal, with automatic retries and state recovery.
Observability: Full tracing of prompts, model responses, and decision paths — the foundation for debugging non-deterministic agent behavior.

The platform’s flagship user-facing product is v0.app, which started as a UI component generator in 2023 and has since been rebuilt as a fully agentic development platform. As the research report describes it, the current version of v0 can “think, plan, and take actions autonomously” — connecting to databases, integrating third-party APIs, and pushing code directly to GitHub. A production subscription management dashboard can go from prompt to deployed in approximately 15 minutes.

Vercel CEO Guillermo Rauch has framed this evolution as the company’s core thesis: the challenge for modern development teams isn’t building AI agents, it’s running them. That framing is the foundation of everything the platform has shipped over the past 18 months.

Why It Matters

The line that captures Vercel’s market positioning best comes from their own platform documentation, cited in the research report: “Prototyping is democratized, but production deployment isn’t.”

That’s a precise diagnosis of where most organizations are in 2026. LLMs have made it trivially easy to scaffold a working agent prototype in an afternoon. Claude can write the code. ChatGPT can suggest the architecture. Any engineer (or motivated product manager) can get something running locally. The hard part — the part that’s creating both massive opportunity and significant risk — is what happens next.

According to the research report, there are three specific failure modes that practitioners are running into constantly:

The Cost Explosion Problem: A system that should cost $500/month to run can balloon to $5,000/month without platform-level optimization. LLM calls in agentic loops are expensive, especially when agents re-process context they’ve already seen. Without caching at the infrastructure level, you’re paying full price for every token, every time. Vercel’s AI Gateway addresses this with automatic caching across providers — more on the implementation in the tutorial section.

The Reliability Gap: Multi-step agents fail at the infrastructure level constantly. A network timeout on step 7 of a 12-step pipeline doesn’t just mean that step fails — without durable execution, it means you lose all state and have to restart from step 1, burning tokens and time. Vercel’s Workflows primitive, built on Temporal, wraps every non-deterministic operation so the system picks up exactly where it left off. As the research report puts it: “If an LLM call hits a rate limit or a network hiccup occurs mid-request, the system picks up exactly where it left off without losing state or wasting tokens.”

The Shadow IT Security Risk: The research report quotes directly: “Vibe coding has created one of the largest shadow IT problems in history.” When non-engineers generate functional code through natural language prompts and push it to production without standard security review, you get API routes with no rate limiting, agents that can execute arbitrary code on production infrastructure, and LLM-generated logic with no observability. Vercel’s Sandboxes directly address the code execution risk; their WAF rate-limiting templates handle the API exposure risk.

For practitioners — whether you’re a solo engineer, an agency building for clients, or an enterprise engineering leader — the platform matters because it converts the operational expertise that used to require a senior DevOps team into configurable primitives. You don’t need to build your own retry logic, your own caching layer, or your own sandboxing. The infrastructure handles it.

The Data

Vercel vs. Cloudflare: Edge Infrastructure Comparison

The most direct competitive comparison for Vercel in the serverless and edge compute space is Cloudflare Workers. Both platforms target the same use case — globally distributed compute close to users — but with different trade-offs. The following comparison is drawn from the research report:

Feature	Vercel Edge Functions	Cloudflare Workers
Cold Starts	Sub-50ms (highly optimized)	Near-zero (<1ms) via V8 isolates
Runtime Support	Node.js, Python, Go, Ruby	JS/TS, WebAssembly (Rust, C++)
Next.js Integration	Native, superior (ISR, etc.)	Available via Pages
Free Tier Bandwidth	100GB/month	Unlimited
Global Distribution	100+ locations	300+ data centers

Vercel AI Platform Core Primitives

Primitive	Function	Business Impact
Sandboxes	Isolated Linux VMs for executing autonomous or untested code	Prevents runaway operations, contains prompt injection attacks
Fluid Compute	Demand-scaled execution for long-running compute	Predictable costs for data-heavy workloads
AI Gateway	Single endpoint for hundreds of models with auto-fallbacks	Eliminates vendor lock-in, enables unified budget control
Workflows	Durable multi-step orchestration with state recovery	Automatic retries, no lost progress on failures
Observability	Full tracing of prompts, responses, and decision paths	Debuggable non-deterministic agent behavior

Source: Vercel Agent Orchestration Strategic Briefing

Step-by-Step Tutorial: Deploy a Durable AI Agent on Vercel

This walkthrough covers deploying a production-grade AI agent on Vercel’s platform, using the AI Gateway for multi-model routing, Workflows for durable execution, and Sandboxes for safe code execution. The example use case is a content generation agent that pulls a brief, generates structured copy, and pushes it to a CMS — a representative multi-step agentic workflow.

Prerequisites

Before you start, you’ll need:
– A Vercel account (Pro tier or higher for Sandboxes and Workflows)
– Node.js 20+ installed locally
– The Vercel CLI (npm i -g vercel)
– API keys for at least two LLM providers (for Gateway fallback configuration)
– A GitHub account for deployment integration

Phase 1: Initialize the Project and Configure the AI Gateway

Step 1: Create a new Next.js project with the AI SDK

npx create-next-app@latest my-agent --typescript --app
cd my-agent
npm install ai @ai-sdk/openai @ai-sdk/anthropic
vercel link

Step 2: Configure the AI Gateway endpoint

In your Vercel project dashboard, navigate to Settings → AI Gateway. Create a new Gateway instance. You’ll receive a unified endpoint URL that looks like:

https://gateway.ai.vercel.app/v1/{your-project-id}/{gateway-name}

This single endpoint will route to whichever provider you configure — OpenAI, Anthropic, Google, DeepSeek, and others — based on rules you define. The key benefit: if your primary provider returns a rate limit error, the Gateway automatically fails over to your secondary provider without any code changes on your end.

Step 3: Enable automatic caching

In your Gateway configuration, set caching to auto. The research report documents how this works across providers: OpenAI, Google, and DeepSeek use implicit caching (handled automatically by the provider). Anthropic and MiniMax require explicit cache_control breakpoints — the AI Gateway handles this by automatically adding the breakpoints to static content when caching: 'auto' is set.

In your application code:

import { createOpenAI } from '@ai-sdk/openai';

const gateway = createOpenAI({
  baseURL: process.env.VERCEL_AI_GATEWAY_URL,
  apiKey: process.env.VERCEL_AI_GATEWAY_TOKEN,
});

// The gateway handles caching and fallbacks transparently
const model = gateway('gpt-4o');

For agents that run the same system prompt repeatedly — like a content agent processing dozens of briefs — this caching layer alone can reduce your token spend by 30-60% on the context that doesn’t change between runs.

Phase 2: Implement Durable Workflows

The single most important reliability improvement you can make for a multi-step agent is wrapping its execution in a durable workflow. Without this, a transient failure on step 6 of 12 means you lose all state and start over. With Temporal-backed Workflows on Vercel, the agent resumes from the exact point of failure.

Step 4: Install the Workflow dependencies

npm install @vercel/workflow-sdk

Step 5: Define your workflow

Create app/api/agent-workflow/route.ts:

Infographic: How to Run Production AI Agents on Vercel's AI Cloud Platform — Infographic: How to Run Production AI Agents on Vercel’s AI Cloud Platform

import { workflow, activity } from '@vercel/workflow-sdk';

// Each activity is a discrete, retryable step
const fetchBrief = activity('fetch-brief', async (briefId: string) => {
  const response = await fetch(`https://your-cms.com/briefs/${briefId}`);
  return response.json();
});

const generateContent = activity('generate-content', async (brief: Brief) => {
  const { text } = await generateText({
    model: gateway('claude-3-7-sonnet'),
    system: 'You are a professional content writer...',
    prompt: `Write a blog post based on: ${JSON.stringify(brief)}`,
  });
  return text;
});

const publishToCMS = activity('publish-to-cms', async (content: string, briefId: string) => {
  const response = await fetch('https://your-cms.com/posts', {
    method: 'POST',
    body: JSON.stringify({ content, briefId }),
  });
  return response.json();
});

// The workflow orchestrates activities with automatic retry and state recovery
export const contentAgentWorkflow = workflow('content-agent', async (briefId: string) => {
  const brief = await fetchBrief(briefId);
  const content = await generateContent(brief);
  const result = await publishToCMS(content, briefId);
  return result;
});

Step 6: Create the trigger endpoint

// app/api/run-agent/route.ts
import { contentAgentWorkflow } from '../agent-workflow/route';

export async function POST(request: Request) {
  const { briefId } = await request.json();

  // Start the workflow — this returns immediately with a workflow ID
  const run = await contentAgentWorkflow.start(briefId);

  return Response.json({
    workflowId: run.id,
    status: 'started'
  });
}

If the generateContent activity fails due to a rate limit, Vercel’s Temporal integration will retry it with exponential backoff. The fetchBrief step won’t re-run because its result is already persisted in the workflow state. This is the core value of durable execution: each completed step is checkpointed.

Phase 3: Add Sandbox Execution for Agent-Generated Code

If your agent generates and executes code — for tasks like data analysis, custom formula evaluation, or dynamic content transformation — you must run that code in a Sandbox, not on your application server.

Step 7: Create a Sandbox execution endpoint

// app/api/execute-code/route.ts
import { Sandbox } from '@vercel/sandbox';

export async function POST(request: Request) {
  const { code, language } = await request.json();

  // Spin up an isolated Linux VM — this is NOT your application server
  const sandbox = await Sandbox.create({
    timeout: 30000, // 30-second execution limit
    memory: 512,    // 512MB RAM cap
  });

  try {
    const result = await sandbox.execute(code, { language });
    return Response.json({ output: result.stdout, error: result.stderr });
  } finally {
    // Always destroy the sandbox — each execution gets a fresh environment
    await sandbox.destroy();
  }
}

The research report is explicit about why this matters: Sandboxes “prevent runaway operations and contain prompt injection attacks.” If an adversarial user crafts a prompt that causes your agent to generate malicious code, that code runs in an isolated VM with no access to your application server, your database credentials, or your deployment environment.

Phase 4: Configure Observability and Rate Limiting

Step 8: Enable tracing in your AI calls

Vercel’s Observability primitive automatically captures prompts and responses when you use the AI SDK through the Gateway. To enable structured tracing:

import { generateText } from 'ai';

const { text, usage } = await generateText({
  model: gateway('gpt-4o'),
  prompt: userPrompt,
  experimental_telemetry: {
    isEnabled: true,
    functionId: 'content-generation',
    metadata: { briefId, userId },
  },
});

Every call will appear in your Vercel Observability dashboard with full prompt/response tracing, token usage, latency breakdown, and decision paths — critical for debugging when an agent produces unexpected output.

Step 9: Add WAF rate limiting to all agent API routes

In your vercel.json:

{
  "firewall": {
    "rules": [
      {
        "name": "Rate limit agent routes",
        "match": { "path": "/api/run-agent" },
        "action": {
          "type": "rate-limit",
          "requests": 10,
          "window": "1m"
        }
      }
    ]
  }
}

The research report flags this specifically: “Implement Vercel WAF rate-limiting templates on all /api/chat routes to prevent bad actors from incurring excessive usage costs.” Without this, a single malicious actor — or a client-side bug that creates an infinite loop — can generate thousands of LLM calls against your account.

Expected Outcomes

After completing this setup, you’ll have an agent that:
– Routes LLM calls through a unified Gateway with automatic failover and caching
– Executes multi-step workflows that survive network failures and rate limits
– Runs agent-generated code in isolated VMs that cannot affect your production environment
– Provides full observability on every LLM call, prompt, and decision
– Enforces rate limits that prevent cost explosions from abuse or bugs

Real-World Use Cases

Use Case 1: Content Marketing Pipeline for Agencies

Scenario: A digital marketing agency manages content production for 40+ clients. Each week, their team manually writes briefs, generates drafts, routes for approval, and publishes — a process that takes 3-4 days per client.

Implementation: Deploy a content agent workflow using Vercel Workflows. The agent receives a brief via API, uses the AI Gateway to generate a first draft (with Anthropic as primary, OpenAI as fallback), runs an SEO analysis activity, formats for the target CMS, and triggers a Slack notification for human review. The full pipeline runs as a durable workflow, so a model outage at 2 AM doesn’t kill the run.

Expected Outcome: Pipeline completes overnight without human intervention for the generation phase. Agency team reviews and approves in the morning. Content production time drops from 3-4 days to same-day turnaround for standard formats.

Use Case 2: Internal Data Agent for Non-Technical Teams

Scenario: A SaaS company’s sales team constantly submits data requests to engineering — “show me accounts over $10K ARR that haven’t logged in for 30 days.” Engineering backlog for these requests is 2 weeks.

Implementation: Use Vercel’s d0 data agent model (text-to-SQL) documented in the research report. The agent accepts natural language queries, converts them to SQL via an LLM call routed through the AI Gateway, executes against a read-only database replica (not production), and returns formatted results. All SQL execution happens in a Sandbox.

Expected Outcome: Sales team gets self-serve data access. Engineering backlog for ad-hoc queries drops to zero. SQL generated by the LLM is isolated from production — a malformed query can’t take down the database.

Use Case 3: Rapid Prototyping with v0.app

Scenario: A product team has a Figma design for a subscription management dashboard. Getting it into working React components typically takes two weeks of developer time.

Implementation: Use v0.app to convert the Figma designs to React/Tailwind components. As the research report documents, the agentic v0 can connect to the project’s existing database schema, generate component logic that reflects real data structures, and push directly to a GitHub branch for review. The research report cites examples where full applications are deployed in approximately 15 minutes.

Expected Outcome: Working prototype deployed to a Vercel preview URL within an hour. Developer reviews and refines rather than writing from scratch. Two-week frontend task becomes a one-day review cycle.

Use Case 4: Multi-Model AI Research Pipeline

Scenario: A fintech company runs daily market research that involves pulling news, summarizing findings across 20+ sources, cross-referencing with internal data, and generating analyst briefs.

Implementation: Build a multi-step workflow where each step uses the optimal model for the task: a fast, cheap model (DeepSeek via AI Gateway) for initial summarization of raw articles, a more capable model (Claude 3.7 Sonnet) for synthesis and cross-referencing, and GPT-4o for final report formatting. The AI Gateway handles model routing; the Workflow primitive ensures each step’s output is checkpointed before the next step starts.

Expected Outcome: Full research pipeline runs each morning automatically. Token costs are minimized by using cheaper models for high-volume summarization tasks. Failures in any single step don’t restart the entire pipeline.

Common Pitfalls

Pitfall 1: Running agent-generated code outside a Sandbox

The mistake: your agent generates a Python script to process uploaded data, and you execute it directly on your server using child_process.exec(). The research report documents this explicitly as a prompt injection vector — a crafted input can cause the agent to generate code that exfiltrates environment variables.

The fix: never execute LLM-generated code outside a Vercel Sandbox. The isolated VM has no access to your application environment.

Pitfall 2: Not enabling AI Gateway caching

The mistake: you call the LLM directly with the same system prompt on every request. For an agent that processes 1,000 documents per day with a 2,000-token system prompt, you’re paying for 2 million tokens daily that never change.

The fix: route all LLM calls through the AI Gateway with caching: 'auto'. The research report confirms this “significantly reduce[s] costs for repetitive prompts in agentic loops.”

Pitfall 3: Building stateless workflows for stateful agent operations

The mistake: using standard serverless functions for multi-step agents. A 15-minute pipeline has a high probability of hitting a function execution timeout, a network error, or a provider rate limit — and all progress is lost.

The fix: wrap every multi-step agent operation in a Vercel Workflow backed by Temporal. Each activity is checkpointed. Failures retry from the last successful step.

Pitfall 4: No rate limiting on agent-facing API routes

The mistake: shipping /api/run-agent without rate limiting. A client-side bug that calls the endpoint in a loop — or a deliberate attack — can generate thousands of LLM calls and hundreds of dollars in minutes.

The fix: apply WAF rate-limiting rules to all agent API routes in vercel.json before going to production.

Pitfall 5: Over-engineering the initial build

The mistake: building custom orchestration infrastructure before validating the agent’s core logic works. The research report notes that AI models “often architect inefficient infrastructure” — and developers following LLM-generated scaffolding hit the same trap.

The fix: use v0.app to prototype the UI and workflow quickly, validate the agent’s core behavior works, then harden the infrastructure with Workflows and Sandboxes.

Expert Tips

Tip 1: Use the AI Gateway as your cost control layer, not just a routing layer. Set budget limits per project in the Gateway configuration. When a project hits its daily LLM spend threshold, the Gateway can stop routing requests rather than letting costs accumulate until you notice. This is the production-grade equivalent of a circuit breaker.

Tip 2: Structure Workflows with fine-grained activities. Don’t wrap your entire agent in a single activity. Break it into discrete steps — fetch, process, transform, publish. The smaller each activity, the less work is repeated when a retry occurs. A 10-step workflow where each step takes 3 minutes is far more resilient than a single 30-minute function.

Tip 3: Use v0.app for internal tools, not just customer-facing UIs. The research report documents v0 as targeting both developers and non-technical users like designers and product managers. Internal dashboards, admin tools, and data visualization panels are high-ROI targets for v0-generated code — they don’t need to be pixel-perfect, they just need to work.

Tip 4: Layer your Observability metadata. When tracing LLM calls, include userId, sessionId, and the specific workflow step in your metadata. When an agent produces a bad output, you need to be able to trace it back to the exact prompt, model, and context that produced it. Generic traces without metadata make debugging non-deterministic behavior nearly impossible.

Tip 5: Deploy to Edge Functions for latency-sensitive agent interactions. The research report documents Vercel Edge Functions at sub-50ms cold starts, compared to 100-300ms for traditional regional serverless. For streaming agent responses to end users — the chat-style interface where you’re streaming tokens — the difference in perceived responsiveness is significant. Route user-facing streaming endpoints through Edge Functions; reserve Fluid Compute for the heavy backend processing.

FAQ

Q: Is Vercel’s AI platform only for Next.js projects?

A: No. While Next.js integration is the most seamless path — as documented in the research report, the integration is “native, superior” — Vercel supports Node.js, Python, Go, and Ruby runtimes. The AI Gateway, Sandboxes, and Workflows are platform-level primitives accessible from any runtime. That said, if you’re starting a new project and don’t have a framework preference, Next.js gives you the tightest integration with features like Incremental Static Regeneration.

Q: How does the AI Gateway handle provider outages?

A: The AI Gateway maintains your fallback configuration and switches providers automatically when a primary provider returns errors. The research report describes it as providing “automatic fallbacks” as a core feature. In practice, you configure a priority list — say, Claude 3.7 Sonnet as primary, GPT-4o as secondary — and the Gateway handles the failover logic without any code changes on your end.

Q: What’s the cost difference between using Fluid Compute vs. standard serverless for AI workloads?

A: Standard serverless functions cap out at 60-second execution limits and are priced per-invocation with memory capped at function limits. For agents running multi-minute pipelines or processing large files, this forces artificial splitting of tasks and increases complexity. Fluid Compute scales execution time and memory to actual demand, which per the research report “maintains predictable costs for data-heavy workloads.” The cost efficiency comes from not paying for idle capacity during LLM “thinking time” while still being able to handle large payloads.

Q: Can I use Vercel Sandboxes for Python data analysis workloads?

A: Yes. Sandboxes spin up isolated Linux VMs, which support any language that runs on Linux. Python with pandas, numpy, or any other data processing library can execute in a Sandbox. This is the recommended pattern for any agent that processes user-uploaded data or runs analytical code generated by an LLM, per the research report security guidance.

Q: How does Vercel’s 86% growth compare to the broader AI infrastructure market?

A: Per the Forbes reporting cited in the topic.json, Vercel’s $340M GAAP run-rate at 86% YoY is a strong signal that production AI deployment infrastructure — as opposed to model providers — is a high-growth segment. The research report contextualizes this as the market transitioning from “build” to “run” focus, where the platform layer is capturing value as prototyping commoditizes.

Bottom Line

Vercel’s $340M run-rate and 86% growth aren’t just a fundraising milestone — they’re a signal that production AI infrastructure is where the real market is heading in 2026. As the research report frames it, competitive advantage now comes from “rapid iteration on AI that solves real problems… and reliably operating those systems at scale,” not from the ability to scaffold a prototype. The platform’s five core primitives — Sandboxes, Fluid Compute, AI Gateway, Workflows, and Observability — address the four real failure modes practitioners face: cost explosions, workflow failures, security gaps, and debugging impossibility. If you’re building AI agents that need to run in production for real users, the tutorial above gives you the exact architecture to deploy on Vercel without reinventing durable execution, caching, or sandboxing infrastructure from scratch. The “build vs. run” gap is the defining engineering challenge of the current AI moment — and it’s now solvable at the platform level.