Next.js OpenAI Streaming: Implement Real-Time Responses with Rate Limiting

8 min read

Published

Updated 5 months ago

Why Next.js OpenAI streaming matters

Streaming OpenAI responses in Next.js lets you render tokens as they arrive, which makes chat and “AI assistant” UIs feel instant. Instead of waiting for the full completion, you can progressively update the UI—especially valuable for longer answers.

In production, you also need rate limiting to protect your OpenAI budget, prevent abuse, and keep your app responsive under load. This guide shows a practical setup for Next.js OpenAI streaming plus a straightforward rate-limiting approach you can adapt to your infrastructure.

Prerequisites

Next.js 13+ (App Router recommended)
A server-side OpenAI API key stored in environment variables (never expose it to the browser)
Node.js runtime available for your route handler (streaming is simplest there)

Project setup (environment variables)

Add your OpenAI API key to your environment. For local development, put it in .env.local:

# .env.local
OPENAI_API_KEY=your_api_key_here

Make sure you only read this variable on the server (Route Handlers, Server Actions, or other server-only code).

Architecture overview

Client: sends a prompt to a Next.js Route Handler and reads a streamed response.
Server (Next.js Route Handler): calls OpenAI with streaming enabled and forwards the stream to the client.
Rate limiting: enforced on the server per user (or per IP) before calling OpenAI.

Implement Next.js OpenAI streaming (Route Handler)

The most common approach in Next.js App Router is a POST Route Handler that returns a streaming response. Below is an example using the official OpenAI JavaScript SDK and Server-Sent Events (SSE) style streaming to the browser.

Create a route at: app/api/chat/route.ts

import OpenAI from "openai";

export const runtime = "nodejs"; // Streaming is typically easiest on Node.js runtime

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function POST(req: Request) {
  const { messages } = await req.json();

  // Create a stream from OpenAI
  const stream = await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages,
    stream: true,
  });

  // Convert the OpenAI SDK stream to a Web ReadableStream
  const encoder = new TextEncoder();

  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const event of stream) {
          const delta = event.choices?.[0]?.delta?.content;
          if (delta) {
            // SSE-style: send "data:" lines
            controller.enqueue(encoder.encode(`data: ${delta}\n\n`));
          }
        }
        controller.enqueue(encoder.encode("data: [DONE]\n\n"));
        controller.close();
      } catch (err) {
        controller.error(err);
      }
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream; charset=utf-8",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
    },
  });
}

Notes:

This example uses SSE formatting (data: ...). The client can parse it incrementally.
The model name is configurable. Use a model available to your account.
If you deploy to an environment that defaults to an Edge runtime, explicitly setting runtime = "nodejs" can avoid streaming limitations depending on your provider.

Client-side: consume the streamed response

On the client, you can POST messages and read the response body as a stream. Here’s a minimal example using fetch and a reader to append tokens as they arrive.

"use client";

import { useState } from "react";

export default function Chat() {
  const [input, setInput] = useState("");
  const [output, setOutput] = useState("");
  const [loading, setLoading] = useState(false);

  async function send() {
    setLoading(true);
    setOutput("");

    const res = await fetch("/api/chat", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        messages: [{ role: "user", content: input }],
      }),
    });

    if (!res.body) {
      setLoading(false);
      throw new Error("No response body");
    }

    const reader = res.body.getReader();
    const decoder = new TextDecoder();

    let buffer = "";

    while (true) {
      const { value, done } = await reader.read();
      if (done) break;

      buffer += decoder.decode(value, { stream: true });

      // Parse SSE chunks separated by double newlines
      const parts = buffer.split("\n\n");
      buffer = parts.pop() || "";

      for (const part of parts) {
        const line = part.trim();
        if (!line.startsWith("data:")) continue;
        const data = line.slice("data:".length).trim();
        if (data === "[DONE]") {
          setLoading(false);
          return;
        }
        setOutput((prev) => prev + data);
      }
    }

    setLoading(false);
  }

  return (
    <div style={{ maxWidth: 720, margin: "40px auto", padding: 16 }}>
      <h1>Next.js OpenAI streaming demo</h1>
      <textarea
        value={input}
        onChange={(e) => setInput(e.target.value)}
        rows={4}
        style={{ width: "100%" }}
        placeholder="Ask something..."
      />
      <button onClick={send} disabled={loading || !input.trim()}>
        {loading ? "Streaming..." : "Send"}
      </button>
      <pre style={{ whiteSpace: "pre-wrap", marginTop: 16 }}>{output}</pre>
    </div>
  );
}

This approach works well for “append-only” streaming UIs. If you need richer event types (tool calls, structured JSON, etc.), you can extend the SSE protocol to send event names and JSON payloads.

Add rate limiting (the practical options)

Rate limiting in Next.js depends on where you deploy and whether you have shared state across instances. A purely in-memory limiter is simple, but it only works reliably on a single long-lived server process. For serverless or multi-region deployments, use a shared store (for example, Redis) or a managed rate-limiting service.

Option A: Simple in-memory rate limiter (single instance only)

Use this for local development or a single Node server. It will not be consistent across multiple serverless instances.

type Bucket = { tokens: number; lastRefill: number };

const buckets = new Map<string, Bucket>();

// Token bucket: allow `capacity` requests per `windowMs`
function allowRequest(key: string, capacity = 10, windowMs = 60_000) {
  const now = Date.now();
  const refillRatePerMs = capacity / windowMs;

  const bucket = buckets.get(key) ?? { tokens: capacity, lastRefill: now };

  // Refill tokens
  const elapsed = now - bucket.lastRefill;
  bucket.tokens = Math.min(capacity, bucket.tokens + elapsed * refillRatePerMs);
  bucket.lastRefill = now;

  if (bucket.tokens < 1) {
    buckets.set(key, bucket);
    return { allowed: false, retryAfterMs: Math.ceil((1 - bucket.tokens) / refillRatePerMs) };
  }

  bucket.tokens -= 1;
  buckets.set(key, bucket);
  return { allowed: true, retryAfterMs: 0 };
}

Then, enforce it in your route before calling OpenAI:

import OpenAI from "openai";

export const runtime = "nodejs";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function getClientKey(req: Request) {
  // Prefer authenticated user ID if you have auth.
  // As a fallback, you can use IP from headers set by your proxy.
  // Header names vary by platform; treat this as a placeholder.
  return req.headers.get("x-forwarded-for")?.split(",")[0]?.trim() || "anonymous";
}

// --- in-memory limiter (single instance only) ---
type Bucket = { tokens: number; lastRefill: number };
const buckets = new Map<string, Bucket>();
function allowRequest(key: string, capacity = 10, windowMs = 60_000) {
  const now = Date.now();
  const refillRatePerMs = capacity / windowMs;
  const bucket = buckets.get(key) ?? { tokens: capacity, lastRefill: now };
  const elapsed = now - bucket.lastRefill;
  bucket.tokens = Math.min(capacity, bucket.tokens + elapsed * refillRatePerMs);
  bucket.lastRefill = now;
  if (bucket.tokens < 1) {
    buckets.set(key, bucket);
    return { allowed: false, retryAfterMs: Math.ceil((1 - bucket.tokens) / refillRatePerMs) };
  }
  bucket.tokens -= 1;
  buckets.set(key, bucket);
  return { allowed: true, retryAfterMs: 0 };
}

export async function POST(req: Request) {
  const key = getClientKey(req);
  const limit = allowRequest(key, 10, 60_000);

  if (!limit.allowed) {
    return new Response(JSON.stringify({ error: "Rate limit exceeded" }), {
      status: 429,
      headers: {
        "Content-Type": "application/json",
        "Retry-After": String(Math.ceil(limit.retryAfterMs / 1000)),
      },
    });
  }

  const { messages } = await req.json();

  const stream = await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages,
    stream: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      try {
        for await (const event of stream) {
          const delta = event.choices?.[0]?.delta?.content;
          if (delta) controller.enqueue(encoder.encode(`data: ${delta}\n\n`));
        }
        controller.enqueue(encoder.encode("data: [DONE]\n\n"));
        controller.close();
      } catch (err) {
        controller.error(err);
      }
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream; charset=utf-8",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
    },
  });
}

Option B: Shared rate limiting (recommended for production)

For production, use a shared datastore so limits apply consistently across instances. A common pattern is a fixed window or sliding window counter stored in Redis (or a managed equivalent). The exact implementation depends on your Redis client and deployment, so the key idea is:

Derive a stable identifier (authenticated user ID is best; IP is a fallback).
Increment a counter for that identifier in a shared store.
Set an expiration (TTL) aligned to your window.
If the counter exceeds your threshold, return HTTP 429 before calling OpenAI.

If you already use Redis, implement the counter with atomic operations (for example, INCR + EXPIRE) to avoid race conditions under concurrency.

Handling OpenAI rate limits vs your app rate limits

Your own rate limiting prevents abuse and cost overruns. Separately, OpenAI may return rate limit responses (HTTP 429) if your account or project exceeds OpenAI’s limits. Treat these as a distinct case:

If OpenAI returns 429, surface a friendly message and consider exponential backoff (especially for background jobs).
Keep your own limiter regardless—OpenAI limits are not a substitute for application-level protection.

Streaming reliability tips in Next.js

Disable buffering where possible: set Cache-Control: no-transform and use text/event-stream for SSE.
Handle client disconnects: in more advanced setups, you can watch for abort signals (req.signal) and stop work early.
Validate input: enforce max prompt size and message count to avoid unexpectedly large requests.
Log safely: never log full user prompts or API keys in production logs unless you have a clear policy and redaction.

Security checklist

Keep OPENAI_API_KEY server-only (Route Handlers / server code).
Prefer authenticated user IDs for rate limiting over IP-based limits.
Add basic abuse protections: maximum request size, CORS rules if applicable, and bot protection for public endpoints.
Return 429 with Retry-After to help clients back off gracefully.

Putting it all together

A solid Next.js OpenAI streaming setup has two pillars: (1) a streaming Route Handler that forwards tokens to the browser as they arrive, and (2) rate limiting that runs before you call OpenAI. Start with the in-memory limiter for development, then move to a shared store (like Redis) for production so your limits remain consistent across instances.

With these pieces in place, you get a fast “real-time” UX while keeping costs and abuse under control—exactly what most teams want when implementing Next.js OpenAI streaming.