Engineering8 min read

Bring-Your-Own-LLM: Running ZenSearch on Groq, Bedrock, Azure, or Ollama

ZenSearch routes every chat, embedding, and rerank call through a centralised Model Gateway. That means swapping providers is a config change — run on Groq for speed, Bedrock for AWS-native, Azure AI Foundry for per-tenant, or Ollama for fully local. Here's how the gateway works and what each provider is actually good for.

March 10, 2026 · ZenSearch Team

Bring-your-own-LLM means every AI call in the platform — chat, agents, embeddings, reranking — routes through a single internal proxy (the Model Gateway), which lets a deployment swap providers by flipping one environment variable. ZenSearch supports OpenAI, Anthropic, Cohere, Groq, OpenRouter, Azure AI Foundry, Amazon Bedrock, Ollama, LM Studio, and any OpenAI-compatible endpoint — picked per deployment, not baked into the application code. On-premise customers get this same gateway in their install, so they can run fully local models with Ollama or point at their own Azure or Bedrock account without touching source.

This is the architectural bit that separates "AI product on one provider" from "AI product on your provider of choice" — and the trade-off each provider represents is real and worth understanding before you pick.

Why Centralise

Without a gateway, every service that calls an LLM needs its own credentials, retry logic, rate-limit handling, and usage tracking. Swapping providers means coordinated changes across every service that makes AI calls. ZenSearch's Model Gateway collapses this into one proxy: services call zen-mini, zen-agent, or zen-agent-pro — internal model aliases — and the gateway resolves those to concrete models on whichever provider the deployment is configured for. One env var (ZEN_MODELS_PROVIDER) flips the entire stack.

Beyond swapping, the gateway gets you: per-tenant usage tracking (every call logged with token counts, latency, cost), configurable rate limits (per team and per model, enforced at gateway level), prompt caching (Anthropic cache_control for a 90% discount on cached reads; OpenAI and Groq prefix caching for ~50%), structured output enforcement (provider-specific schema coercion), SSRF protection (URL allow-list for embedding and custom endpoints), and optional HMAC request signing for service-to-service auth.

OpenAI (Default)

The baseline. Well-documented, stable pricing, strong tool-calling support, native structured output via json_schema. GPT-4o and its mini variant cover the chat/agent surface; text-embedding-3-small or -large covers embeddings. Reasonable default for deployments that don't have a compliance or performance reason to pick otherwise.

Groq — Fast Inference

ZEN_MODELS_PROVIDER=groq. Groq's LPU hardware runs Llama 3.1 8B at hundreds of tokens per second, cheaper per token than OpenAI, and with a free tier generous enough for real prototyping. Good fit for agents where iteration cost dominates wall-clock cost — deep research, multi-step reasoning, high-frequency tool-calling.

Caveat: Groq doesn't do embeddings. ZenSearch's gateway blocks Groq models at /embed at startup via a positive allow-list (openai|cohere|jina|mixedbread|azure|bedrock). Set ZEN_EMBED_PROVIDER to something with embedding support (OpenAI is the usual pair with Groq).

OpenRouter — Unified Gateway to 100+ Models

ZEN_MODELS_PROVIDER=openrouter. OpenRouter is itself a gateway — one API key gets you 100+ upstream models from Anthropic, OpenAI, Meta, Mistral, Cohere, and others. Useful for deployments that want a single billing relationship across every major model family, or for comparing provider quality without managing multiple credential relationships.

Model IDs use openrouter/vendor/name form (e.g. openrouter/anthropic/claude-3.5-sonnet) so they don't collide with direct provider access. Same embedding caveat as Groq — no embeddings, blocked at /embed.

Azure AI Foundry — Per-Tenant OpenAI

ZEN_MODELS_PROVIDER=azure. Azure AI Foundry is Microsoft's OpenAI-compatible surface hosted in the customer's own Azure tenant — same models as OpenAI proper, plus Meta Llama, Mistral, Cohere, and Anthropic through Azure Marketplace. Good fit for customers who need an enterprise agreement with Microsoft, Azure-region data residency, or Azure AD-integrated billing.

Model IDs use azure/<deployment-name> where <deployment-name> must match a deployment the customer created in their Foundry resource. Authentication is api-key header rather than OpenAI's Authorization: Bearer — rewritten per-request by an auth round-tripper. Unlike Groq and OpenRouter, Azure does support embeddings, so ZEN_EMBED_PROVIDER=azure is valid.

Amazon Bedrock — AWS-Native Multi-Provider

ZEN_MODELS_PROVIDER=bedrock. Bedrock is AWS's hosted model service, supporting Anthropic Claude (including Opus 4.5), Amazon Nova, Meta Llama, Mistral, and Cohere through a single region-scoped endpoint. Good fit for AWS-heavy customers who want IAM-based auth, VPC-scoped inference, and AWS-native billing.

Model IDs use native Bedrock format with inference-profile prefixes for cross-region routing (e.g. global.anthropic.claude-opus-4-5-20251101-v1:0). Auth prefers the Bedrock API key (AWS_BEARER_TOKEN_BEDROCK, GA July 2025) and falls back to the standard AWS credential chain (env vars, shared credentials, IAM role, IRSA, instance metadata). Region is mandatory — Bedrock has no global endpoint. Bedrock supports embeddings via Titan v2 and Cohere v3/v4, but embeddings cannot use inference profiles — always use a region-pinned base model.

Partner models (Meta Llama, Mistral, Cohere, Anthropic) require a one-time Bedrock Marketplace agreement per model; first-party Amazon Nova + Titan work out of the box.

Ollama — Fully Local

Set ZEN_MODELS_PROVIDER=openai and point OPENAI_BASE_URL at http://ollama:11434/v1. Ollama is OpenAI-compatible at the /v1/chat/completions surface, so the gateway treats it like a local OpenAI endpoint. Run Llama 3.3 70B on your own GPU, embed with nomic-embed-text, rerank locally, and the conversation never leaves your infrastructure.

This is the canonical on-premise path for air-gapped customers and regulated industries — exactly the shape of deployment the Model Gateway was built for.

How to Pick

For most deployments: OpenAI as the default, switch to Groq when iteration cost matters, or Ollama when the data can't leave. For AWS-native orgs: Bedrock with Claude Sonnet as the agent model. For Microsoft shops: Azure with GPT-4o-mini. For "let me compare all of them before I decide": OpenRouter.

Swapping is ZEN_MODELS_PROVIDER=<x> plus the provider's credentials in the gateway config, then restart the two services. Models auto-update on the next request. No database migration.

PreviousSyncing Enterprise Data Without Re-Indexing the World NextAI Agents That Read and Write Google Workspace, Microsoft 365, and Zendesk