Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now

MODEL COMPARISONS

Compare LLM Models Side-by-Side

Benchmarks, pricing, and context windows for 1197 head-to-head pairs across OpenAI, Anthropic, Google, Meta, DeepSeek, xAI, and more. Pick the right model before you write a single line of code.

WHY THIS MATTERS

Two numbers from a marketing page never settle a model choice

Choosing between two frontier LLMs by reading their vendors' announcement posts is how teams end up rewriting prompts a month later. A solid model decision compares the same axes across both sides: Artificial Analysis benchmark scores, throughput, time-to-first-token, context window, modality coverage, and per-million-token pricing — including cache-read and cache-write rates when the vendor publishes them.

Every page in this directory follows the same template. The same dimensions are scored side-by-side, the same FAQ format is answered, and every claim cites either a public benchmark page or a vendor pricing page. ElliotGate runs both models behind one OpenAI-compatible endpoint, so the comparisons stay vendor-neutral — once you've picked, switching back is a slug change, not a migration.

BY VENDOR

Browse by vendor

Each vendor's highest-signal matchups, with cross-vendor pairs tagged. All 1197 comparison pages are indexed in our sitemap and searchable.

FREQUENTLY ASKED

FAQ

  • How do I fairly compare two LLM models?

    Pick the same axes for both sides and source each number from a third-party benchmark instead of a vendor announcement. Artificial Analysis publishes Intelligence Index, Coding Index, GPQA, HLE, TerminalBench Hard, and Tau-2 on a comparable scale. Add throughput (tokens per second) and time-to-first-token because they shape user-perceived quality, and pull per-million-token pricing from each vendor's own pricing page — including cache-read and cache-write when they're published. Every page in this directory follows that template so you can compare two pairs the same way.

  • Which benchmarks matter most for my use case?

    It depends on the workload. Knowledge-heavy products lean on GPQA Diamond and Humanity's Last Exam. Coding agents care about Coding Index, TerminalBench Hard, and SWE-Bench Verified. Tool-using agents weight Tau-2 above raw reasoning. Retrieval and summarization care more about throughput and Long Context Recall than aggregate Intelligence. The use-cases hub at /use-cases scores each task with explicit criteria weights so you can borrow the methodology directly.

  • Are these prices the same as the upstream providers?

    Yes. ElliotGate publishes the same per-token rates as Anthropic, OpenAI, Google, and the other upstream vendors charge directly. There is no gateway markup. Cache-read and cache-write rates are shown when the vendor publishes them. We refresh pricing snapshots whenever a vendor changes their public list.

  • Can I switch models without changing my code?

    Yes. The OpenAI-compatible chat completions endpoint accepts every listed model — change the model slug, keep the rest of the request body the same. Most teams build a one-line router that picks per request based on input length, latency budget, or a feature flag. The Anthropic Messages endpoint is also supported for prompt caching workflows that depend on it.

  • How often is this comparison data updated?

    Each comparison page carries its own publishedAt and reviewedAt date. We re-pull Artificial Analysis scores when AA refreshes its leaderboard and re-check vendor pricing pages quarterly at minimum. When a model is retired or significantly updated, the relevant comparison pages are reviewed within a week.

One key, every model on the list

Pick the comparison that fits your workload, run both sides behind one OpenAI-compatible API key, and ship without juggling vendor accounts.