How do I fairly compare two LLM models?

Pick the same axes for both sides and source each number from a third-party benchmark instead of a vendor announcement. Artificial Analysis publishes Intelligence Index, Coding Index, GPQA, HLE, TerminalBench Hard, and Tau-2 on a comparable scale. Add throughput (tokens per second) and time-to-first-token because they shape user-perceived quality, and pull per-million-token pricing from each vendor's own pricing page — including cache-read and cache-write when they're published. Every page in this directory follows that template so you can compare two pairs the same way.

Which benchmarks matter most for my use case?

It depends on the workload. Knowledge-heavy products lean on GPQA Diamond and Humanity's Last Exam. Coding agents care about Coding Index, TerminalBench Hard, and SWE-Bench Verified. Tool-using agents weight Tau-2 above raw reasoning. Retrieval and summarization care more about throughput and Long Context Recall than aggregate Intelligence. The use-cases hub at /use-cases scores each task with explicit criteria weights so you can borrow the methodology directly.

Are these prices the same as the upstream providers?

Yes. ElliotGate publishes the same per-token rates as Anthropic, OpenAI, Google, and the other upstream vendors charge directly. There is no gateway markup. Cache-read and cache-write rates are shown when the vendor publishes them. We refresh pricing snapshots whenever a vendor changes their public list.

Can I switch models without changing my code?

Yes. The OpenAI-compatible chat completions endpoint accepts every listed model — change the model slug, keep the rest of the request body the same. Most teams build a one-line router that picks per request based on input length, latency budget, or a feature flag. The Anthropic Messages endpoint is also supported for prompt caching workflows that depend on it.

How often is this comparison data updated?

Each comparison page carries its own publishedAt and reviewedAt date. We re-pull Artificial Analysis scores when AA refreshes its leaderboard and re-check vendor pricing pages quarterly at minimum. When a model is retired or significantly updated, the relevant comparison pages are reviewed within a week.

MODEL COMPARISONS

Compare LLM Models Side-by-Side

Benchmarks, pricing, and context windows for 1197 head-to-head pairs across OpenAI, Anthropic, Google, Meta, DeepSeek, xAI, and more. Pick the right model before you write a single line of code.

Start building Browse all models

WHY THIS MATTERS

Two numbers from a marketing page never settle a model choice

Choosing between two frontier LLMs by reading their vendors' announcement posts is how teams end up rewriting prompts a month later. A solid model decision compares the same axes across both sides: Artificial Analysis benchmark scores, throughput, time-to-first-token, context window, modality coverage, and per-million-token pricing — including cache-read and cache-write rates when the vendor publishes them.

Every page in this directory follows the same template. The same dimensions are scored side-by-side, the same FAQ format is answered, and every claim cites either a public benchmark page or a vendor pricing page. ElliotGate runs both models behind one OpenAI-compatible endpoint, so the comparisons stay vendor-neutral — once you've picked, switching back is a slug change, not a migration.

Eight comparisons teams reach for first

These are the cross-vendor matchups we get asked about most often. Start here if you're early in model selection — these pairs cover the four flagship families currently shipping API access.

BY VENDOR

Browse by vendor

Each vendor's highest-signal matchups, with cross-vendor pairs tagged. All 1197 comparison pages are indexed in our sitemap and searchable.

OpenAI627

+617 more in the sitemap

Anthropic284

+274 more in the sitemap

Google99

+89 more in the sitemap

xAI144

+134 more in the sitemap

DeepSeek129

+119 more in the sitemap

Qwen268

+258 more in the sitemap

Moonshot AI130

+120 more in the sitemap

Z.ai201

+191 more in the sitemap

MiniMax118

+108 more in the sitemap

Xiaomi137

+127 more in the sitemap

FREQUENTLY ASKED

FAQ

How do I fairly compare two LLM models?
Pick the same axes for both sides and source each number from a third-party benchmark instead of a vendor announcement. Artificial Analysis publishes Intelligence Index, Coding Index, GPQA, HLE, TerminalBench Hard, and Tau-2 on a comparable scale. Add throughput (tokens per second) and time-to-first-token because they shape user-perceived quality, and pull per-million-token pricing from each vendor's own pricing page — including cache-read and cache-write when they're published. Every page in this directory follows that template so you can compare two pairs the same way.
Which benchmarks matter most for my use case?
It depends on the workload. Knowledge-heavy products lean on GPQA Diamond and Humanity's Last Exam. Coding agents care about Coding Index, TerminalBench Hard, and SWE-Bench Verified. Tool-using agents weight Tau-2 above raw reasoning. Retrieval and summarization care more about throughput and Long Context Recall than aggregate Intelligence. The use-cases hub at /use-cases scores each task with explicit criteria weights so you can borrow the methodology directly.
Are these prices the same as the upstream providers?
Yes. ElliotGate publishes the same per-token rates as Anthropic, OpenAI, Google, and the other upstream vendors charge directly. There is no gateway markup. Cache-read and cache-write rates are shown when the vendor publishes them. We refresh pricing snapshots whenever a vendor changes their public list.
Can I switch models without changing my code?
Yes. The OpenAI-compatible chat completions endpoint accepts every listed model — change the model slug, keep the rest of the request body the same. Most teams build a one-line router that picks per request based on input length, latency budget, or a feature flag. The Anthropic Messages endpoint is also supported for prompt caching workflows that depend on it.
How often is this comparison data updated?
Each comparison page carries its own publishedAt and reviewedAt date. We re-pull Artificial Analysis scores when AA refreshes its leaderboard and re-check vendor pricing pages quarterly at minimum. When a model is retired or significantly updated, the relevant comparison pages are reviewed within a week.

One key, every model on the list

Pick the comparison that fits your workload, run both sides behind one OpenAI-compatible API key, and ship without juggling vendor accounts.

Get an API key Read the docs