whatllm.org

WhatLLM.org - LLM Comparison Tool

The ultimate LLM comparison tool

Compare price, performance, and speed across the entire AI ecosystem.Updated daily with the latest benchmarks.

0
Models
0
Endpoints
0
Providers

Frequently Asked Questions

On the current Explore leaderboard, the strongest general-purpose models are Claude Opus 4.7 (57.3), Gemini 3.1 Pro Preview (57.2), GPT-5.4 (xhigh) (56.8), GPT-5.3 Codex (xhigh) (53.6), and Claude Opus 4.6 (53). If you want lower-cost strong performers, leading open-weight options include GLM-5.1 (Reasoning) (51.4), Qwen3.6 Plus (50), GLM-5 (Reasoning) (49.8), MiniMax-M2.7 (49.6), and MiMo-V2-Pro (49.2). Use Explore to narrow further by price, speed, or context window.
Among endpoints with published pricing, the lowest-cost options on Explore right now include Qwen3.5 0.8B via Deepinfra ($0.02/M tokens), Llama 3.2 Instruct 3B via Deepinfra ($0.02/M tokens), Llama 3.1 Instruct 8B via Deepinfra ($0.02/M tokens), and Gemma 3n E4B Instruct via Together.ai ($0.03/M tokens). Absolute price depends on the provider row, so check the table before choosing a host for production.
Input pricing is what you pay for tokens sent to the model, while output pricing is what you pay for generated tokens. Output tokens are usually more expensive, so the total cost of a request depends both on prompt size and response length. Explore shows input, output, and blended pricing to make that tradeoff visible.
For real-time use cases, the fastest endpoints on Explore right now include Llama 3.1 Instruct 8B via Cerebras (2191 tok/s), gpt-oss-120B (high) via Cerebras (1957 tok/s), gpt-oss-120B (low) via Cerebras (1842 tok/s), and GLM-4.7 (Reasoning) via Cerebras (1384 tok/s). If responsiveness matters, compare both output speed and latency because the quickest streaming model is not always the one with the best time to first token.
The top open-weight models on Explore right now are GLM-5.1 (Reasoning) (51.4), Qwen3.6 Plus (50), GLM-5 (Reasoning) (49.8), MiniMax-M2.7 (49.6), and MiMo-V2-Pro (49.2). They are the best place to start if you want strong benchmark performance without defaulting to closed frontier APIs.
Quality Index is the composite score used to sort the leaderboard. It blends reasoning, math, coding, knowledge, and agentic-style benchmarks into a single higher-is-better number, which makes it useful for broad ranking but not a substitute for checking task-specific benchmarks.
The largest context windows currently listed on Explore belong to Grok 4.20 0309 v2 (Reasoning) (2M tokens), Grok 4.20 0309 (Reasoning) (2M tokens), Grok 4.1 Fast (Reasoning) (2M tokens), Grok 4 Fast (Reasoning) (2M tokens), and Grok 4.20 0309 (2M tokens). Use the context filter when you need long-document analysis, repo ingestion, or multi-file workflows.
The same model can vary a lot by provider. For example, Claude Opus 4.7 currently ranges from $10/M tokens on Amazon Bedrock to $10/M tokens on Amazon Bedrock. Recorded throughput for that same model spans 53 to 102 tok/s depending on host. Use the provider rows in Explore when you want to keep the model family fixed but optimize for cost or speed.

Live LLM Comparison: Rankings, Benchmarks & Pricing

Why WhatLLM.org Exists

The AI landscape moves fast. Every few weeks a new model launches — sometimes multiple in a single day — each claiming state-of-the-art results on different benchmarks. For developers choosing an LLM for production, for researchers evaluating the field, or for teams deciding where to invest their API budget, keeping track of it all is exhausting.

WhatLLM.org was built to solve that problem. We aggregate benchmark data, real-world pricing, and throughput metrics for over 301 large language models from 52+ providers into one place. Instead of opening dozens of tabs to compare OpenAI, Anthropic, Google, Meta, DeepSeek, and Mistral side by side, you get a unified interface where models can be filtered, sorted, and compared on the dimensions that actually matter to your use case.

How We Compare Models

Every model on WhatLLM.org is evaluated across four core dimensions: quality, speed, price, and context length. Quality is measured using the Artificial Analysis Intelligence Index, a composite benchmark score that synthesizes results from GPQA Diamond (PhD-level reasoning), AIME 2025 (advanced mathematics), LiveCodeBench (real-world coding), SWE-Bench Verified (software engineering), MMLU-Pro (broad knowledge), and Humanity's Last Exam (frontier reasoning) into a single 0–100 score.

Speed is measured in output tokens per second, reflecting real-world throughput under typical load. Price is tracked per million tokens for both input and output, with blended cost calculations. Context length reflects the maximum number of tokens a model can process in a single request, ranging from 8K tokens on older models to 10M tokens on the latest architectures.

Finding the Right Model for Your Use Case

There is no single "best" LLM. The right choice depends on your priorities. If you need the highest reasoning quality for complex tasks, frontier models like GPT-5, Gemini 3 Pro, or Claude Opus 4.5 lead the benchmarks. If cost efficiency matters most, open-source models like DeepSeek V3, Qwen3, or Kimi K2 deliver strong performance at a fraction of the price. For latency-sensitive applications, speed-optimized endpoints on providers like Groq, Fireworks, or Cerebras can deliver hundreds of tokens per second.

Our LLM Selector tool walks you through a few quick questions about your use case — coding, analysis, creative writing, agentic workflows — and recommends a shortlist of models ranked by fit. The Compare page lets you pick any 2–4 models and see their benchmarks, pricing, and speed side by side in a detailed breakdown.

Original Analysis and Research

Beyond the comparison tools, WhatLLM.org publishes original analysis on model releases, benchmark trends, and the economics of AI deployment. Our blog covers topics from detailed model face-offs (like Kimi K2 Thinking vs. ChatGPT 5.1) to broader industry analysis (the open-source vs. proprietary cost curve, the rise of agentic coding models, and whether benchmark saturation is making traditional evaluation frameworks obsolete). Each article is written with original commentary grounded in the data we track daily.

Data Sources and Transparency

Benchmark and quality data is sourced from Artificial Analysis, an independent research organization. Pricing data is verified against official provider documentation and updated daily. We are transparent about what we measure, what we aggregate, and what we add on top — our full methodology is documented publicly. WhatLLM.org does not run its own benchmarks; we focus on making existing high-quality data accessible, interactive, and actionable.