AI Testing Results

Results as of Feb 1, 2026

Introduction

New LLM models are released frequently, and teams working in model risk management (MRM) and AI governance often look for practical insight into how different providers perform on the types of tasks they encounter in day-to-day documentation and validation workflows—not just how they score on broad, general-purpose benchmarks.

In this benchmarking analysis, we focus on a set of governance-oriented tasks that commonly arise in documentation and review processes, including:

  • Interpreting quantitative test results from tables and figures
  • Drafting documentation from existing source materials
  • Assessing evidence against predefined criteria
  • Reviewing documentation in light of regulatory or policy questions

These tasks were selected because they tend to surface behaviors that matter in enterprise environments—such as accurately reading numerical results, avoiding unsupported additions, identifying gaps, and producing conclusions that remain defensible under review.

The objective of this leaderboard is simply to provide greater transparency into how models behave across these use cases, so teams can make informed decisions based on their own priorities, risk tolerance, and operational constraints.

Why This Matters

Generic public benchmarks can be useful for understanding general model capabilities, but they often do not reflect the types of evidence-based tasks that appear in AI governance and model risk management workflows. In these settings, the key requirement is not only strong language generation, but the ability to stay grounded in provided inputs and produce outputs that remain defensible under review.

For this reason, we focus on domain-specific tasks such as interpreting validation results, drafting documentation from source materials, assessing evidence against guidelines, and evaluating regulatory question coverage. This approach is intended to provide a more practical view of how different models perform in real governance workflows, and to highlight trade-offs across quality, reliability, cost, and operational efficiency.

Small differences in model behavior can create material risk:

  • Misreading quantitative results can distort validation conclusions
  • Unsupported statements can introduce audit and compliance exposure
  • Missed gaps or contradictions can weaken governance controls
  • Overconfident language can create false assurance
  • Cost and latency differences compound quickly at scale

What We Tested

We evaluated leading LLM providers across four AI governance–oriented tasks that are commonly encountered in enterprise validation and documentation workflows, spanning both first-line (model development) and second-line (independent validation and oversight) activities.

  • Quantitative Results Analysis: Interpreting metric tables and plots to produce accurate, evidence-grounded insights without introducing unsupported assumptions.
  • Model Documentation Drafting: Generating documentation from existing source materials while preserving numerical accuracy, maintaining scope discipline, and ensuring internal consistency.
  • Risk Assessment Against Guidelines: Assessing whether available evidence aligns with predefined criteria, distinguishing substantiated evidence from narrative statements, and forming well-supported conclusions.
  • Regulatory Question Coverage: Evaluating whether documentation addresses specific regulatory or policy questions, identifying potential gaps, and providing a calibrated assessment of coverage.

For each task, models were evaluated across multiple runs (10 traces per task), and the reported metrics reflect average performance across those runs. The overall leaderboard aggregates results across both traces and tasks to provide a balanced, high-level comparative view.

How We Evaluated the Models

Each model was evaluated using a consistent scoring framework designed to capture both output quality and operational characteristics.

  • Faithfulness: Measures how consistently a response stays grounded in the provided inputs. Outputs are assessed statement by statement to determine whether claims can be traced back to the source material. Higher scores indicate stronger evidence alignment and fewer unsupported additions or contradictions.
  • Answer Relevancy: Measures how directly the response addresses the task at hand. This helps distinguish focused, task-specific outputs from responses that include tangential or generic content that does not materially contribute to the objective.
  • Verbosity: Measures the length of the response (word count). This provides insight into how concise or expansive a model tends to be, which can influence review effort and downstream workflow efficiency.
  • Cost: Measures the relative resource expense of completing a task, based on token usage and provider pricing. This is particularly relevant when considering scalability across repeated or high-volume workflows.
  • Latency: Measures end-to-end response time. In operational settings, latency can affect user experience and overall process throughput, especially when tasks are run at scale.

All models were evaluated under comparable conditions using standardized prompts and datasets to support consistent comparison.


The Leaderboard

The table below provides a summary view of how leading LLM providers perform across the four governance-oriented tasks included in this analysis. Each row represents a model, and the scores reflect performance across key dimensions, including faithfulness, answer relevancy, and operational characteristics such as cost and latency.

LLM Faithfulness Answer Relevancy Verbosity Cost Latency
gemini/gemini-2.5-flash 0.9347 0.9342 705.5357 0.0089 11988.0714
gemini/gemini-3-pro-preview 0.8833 0.8743 436.7963 0.0352 16974.1481
claude-opus-4-5-20251101 0.9232 0.9371 685.2069 0.1091 27694.6552
gpt-4.1 0.8269 0.7687 594.3077 0.0369 24678.2308
gpt-5 0.7946 0.7587 653.5472 0.0411 51410.6415
gpt-5.1 0.7784 0.7298 966.8261 0.0330 22145.1957
gemini/gemini-2.0-flash 0.7544 0.6862 414.9623 0.0041 5658.6981
gpt-4o 0.7729 0.7468 382.0185 0.0681 18283.3889
claude-sonnet-4-5-20250929 0.7750 0.7708 849.3878 0.1462 33511.6327
claude-opus-4-20250514 0.6662 0.6760 493.6346 0.6625 28997.0962

Performance by Task

Model performance can vary depending on the type of task being performed. A model that shows strong results in quantitative interpretation may perform differently when drafting documentation or assessing regulatory coverage.

The task-level leaderboards below offer a more detailed view, allowing teams to compare providers based on the specific workflows.

Quantitative Results Analysis

Ability to accurately interpret tables and plots and produce evidence-grounded summaries without introducing unsupported conclusions.

LLM Faithfulness Answer Relevancy Verbosity Cost Latency
gemini/gemini-2.5-flash 0.9816 1.0000 1100.8889 0.0086 16205.2222
gpt-4.1 0.9987 1.0000 1118.6667 0.0185 40470.0000
gemini/gemini-2.0-flash 0.9270 0.9156 848.2222 0.0007 8490.2222
gpt-5.1 1.0000 1.0000 1857.7500 0.0278 32342.6250
gemini/gemini-3-pro-preview 0.9925 0.9998 968.4444 0.0317 27643.1111
gpt-5 0.9995 1.0000 1179.4444 0.0375 86486.4444
claude-sonnet-4-5-20250929 0.9864 1.0000 1540.7778 0.0438 53567.7778
claude-opus-4-5-20251101 0.9995 1.0000 1123.3333 0.0592 36121.8889
gpt-4o 0.8544 0.8084 667.0000 0.0172 29074.4444
claude-opus-4-20250514 0.9027 0.9870 984.7778 0.1611 49951.3333

Model Documentation Drafting

Ability to generate documentation strictly from source materials while preserving accuracy, scope discipline, and internal consistency.

LLM Faithfulness Answer Relevancy Verbosity Cost Latency
gpt-5.1 0.9407 0.9150 142.8000 0.0393 5377.2000
gemini/gemini-2.5-flash 0.8719 0.8673 83.0000 0.0093 5361.7778
gpt-5 0.9436 0.8289 95.9000 0.0492 27324.8000
gpt-4.1 0.9250 0.8285 118.0000 0.0494 8348.5000
gemini/gemini-3-pro-preview 0.8576 0.8436 120.2000 0.0547 12125.8000
claude-opus-4-5-20251101 0.8705 0.9175 192.9000 0.1789 8652.3000
claude-sonnet-4-5-20250929 0.7146 0.7812 303.4000 0.1107 13305.0000
gemini/gemini-2.0-flash 0.7120 0.5866 77.3333 0.0035 1887.1111
gpt-4o 0.6552 0.7030 187.5000 0.0788 13454.0000
claude-opus-4-20250514 0.7242 0.7903 280.5000 0.5476 16769.5000

Risk Assessment Against Guidelines

Ability to evaluate whether evidence meets predefined criteria and produce defensible, evidence-weighted conclusions.

LLM Faithfulness Answer Relevancy Verbosity Cost Latency
gemini/gemini-2.5-flash 0.9492 0.9351 910 0.0087 14156.3
gemini/gemini-2.0-flash 0.9011 0.8911 543.2 0.0011 7232.6
gpt-5.1 0.9369 0.9061 1451 0.0339 27201.6667
gemini/gemini-3-pro-preview 0.8968 0.9078 525.2 0.0339 21700.2
gpt-5 0.9349 0.9349 1053.2222 0.0485 97935.6667
gpt-4.1 0.8726 0.8447 656.3 0.0255 36028.4
gpt-4o 0.8448 0.855 394.2 0.0271 21736.8
claude-sonnet-4-5-20250929 0.8824 0.9102 1615.8571 0.0671 59471.2857
claude-opus-4-5-20251101 0.9071 0.9 783.2 0.0842 39152.5
claude-opus-4-20250514 0.881 0.8879 601.8 0.2172 34260.9

Regulatory Question Coverage

Ability to assess whether documentation adequately addresses a specific regulatory question, identifying gaps with calibrated judgment.

LLM Faithfulness Answer Relevancy Verbosity Cost Latency
gemini/gemini-3-pro-preview 0.849 0.828 336.68 0.0293 13182.24
gpt-4o 0.7619 0.6989 352.36 0.0985 14949
gpt-4.1 0.6972 0.6191 569.2609 0.0437 20663.8696
gemini/gemini-2.0-flash 0.6489 0.5575 329.24 0.0068 5367.56
gpt-5 0.6107 0.5804 543.4 0.0366 31668.68
claude-sonnet-4-5-20250929 0.6859 0.6343 582.9565 0.2258 26548.3043
gpt-5.1 0.5808 0.4993 885.3636 0.0319 24679.8182
claude-opus-4-20250514 0.455 0.4124 347.087 1.1023 23825.3478