AI Testing Results

Results as of Feb 1, 2026

Introduction

New LLM models are released frequently, and teams working in model risk management (MRM) and AI governance often look for practical insight into how different providers perform on the types of tasks they encounter in day-to-day documentation and validation workflows—not just how they score on broad, general-purpose benchmarks.

In this benchmarking analysis, we focus on a set of governance-oriented tasks that commonly arise in documentation and review processes, including:

Interpreting quantitative test results from tables and figures
Drafting documentation from existing source materials
Assessing evidence against predefined criteria
Reviewing documentation in light of regulatory or policy questions

These tasks were selected because they tend to surface behaviors that matter in enterprise environments—such as accurately reading numerical results, avoiding unsupported additions, identifying gaps, and producing conclusions that remain defensible under review.

The objective of this leaderboard is simply to provide greater transparency into how models behave across these use cases, so teams can make informed decisions based on their own priorities, risk tolerance, and operational constraints.

Why This Matters

Generic public benchmarks can be useful for understanding general model capabilities, but they often do not reflect the types of evidence-based tasks that appear in AI governance and model risk management workflows. In these settings, the key requirement is not only strong language generation, but the ability to stay grounded in provided inputs and produce outputs that remain defensible under review.

For this reason, we focus on domain-specific tasks such as interpreting validation results, drafting documentation from source materials, assessing evidence against guidelines, and evaluating regulatory question coverage. This approach is intended to provide a more practical view of how different models perform in real governance workflows, and to highlight trade-offs across quality, reliability, cost, and operational efficiency.

Small differences in model behavior can create material risk:

Misreading quantitative results can distort validation conclusions
Unsupported statements can introduce audit and compliance exposure
Missed gaps or contradictions can weaken governance controls
Overconfident language can create false assurance
Cost and latency differences compound quickly at scale

What We Tested

We evaluated leading LLM providers across four AI governance–oriented tasks that are commonly encountered in enterprise validation and documentation workflows, spanning both first-line (model development) and second-line (independent validation and oversight) activities.

Quantitative Results Analysis: Interpreting metric tables and plots to produce accurate, evidence-grounded insights without introducing unsupported assumptions.
Model Documentation Drafting: Generating documentation from existing source materials while preserving numerical accuracy, maintaining scope discipline, and ensuring internal consistency.
Risk Assessment Against Guidelines: Assessing whether available evidence aligns with predefined criteria, distinguishing substantiated evidence from narrative statements, and forming well-supported conclusions.
Regulatory Question Coverage: Evaluating whether documentation addresses specific regulatory or policy questions, identifying potential gaps, and providing a calibrated assessment of coverage.

For each task, models were evaluated across multiple runs (10 traces per task), and the reported metrics reflect average performance across those runs. The overall leaderboard aggregates results across both traces and tasks to provide a balanced, high-level comparative view.

How We Evaluated the Models

Each model was evaluated using a consistent scoring framework designed to capture both output quality and operational characteristics.

Faithfulness: Measures how consistently a response stays grounded in the provided inputs. Outputs are assessed statement by statement to determine whether claims can be traced back to the source material. Higher scores indicate stronger evidence alignment and fewer unsupported additions or contradictions.
Answer Relevancy: Measures how directly the response addresses the task at hand. This helps distinguish focused, task-specific outputs from responses that include tangential or generic content that does not materially contribute to the objective.
Verbosity: Measures the length of the response (word count). This provides insight into how concise or expansive a model tends to be, which can influence review effort and downstream workflow efficiency.
Cost: Measures the relative resource expense of completing a task, based on token usage and provider pricing. This is particularly relevant when considering scalability across repeated or high-volume workflows.
Latency: Measures end-to-end response time. In operational settings, latency can affect user experience and overall process throughput, especially when tasks are run at scale.

All models were evaluated under comparable conditions using standardized prompts and datasets to support consistent comparison.

The Leaderboard

The table below provides a summary view of how leading LLM providers perform across the four governance-oriented tasks included in this analysis. Each row represents a model, and the scores reflect performance across key dimensions, including faithfulness, answer relevancy, and operational characteristics such as cost and latency.

LLM	Faithfulness	Answer Relevancy	Verbosity	Cost	Latency
gemini/gemini-2.5-flash	0.9347	0.9342	705.5357	0.0089	11988.0714
gemini/gemini-3-pro-preview	0.8833	0.8743	436.7963	0.0352	16974.1481
claude-opus-4-5-20251101	0.9232	0.9371	685.2069	0.1091	27694.6552
gpt-4.1	0.8269	0.7687	594.3077	0.0369	24678.2308
gpt-5	0.7946	0.7587	653.5472	0.0411	51410.6415
gpt-5.1	0.7784	0.7298	966.8261	0.0330	22145.1957
gemini/gemini-2.0-flash	0.7544	0.6862	414.9623	0.0041	5658.6981
gpt-4o	0.7729	0.7468	382.0185	0.0681	18283.3889
claude-sonnet-4-5-20250929	0.7750	0.7708	849.3878	0.1462	33511.6327
claude-opus-4-20250514	0.6662	0.6760	493.6346	0.6625	28997.0962

Performance by Task

Model performance can vary depending on the type of task being performed. A model that shows strong results in quantitative interpretation may perform differently when drafting documentation or assessing regulatory coverage.

The task-level leaderboards below offer a more detailed view, allowing teams to compare providers based on the specific workflows.

Quantitative Results Analysis

Ability to accurately interpret tables and plots and produce evidence-grounded summaries without introducing unsupported conclusions.

LLM	Faithfulness	Answer Relevancy	Verbosity	Cost	Latency
gemini/gemini-2.5-flash	0.9816	1.0000	1100.8889	0.0086	16205.2222
gpt-4.1	0.9987	1.0000	1118.6667	0.0185	40470.0000
gemini/gemini-2.0-flash	0.9270	0.9156	848.2222	0.0007	8490.2222
gpt-5.1	1.0000	1.0000	1857.7500	0.0278	32342.6250
gemini/gemini-3-pro-preview	0.9925	0.9998	968.4444	0.0317	27643.1111
gpt-5	0.9995	1.0000	1179.4444	0.0375	86486.4444
claude-sonnet-4-5-20250929	0.9864	1.0000	1540.7778	0.0438	53567.7778
claude-opus-4-5-20251101	0.9995	1.0000	1123.3333	0.0592	36121.8889
gpt-4o	0.8544	0.8084	667.0000	0.0172	29074.4444
claude-opus-4-20250514	0.9027	0.9870	984.7778	0.1611	49951.3333

Model Documentation Drafting

Ability to generate documentation strictly from source materials while preserving accuracy, scope discipline, and internal consistency.

LLM	Faithfulness	Answer Relevancy	Verbosity	Cost	Latency
gpt-5.1	0.9407	0.9150	142.8000	0.0393	5377.2000
gemini/gemini-2.5-flash	0.8719	0.8673	83.0000	0.0093	5361.7778
gpt-5	0.9436	0.8289	95.9000	0.0492	27324.8000
gpt-4.1	0.9250	0.8285	118.0000	0.0494	8348.5000
gemini/gemini-3-pro-preview	0.8576	0.8436	120.2000	0.0547	12125.8000
claude-opus-4-5-20251101	0.8705	0.9175	192.9000	0.1789	8652.3000
claude-sonnet-4-5-20250929	0.7146	0.7812	303.4000	0.1107	13305.0000
gemini/gemini-2.0-flash	0.7120	0.5866	77.3333	0.0035	1887.1111
gpt-4o	0.6552	0.7030	187.5000	0.0788	13454.0000
claude-opus-4-20250514	0.7242	0.7903	280.5000	0.5476	16769.5000

Risk Assessment Against Guidelines

Ability to evaluate whether evidence meets predefined criteria and produce defensible, evidence-weighted conclusions.

LLM	Faithfulness	Answer Relevancy	Verbosity	Cost	Latency
gemini/gemini-2.5-flash	0.9492	0.9351	910	0.0087	14156.3
gemini/gemini-2.0-flash	0.9011	0.8911	543.2	0.0011	7232.6
gpt-5.1	0.9369	0.9061	1451	0.0339	27201.6667
gemini/gemini-3-pro-preview	0.8968	0.9078	525.2	0.0339	21700.2
gpt-5	0.9349	0.9349	1053.2222	0.0485	97935.6667
gpt-4.1	0.8726	0.8447	656.3	0.0255	36028.4
gpt-4o	0.8448	0.855	394.2	0.0271	21736.8
claude-sonnet-4-5-20250929	0.8824	0.9102	1615.8571	0.0671	59471.2857
claude-opus-4-5-20251101	0.9071	0.9	783.2	0.0842	39152.5
claude-opus-4-20250514	0.881	0.8879	601.8	0.2172	34260.9

Regulatory Question Coverage

Ability to assess whether documentation adequately addresses a specific regulatory question, identifying gaps with calibrated judgment.

LLM	Faithfulness	Answer Relevancy	Verbosity	Cost	Latency
gemini/gemini-3-pro-preview	0.849	0.828	336.68	0.0293	13182.24
gpt-4o	0.7619	0.6989	352.36	0.0985	14949
gpt-4.1	0.6972	0.6191	569.2609	0.0437	20663.8696
gemini/gemini-2.0-flash	0.6489	0.5575	329.24	0.0068	5367.56
gpt-5	0.6107	0.5804	543.4	0.0366	31668.68
claude-sonnet-4-5-20250929	0.6859	0.6343	582.9565	0.2258	26548.3043
gpt-5.1	0.5808	0.4993	885.3636	0.0319	24679.8182
claude-opus-4-20250514	0.455	0.4124	347.087	1.1023	23825.3478