Open Data & Transparency Reports·4 min read·Updated 1 April 2026

Accuracy Metrics — Live Report

Current AI accuracy rates across all question categories, confidence level distribution, flagging rates, and quarter-on-quarter improvement trends.

Live metrics

93.0%↑Overall accuracy (Q1 2026)

847→Flags received Q1 2026

36.8%↓Confirmed error rate from flags

2,400+↑Benchmark test cases

Q1 2026 Accuracy Summary

These figures are measured against our internal benchmark suite of 2,400+ test queries, updated quarterly. They reflect performance of the currently deployed model version.

By question category:

Calculation questions: 98.2% (↑ from 97.8% in Q4 2025)
Pattern questions: 94.7% (↑ from 93.1% in Q4 2025)
Factual questions: 91.3% (→ stable from Q4 2025)
Recommendation questions: 87.6% (↑ from 85.2% in Q4 2025)

Error type distribution:

Hallucination/fabrication: 1.2% of responses
Data misinterpretation: 2.8% of responses
Overconfidence: 3.1% of responses
Under-specificity: 4.2% of responses
Currency/unit errors: 0.9% of responses

Confidence level distribution (all queries, April 2026):

High confidence: 71%
Medium confidence: 21%
Low confidence: 6%
Estimate: 2%

User-Flagged Errors

Q1 2026 flagging summary:

Total flags received: 847
Confirmed errors: 312 (36.8%)
Not confirmed (correct answer): 398 (47.0%)
Ambiguous / additional context needed: 137 (16.2%)

Most common confirmed error types flagged by users:

1. Incorrect date range interpretation (23% of confirmed errors)

2. Incorrect currency conversion (18%)

3. Seasonality misattribution (15%)

4. Missing data source not flagged (14%)

5. Regulatory information outdated (11%)

6. Other / miscellaneous (19%)

Actions taken:

System prompt updates addressing top error types: 6
New benchmark test cases added: 89
Escalated to Anthropic for model-level review: 3

Improvement Trend

Overall accuracy has improved consistently since launch:

Q1 2025 (launch): 84.2% overall accuracy
Q2 2025: 86.7%
Q3 2025: 88.4%
Q4 2025: 90.1%
Q1 2026: 93.0% (weighted average across all categories)

The most significant single improvement was the deployment of our custom financial domain fine-tuning layer in Q3 2025, which reduced currency and calculation errors by 34%.

Q1 2026 Accuracy Summary

User-Flagged Errors

Improvement Trend

Related Articles