Accuracy Metrics — Live Report
Current AI accuracy rates across all question categories, confidence level distribution, flagging rates, and quarter-on-quarter improvement trends.
Live metrics
Q1 2026 Accuracy Summary
These figures are measured against our internal benchmark suite of 2,400+ test queries, updated quarterly. They reflect performance of the currently deployed model version.
By question category:
- Calculation questions: 98.2% (↑ from 97.8% in Q4 2025)
- Pattern questions: 94.7% (↑ from 93.1% in Q4 2025)
- Factual questions: 91.3% (→ stable from Q4 2025)
- Recommendation questions: 87.6% (↑ from 85.2% in Q4 2025)
Error type distribution:
- Hallucination/fabrication: 1.2% of responses
- Data misinterpretation: 2.8% of responses
- Overconfidence: 3.1% of responses
- Under-specificity: 4.2% of responses
- Currency/unit errors: 0.9% of responses
Confidence level distribution (all queries, April 2026):
- High confidence: 71%
- Medium confidence: 21%
- Low confidence: 6%
- Estimate: 2%
User-Flagged Errors
Q1 2026 flagging summary:
- Total flags received: 847
- Confirmed errors: 312 (36.8%)
- Not confirmed (correct answer): 398 (47.0%)
- Ambiguous / additional context needed: 137 (16.2%)
Most common confirmed error types flagged by users:
1. Incorrect date range interpretation (23% of confirmed errors)
2. Incorrect currency conversion (18%)
3. Seasonality misattribution (15%)
4. Missing data source not flagged (14%)
5. Regulatory information outdated (11%)
6. Other / miscellaneous (19%)
Actions taken:
- System prompt updates addressing top error types: 6
- New benchmark test cases added: 89
- Escalated to Anthropic for model-level review: 3
Improvement Trend
Overall accuracy has improved consistently since launch:
- Q1 2025 (launch): 84.2% overall accuracy
- Q2 2025: 86.7%
- Q3 2025: 88.4%
- Q4 2025: 90.1%
- Q1 2026: 93.0% (weighted average across all categories)
The most significant single improvement was the deployment of our custom financial domain fine-tuning layer in Q3 2025, which reduced currency and calculation errors by 34%.