[private] Performance Metrics Sketch Pad
This page is an internal workspace to define the GNW performance metrics that we'll publish on the Accuracy & Performance page
Model version
Human evaluation agreement
% of AI outputs judged “accurate” by experts
manual inspection of X prompt/response pairs from GNW AI traces
Location identification
% of prompts with correct location identified
automated evaluations of predefined question/answer pairs
X% [date], % change since last run
Dataset identification
% of prompts with correct dataset identified
automated evaluations of predefined question/answer pairs
Dataset interpretation
% of prompts where GNW talks about domain-specific data and topics correctly
LLM evaluation of predefined question/answer pairs based on expert guidance
Analysis results
% of prompts with correct quantitative result
automated evaluations of predefined question/answer pairs
Analysis interpretation
% of prompts with correct interpretation of quantitative results
automated evaluations of predefined question/answer pairs
IDEA BACKLOG
@AJ - thoughts on including any of these?
Downtime or failure rate
% of requests resulting in an error or timeout
Human evaluation agreement
% of AI outputs judged “accurate” by experts
Failure rate / hallucination rate
% of responses that include unsupported claims or data errors
Factual accuracy (human-rated)
% of responses judged factually correct
Data grounding rate
% of responses that correctly cite a dataset or known source
Response relevance
How often responses directly answer the user’s query
Clarity and interpretability
% of responses rated “understandable” by users
Last updated
