[private] Performance Metrics Sketch Pad

This page is an internal workspace to define the GNW performance metrics that we'll publish on the Accuracy & Performance page

Metric
Description
Method
Score

Model version

Human evaluation agreement

% of AI outputs judged “accurate” by experts

manual inspection of X prompt/response pairs from GNW AI traces

Location identification

% of prompts with correct location identified

automated evaluations of predefined question/answer pairs

X% [date], % change since last run

Dataset identification

% of prompts with correct dataset identified

automated evaluations of predefined question/answer pairs

Dataset interpretation

% of prompts where GNW talks about domain-specific data and topics correctly

LLM evaluation of predefined question/answer pairs based on expert guidance

Analysis results

% of prompts with correct quantitative result

automated evaluations of predefined question/answer pairs

Analysis interpretation

% of prompts with correct interpretation of quantitative results

automated evaluations of predefined question/answer pairs

IDEA BACKLOG

@AJ - thoughts on including any of these?

Metric
Description
Score

Downtime or failure rate

% of requests resulting in an error or timeout

Human evaluation agreement

% of AI outputs judged “accurate” by experts

Failure rate / hallucination rate

% of responses that include unsupported claims or data errors

Factual accuracy (human-rated)

% of responses judged factually correct

Data grounding rate

% of responses that correctly cite a dataset or known source

Response relevance

How often responses directly answer the user’s query

Clarity and interpretability

% of responses rated “understandable” by users

Last updated