✅Accuracy & Performance

Overview

The platform is designed for high accuracy and to rigorously avoid misinterpreting satellite data through a sophisticated, multi-layered approach:

Orchestration and reasoning with LangChain: Under the hood, our system uses LangChain to coordinate various AI models and tools. This allows the assistant to effectively reason, plan its approach to your query and then use a specialized suite of capabilities on your behalf.
Deterministic geospatial analysis tools: When you ask a question that requires data (e.g., "What was the tree cover loss in this area?"), the AI uses tools that access our analysis and map rendering services to perform deterministic geospatial calculations. These are pre-built, robust tools (not generative AI outputs) that:
- Execute precise geospatial analysis over your regions of interest.
- Generate visual outputs on demand.
- Crucially, return deterministic, verifiable results that come directly from our trusted, quality-assured GFW and Land & Carbon Lab data sources, ensuring the assistant's factual outputs are always grounded in reliable, objective data.
Grounded generative insights with Retrieval Augmented Generation (RAG): While the core geospatial analysis is deterministic, the conversational layer and "generative insights" are powered by large language models. To ensure these LLMs provide answers that reflect real expertise and are factually correct (and don't "hallucinate"), the assistant is grounded using Retrieval Augmented Generation (RAG). This process works by:
- Consulting a curated knowledge base: Instead of relying solely on its general training, the LLM first consults our specific, curated knowledge base. This includes up-to-date layer descriptions, comprehensive metadata, internal research papers and expert guidance from WRI and Land & Carbon Lab.
- Providing context to the LLM: Relevant information retrieved from this knowledge base is provided as context to the LLM alongside your query. This ensures the assistant responds like a knowledgeable research assistant, pulling directly from our validated content.
Continuous performance evaluation: We also run performance evaluations on the assistant. This ongoing testing ensures the tool maintains a high success rate when executing user queries and consistently delivers accurate and reliable results.

While Global Nature Watch's AI assistant is designed to generate high-quality, reliable insights by drawing on trusted geospatial data and carefully structured methodologies, it is still improving and may occasionally produce inaccurate responses, especially when interpreting quantitative results such as graphs or charts. The underlying data and map layers themselves are deterministic (not generated by AI) but large language models are used to explain and summarize those results, which can sometimes introduce errors.

If you notice something that seems incorrect, please click the thumbs down button to provide feedback. Your input helps us continue improving the tool's accuracy and usefulness.

Performance

We track the following metrics to evaluate system stability and AI assistant performance

Metric

Description

Method

Score

Model version

Human evaluation agreement

% of AI outputs judged “accurate” by experts

manual inspection of X prompt/response pairs from GNW AI traces

Location identification

% of prompts with correct location identified

automated evaluations of predefined question/answer pairs

X% [date], % change since last run

Dataset identification

% of prompts with correct dataset identified

automated evaluations of predefined question/answer pairs

Dataset interpretation

% of prompts where GNW talks about domain-specific data and topics correctly

LLM evaluation of predefined question/answer pairs based on expert guidance

Analysis results

% of prompts with correct quantitative result

automated evaluations of predefined question/answer pairs

Analysis interpretation

% of prompts with correct interpretation of quantitative results

automated evaluations of predefined question/answer pairs

IDEA BACKLOG

@AJ - thoughts on including any of these?

Metric

Description

Score

Downtime or failure rate

% of requests resulting in an error or timeout

Human evaluation agreement

% of AI outputs judged “accurate” by experts

Failure rate / hallucination rate

% of responses that include unsupported claims or data errors

Factual accuracy (human-rated)

% of responses judged factually correct

Data grounding rate

% of responses that correctly cite a dataset or known source

Response relevance

How often responses directly answer the user’s query

Clarity and interpretability

% of responses rated “understandable” by users

Last updated 13 days ago