✅Accuracy & Performance
Overview
The platform is designed for high accuracy and to rigorously avoid misinterpreting satellite data through a sophisticated, multi-layered approach:
Orchestration and reasoning with LangChain: Under the hood, our system uses LangChain to coordinate various AI models and tools. This allows the assistant to effectively reason, plan its approach to your query and then use a specialized suite of capabilities on your behalf.
Deterministic geospatial analysis tools: When you ask a question that requires data (e.g., "What was the tree cover loss in this area?"), the AI uses tools that access our analysis and map rendering services to perform deterministic geospatial calculations. These are pre-built, robust tools (not generative AI outputs) that:
Execute precise geospatial analysis over your regions of interest.
Generate visual outputs on demand.
Crucially, return deterministic, verifiable results that come directly from our trusted, quality-assured GFW and Land & Carbon Lab data sources, ensuring the assistant's factual outputs are always grounded in reliable, objective data.
Grounded generative insights with Retrieval Augmented Generation (RAG): While the core geospatial analysis is deterministic, the conversational layer and "generative insights" are powered by large language models. To ensure these LLMs provide answers that reflect real expertise and are factually correct (and don't "hallucinate"), the assistant is grounded using Retrieval Augmented Generation (RAG). This process works by:
Consulting a curated knowledge base: Instead of relying solely on its general training, the LLM first consults our specific, curated knowledge base. This includes up-to-date layer descriptions, comprehensive metadata, internal research papers and expert guidance from WRI and Land & Carbon Lab.
Providing context to the LLM: Relevant information retrieved from this knowledge base is provided as context to the LLM alongside your query. This ensures the assistant responds like a knowledgeable research assistant, pulling directly from our validated content.
Continuous performance evaluation: We also run performance evaluations on the assistant. This ongoing testing ensures the tool maintains a high success rate when executing user queries and consistently delivers accurate and reliable results.
While Global Nature Watch's AI assistant is designed to generate high-quality, reliable insights by drawing on trusted geospatial data and carefully structured methodologies, it is still improving and may occasionally produce inaccurate responses, especially when interpreting quantitative results such as graphs or charts. The underlying data and map layers themselves are deterministic (not generated by AI) but large language models are used to explain and summarize those results, which can sometimes introduce errors.
If you notice something that seems incorrect, please click the thumbs down button to provide feedback. Your input helps us continue improving the tool's accuracy and usefulness.

Performance
We track the following metrics to evaluate system stability and AI assistant performance
Model version
Human evaluation agreement
% of AI outputs judged “accurate” by experts
manual inspection of X prompt/response pairs from GNW AI traces
Location identification
% of prompts with correct location identified
automated evaluations of predefined question/answer pairs
X% [date], % change since last run
Dataset identification
% of prompts with correct dataset identified
automated evaluations of predefined question/answer pairs
Dataset interpretation
% of prompts where GNW talks about domain-specific data and topics correctly
LLM evaluation of predefined question/answer pairs based on expert guidance
Analysis results
% of prompts with correct quantitative result
automated evaluations of predefined question/answer pairs
Analysis interpretation
% of prompts with correct interpretation of quantitative results
automated evaluations of predefined question/answer pairs
IDEA BACKLOG
@AJ - thoughts on including any of these?
Downtime or failure rate
% of requests resulting in an error or timeout
Human evaluation agreement
% of AI outputs judged “accurate” by experts
Failure rate / hallucination rate
% of responses that include unsupported claims or data errors
Factual accuracy (human-rated)
% of responses judged factually correct
Data grounding rate
% of responses that correctly cite a dataset or known source
Response relevance
How often responses directly answer the user’s query
Clarity and interpretability
% of responses rated “understandable” by users
Last updated
