Building a model trained specifically to trade insurance safely is only the beginning. To be ready for customer use in a regulated financial environment, you need a safety harness around it - a continuous system of monitoring, explainability, and protection that ensures every interaction stays compliant, auditable, and trustworthy.
1. Continuous monitoring and benchmarking
Insurance language is regulated language. That means every response must be demonstrably fair, clear, and not misleading. We have built an evaluation framework that continuously tests our models against thousands of real insurance questions to detect failure modes such as missing disclaimers or language that risks crossing the advice boundary.
This framework powers the Compliance Risk Index, our benchmark for model safety. It measures performance across four dimensions: evidence and retrieval, policy logic, communication standards, and the detection of vulnerability or complaints. Each dimension is scored at scale, giving us a measurable view of how our models perform over time and against the generalist models available from other providers.
2. Explainability and traceability
We believe all model outputs should be explainable. That means that we must be able to attribute model responses to ground truth source material.
There are two approaches to model attribution.
- Generate a response, and then check for similarity after, and then assume that similarity equals attribution.
- While generating a response, see what nodes are being activated, and trace those nodes back to source material.
Nearly everyone uses the first approach because it is simpler, and can be done with closed source, black-box models. However, this approach is not reliable, as the assumption that similarity equals attribution doesn’t always turn out to be true.
We believe having robust explainability is a key requirement for the safe application of large language models in insurance, and as such have invested in the second approach.
Because we have built the model, we have access to the internal model weights during inference and as such can link node activations to exact text within policy wordings. This means we can provide industry leading levels of explainability and traceability.
3. Retrieval accuracy at scale
To answer customer queries with correct and relevant information, our models must be able to draw from a dynamic knowledge base with thousands of policy documents and supporting materials.
Traditional retrieval-augmentated generation (RAG) systems rely on vector embeddings. However, these embeddings introduce brittleness:
- Chunking strategies must be retuned as documents change
- Embedding models can have blind spots for domain-specific terminology
- Retrieval quality degrades when new policy types are added
Rather than segmenting optimisation between the retrieval pipeline and the model, we put all the capability into the model itself. This means we can teach the model to issue precise text-pattern searches over raw policy documents, compose multiple searches to handle complex queries, and importantly self-correct when initial searches don't yield useful results.
This approach delivers higher retrieval accuracy than embedding based approaches, even as the number of documents increases.
This means that when customers ask questions, they can be confident that the responses are accurate, even when comparing many policies at once.
4. Hallucination detection and output guardrails
Generative models can sometimes fill gaps with plausible but unsubstantiated information. In insurance, unsubstantiated, potentially false information can lead to customer harm. We need to be able to detect this, and ultimately prevent any hallucinated content from being exposed to the consumer.
We have trained a secondary model with the sole purpose of detecting hallucinations. It does this by confirming that any information contained in the output message is in fact contained in source material that the model referenced at the time of generating the response.
If hallucinated content is detected, the output is flagged and the message can be regenerated or the conversation can be escalated. These controls ensure that every response is grounded in verifiable, in-context evidence, ensuring the consumers can trust the model’s outputs.
The result: safe, confident deployment
Together, these mechanisms form the safety harness around our models. They make Open General Insurance Intelligence not just powerful, but safe to deploy directly to customers through products like Insurance Companion, delivering compliant, trustworthy experiences at scale.



