LLM-Generated Rules Engines: Executable IF-THEN Logic for LLM Explainability in Regulated Industries
A technique to make LLM decision making auditable when processing unstructured documents.
A technique to make LLM decision making auditable when processing unstructured documents.
When developing AI solutions for large enterprises and governments, we need our applications to make explainable, auditable decisions. By this we mean taking a source of truth, whether that’s building codes in construction, coverage policies in insurance or treatment needs in healthcare and checking whether a submission is compliant with the rules. The issue is that the base source of truth is rarely structured in a manner that is easily understood and decomposed into checks that can be individually evaluated. Instead, humans often rely on shortcuts and personal heuristics developed by decades of experience when reviewing submissions and applying these guidelines. This presents challenges to automation, but also affects consistency in the checks themselves.
The nondeterministic nature of LLMs makes it challenging to achieve the accuracy and interpretability needed to make critical decisions. Instead, what we need is business logic or a decision tree: a set of rules that is mutually exclusive (no overlap) and collectively exhaustive (covers all scenarios)–in other words, MECE. This ensures checks are unambiguous and consistent.
We show a technique to make LLM decision-making auditable when processing unstructured documents through what we call a Rules Engine. The goal here is to remove unexplained variance when applying LLMs across different scenarios, regions or use cases. Our applications can then leverage LLMs for crucial business processes ensuring the transparency and more deterministic behavior needed for both regulatory compliance and internal governance.
We therefore demonstrate how we:
At Brain Co., we’ve leveraged this rules engine approach across multiple clients and verticals including healthcare, permitting, and construction, but for the purpose of this post, we’ll return to its origin–insurance policy verification.
To reliably extract MECE sets of rules from plans, codes, or any other unstructured document we take the following steps:
To begin, we extract text from the PDF page by page before deciding on a text segmentation strategy to yield section‑sized units (e.g., 5.1, 5.2). We tested the pdfplumber
, pymupdf
, and docling
packages for the initial extraction pass.
Segmentation is necessary for this task to break text into sections large enough to preserve document context but small enough that an LLM can subsequently break the section into MECE rules. Like with other long-context tasks, LLM performance degrades if asked to create too many rules at once. In most commercial use cases, entire guidelines will be too large for a single context window.
When headings and guidelines are clean, we use docling
alongside deterministic regex and inspect the outputs to evaluate performance. However, for messy layouts with smaller documents, we fall back to pymupdf
for initial plain text extraction and LLM‑assisted segmentation. We recommend considering the following when choosing between regex and LLMs for this step:
Note: In this post we use an Allianz policy document as the running sample for artifacts/examples.
With the strategy chosen, extract and normalize text per section and attach useful metadata (section id, coverage type, page spans). In our insurance example, we create a list of policy provisions of the form. As part of this work, we:
Example artifact:
{
"filename": "allianz_insurance.pdf",
"policy_provisions": [
{
"section_id": "5.1",
"provision_text": "Trip Cancellation Coverage only applies before you have left for your trip. If your trip is cancelled or rescheduled for a covered reason listed below, we will reimburse you (less available refunds) for your non- refundable trip payments, deposits, and any reasonable and customary related service fees charged by your travel supplier, up to the maximum benefit for Trip Cancellation Coverage listed in your Coverage Summary. If you prepaid for shared accommodation and your travelling companion cancels their trip due to one or more of the covered reasons listed below, we will reimburse any additional accommodation fees you are required to pay. Important: You must notify all of your travel suppliers within 72 hours of discovering that you will need to cancel your trip (this includes being advised to cancel your trip by a doctor). If you notify any travel suppliers later than that and get a smaller refund as a result, we will not cover the difference. If a serious illness, injury, or medical condition prevents you from being able to notify your travel suppliers within that 72 hour period, you must notify them as soon as you are able. You must check the General Exclusions section for exclusions which may apply.",
"coverage_type": "Trip Cancellation",
"benefit_limits": {},
"rule_logic": []
},
...
],
"total_provisions": 20,
"processing_stage": "pdf_parsed",
"extraction_metadata": {
"engine": "pymupdf",
"pages_processed": 17,
"total_pages": 17
}
}
Some checks span multiple sections. This retrieval step finds the relevant referenced sections of information and brings them into the model context for reference. For example, suppose Section 1 defines plan tiers (Silver/Gold/Platinum) while Section 4 defines natural disaster coverage (e.g., earthquakes). We currently retrieve these pointers through regex matching of references to tables, sections, and other important keywords: e.g., “see Section 4,” “Definitions”, “Exclusions”.
When a section contains pointers or relies on global concepts, we retrieve a bounded amount of that text to the prompt for that section. We keep attached snippets in a context_map
field for a given policy section and label each snippet to preserve the reference context during subsequent rule generation. We show the context_map for the earlier 5.1 example below:
"context_map": {
"section:general_exclusions": "GENERAL EXCLUSIONS The General Exclusions apply to each coverage. An “exclusion” is something that is not covered by this insurance policy, and if an exclusion applies to your claim, no payment is available to you. This policy does not provide coverage for any loss that results directly or indirectly from or that is related to any of the following: 1. Things you were aware of Any loss, condition, or event that was known, foreseeable, intended, or expected when your policy was purchased. 2. Pre-existing medical conditions a) Your pre-existing medical condition(s), including any complications attributable to those condition(s); b) Pre-existing medical condition(s) of your travelling companion including any complications attributable to those condition(s); c) Pre-existing medical condition(s) of your family members including any complications attributable to those condition(s). 3. Travelling for medical treatment You travelling with the intention to receive health care, medical treatment, or dental treatment of any kind while on your trip. 4. Travelling against medical advice You travelling with the intention to receive health care, medical treatment, or dental treatment of any kind while on your trip."
}
Example reference pattern
import re
REF_PATTERN = re.compile(r"\b[Ss]ection\s+(\d+(?:\.\d+)*)\b")
def find_cross_refs(section_text: str) -> set[str]:
return {m.group(1) for m in REF_PATTERN.finditer(section_text)}
Regex matching is lightweight, deterministic, and it is typically sufficient to customize these keywords for each use case to attach relevant sections. However, we can also introduce LLMs to clean attached section text or dynamically use tools to search for references when necessary (see Future Enhancements: LLMs and Tool Use for Retrieval Augmented Generation).
For each portion of the extracted text, we instruct the model to extract IF-THEN rules that are mutually exclusive (no overlap in rules) and collectively exhaustive (covering all scenarios). The prompt defines the task, a strict output format, and examples to calibrate model output.
In this processor, we use Instructor and Pydantic to enforce desired schemas. We deploy the following prompt specifying desired output and the definition of MECE rules, providing the machine with similar specifications to that provided to human labelers:
You are an insurance expert tasked with converting insurance policy documents into precise, machine-readable IF-THEN rules for insurance claim processing systems.
**POLICY PROVISION TO PROCESS:**
Section: {provision.section_id}
Text: {provision.provision_text}
**CORE REQUIREMENTS:**
1. Extract all distinct IF-THEN rules from this policy provision
2. Each rule must be directly traceable to the source text - no hallucinations or inferences
3. Preserve the exact insurance language and conditional requirements from the policy
4. Maintain benefit limits, time constraints, and insurance terminology as specified
**INSURANCE ACCURACY STANDARDS:**
- **Preserve Coverage Language**: Use exact modal verbs from the policy
- "will reimburse" → coverage provided
- "may cover" → conditional coverage
- "will not pay" → exclusion
- **Maintain Insurance Precision**: Include specific benefit limits, time constraints, and eligibility criteria
- **Traceability**: Each rule should be verifiable against the original policy text
- **No Insurance Inference**: Do not add coverage logic not explicitly stated in the provision
**RULE FORMAT:**
Each rule must follow this structure:
- IF [specific claim condition/circumstance with precise criteria] THEN [exact coverage action from policy]
**EXAMPLES OF CORRECT EXTRACTION:**
- "IF traveler has [covered event] AND [eligibility criteria met] THEN [reimbursement action]"
- "IF traveler has [covered event] AND [notification timeframe met] THEN [benefit payment]"
- "IF traveler has [exclusion condition] THEN [coverage denied]"
**QUALITY CRITERIA:**
- **Correctness**: Rule content must exactly match the insurance policy without omission or addition
- **Completeness**: Include all relevant conditions and actions from the source provision
- **Consistency**: Use consistent terminology and formatting aligned with insurance standards
- **Clarity**: Each rule should be unambiguous and implementable by claims processors
**CRITICAL GUIDELINES:**
- Only extract rules explicitly supported by the provision text
- **ATOMIC DECOMPOSITION**: Break complex provisions into separate, atomic rules - each rule should test ONE specific condition and trigger ONE specific action
- **NO COMPOUND CONDITIONS**: Avoid "earliest of", "any of", or multiple OR conditions in a single rule - create separate rules instead
- Include specific benefit amounts, time limits, and eligibility criteria as stated
- Do not combine unrelated coverage decision points
- Maintain the insurance nuance and conditional language of the original policy
- **TESTABILITY**: Each rule should be independently testable and implementable by claims processing systems
**ATOMIC DECOMPOSITION EXAMPLE:**
**Input Provision:**
Section: 5.1
Text: If you have to cancel your trip before you depart due to illness, we will reimburse you for your non-refundable trip payments up to the maximum benefit listed in your Coverage Summary, provided you notify us within 72 hours of discovering you need to cancel.
**Expected Output:**
```json
{{
"rules": [
{{"rule": "IF traveler has to cancel trip before departure due to illness AND notifies insurer within 72 hours of discovering need to cancel, THEN reimburse non-refundable trip payments up to maximum benefit."}},
{{"rule": "IF traveler cancels trip before departure due to illness AND fails to notify within 72 hours, THEN no reimbursement provided."}}
],
"reasoning": "The provision contains explicit conditional logic with specific triggering events (illness, pre-departure cancellation), notification requirements (72 hours), and coverage outcomes (reimbursement vs denial)."
}}
```
**ATOMIC DECOMPOSITION FOR COMPLEX CONDITIONS:**
**Input Provision:**
Section: 6.2
Text: Coverage will end on the earliest of: (1) reaching final destination, (2) declining to continue travel, or (3) arriving at medical facility.
**CORRECT Atomic Output:**
```json
{{
"rules": [
{{"rule": "IF traveler reaches final destination, THEN coverage ends."}},
{{"rule": "IF traveler declines to continue travel when able, THEN coverage ends."}},
{{"rule": "IF traveler arrives at medical facility for further care, THEN coverage ends."}}
],
"reasoning": "Each termination condition is separated into an independent, testable rule."
}}
```
**INCORRECT Compound Output (DO NOT DO THIS):**
```json
{{
"rules": [
{{"rule": "IF coverage ends on the earliest of: reaching final destination, declining to continue travel, or arriving at medical facility, THEN coverage ends."}}
]
}}
```
**IMPORTANT**: Most policy provisions contain extractable conditional logic. Even general coverage statements can be decomposed into actionable IF-THEN rules by identifying the triggering conditions and required insurance actions. Only return an empty list if the provision truly contains no actionable content.
Extract the IF-THEN rules from the policy provision above, ensuring each rule is precise, traceable, and insurance-accurate.
We break down these calls into the section-level segments discussed above so models are not overwhelmed by the entire PDF context while still encapsulating the entire context of rules that must be generated for that section. A one-shot prompt over text from an entire PDF is often too large for models to handle–this causes either an error or models not generating enough rules to be collectively exhaustive. With smaller chunks and/or without a context-map, however, models may not have enough context to capture references to other parts of a document.
The dynamic attachment of these retrieved references from Step 3 therefore helps models combine necessary text from other sections in the subsequent process of rule creation. For example, suppose an insurance document has the following section excerpts:
Section 1.2–Plan Tiers
Gold: up to $10,000 for covered events.
Platinum: up to $20,000 for covered events.
Section 4.3–Earthquake Coverage
If a trip is impacted by an Earthquake, coverage applies subject to plan limits in Section 1.2.
By including Section 1.2 text in the context map for Section 4.3, we can therefore extract the following policy provisions:
{
"policy_provisions": [{
"section_id": "4.3",
"coverage_type": "Earthquake",
"rule_logic": [
{"rule": "IF Tier = Gold AND Event = Earthquake THEN Coverage Limit = $10,000"},
{"rule": "IF Tier = Platinum AND Event = Earthquake THEN Coverage Limit = $20,000."
}
]
}],
"processing_stage": "rules_extracted"
}
Initially, we contracted human subject matter experts (SMEs) to generate the first few sets of rules. However their output was too slow and often not accurate enough. Specifically, the rules they generated tended not to be truly MECE–labellers generally did not produce truly exhaustive datasets and sometimes missed ways that rules could be further broken down into atomic statements. Overall, we found LLMs produced more rules and were able to show that these LLMs generally had better single-shot accuracy than the SMEs.
To evaluate whether the LLM-based approach produces a complete and accurate ruleset, as well as audit the initial SME labels, we developed a three stage system for evaluations:
First, within each policy section, we look for exact or fuzzy matches between the human‑labeled rules and the model’s predicted rules. We normalize text and use TheFuzz to calculate levenshtein distance; very high similarity is treated as exact (e.g., ≈0.95), otherwise fuzzy if it meets the configured threshold (e.g., default ≈0.8). This is a deterministic, algorithmic pass intended to quickly catch paraphrases and near‑duplicates; for each gold rule we take the single best available prediction in that section and consume it to prevent duplicate matches. We recommend experimenting with your thresholds by directly inspecting the matched rules as the right threshold will vary based on the rules being produced.
Second, for any gold rule (SME-labelled) still unmatched, we compute semantic similarity using OpenAI’s text-embedding-3-small embeddings with cosine similarity. For each gold rule we select the single best remaining prediction and accept it if it exceeds the semantic threshold (e.g., default ≈0.7). This is an algorithmic step (no humans; the LLM is used only as an embedding backend) designed to recover semantically equivalent rewrites that string matching can miss.
Lastly, any remaining unmatched predictions are evaluated by an ensemble of LLM judges with majority voting (e.g., claude‑sonnet, gpt‑4.1, llama‑3.3‑70b), excluding the generation model to avoid preference leakage–the tendency for a model to favor its own outputs. Judges classify each LLM-generated rule as a) matched to one in the labelled set, b) redundant or unnecessary due to a clash with an existing rule, or c) distinct and part of the MECE set, but missed in the initial SME-labeled dataset. Rules judged as distinct do not count against model performance in subsequent precision metrics. This automated LLM step resolves edge cases and separates overlaps from true novelties that may have been missed in the imperfect human-labeled evaluation dataset.
In an analysis of an Allianz insurance document with 69 rules extracted by the subject matter experts, we observed that better performing models both created more rules overall and were more likely to recall the initial rules identified by the subject matter experts. Crucially, the LLM judge classified very few rules generated by these models as redundant, instead classifying them as distinct rules missed by the initial human-crafted rule set. When subject matter experts audited these judgments, they agreed with these LLM judge decisions. This suggests that the best performing models are those with both a high F1 score relative to the SME labeled dataset as well as the most non-redundant generated rules.
Focusing on the GPT 5 model family, we observe that both performance and the number of generated rules increase as model size increases. Notably, all reasoning and hybrid reasoning models generated more rules than the non-reasoning GPT 4.1 model–the only model that generated less than the SME expert labelled ‘gold’ dataset.
We can then use these results and error analysis to improve rule generation overall through an AI-assisted ‘benevolent dictatorship’ model, where a single subject matter expert audits the generated rule output from a top-performing LLM to use as the gold standard for subsequent use cases.
We find that this approach works when scaled up, and holds across industries, regions and use-cases, offering broad applicability. For example, in Permitting, building codes are highly varied across regions, but the general approach of transforming unstructured human-written guidelines into executable MECE rules applies to all of them. Similarly, across many industries, especially those with high levels of regulation (e.g., Healthcare, Financial Services, Government), there are countless examples of unstructured guidelines which are applied across myriad use cases.
In each case, our approach can be used to significantly reduce the cost, and improve the accuracy of codifying guidelines into MECE rules, which can be audited once by an SME, and then applied consistently, with complete auditability, allowing LLMs to be deployed much more widely without the risk of ‘black box’ errors or inconsistencies.
Once the final enriched set of rules is compiled, we are ready to use them for evaluating real world scenarios. Below we show the key advantage of this approach versus naive LLM assessments.
The example below is a mock insurance claim being applied to an Allianz Insurance policy document.
The input details provided are a description of the policy holders’ claim details, along with metadata including their plan tier and coverage limits. We want the LLM to assess whether or not the policy holder is covered, and if so, for what value.
Scenario = {
"description": "At departure, airline denied boarding after gate staff suspected a contagious condition (visible rash). Local clinic later cleared traveller. Total delay: 10 hours. Receipts uploaded for meals and ground transport during the delay.",
"plan": {
"plan_type": "Silver",
"section_limits": { "5.3": 1200 },
"minimum_required_delay_hours": 6,
"daily_limits": {
"with_receipts": 150,
"without_receipts": 80
}
},
}
In the case below, when given the policy document and policy holder scenario details, the model determines that they are covered for their daily limit up to the policy maximum, which we can infer from their plan details is $150 when receipts are provided. However, how the model determined this is opaque, making it hard to trust or verify, and increasing the likelihood of inconsistent policy determinations across policy holders based on textual variation in the claim details or model used.
Model output:
{
"COVERED": true,
"COVERAGE_AMOUNT": "DAILY_LIMIT_UP_TO_POLICY_MAX",
"POLICY_SECTION": "5.3"
}
You might be wondering, couldn’t we have it actually share the relevant text from the policy section in the naive approach. You can in theory, but in practice it will often use different rules each time since there isn’t a source of truth it’s pulling from.
When using the generated rules, we see the same outcome for the policy holder–covered for the daily limit up to the policy max–however, we can now see that the model applied three rules: one to determine whether coverage applied, and the following two to determine the coverage limit.
Model output:
{
"COVERAGE": true,
"COVERAGE_AMOUNT": "DAILY_LIMIT_UP_TO_POLICY_MAX",
"RULES": [
"IF the trip is delayed because a travel carrier denies you or a travelling companion boarding based on a suspicion that you or a travelling companion has a contagious medical
condition (including an epidemic or pandemic disease such as COVID-19), THEN the delay is for a covered reason under Travel Delay Coverage.",
"IF your trip is delayed for one of the covered reasons listed below AND your travel delay is at least the Minimum Required Delay listed in your Coverage Summary, THEN we will reimburse you for your lost prepaid trip expenses and additional expenses you incur while and where you are delayed for meals, accommodation, communication, and transport, less available refunds, up to the maximum benefit shown in your Coverage Summary for Travel Delay, subject to a daily (24 hours) limit listed in your Coverage Summary.",
"IF you provide receipts for expenses reimbursed under meals, accommodation, communication, and transport in Travel Delay Coverage, THEN the With Receipts Daily limit listed in your Coverage Summary applies."
]
}
In our rule engine, Step 3 is analogous to the retrieval step in retrieval augmented generation (RAG). This retrieved information becomes useful context when subsequently generating the If-Then rules in Step 4. Notably, however, the current system for identifying cross references relies on regex and pre-defined length caps for context.
For more complex documents, LLMs could be used to extract and summarize the relevant parts of a longer referenced section. Alternatively, we could allow an LLM dynamically search for less explicit references through tool use calls. For this latter use case, we could expose a document search tool (e.g. grep) during rule generation that a model can call to read other parts of a document when generating rules for a given section.
While we currently trace rules at the provision level, production deployment requires more granular mapping. Each extracted clause should link directly to its source location in the PDF, including page numbers and bounding boxes. This enhancement will significantly improve the audit experience and allow reviewers to quickly verify rule accuracy against source material.
Our current judge ensemble uses simple majority voting, which treats all models equally. We plan to calibrate these outputs against small human validation sets to learn per-model weights. Models that consistently align with human judgment on specific rule types should receive higher weights for those categories.
Currently, we have a subject matter expert acting as a “benevolent dictator” to conduct error analysis, evaluate the output of our three-stage evaluation runs, and improve the gold standard. To improve scaling and make this process more efficient, we could use LLMs to prioritize high-value labeling tasks for human input. The system will identify:
Validated rules will continuously feed back into the golden set, creating a virtuous cycle of improvement.
Insurance policies and similar documents evolve over time. We're building first-class support for comparing extracted rules across document versions, automatically generating change logs that highlight additions, modifications, and deletions. This feature will streamline the review process when policies update and provide clear audit trails for regulatory compliance.
This approach offers a practical solution to one of the most pressing challenges in enterprise LLM usage: the need for explainable, auditable decision-making. By transforming unstructured documents into MECE rule sets, organizations can deploy LLMs for critical business processes while maintaining the transparency required for regulatory compliance and internal governance. The technique's strong performance demonstrates that auditability doesn't have to come at the cost of accuracy. As businesses increasingly rely on AI for document-heavy workflows like insurance claims processing, contract review, and compliance checking, this rule-based intermediary layer provides the accountability trail that risk-conscious enterprises demand.