Engineering
Calendar Icon Light V2 - TechVR X Webflow Template
Oct 28, 2025

Confidence Scoring with Bayesian Networks

Brain Co. uses a Bayesian Network approach to confidence scoring that enables full automation of the building permit approval process.

James Huang

Technical Staff

The running joke inside Brain Co. is that if you want to extend your deck in the Bay Area, the permits will take 2 years. The first time this joke was told it was 6 months, and the duration has increased by a couple months every time it’s retold. [1] If you are a homeowner you already understand how painful this aspect of the American dream is, not just in America, but all over the world.

Using AI in construction permitting is straight forward:

  • Ingest the documents of a construction project: architectural blueprints, tables, and forms
  • Add safety and zoning regulations to flag violations in minutes, not years
  • Builders and architects can use these capabilities during design to identify violations
  • Governments can use these capabilities to ensure safe construction and enforce consistency where processes may be tribal knowledge

On the government side, we are actively developing a system that can fully automate the approval process. The problem space is complex, and so we are tackling the problem with a land-and-expand approach:

Permitting Automation comes with relatively high consequences and needs high accuracy and graceful degradation out of the gate. Unfortunately, Large Language Models (and neural networks) occasionally (frequently) make stuff up. A common approach for these complex problem spaces is building out additional systems to verify candidate solutions in parallel to the solution generation. From Retrieval-Augmented Generation to DeepMind’s OG AlphaCode, we see ML systems incorporate things like re-ranking, filtering, or validation. Confidence scoring can be similarly useful in enhancing the performance of systems that generate solutions, where we threshold out low confidence decisions. [2]

Architecture: Three-Tier Confidence Composition

Our confidence scoring architecture operates at three hierarchical levels, each consuming outputs from the previous tier along with original inputs. We can use a common zoning regulation, where bedrooms need a window, to illustrate this.

The extraction system is the core ML system that turns unstructured blueprints (images) into structured information (JSON) like windows and rooms. These then form the basis of the probabilities that:

  1. The area in a file is window (Value Confidence)
  2. The room has a window (Check Confidence)
  3. The application passes all regulations (Application Confidence)

Level 1: Value Extraction Confidence

SoTA vision models provide a confidence score that is a combination of existence as well as class confidence. These outputs are used in conjunction with other signals to generate the confidence of if an object is a bedroom and if an object is a window:

  • Object properties (e.g. thin, low-contrast lines may give less confidence)
  • Application file properties (e.g. blurry files might give less confidence )
  • LLM as a judge (e.g. ask for additional scoring based on a rubric)

In the long run, we want to lean on any production data flywheels and signals as opposed to hand-crafted features.

Level 2: Check-Level Confidence

In our co-pilot application, lower confidence checks are flagged for human attention. This level of transparency helps our AI systems build trust with end users.

Value Confidence forms the input basis of the Bayesian system, along with other application properties. Examples include:

  • Data Distribution (e.g. windows are usually not 100 meters wide)
  • Application metadata (professional architect vs. homeowner submission)
  • Check specific bayesian properties (e.g. a room only needs 1 window, every additional window detected increases the probability that the room is compliant.)
  • Some understanding of check thresholds and tolerances. (e.g. if a safety railing needs to be at least 1.2 meter tall and it measures 1.40 meters theres some buffer and its likely compliant) https://brain.co/blog/blueprint-information-extraction

Level 3: Application-Level Confidence

When all confidence scores for all regulation checks are high, we can mark the application as fully automatable, and giving our partners the option to completely remove humans from the loop.

Evals, Evals, Evals

How do we know our confidence scores are any good? Confidence scores for the confidence scores? A useful system will generate high confidence scores when accuracy is high, and lower confidence scores when accuracy is lower. A natural starting place is using correlation metric like Pearson or Spearman. However, on certain values or regulations our ML system has very high accuracy across the board and there is little to zero accuracy variance. In these cases correlation as a metric breaks down.

An alternative to Turtles Confidence scores all the way down

If confidence scores are constrained to map directly to accuracy, then error or loss scoring like Expected Calibration Error or RMSE variants can be used. In practice we relax this constraint and use a precision-blended scoring system to measure correlation when there is variance and accuracy when there is not.

Production Deployment Insights

Part of delivering value on the application level involves targeted inference/engineering strategies beyond just ML Evals. To process permits faster we aim to batch/parallelize as much of the computation as possible based on dependency flows. We also add caching for all levels of inference both for value extraction and confidence scoring. ML datasets also never start covering the long tail of production edge cases like empty applications, corrupted files, conflicting information, overlapping text. Handling these edge cases thoughtfully is often the difference between a demo and a product people actually trust.

Future Directions

  • “Closing the loop” with production activity: using human reviewer signals and comments to inform online learning and improved confidence scoring systems
  • “Closing the loop” with confidence scoring: active learning guided by confidence scores to prioritize labeling of edge cases or detect distribution drift.
  • Platform Expansion: applying custom compliance frameworks onto unstructured data in for different municipalities or verticals. One system to verify everything.

Brain Co. partners with institutions in Government, Energy, and Healthcare, among others. If you are interested in deploying high-impact AI into society (and then writing promotional blog posts about it), we’d love to hear from you.

[1]  A median housing-related project in SF last year spent 289 days in Planning review; 259 days with DBI; 137 with DPW; 43 with PUC; and for those that needed Fire Department review, another 127. The median length of time was over 620 days.

https://thefrisc.com/how-long-it-really-takes-to-get-a-building-permit-in-san-francisco-and-why-7f00dac3bf79/

[2] Several threads of existing work:

Selective Classification for Deep Neural Networks [Geifman & El-Yaniv, 2017]

On Calibration of Modern Neural Networks [Guo et al., 2017]

Competition-Level Code Generation with AlphaCode [Li, Yujia, et al 2022]