Confidence Scoring with Bayesian Networks
Brain Co. uses a Bayesian Network approach to confidence scoring that enables full automation of the building permit approval process.

Technical Staff
Brain Co. uses a Bayesian Network approach to confidence scoring that enables full automation of the building permit approval process.

Technical Staff

The running joke inside Brain Co. is that if you want to extend your deck in the Bay Area, the permits will take 2 years. The first time this joke was told it was 6 months, and the duration has increased by a couple months every time it’s retold. [1] If you are a homeowner you already understand how painful this aspect of the American dream is, not just in America, but all over the world.
Using AI in construction permitting is straight forward:
On the government side, we are actively developing a system that can fully automate the approval process. The problem space is complex, and so we are tackling the problem with a land-and-expand approach:

Permitting Automation comes with relatively high consequences and needs high accuracy and graceful degradation out of the gate. Unfortunately, Large Language Models (and neural networks) occasionally (frequently) make stuff up. A common approach for these complex problem spaces is building out additional systems to verify candidate solutions in parallel to the solution generation. From Retrieval-Augmented Generation to DeepMind’s OG AlphaCode, we see ML systems incorporate things like re-ranking, filtering, or validation. Confidence scoring can be similarly useful in enhancing the performance of systems that generate solutions, where we threshold out low confidence decisions. [2]

Our confidence scoring architecture operates at three hierarchical levels, each consuming outputs from the previous tier along with original inputs. We can use a common zoning regulation, where bedrooms need a window, to illustrate this.

The extraction system is the core ML system that turns unstructured blueprints (images) into structured information (JSON) like windows and rooms. These then form the basis of the probabilities that:


SoTA vision models provide a confidence score that is a combination of existence as well as class confidence. These outputs are used in conjunction with other signals to generate the confidence of if an object is a bedroom and if an object is a window:
In the long run, we want to lean on any production data flywheels and signals as opposed to hand-crafted features.

Value Confidence forms the input basis of the Bayesian system, along with other application properties. Examples include:
When all confidence scores for all regulation checks are high, we can mark the application as fully automatable, and giving our partners the option to completely remove humans from the loop.

How do we know our confidence scores are any good? Confidence scores for the confidence scores? A useful system will generate high confidence scores when accuracy is high, and lower confidence scores when accuracy is lower. A natural starting place is using correlation metric like Pearson or Spearman. However, on certain values or regulations our ML system has very high accuracy across the board and there is little to zero accuracy variance. In these cases correlation as a metric breaks down.

If confidence scores are constrained to map directly to accuracy, then error or loss scoring like Expected Calibration Error or RMSE variants can be used. In practice we relax this constraint and use a precision-blended scoring system to measure correlation when there is variance and accuracy when there is not.
Part of delivering value on the application level involves targeted inference/engineering strategies beyond just ML Evals. To process permits faster we aim to batch/parallelize as much of the computation as possible based on dependency flows. We also add caching for all levels of inference both for value extraction and confidence scoring. ML datasets also never start covering the long tail of production edge cases like empty applications, corrupted files, conflicting information, overlapping text. Handling these edge cases thoughtfully is often the difference between a demo and a product people actually trust.
Brain Co. partners with institutions in Government, Energy, and Healthcare, among others. If you are interested in deploying high-impact AI into society (and then writing promotional blog posts about it), we’d love to hear from you.
[1] A median housing-related project in SF last year spent 289 days in Planning review; 259 days with DBI; 137 with DPW; 43 with PUC; and for those that needed Fire Department review, another 127. The median length of time was over 620 days.
[2] Several threads of existing work:
Selective Classification for Deep Neural Networks [Geifman & El-Yaniv, 2017]
On Calibration of Modern Neural Networks [Guo et al., 2017]
Competition-Level Code Generation with AlphaCode [Li, Yujia, et al 2022]