Error Ellipse
An error ellipse is a statistical and graphical tool used in surveying, geodesy, and geospatial sciences to represent the positional uncertainty of a measured o...
A confusion matrix tabulates model predictions against ground truth: rows are actual classes, columns are predicted classes. The diagonal shows correct predictions; off-diagonal elements show error types. For infrastructure inspection models, confusion matrices reveal which defect types or quality grades are confused — e.g., efflorescence mistaken for corrosion. Covers matrix interpretation, multi-class confusion, and deriving precision/recall per class.
{
A confusion matrix, also known as an error matrix, is a specific table layout that enables detailed visualization of the performance of a classification algorithm. It is one of the most fundamental and informative tools in machine learning model evaluation, providing a complete picture of where a model succeeds and, more importantly, where it fails. The matrix cross-tabulates the actual class labels (ground truth) against the predicted class labels produced by the model, with each cell containing the count of instances falling into that combination.
The standard convention places true classes as rows and predicted classes as columns. For a classification problem with K distinct classes, the confusion matrix has dimensions K×K. The element at position C[i][j] represents the number of instances belonging to true class i that were predicted as class j by the model. The diagonal elements C[i][i] therefore represent correct classifications — instances where the predicted class matches the true class. All off-diagonal elements represent misclassifications of varying types and severity.
The confusion matrix derives its name from the insight it provides into which classes the model “confuses” with each other. A model that reliably distinguishes between asphalt and concrete surfaces but frequently confuses composite pavement with asphalt will show high values along the asphalt-asphalt and concrete-concrete diagonals but a significant off-diagonal concentration at the composite-asphalt intersection. This pattern tells the model developer exactly where to focus improvement efforts.
The mathematical foundation of the confusion matrix is rooted in contingency table analysis, a statistical method dating to Karl Pearson’s early 20th-century work on chi-squared tests for categorical data. In machine learning contexts, the matrix was formalized as a standard evaluation tool in the 1960s with the development of automated pattern recognition systems. Today, every major machine learning framework includes confusion matrix computation — scikit-learn provides sklearn.metrics.confusion_matrix, TensorFlow offers tf.math.confusion_matrix, and PyTorch can compute matrices via torchmetrics.ConfusionMatrix. The scikit-learn implementation is the most widely used in Python-based infrastructure inspection pipelines, accepting arrays of true and predicted labels and returning the K×K matrix with configurable normalization options.
The binary confusion matrix is the simplest and most widely taught form, applicable when the classification problem has exactly two classes — conventionally labeled positive and negative. For infrastructure inspection, a binary problem might be: “does this pavement image contain a crack?” (positive = crack present) or “is this bridge component sound?” (positive = defect detected).
The 2×2 binary confusion matrix contains exactly four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
True Positives (TP) — Instances correctly identified as belonging to the positive class. For a crack detection model, TP is the count of images containing cracks that the model correctly flagged as cracked. Each true positive represents a defect correctly identified, enabling timely maintenance action. High TP counts indicate high sensitivity or recall — the model catches the defects it is designed to find.
False Positives (FP) — Negative instances incorrectly classified as positive. These are also called Type I errors in statistical hypothesis testing. A false positive in crack detection means the model flagged intact pavement as cracked. While false positives do not cause structural safety issues (no defect goes undetected), they generate false alarms that waste inspection resources — crews dispatched to investigate non-existent defects, maintenance budgets allocated to unnecessary repairs, and overall erosion of trust in the AI system. In airport operations where ICAO Annex 14 compliance requires documented inspection findings, excessive false positives burden the reporting workflow.
False Negatives (FN) — Positive instances incorrectly classified as negative. These are Type II errors and are generally considered the more dangerous error type in infrastructure inspection. A false negative means a real defect — a crack, a spall, a corrosion patch — goes undetected. For airfield pavements subject to aircraft loads, an undetected crack can propagate under repeated tire loading, leading to accelerated pavement deterioration and potential foreign object debris (FOD) generation. False negatives represent missed safety-critical defects and must be minimized even at the cost of accepting more false positives.
True Negatives (TN) — Instances correctly identified as not belonging to the positive class. These represent correctly identified intact pavement areas. While true negatives do not directly contribute to defect discovery, they are essential for validating overall model accuracy and for computing metrics like specificity (true negative rate).
The relationship between these four values determines all derived metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN) — The proportion of all predictions that are correct.
Precision (Positive Predictive Value) = TP / (TP + FP) — Of all instances predicted as positive, what proportion truly are positive. High precision means few false alarms.
Recall (Sensitivity, True Positive Rate) = TP / (TP + FN) — Of all actual positive instances, what proportion did the model catch. High recall means few missed defects.
Specificity (True Negative Rate) = TN / (TN + FP) — Of all actual negative instances, what proportion were correctly identified as negative.
F1-Score = 2 × (Precision × Recall) / (Precision + Recall) — The harmonic mean of precision and recall, providing a single balanced metric.
For infrastructure inspection, the precision-recall tradeoff is managed through the model’s decision threshold. A crack detection model might output a probability score between 0 and 1 for each image. Setting the threshold at 0.5 gives standard precision-recall balance. Lowering the threshold to 0.3 increases recall (fewer missed cracks) but decreases precision (more false alarms). Raising the threshold to 0.8 improves precision but risks missing subtle cracks. The optimal threshold depends on the operational context: for critical airfield pavements where missing a crack could lead to FOD generation, a lower threshold favoring recall is appropriate. For routine visual inspections where false alarms waste limited maintenance budgets, a higher threshold favoring precision may be preferable.
When the classification task involves three or more classes, the confusion matrix expands to K×K dimensions, where K is the number of classes. Multi-class classification is the dominant paradigm in infrastructure inspection AI, where models must distinguish between multiple surface types, multiple defect categories, or multiple quality grades simultaneously.
A 3-class example for surface type classification on airfield pavements might have the classes: Asphalt (A), Concrete (C), and Composite (O). A hypothetical confusion matrix for 1,000 validation images:
| True \ Predicted | Asphalt | Concrete | Composite | Total |
|---|---|---|---|---|
| Asphalt | 420 | 15 | 15 | 450 |
| Concrete | 10 | 280 | 10 | 300 |
| Composite | 30 | 20 | 200 | 250 |
| Total | 460 | 315 | 225 | 1000 |
The diagonal shows correct predictions: 420 asphalt, 280 concrete, 200 composite — totaling 900 correct out of 1,000, giving 90% overall accuracy. The off-diagonal cells reveal the error structure: Asphalt was confused with Concrete (15 instances) and Composite (15 instances) roughly equally. Concrete was confused with Asphalt (10) and Composite (10) equally. Composite was most frequently confused with Asphalt (30 instances) — nearly double the confusion with Concrete (20). This pattern tells the model developer that composite surfaces are the most challenging class, particularly when they visually resemble pure asphalt.
For multi-class confusion matrices, the one-vs-rest approach converts the K-class problem into K binary sub-problems for metric calculation. For a given class i:
For the Composite class in the example above:
The multi-class confusion matrix scales to any number of classes. For infrastructure inspection models with 10-15 defect types, the matrix becomes a rich information source revealing not just which classes perform poorly, but exactly which class pairs are problematic. This is fundamentally more informative than a single accuracy number.
The confusion matrix is the source from which all per-class classification metrics are derived. Understanding the derivation enables practitioners to correctly interpret model performance and identify which classes need improvement.
For each class i in a K-class classification problem:
Precision_i = C[i][i] / sum(C[:][i]) = TP / (TP + FP)
Precision answers: “When the model predicts class i, how often is it correct?” This is also called the positive predictive value. For defect classification, high precision on the “critical structural crack” class means that when the model flags a severe crack, inspectors can trust that finding.
Recall_i = C[i][i] / sum(C[i][:]) = TP / (TP + FN)
Recall answers: “Of all actual instances of class i, how many did the model find?” This is also called sensitivity or true positive rate. For defect classification, high recall on “spalling” means most actual spalls are detected, minimizing missed deterioration.
F1_i = 2 × (Precision_i × Recall_i) / (Precision_i + Recall_i)
F1 is the harmonic mean, always lying between precision and recall. F1 is preferred over arithmetic mean because it penalizes extreme imbalance — a model with precision=1.0 and recall=0.0 has F1=0.0, correctly indicating the model is useless despite the 0.5 arithmetic mean.
For comparing models across all classes, three averaging methods exist:
Macro-average computes the metric independently for each class and averages them with equal weight: Macro-Precision = (1/K) × sum(Precision_i). This treats all classes equally regardless of their frequency. For the 3-class surface example: Macro-Precision = (420/460 + 280/315 + 200/225) / 3 = (0.913 + 0.889 + 0.889) / 3 = 0.897. Macro-average is appropriate when all classes are equally important — for instance, classifying pavement distress types where even rare defects matter for safety.
Micro-average aggregates the counts across all classes before computing the metric: Micro-Precision = sum(TP_i) / sum(TP_i + FP_i). For the example: Micro-Precision = (420+280+200) / (420+280+200+15+15+10+10+30+20) = 900 / 1000 = 0.900. Notably, micro-average precision equals accuracy for single-label classification. Micro-average is driven by the most frequent classes and is appropriate when overall correctness is the priority.
Weighted-average computes the metric per class and averages weighted by the number of true instances per class: Weighted-Precision = sum(Precision_i × n_i) / sum(n_i), where n_i is the true count for class i. For the example: Weighted-Precision = (0.913 × 450 + 0.889 × 300 + 0.889 × 250) / 1000 = (410.85 + 266.70 + 222.25) / 1000 = 0.900. Weighted-average is the recommended default for imbalanced datasets because it accounts for class frequency without hiding poor performance on minor classes.
| Averaging Method | Formula | Best For |
|---|---|---|
| Macro | (1/K) × Σ Metric_i | Equal class importance, rare defects matter |
| Micro | Σ TP / (Σ TP + Σ FP) | Overall dataset correctness |
| Weighted | Σ (Metric_i × n_i) / Σ n_i | Imbalanced classes, practical default |
The MCC is derived from the confusion matrix and provides a single metric that summarizes the entire matrix in a way that is robust to class imbalance. For multi-class classification, MCC ranges from -1 to +1, where +1 indicates perfect prediction, 0 indicates random prediction, and -1 indicates total disagreement. MCC is defined as:
MCC = [sum(sum(C[k][l] × C[m][n] - C[k][n] × C[m][l]))] / sqrt( [sum(sum(C[p][q] * C[p][r]))] × [sum(sum(C[s][t] * C[u][t]))] )
where the sums are over appropriate index ranges as defined by Gorodkin (2004). The MCC is widely considered the most informative single metric for classifier evaluation because it uses all four confusion matrix quadrants (in binary) or all K² cells (in multi-class), unlike accuracy which uses only the diagonal.
Overall accuracy is the most intuitively understood metric derived from the confusion matrix: the sum of the diagonal (correct predictions) divided by the total number of samples. For any confusion matrix, overall accuracy is computed as:
Accuracy = Σ C[i][i] / Σ C[i][j] for all i, j
Accuracy represents the proportion of all predictions the model got right. While intuitive, accuracy has critical limitations that the confusion matrix itself helps to diagnose.
The accuracy paradox describes situations where high accuracy does not indicate good model performance due to class imbalance. Consider a pavement defect model evaluated on a dataset where 95% of images show intact pavement (negative) and 5% show cracks (positive). A trivial model that predicts “intact” for every image achieves 95% accuracy — yet it detects zero cracks. The confusion matrix immediately exposes this failure: the model has TP=0, FP=0, FN=500 (all cracks missed), TN=9,500 (all intact correctly identified). Despite 95% overall accuracy, recall for the crack class is 0%.
The confusion matrix makes the accuracy paradox visible. Accuracy alone cannot distinguish between:
For infrastructure inspection, this distinction is safety-critical. ICAO Annex 14 requires that runway surface inspections identify all defects that could compromise aircraft operations. A model with 99% accuracy that misses 100% of a rare but dangerous defect type (such as a deep structural crack in the runway touchdown zone) represents a safety hazard that accuracy alone would mask.
From the confusion matrix, practitioners can compute per-class accuracy (also called recall or sensitivity for the positive class in binary settings):
Class_i Accuracy = C[i][i] / sum(C[i][:])
This tells the proportion of actual class i instances that the model correctly classified. For imbalanced datasets, per-class accuracy is far more informative than overall accuracy. A useful reporting approach is to present overall accuracy alongside the minimum per-class accuracy — the class with the lowest individual accuracy becomes the model’s weak point that requires attention.
Balanced accuracy addresses class imbalance by averaging recall across all classes:
Balanced Accuracy = (1/K) × Σ (C[i][i] / sum(C[i][:]))
For the 95% intact / 5% crack example with a trivial always-intact model: Balanced Accuracy = (Recall_intact + Recall_crack) / 2 = (9500/9500 + 0/500) / 2 = (1.0 + 0.0) / 2 = 0.50. Balanced accuracy correctly identifies this model as no better than random (0.50), while overall accuracy (0.95) is misleadingly high.
The most powerful diagnostic capability of the confusion matrix is its ability to reveal which specific classes are confused with which — the pattern of off-diagonal errors. This information directly guides model improvement strategies.
Common confusion patterns in infrastructure inspection models include:
Within-category confusion — Two visually similar defect types are frequently mistaken for each other. Efflorescence (white crystalline salt deposits on concrete) and early-stage corrosion (rust-colored staining) are frequently confused because both appear as surface discoloration. Within asphalt pavements, alligator cracking (interconnected polygons from fatigue) is sometimes confused with block cracking (rectangular blocks from shrinkage) when the crack network density is moderate.
Hierarchical confusion — The model correctly identifies the general category but confuses the specific subtype. A model may correctly detect that a surface is “cracked” but confuse “transverse crack” with “longitudinal crack” — both are linear cracks differing only in orientation relative to the pavement centerline or traffic direction.
Cross-category confusion — A surface condition is mistaken for a fundamentally different condition. Shadow edges on pavement may be confused with crack edges due to similar contrast gradients. Joint sealant material may be confused with crack filling material. Tire skid marks on runway touchdown zones may be confused with surface deterioration.
The confusion fraction for a pair of classes (i, j) is:
Confusion(i → j) = C[i][j] / sum(C[i][:])
This tells, for actual instances of class i, what proportion were misclassified as class j. A confusion fraction of 0.15 between composite (true) and asphalt (predicted) means 15% of composite surfaces are mistaken for asphalt — the primary failure mode for that class.
Similarly, the normalized confusion matrix with row-wise normalization sets each row to sum to 1.0, directly showing the proportion of each true class distributed across predicted classes. This is the most common visualization format for multi-class confusion matrices because it makes confusion patterns immediately visible regardless of class sample sizes.
The normalized confusion matrix is typically displayed as a heatmap using a diverging color scheme. The diagonal (correct predictions) is shown in green or blue, creating a visible “correct ridge” that should be the dominant visual feature. Off-diagonal cells are shown in red or warm colors, with intensity proportional to the confusion fraction. This visual encoding allows immediate identification of:
Once confused class pairs are identified, the following targeted strategies can be applied:
Surface type classification is a foundational task in infrastructure inspection. For airfield pavements, the International Civil Aviation Organization (ICAO) and the Federal Aviation Administration (FAA) require accurate surface type identification for aircraft performance calculations.
A typical surface type classification model for airfield pavements must distinguish between:
A confusion matrix for a 4-class surface type model tested on 2,000 validation images might appear as:
| True \ Predicted | Asphalt | Concrete | Composite | Gravel |
|---|---|---|---|---|
| Asphalt (n=600) | 564 | 6 | 24 | 6 |
| Concrete (n=500) | 10 | 465 | 20 | 5 |
| Composite (n=400) | 48 | 28 | 312 | 12 |
| Gravel (n=500) | 5 | 10 | 5 | 480 |
This matrix reveals:
Asphalt (94.0% recall): 24 out of 600 asphalt images were misclassified as composite — the most significant confusion for this class. This occurs when asphalt surfaces have reflective cracking patterns that visually resemble composite pavement (asphalt over concrete with crack reflection). The 6 misclassifications to concrete may occur on light-colored oxidized asphalt that resembles aged concrete.
Concrete (93.0% recall): The primary confusion is 20 images misclassified as composite — typically concrete surfaces with thin asphalt patches or overlay strips that create a composite-like appearance.
Composite (78.0% recall): This is the problem class. 48 of 400 composite images (12%) were classified as pure asphalt. This happens when the asphalt overlay is thick enough that the underlying concrete texture and joints are not visible in the captured imagery. Another 28 (7%) were classified as pure concrete — typically when the asphalt overlay has worn thin in traffic areas, exposing the concrete substrate. The model struggles because composite pavement appearance spans the range between pure asphalt and pure concrete.
Gravel (96.0% recall): Gravel is the most distinct class visually and achieves the highest recall.
For ICAO compliance, the confusion between composite and pure asphalt is the most operationally significant. Aircraft performance calculations — particularly takeoff and landing distances — depend on surface type. Confusing composite pavement for pure asphalt could lead to incorrect braking coefficient estimates, affecting safety margins.
Targeted improvements for the composite class include: capturing training images at multiple overlay ages (new thick overlay vs. worn thin overlay), adding images showing reflective cracking patterns specific to composite construction, and training a dedicated binary discriminator between pure asphalt and composite overlay.
Quality grade classification assigns a categorical condition rating to infrastructure surfaces. For airfield pavements, common grading systems include the Pavement Condition Index (PCI) per ASTM D5340 and the Airport Pavement Condition Classification used in ICAO-referenced airport pavement management systems.
Quality grades typically follow a 4-level or 5-level scale:
| Grade | PCI Range | Description | Visual Indicators |
|---|---|---|---|
| Good | 86-100 | Minor or no distress | Few cracks, no spalling, intact joints |
| Fair | 71-85 | Moderate deterioration | Some cracking, minor spalling, slight weathering |
| Poor | 56-70 | Significant deterioration | Extensive cracking, moderate spalling, visible raveling |
| Serious/Failed | 0-55 | Severe deterioration | Extensive interconnected cracking, severe spalling, structural defects |
A confusion matrix for quality grade classification on 1,000 runway pavement sections:
| True \ Predicted | Good | Fair | Poor | Failed |
|---|---|---|---|---|
| Good (n=350) | 315 | 28 | 7 | 0 |
| Fair (n=300) | 36 | 237 | 24 | 3 |
| Poor (n=200) | 0 | 30 | 152 | 18 |
| Failed (n=150) | 0 | 0 | 16 | 134 |
This matrix reveals the characteristic pattern of ordinal classification confusion: errors are concentrated on adjacent grades. The model rarely mistakes Good for Failed (0 instances) or Failed for Good (0 instances) because these classes are visually very different. However, adjacent-grade confusion is common:
Good ↔ Fair (28 + 36 = 64 confusions): These two grades are the most frequently confused pair, representing borderline cases where minor cracking is present but the overall condition is near the Good-Fair boundary (PCI ≈ 85). The 28 Good sections classified as Fair may have early hairline cracking that the model interprets as significant; the 36 Fair sections classified as Good may have very fine cracking below the model’s detection threshold.
Fair ↔ Poor (24 + 30 = 54 confusions): Moderate deterioration grading is subjective even among human inspectors. The 24 Fair sections classified as Poor likely have crack densities near the Fair-Poor boundary; the 30 Poor sections classified as Fair may represent cases where crack severity is borderline.
Poor ↔ Failed (18 + 16 = 34 confusions): At the severe end, confusion between Poor (extensive cracking) and Failed (structural deterioration) is relatively low because failed pavement shows qualitatively different distress — spalling, faulting, and surface disintegration beyond simple cracking.
The matrix is asymmetric: Good→Fair confusion (28) is lower than Fair→Good confusion (36). This means the model is more conservative for Fair sections (tending to downgrade Good sections to Fair) than for Good sections (tending to upgrade Fair to Good). This asymmetry is relevant for maintenance planning — conservative misclassifications (rating better pavement as worse) are operationally safer because they lead to earlier maintenance intervention rather than delayed action.
Cohen’s weighted Kappa is particularly appropriate for quality grade confusion matrices because it accounts for the order of classes. Adjacent-grade errors (Fair classified as Poor) are penalized less severely than distant errors (Good classified as Failed). Linear weighting penalizes proportionally to grade separation, while quadratic weighting penalizes the square of grade separation — more appropriate when grade differences have nonlinear safety implications.
For the matrix above, weighted Kappa (linear) might be approximately 0.78, indicating substantial agreement beyond chance, while unweighted Kappa would be lower at approximately 0.72 because it treats all off-diagonal errors equally regardless of severity.
Defect classification is the most complex and safety-critical task for infrastructure inspection AI models. For concrete bridge components or airfield pavements, a model may need to recognize 10-15 distinct defect types simultaneously.
Typical defect classes for concrete infrastructure inspection include:

A partial confusion matrix focusing on the most frequently confused defect pairs for a concrete bridge deck inspection model:
| True \ Predicted | Hairline Crack | Structural Crack | Spalling | Efflorescence | Corrosion Stain | Intact |
|---|---|---|---|---|---|---|
| Hairline Crack | 820 | 30 | 5 | 40 | 10 | 95 |
| Structural Crack | 15 | 440 | 20 | 5 | 15 | 5 |
| Spalling | 0 | 10 | 285 | 5 | 20 | 0 |
| Efflorescence | 25 | 0 | 5 | 145 | 60 | 15 |
| Corrosion Stain | 5 | 5 | 15 | 35 | 180 | 10 |
| Intact | 65 | 0 | 0 | 10 | 15 | 1910 |
Efflorescence ↔ Corrosion Stain (60 + 35 = 95 confusions): The most significant confusion pair in concrete defect classification. Both appear as surface discoloration — efflorescence as white crystalline deposits, corrosion staining as rust-colored patches. When efflorescence incorporates dirt or when corrosion staining is in early stages (rust-colored but not yet patterned), the two are visually indistinguishable. This confusion has material implications: efflorescence indicates water migration (a maintenance issue), while corrosion staining indicates active reinforcement corrosion (a structural safety issue). Confusing one for the other could lead to dramatically incorrect maintenance prioritization.
Hairline Crack ↔ Intact (95 + 65 = 160 confusions): Hairline cracks near the model’s resolution limit (approximately 0.2mm at the capture resolution of 0.5mm/pixel) are frequently missed. 95 hairline cracks were classified as intact (false negatives), representing missed early-stage deterioration. 65 intact surfaces were classified as hairline cracked (false positives), representing false alarms. This is the classic detection sensitivity tradeoff at the perceptual limit.
Spalling ↔ Corrosion Stain (20 + 15 = 35 confusions): Spalled areas exposing corroded reinforcement bars often have rust-colored staining around the spall edges, leading to confusion between the two classes. In many cases both defects coexist — a spall caused by underlying corrosion — making the single-label classification task inherently ambiguous.
Structural Crack ↔ Hairline Crack (30 + 15 = 45 confusions): Cracks near the hairline-to-structural boundary (approximately 0.3mm width) are confused based on perceived width. Without precise sub-millimeter measurement capability in standard inspection imagery, this confusion is expected and may be acceptable if both crack types are flagged for inspection.
Based on confusion patterns, specific remediation strategies include:
Efflorescence vs. Corrosion Stain: Add training data showing efflorescence with embedded dirt (yellowish tint) and early corrosion without visible rust (greenish tint). Apply color augmentation emphasizing these subtle spectral differences. Consider adding near-infrared or multispectral channels that detect chemical composition differences.
Hairline Crack vs. Intact: Improve capture resolution or deploy super-resolution preprocessing. Apply targeted augmentation that simulates hairline cracks on different surface textures. Consider rejecting borderline predictions and flagging them for human review.
Spalling vs. Corrosion Stain: Model training should use multi-label annotation where spalling and corrosion can coexist. Alternatively, create a hierarchical classifier that first detects “area of deterioration” then distinguishes spalling from staining at the second level.
Structural vs. Hairline Crack: Integrate crack width estimation as a regression head rather than classification. Use the continuous width estimate to set severity thresholds that can be tuned per inspection standard.
Effective confusion matrix visualization and reporting is essential for communicating model performance to stakeholders — from data scientists to airport maintenance managers to regulatory authorities.
The standard visualization format for a confusion matrix is a heatmap with the following conventions:
For publication-quality figures, the standard approach uses matplotlib with seaborn.heatmap in Python:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred, labels=class_names)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(cm_normalized, annot=True, fmt='.2f',
xticklabels=class_names, yticklabels=class_names,
cmap='RdYlGn', vmin=0, vmax=1, ax=ax)
ax.set_xlabel('Predicted Class')
ax.set_ylabel('True Class')
ax.set_title(f'Confusion Matrix (Overall Accuracy: {accuracy:.2%})')
plt.tight_layout()
The choice of normalization significantly affects interpretation:
Row-normalized (normalize=‘true’): Each row sums to 1.0 (100%). Diagonal values show recall per class. Across-row values show “when the true class is X, what proportion was predicted as each class?” This is the most common normalization for diagnostic analysis.
Column-normalized (normalize=‘pred’): Each column sums to 1.0 (100%). Diagonal values show precision per class. Down-column values show “when the model predicted X, what proportion actually belonged to each true class?” This is useful for understanding false positive distributions.
No normalization: Raw counts are displayed. Essential for verifying sample sizes but makes comparison difficult when classes have different frequencies.
Triple-cell format: Each cell shows three values: raw count, row %, and column %. This provides complete information in a single visualization but can be visually cluttered for large matrices.
For infrastructure inspection model reporting, the recommended template includes:
For model development tracking, confusion matrices should be generated and logged at regular training checkpoints (every 10-20 epochs). Comparing matrices across checkpoints reveals:
The Arena platform and MLflow provide confusion matrix tracking as part of experiment management, automatically generating and versioning matrices for every training run.
Not all confusion in the matrix is equal. Domain experts should review confusion patterns to classify each off-diagonal pair as:
Avoidable confusion: The two classes are visually distinct to a human expert, and the model’s confusion indicates a deficiency in training data, model architecture, or feature learning. Efflorescence vs. corrosion staining in images with clear color differences falls in this category.
Unavoidable confusion: The two classes are genuinely difficult to distinguish even for human experts, or the differentiation requires information not available in the input (e.g., temporal progression data, subsurface sensing). Hairline crack vs. surface scratch where both appear as fine linear features may be unavoidably confused from visual imagery alone.
Ambiguous ground truth: The true class itself is uncertain due to inter-annotator disagreement. If two human inspectors disagree on whether a surface is “fair” or “poor” grade 15% of the time, the model cannot be expected to exceed this agreement ceiling. The confusion matrix should be interpreted relative to the human agreement baseline — a model achieving 90% agreement with a reference standard may be excellent if human inter-rater reliability is only 85%.
For infrastructure inspection models used in regulatory compliance contexts — such as ICAO Annex 14 aerodrome certification or FAA AC 150/5320-5D pavement management — the confusion matrix serves as a core validation artifact. Regulatory reporting should include:
The confusion matrix, when properly constructed and interpreted, transforms model evaluation from a single accuracy number into a rich diagnostic tool that reveals the complete error structure of a classification system. For infrastructure inspection applications where the cost of different error types varies dramatically — a missed structural defect costs far more than a false alarm on intact pavement — this granular understanding enables practitioners to tune, validate, and deploy models that meet the specific reliability requirements of aviation safety.
TarmacView uses confusion matrix analysis to validate infrastructure inspection AI models across surface type, quality grade, and defect classification tasks. Ensure your models perform reliably with per-class evaluation metrics derived from comprehensive confusion matrices.
An error ellipse is a statistical and graphical tool used in surveying, geodesy, and geospatial sciences to represent the positional uncertainty of a measured o...
Intersection Over Union (IoU), also called Jaccard Index, measures the overlap between a predicted segmentation mask and ground truth mask: IoU = |A∩B| / |A∪B|....
Defect gating is an inference strategy that filters predicted defect labels by surface type and structural domain to suppress false positives — e.g., only flagg...