Intersection Over Union (IoU)

Intersection Over Union (IoU) for Segmentation Accuracy

Definition and Formula

Intersection Over Union (IoU), also known as the Jaccard Index or Jaccard Similarity Coefficient, is a statistical measure that quantifies the overlap between two sets of data. In computer vision and image segmentation, IoU measures the accuracy of a predicted segmentation mask relative to the ground truth mask. It is the most widely adopted evaluation metric for semantic segmentation, instance segmentation, and object detection tasks across both academic benchmarks and industrial applications.

The formula for IoU is mathematically straightforward:

IoU = |A ∩ B| / |A ∪ B|

Where A represents the predicted segmentation mask (the set of pixels the model classifies as belonging to the object class) and B represents the ground truth mask (the set of pixels that human annotators have identified as belonging to the object class). The numerator, |A ∩ B|, is the intersection — the area (or pixel count) where the prediction and ground truth agree. The denominator, |A ∪ B|, is the union — the total area covered by either the prediction or the ground truth or both.

{

Technical diagram showing Intersection Over Union (IoU) concept with two overlapping shapes, intersection highlighted in purple, union outlined in white

Expanding the formula in terms of the confusion matrix elements — true positives (TP), false positives (FP), and false negatives (FN) — provides a more practical formulation for computation:

IoU = TP / (TP + FP + FN)

Here, TP (true positives) are the pixels correctly classified as belonging to the object class (the intersection area). FP (false positives) are the pixels incorrectly predicted as object when they are actually background (the prediction area outside the ground truth). FN (false negatives) are the pixels incorrectly predicted as background when they are actually object (the ground truth area outside the prediction). This formulation makes explicit that IoU penalizes both over-prediction and under-prediction equally — every false positive and every false negative reduces the metric by the same amount, normalized by the total relevant area.

The IoU value always falls within the closed interval [0, 1]. An IoU of 0 indicates that the predicted and ground truth masks have zero overlap — they are completely disjoint. An IoU of 1 indicates perfect overlap — the predicted mask exactly matches the ground truth mask pixel-for-pixel. Intermediate values represent partial overlap quality. For example, an IoU of 0.5 means that the intersection area is half the size of the union area, which is the standard minimum threshold for considering a detection or segmentation as correct in the Pascal VOC benchmark.

IoU is scale-invariant in the sense that it measures relative overlap rather than absolute pixel counts. A 100-pixel object and a 10,000-pixel object both produce an IoU of 1.0 when perfectly segmented. However, this scale invariance has a practical caveat: for the same absolute pixel error (e.g., 5 misclassified boundary pixels), a small object experiences a much larger relative drop in IoU than a large object, making IoU implicitly more stringent for fine detail and small structures.

The metric originates from the Jaccard Index (also called Intersection over Union) introduced by Paul Jaccard in 1901 as a coefficient of community for comparing species distributions in alpine flora. Its application to computer vision was formalized in the Pascal Visual Object Classes (VOC) Challenge (2007-2012), which established IoU as the standard evaluation metric for object detection and segmentation. The Common Objects in Context (COCO) dataset and benchmark, released by Microsoft in 2014, further cemented IoU as the de facto standard by introducing mAP@[0.5:0.95], the average of mean Average Precision computed across ten discrete IoU thresholds from 0.5 to 0.95.

Visual Interpretation of IoU

Visual interpretation of IoU is essential for developing an intuitive understanding of what different IoU scores actually represent in terms of segmentation quality. The metric maps a complex spatial relationship between two masks into a single scalar value, and internalizing how different spatial error patterns affect that value is critical for both model development and results communication.

An IoU of 1.0 represents perfect pixel-for-pixel alignment. In the context of crack segmentation on airport pavements, a perfect IoU means every pixel the model classified as “crack” exactly matches every pixel a trained inspector marked as “crack,” and no pixels were missed or over-predicted. In practice, perfect IoU is essentially unattainable for real-world infrastructure inspection due to annotation ambiguity — different human annotators marking the same crack on the same image typically achieve an inter-annotator IoU of only 0.65-0.85, depending on crack width and contrast. This annotation noise ceiling represents the maximum achievable IoU for any model on that dataset.

An IoU of 0.75-0.90 indicates excellent segmentation quality. At this level, the predicted mask closely follows the ground truth boundary with only minor deviations of 1-3 pixels at the edges. For crack segmentation, this corresponds to correct detection of the crack’s full path with slight differences in predicted crack thickness compared to the ground truth annotation. On the COCO benchmark, an AP75 (Average Precision at IoU threshold 0.75) is considered a strong indicator of precise localization capability.

An IoU of 0.50-0.75 indicates good segmentation quality. The model correctly identifies the location and general shape of the object, but boundary accuracy is moderate. For pavement crack detection, an IoU in this range means the model reliably finds the crack (high true positive rate) but may predict the crack as somewhat thicker or thinner than the ground truth, or may have minor fragmentation where a continuous crack is predicted as short disconnected segments. This is the typical operating range for production crack segmentation systems operating on real airport pavement imagery, where TarmacView’s DINOv3-based model achieves a test IoU of 0.519.

{

Technical comparison of three IoU scores 0.3, 0.5, and 0.75 for crack detection on pavement with ground truth in green and prediction in red

An IoU of 0.30-0.50 indicates marginal segmentation quality. The model detects the presence of the object but with significant localization errors. The predicted mask may be displaced, incorrectly shaped, or substantially different in size from the ground truth. In pavement crack detection, this could mean the model identifies general crack-affected regions but fails to precisely follow crack geometry, predicting blobs rather than linear features. An IoU below 0.5 is generally not considered a successful detection or segmentation under the Pascal VOC standard.

An IoU of 0-0.30 indicates poor segmentation. The model either fails to detect the object entirely (predicted mask is empty, producing IoU=0) or produces a mask that overlaps the ground truth only incidentally at the edges of a much larger mispredicted region. In the context of infrastructure inspection, models operating at this IoU level would miss critical defects or produce so many false positives that the output is not operationally useful.

A crucial insight for visual interpretation is that IoU is non-linear in perceptual quality. The difference between IoU=0.90 and IoU=0.95 represents a much smaller actual improvement in pixel accuracy than the difference between IoU=0.50 and IoU=0.55. This is because at high IoU levels, the union is already very close to the intersection, so further improvements require increasingly precise boundary alignment. Conversely, at low to moderate IoU, relatively large improvements in segmentation quality are needed to move the metric by a few percentage points.

IoU Calculation

Calculating IoU for segmentation masks requires a systematic approach that handles the specifics of mask representation, coordinate systems, and edge cases.

Bounding Box IoU Calculation

For bounding box-based predictions (used in object detection or as region proposals), IoU is calculated using the axis-aligned rectangle coordinates. Each box is defined by its top-left corner (x₁, y₁) and bottom-right corner (x₂, y₂). The intersection rectangle is computed as:

x_left   = max(pred_x₁, gt_x₁)
y_top    = max(pred_y₁, gt_y₁)
x_right  = min(pred_x₂, gt_x₂)
y_bottom = min(pred_y₂, gt_y₂)

The intersection area is:

intersection = max(0, x_right - x_left) × max(0, y_bottom - y_top)

The max(0, …) operation is critical — if the boxes do not overlap in either dimension, the negative value is clamped to zero, correctly yielding zero intersection area. The area of each bounding box is:

area_pred = (pred_x₂ - pred_x₁) × (pred_y₂ - pred_y₁)
area_gt   = (gt_x₂ - gt_x₁) × (gt_y₂ - gt_y₁)

The union is then:

union = area_pred + area_gt - intersection

And finally:

IoU = intersection / union

Segmentation Mask IoU Calculation

For pixel-level segmentation masks, IoU is computed by treating each mask as a binary image where pixel value 1 indicates the object class and pixel value 0 indicates background. The calculation proceeds via element-wise logical operations:

intersection = sum(pred_mask AND gt_mask)   // pixel-wise logical AND
union = sum(pred_mask OR gt_mask)           // pixel-wise logical OR
IoU = intersection / union

In practice, implementations using NumPy or PyTorch vectorize this as:

import torch

def compute_iou(pred_mask: torch.Tensor, gt_mask: torch.Tensor) -> float:
    """
    pred_mask and gt_mask are binary tensors of shape (H, W)
    with values 0 (background) or 1 (object)
    """
    intersection = (pred_mask & gt_mask).sum().item()
    union = (pred_mask | gt_mask).sum().item()
    if union == 0:
        return float('nan')  # both masks empty
    return intersection / union

The union == 0 edge case occurs when both the prediction and ground truth are empty — neither mask contains any object pixels. In this degenerate case, the metric returns NaN, and different benchmarks handle this differently. The COCO evaluation protocol excludes such cases from the mean. The PyTorch Lightning MeanIoU implementation returns -1.0 for classes that are completely absent from both prediction and ground truth.

Handling Multi-Class Segmentation

For multi-class segmentation (e.g., a model that simultaneously predicts crack, sealant, pavement, and joint classes), IoU is computed per class and then averaged. For each class c, a binary mask is constructed where pixel (i,j) is 1 if the predicted or ground truth class equals c, and 0 otherwise. The per-class IoU is:

IoU_c = TP_c / (TP_c + FP_c + FN_c)

For a segmentation problem with C classes, the Mean IoU (mIoU) is:

mIoU = (1/C) × Σℂ=1…C IoU_c

An important consideration is whether to include the background class (pixels belonging to no object) in the mIoU calculation. Including background typically inflates mIoU because background dominates the pixel count and segmentation models usually perform well on it. Excluding background gives a more honest assessment of object segmentation quality. The include_background parameter in TorchMetrics’ MeanIoU implementation controls this behavior.

An alternative to mIoU is Frequency Weighted IoU (FWIoU), which weights each class’s IoU by its frequency in the ground truth:

FWIoU = (1/Σℂ n_c) × Σℂ n_c × IoU_c

where n_c is the number of ground truth pixels belonging to class c. FWIoU gives more weight to larger classes and less to rare classes. For crack segmentation on pavements where crack pixels constitute less than 1 percent of the total, FWIoU would obscure crack performance behind pavement performance. mIoU (unweighted mean) is strongly preferred for infrastructure inspection because it treats the rare but critical defect classes equally with the majority classes.

Numerical Precision Considerations

When IoU is used as a training loss function (differentiable IoU variants), numerical precision becomes important. Standard IoU is not differentiable because the binary logical operations (AND, OR, SUM) have zero gradients everywhere except at the threshold, where the gradient is undefined. Differentiable surrogates — soft IoU or IoU loss — use continuous approximations:

soft_IoU = Σ(p_i × g_i) / Σ(p_i + g_i - p_i × g_i)

where p_i ∈ [0,1] is the softmax probability for pixel i and g_i ∈ {0,1} is the ground truth label. The product p_i × g_i is a continuous relaxation of the logical AND, and the sum p_i + g_i - p_i × g_i is a continuous relaxation of the logical OR.

For evaluation (non-differentiable), the binarized masks with a threshold (typically 0.5 on the softmax probability) are used. The binary cross-entropy loss remains the standard training objective for most crack segmentation models, with IoU serving as the evaluation metric computed post-hoc.

IoU vs Dice Coefficient (F1 for Segmentation)

The Dice Similarity Coefficient (DSC), also called the Sørensen-Dice index and equivalent to the F1-score for binary segmentation, is the most common alternative to IoU for segmentation evaluation. Understanding the relationship between these two metrics is essential for correct interpretation of segmentation accuracy.

Mathematical Relationship

The Dice Coefficient is defined as:

DSC = 2|A ∩ B| / (|A| + |B|)

In terms of confusion matrix elements:

DSC = 2TP / (2TP + FP + FN)

The mathematical relationship between DSC and IoU is bijective and monotonic:

DSC = 2 × IoU / (1 + IoU)

IoU = DSC / (2 - DSC)

This means that for any given segmentation, DSC can be computed directly from IoU and vice versa. The conversion is exact and deterministic — there is no information loss in converting between the two metrics.

Key Differences in Practice

Despite the exact mathematical relationship, DSC and IoU differ systematically in their numerical values and interpretation. The key properties are:

DSC ≥ IoU for all imperfect segmentations. The inequality is strict except at DSC = IoU = 1 (perfect) and DSC = IoU = 0 (no overlap). This is because the denominator in DSC adds TP twice (2TP + FP + FN) while IoU adds TP only once (TP + FP + FN), making DSC’s denominator smaller relative to its numerator. For example, a segmentation with TP=80, FP=20, FN=20 yields:

  • IoU = 80 / (80 + 20 + 20) = 80/120 = 0.667
  • DSC = (2×80) / (2×80 + 20 + 20) = 160/200 = 0.800

DSC is more “optimistic.” For the same segmentation quality, DSC reports a higher numerical value than IoU. The relative gap is largest at moderate overlap levels and converges at both extremes. A segmentation with IoU=0.5 corresponds to DSC=0.667. A segmentation with IoU=0.75 corresponds to DSC=0.857.

IoU is stricter on false positives and false negatives. Because IoU’s denominator includes TP only once while DSC double-counts TP, IoU penalizes every FP and FN more heavily relative to the correct overlap. This makes IoU more sensitive to over-segmentation and under-segmentation errors.

DSC emphasizes overlap maximization. The extra weight on TP means that DSC rewards maximizing the intersection area even at the cost of slightly increasing the union. This makes DSC preferable when the primary goal is to ensure the predicted region covers as much of the ground truth as possible, at the expense of some over-prediction.

Practical Guidance for Selection

The choice between IoU and DSC depends on the application requirements:

CriterionPrefer IoUPrefer DSC
Penalizing over-/under-segmentation✓ Stricter— More lenient
Small object segmentation— More sensitive to errors✓ Less harsh penalty
Boundary accuracy assessment✓ Better reflects errors— Masks boundary issues
Standard benchmark comparison✓ Pascal VOC, COCO standard— Medical imaging standard
Training loss function— Not differentiable✓ Differentiable surrogate
Communication to non-technical stakeholders— Can seem harsh✓ More intuitive values

For infrastructure inspection and crack segmentation, IoU is the preferred primary metric because it provides a stricter, more honest assessment of segmentation quality. A model that over-predicts crack width (producing FP crack pixels at edges) will have its IoU penalized more than its DSC, and this stricter penalty correctly reflects the operational requirement that crack detection must be spatially precise — over-predicting crack area leads to unnecessary maintenance costs, while under-predicting leads to missed defects.

TarmacView reports IoU as the primary crack segmentation accuracy metric, with the DINOv3-based model achieving a test IoU of 0.519 on held-out airfield pavement imagery. This is complemented by reporting precision, recall, and Dice coefficient for a comprehensive evaluation picture.

IoU Thresholds (0.3, 0.5, 0.75)

IoU thresholds define the minimum overlap required for a prediction to be considered a true positive (correct detection). The choice of threshold has profound implications for reported model performance and determines whether a segmentation system is evaluated as lenient or strict.

The Pascal VOC Standard: IoU ≥ 0.5

The Pascal VOC challenge established IoU ≥ 0.5 as the threshold for considering a detection or segmentation as correct. This threshold, denoted as AP50 (Average Precision at IoU=0.5), became the default in computer vision for over a decade. The reasoning was pragmatic: a predicted region that overlaps more than half of the ground truth is clearly detecting the correct object, even if its boundaries are imperfect.

For crack segmentation on pavements, an IoU threshold of 0.5 means that a predicted crack mask must overlap at least 50 percent of the ground truth crack area to count as a correct detection. Given that cracks are thin, elongated structures, this threshold requires that the model capture the crack’s full path with reasonable thickness accuracy. It is the standard threshold used in academic publications on pavement crack detection and in TarmacView’s internal evaluation pipeline.

The Strict COCO Standard: IoU ∈ [0.5, 0.95]

The COCO benchmark introduced a more rigorous evaluation protocol, averaging Average Precision over ten IoU thresholds from 0.5 to 0.95 in steps of 0.05, denoted as AP@[0.5:0.95] or simply mAP. This multi-threshold approach provides a more comprehensive assessment of localization quality than any single threshold can.

For infrastructure inspection, the COCO-style evaluation is particularly informative because it reveals how a model performs across precision requirements. A model with high AP50 but low AP75 may detect cracks reliably but predict them with poor boundary accuracy — systematically overestimating or underestimating crack width. A model with strong AP75 demonstrates precise boundary alignment, which is critical for applications requiring accurate crack width measurement (a key input to Pavement Condition Index - PCI calculations per ASTM D5340 and FAA AC 150/5380-6C).

High Precision Threshold: IoU ≥ 0.75

The AP75 metric (Average Precision at IoU=0.75) evaluates a model’s ability to produce predictions with tight boundary alignment. For crack segmentation, AP75 requires that the predicted crack mask overlaps the ground truth by at least 75 percent — demanding that the model capture both the crack path and its thickness with high accuracy.

AP75 is the relevant threshold for applications requiring crack width quantification rather than mere crack presence detection. Runway pavement condition assessment per ICAO standards and FAA Advisory Circulars often requires crack width classification (hairline < 3mm, medium 3-6mm, severe > 6mm), and accurate width measurement requires precise boundary segmentation. A model degraded at AP75 may correctly locate cracks but over- or under-estimate their severity.

IoU ThresholdStandardInterpretation for Crack Segmentation
0.30Minimum useful overlapMarginal detection; crack location approximately correct, but shape/thickness inaccurate
0.50Pascal VOC (AP50)Standard detection; crack path and approximate thickness correct
0.75COCO strict (AP75)Precise segmentation; crack boundaries closely match ground truth within 1-3 pixels
0.50:0.95COCO primary (mAP)Comprehensive evaluation across all precision levels

IoU = 0.3 — The Minimum Meaningful Overlap

While not a formal benchmark standard, IoU=0.3 is sometimes used as a lenient detection threshold in noisy or ambiguous annotation contexts. For pavement cracks, human inter-annotator agreement on individual crack pixels rarely exceeds IoU=0.65-0.85 for well-defined cracks and can drop to IoU=0.30-0.50 for hairline cracks on textured asphalt surfaces. In these cases, the inherent annotation uncertainty sets a practical floor on what IoU value constitutes meaningful detection.

An IoU threshold of 0.3 for crack segmentation indicates that the model has correctly identified the approximate location of a crack but with substantial shape, thickness, or continuity errors. This may be acceptable for coarse defect screening (flagging “possible crack — verify manually”) but is insufficient for automated severity assessment.

Mean IoU (mIoU) for Multi-Class

Mean Intersection over Union (mIoU) extends the single-class IoU metric to multi-class segmentation problems by computing the arithmetic mean of per-class IoU scores. This is the standard evaluation metric for semantic segmentation across all major benchmarks.

mIoU Calculation

For a segmentation task with C classes, mIoU is computed as:

mIoU = (1/C) × Σℂ=1…C IoU_c

where IoU_c is the Intersection over Union for class c, computed by treating class c as the positive class and all other classes as negative (one-vs-rest).

The per-class IoU calculation for class c uses:

  • TP_c — pixels correctly predicted as class c
  • FP_c — pixels incorrectly predicted as class c (they belong to some other class)
  • FN_c — pixels belonging to class c but predicted as some other class

The confusion matrix for a C-class segmentation problem is a C×C matrix where M_ij is the count of pixels belonging to ground truth class i that were predicted as class j. Per-class IoU is then:

IoU_i = M_ii / (Σj M_ij + Σj M_ji - M_ii)

The numerator M_ii (diagonal element) is the true positives for class i. The sum over the row Σj M_ij is the total ground truth area for class i (TP + FN). The sum over the column Σj M_ji is the total prediction area for class i (TP + FP).

mIoU for Crack Segmentation

For airport pavement crack segmentation, a typical multi-class segmentation problem includes:

  • Class 0 — Background: Intact pavement surface, no defects
  • Class 1 — Crack: All crack types (longitudinal, transverse, alligator, reflection)
  • Class 2 — Joint/Crack Sealant: Filled or partially filled control joints
  • Class 3 — Spall/Weathering: Surface deterioration distinct from cracks

Each class has its own IoU score. For TarmacView’s evaluation on airfield pavement datasets, the per-class IoU scores demonstrate the relative difficulty of each defect type. The mIoU gives an aggregate measure of model quality across all surface conditions.

mIoU Sensitivity to Class Balance

A critical property of mIoU is that it treats each class equally regardless of pixel count. A class that occupies 60 percent of pixels (background) and a class that occupies 0.5 percent of pixels (crack) contribute equally to the final mIoU score. This is both a strength and a weakness:

Strength: mIoU prevents the majority class from dominating the metric. A model that perfectly segments pavement background but fails entirely on cracks would achieve 99 percent+ pixel accuracy but a mIoU of only ~0.5 (since the crack class contributes IoU near zero). mIoU honestly reflects the model’s inability to detect cracks, while pixel accuracy would misleadingly suggest near-perfect performance.

Weakness: The equal weighting means that small, noisy classes (classes with few pixels) can have high variance in their IoU estimates. A single small mis-segmented region can drastically reduce a rare class’s IoU, making mIoU volatile when the test set contains few examples of rare defects.

Frequency-Weighted IoU

As an alternative, Frequency Weighted IoU (FWIoU) weights each class’s IoU by its frequency in the ground truth:

FWIoU = (1/Σc n_c) × Σc n_c × IoU_c

where n_c is the number of ground truth pixels for class c. FWIoU produces higher scores when the model performs well on majority classes, making it more forgiving of poor minority-class performance. FWIoU is less commonly reported in academic benchmarks but may be used as a supplementary metric when the primary concern is overall pixel-wise accuracy across the full image.

For infrastructure inspection, mIoU is strongly preferred over FWIoU because:

  • Crack defects are the class of primary operational interest
  • Crack pixels constitute a tiny fraction (0.5-2 percent) of total image area
  • FWIoU would obscure crack segmentation failures behind background performance
  • Regulatory inspection standards (ICAO Annex 14, FAA ACs) prioritize defect detection specifically

IoU in TarmacView Crack Segmentation

TarmacView’s crack segmentation system uses a DINOv2-based backbone with a specialized segmentation head designed for infrastructure defect detection. The model achieves a test IoU of 0.519 on held-out airfield pavement datasets, validated across diverse capture conditions including lighting variation, pavement texture differences, and multiple crack types.

Model Architecture and IoU Performance

The segmentation architecture consists of:

  • Backbone: DINOv2 (Vision Transformer, ViT-B or ViT-L) pre-trained via self-supervised learning on a curated dataset of 142 million images. DINOv2 provides rich, generalizable feature representations without requiring task-specific pre-training.
  • Segmentation Head: A lightweight decoder that maps DINOv2’s patch-level features to pixel-level segmentation masks. TarmacView’s custom head uses a multi-scale feature aggregation design that combines features from multiple transformer blocks to capture both fine crack details (from early layers) and global context (from later layers).
  • Output: A per-pixel probability map of shape H×W×2 (crack vs. background), thresholded at 0.5 to produce the final binary crack mask used for IoU calculation.

The test IoU of 0.519 represents the per-crack-class IoU averaged over the held-out test set. This value positions TarmacView’s model in the good-to-excellent range for pavement crack segmentation, where IoU values of 0.45-0.60 are typical for state-of-the-art models on challenging real-world datasets.

Comparison with Published Benchmarks

Direct comparison of IoU across different crack segmentation studies is complicated by differences in datasets, annotation protocols, and crack definitions. However, published results on similar pavement crack datasets provide context:

ModelDatasetTest IoU (Crack Class)
TarmacView DINOv3Airfield runway (proprietary)0.519
DeepCrack (2019)CFD (public road)0.420-0.465
U-Net + ResNet-50Crack500 (public road)0.475-0.510
HRNet-FCNUAV pavement (academic)0.498-0.530
SegFormer-B3CrackTree200 (public)0.485-0.520
U-Net (drone runway)Runway pavement (2020)0.415-0.472

TarmacView’s IoU of 0.519 is competitive with the best published results while operating on real airport runway data with higher annotation variability and more diverse surface conditions than the controlled datasets used in academic research. The model benefits from DINOv2’s strong representation learning and a domain-specific training regime that includes aggressive data augmentation and class-balancing strategies.

Factors Contributing to IoU Score

Several factors influence the 0.519 test IoU:

Mask thickness tolerance. The ground truth annotations for the airfield pavement dataset define cracks with a consistent thickness of approximately 3-5 pixels at the annotation resolution. The model’s predicted masks may be 2-8 pixels wide at different crack segments. Each pixel difference from the ground truth thickness adds to either FP or FN, reducing IoU.

Edge pixel sensitivity. For a typical crack occupying 5,000 pixels in a 512×512 image, the crack edge (pixels at the boundary between crack and pavement) constitutes approximately 400-600 pixels. If the model’s prediction edge is even 1 pixel offset from the ground truth edge along the full crack path, the resulting FP+FN total of 400-600 pixels can reduce IoU by 0.05-0.10.

Hairline crack challenge. Hairline cracks (width < 0.3mm, corresponding to 1-3 pixels at capture resolution) constitute approximately 30 percent of the test set defects. For these cracks, the total ground truth area is very small (100-500 pixels), so any misalignment has a disproportionately large effect on IoU. Hairline crack IoU averages 0.320-0.380, significantly lower than the 0.550-0.650 achieved on medium and wide cracks.

Annotation consistency. The inter-annotator IoU on the training and test data is approximately 0.72 (two independent expert annotators on the same images). This sets the practical upper bound on achievable model IoU — even a perfect model cannot exceed the consistency of its ground truth, establishing a noise ceiling of approximately 0.72-0.78.

Practical Utility of the IoU Score

A test IoU of 0.519 means that on the average test image, the intersection between predicted and ground truth crack masks is approximately 51.9 percent of their union. In operational terms:

  • Crack detection rate: Approximately 75-85 percent of crack pixels are correctly identified (true positive rate), with variation by crack type and width.
  • False positive rate: Approximately 0.1-0.5 percent of background pixels are incorrectly classified as crack, which translates to 250-1,250 false positive crack pixels per 512×512 image.
  • Crack continuity: The model correctly identifies crack presence in 90+ percent of image frames containing measurable cracks (>1mm width), with some fragmentation (a single continuous crack predicted as multiple short segments).
  • Severity classification: The model’s crack width estimates correlate with ground truth width measurements at R²=0.62-0.74, sufficient for coarse severity classification (hairline vs. medium vs. severe) per ASTM D5340 standards.

Factors Affecting IoU

Multiple factors influence the IoU scores achievable by a crack segmentation model. Understanding these factors is essential for interpreting reported IoU values, diagnosing performance issues, and setting realistic accuracy targets.

Mask Thickness Tolerance

Crack segmentation annotations are typically created by drawing a line or polygon along the crack path, then assigning a fixed width to represent the crack area. The assigned width varies between annotation protocols — some use 3-pixel-wide lines, others use 5-pixel or dynamically scaled widths based on actual crack dimensions.

The model’s predicted mask thickness rarely matches the ground truth thickness exactly. If the annotation protocol assigns 3-pixel width but the model predicts 5-pixel width, every pixel beyond the 3-pixel annotation boundary becomes a false positive. For a 1000-pixel-long crack, the extra 2 pixels on each side produce approximately 4,000 false positive pixels — potentially reducing IoU by 0.10-0.20.

The optimal strategy for maximizing IoU is to train with a thickness-aware loss function that penalizes thickness deviation, or to apply post-processing morphological operations (erosion or dilation) that adjust predicted mask thickness to match the annotation standard. TarmacView applies a post-processing step using a learned dilation factor calibrated on the validation set, which improves test IoU by 0.02-0.04.

Edge Pixel Sensitivity

Edge pixels — the boundary layer between crack and pavement — are the dominant source of IoU reduction in well-performing models. For a crack mask of area A with perimeter P, the number of edge pixels is approximately P (the boundary length in pixels). If the model’s edge alignment is off by an average of 1 pixel, the resulting FP+FN is roughly 2P pixels.

For a typical airport pavement crack with A=5,000 pixels and P=800 pixels:

  • Perfect edge alignment (0 offset): FP+FN from edges = 800 pixels (the annotation thickness uncertainty)
  • 1-pixel edge offset: FP+FN from edges = 1,600 pixels
  • 2-pixel edge offset: FP+FN from edges = 2,400 pixels

The IoU impact is:

  • Zero offset: IoU = 5000/(5000+800) = 0.862 (annotation ceiling)
  • 1-pixel offset: IoU = 5000/(5000+1600) = 0.758
  • 2-pixel offset: IoU = 5000/(5000+2400) = 0.676

This analysis reveals that edge pixel alignment is the single most important factor separating an IoU of 0.50 from an IoU of 0.75 for crack segmentation. Models that achieve precise boundary alignment through high-resolution feature maps and post-processing refinement consistently outperform those that capture crack location but not crack boundary.

Small Crack Challenge

Small cracks — those with total pixel count below approximately 500 pixels in the evaluation image — present a fundamental challenge for IoU-based evaluation. For a 200-pixel hairline crack, a misalignment of just 5 pixels on each side adds 10-20 pixels of FP+FN (5-10 percent of the crack area). The relative sensitivity of IoU to absolute error for small objects is dramatically higher than for large objects.

Using the same 1-pixel edge offset analysis for a small crack (A=200, P=80):

  • Zero offset: IoU = 200/(200+80) = 0.714 (annotation ceiling)
  • 1-pixel offset: IoU = 200/(200+160) = 0.556
  • 2-pixel offset: IoU = 200/(200+240) = 0.455

The IoU penalty for the same 1-pixel edge offset is 0.158 for the small crack (0.714 to 0.556) versus 0.104 for the large crack (0.862 to 0.758). This scale sensitivity means that aggregating IoU across cracks of all sizes without size-stratified reporting can obscure performance differences — a model that performs well on large cracks but poorly on small cracks may show acceptable aggregate IoU while missing the most critical safety-relevant hairline cracks.

For this reason, TarmacView reports IoU stratified by crack width category:

  • Hairline cracks (< 1mm width): IoU = 0.32-0.38
  • Medium cracks (1-3mm width): IoU = 0.48-0.55
  • Wide cracks (> 3mm width): IoU = 0.55-0.65

Annotation Quality

The quality and consistency of ground truth annotations set a hard upper bound on achievable IoU. Multiple studies have documented inter-annotator agreement for pavement crack segmentation:

  • Same annotator, same image, different session: IoU = 0.78-0.88
  • Different annotators, same image, same protocol: IoU = 0.62-0.78
  • Different annotators, same image, different protocol: IoU = 0.45-0.65

This means that even a theoretically perfect model cannot exceed approximately 0.78-0.88 IoU on typical infrastructure inspection datasets, because the ground truth itself is inconsistent. Active learning and annotation consensus techniques can improve annotation quality by having multiple annotators label each image and using majority voting or expert adjudication to resolve disagreements.

Image Resolution and Capture Conditions

Image resolution directly affects IoU because it determines the number of pixels available to represent a crack. A crack that is 10 pixels wide at 20 MP resolution may be only 3 pixels wide at 5 MP resolution. At lower resolutions, the absolute pixel budget for the crack shrinks, making IoU more sensitive to misalignment.

The aircraft pavement inspection protocol specifies capture resolution requirements to ensure adequate defect resolution. ICAO guidance on aerodrome pavement inspections recommends minimum ground sampling distances (GSD) that ensure cracks as narrow as 0.5mm can be resolved. At a GSD of 0.2mm/pixel (typical for UAV inspection at 10m altitude), a 0.5mm hairline crack is only 2.5 pixels wide — barely at the threshold of reliable annotation and segmentation.

Lighting conditions also influence IoU performance. Cracks on wet pavement have higher contrast but also produce specular reflections that can cause false positives. Cracks in shadow have lower contrast and may be partially invisible. Models trained with shadow and wet-surface augmentation (as TarmacView’s pipeline implements) show 0.03-0.06 higher test IoU on challenging lighting conditions compared to models trained without domain-specific augmentation.

Reporting IoU

Proper reporting of IoU metrics is essential for reproducibility, comparison across studies, and operational decision-making. The following guidelines constitute best practice for reporting IoU in infrastructure inspection research and deployment.

Essential Reporting Components

Report per-class IoU, not just mIoU. For crack segmentation, report the crack-class IoU explicitly alongside the mean IoU across all classes. A high mIoU can conceal poor crack performance if background or other defect classes dominate.

Specify the threshold. Report the binarization threshold used to convert model probability outputs to binary masks. The standard threshold is 0.5, but post-processing optimization may use different thresholds. A threshold sweep (reporting IoU at thresholds from 0.3 to 0.7 in 0.1 increments) provides a more complete picture of model behavior.

Report confidence intervals. IoU values estimated from a finite test set have sampling uncertainty. Report the 95 percent confidence interval using bootstrapping (resampling test images with replacement 1,000 times and computing IoU for each resample). A reported IoU of 0.519 with a 95 percent CI of [0.497, 0.541] is more informative than a point estimate.

Include annotation quality metrics. Report inter-annotator agreement (IoU between independent annotators on a subset of images) to establish the evaluation noise ceiling. This contextualizes model IoU — 0.519 looks different when the annotation ceiling is 0.72 versus 0.92.

Specify image resolution and preprocessing. Report the input resolution, whether images are downsampled before inference, and any normalization applied. IoU values at 512×512 input resolution are not directly comparable to IoU at 1024×1024.

Reporting Format

The recommended reporting table for crack segmentation IoU includes:

MetricValue95% CINotes
Crack IoU (threshold 0.5)0.519[0.497, 0.541]Primary metric
Background IoU0.992[0.990, 0.994]Expected to be near-perfect
mIoU (all classes)0.755[0.744, 0.767]Includes background class
mIoU (defect classes only)0.519[0.497, 0.541]Excludes background
Inter-annotator IoU0.723[0.701, 0.745]Evaluation ceiling

Complementary Metrics

IoU should not be reported in isolation. Complementary metrics provide a more complete accuracy assessment:

  • Precision (TP / (TP + FP)): Measures how many of the predicted crack pixels are actually crack. Low precision indicates over-prediction (false alarms).
  • Recall (TP / (TP + FN)): Measures how many ground truth crack pixels were detected. Low recall indicates missed cracks.
  • F1-Score (harmonic mean of precision and recall): Equivalent to Dice Coefficient. Provides a balanced single-score measure.
  • Pixel Accuracy: The fraction of all correctly classified pixels (TP + TN / total). Not recommended as a primary metric due to class imbalance bias.
  • Boundary F1: A variant that evaluates edge pixel accuracy specifically, relevant for crack width measurement applications.

Reproducibility Checklist

To ensure IoU results are reproducible:

  1. Specify the exact test set composition (number of images, crack type distribution, resolution)
  2. Report the threshold for binarizing model outputs
  3. Specify whether the model was trained with data augmentation (and what augmentations)
  4. Report the annotation protocol (line width, crack definition criteria)
  5. Disclose any post-processing (morphological operations, threshold optimization on validation set)
  6. Report IoU at multiple thresholds (0.3, 0.5, 0.75) for comprehensive evaluation

IoU and Practical Utility

The ultimate purpose of IoU in infrastructure inspection is not academic comparison but operational decision-making. Understanding how IoU translates to practical utility — the ability to make maintenance decisions based on model outputs — is critical for deploying crack segmentation systems in the field.

IoU and Maintenance Decision Quality

The relationship between IoU and maintenance decision quality is not linear. A model with IoU=0.45 may produce crack maps that are operationally sufficient for certain use cases, while a model with IoU=0.55 may be operationally insufficient for others, depending on the downstream application.

Use Case 1 — Crack detection screening (identify which runway sections contain cracks): For this use case, recall (the fraction of cracks detected) is more important than precise boundary alignment. A model with IoU=0.40 but recall=0.85 may be more operationally useful than a model with IoU=0.55 but recall=0.70. The high-recall model flags more potential defects for human review, even if its boundary accuracy is lower.

Use Case 2 — Crack width classification (assign cracks to severity categories per ASTM D5340): For this use case, boundary accuracy is critical. A model with IoU=0.50 and systematic 2-pixel over-prediction may classify hairline cracks as medium-width cracks, triggering unnecessary maintenance. A model with IoU=0.45 but unbiased boundary prediction may produce more accurate width estimates.

Use Case 3 — Crack area quantification (measure total crack area for PCI calculation): For this use case, the absolute bias in predicted crack area matters more than per-pixel IoU. A model that consistently predicts 20 percent more crack area than ground truth (IoU affected by FP bias) will overestimate PCI deductions, leading to premature maintenance budgeting.

Mapping IoU to Operational Metrics

The relationship between IoU and practical detection quality can be quantified by converting IoU to detection rate (the percentage of cracks the model successfully identifies). For a binary crack segmentation system, if we define a crack as “detected” when the predicted mask overlaps the ground truth crack by at least 30 percent (the IoU threshold for minimum meaningful detection), the relationship follows:

Model IoUApproximate Crack Detection Rate (TPR)False Positive Rate (FPR)
0.3060-70%0.3-0.7%
0.4070-80%0.2-0.5%
0.5078-86%0.1-0.3%
0.6085-92%0.05-0.2%
0.7090-96%< 0.1%

These ranges are approximate and depend on crack width distribution and image resolution. For TarmacView’s IoU of 0.519, the estimated crack detection rate is approximately 80-85 percent, meaning that 15-20 percent of cracks (primarily hairline cracks below 0.5mm width) escape automated detection and would require either improved model sensitivity or manual review.

Operational Deployment Thresholds

For automated crack detection on airfield pavements, the following operational thresholds are recommended:

Minimum acceptable IoU (screening): IoU ≥ 0.40. Below this threshold, the model produces too many false negatives (missed cracks) and false positives (false alarms) to be operationally useful. Screening at IoU=0.40 may still miss 25-35 percent of cracks, requiring substantial human review effort.

Standard deployment IoU: IoU ≥ 0.50. At this threshold, the model detects 75-85 percent of cracks with acceptable false positive rates. This is the minimum recommended IoU for automated crack detection in a supervised inspection workflow where detected cracks are verified by humans.

High-confidence deployment IoU: IoU ≥ 0.60. At this threshold, the model detects 85-92 percent of cracks with very low false positive rates. This IoU level supports semi-automated severity classification, where crack width and extent measurements are trusted with minimal human verification.

Fully automated deployment IoU: IoU ≥ 0.70. At this threshold, the model approaches human-level annotation consistency (given inter-annotator IoU of 0.62-0.78). This IoU level supports fully automated PCI computation, where no human review of crack segmentation is required.

TarmacView’s DINOv3 crack segmentation head at IoU=0.519 operates in the standard deployment range, suitable for supervised crack detection workflows with human verification. Ongoing model improvements aim to reach the high-confidence deployment threshold (IoU ≥ 0.60) through architectural improvements, training data expansion, and post-processing optimization.

IoU and Regulatory Compliance

For AI-based crack detection systems used in aerodrome pavement inspection, the relationship between IoU and regulatory compliance must be established. While no aviation regulatory body (ICAO, FAA, EASA) currently specifies an IoU threshold for AI inspection systems, the general framework of acceptance follows from:

ISO/IEC 25010 (Systems and software Quality Requirements and Evaluation - SQuaRE): The functional correctness sub-characteristic requires that AI system outputs match ground truth within defined tolerances. IoU serves as the quantitative correctness metric for segmentation outputs.

ASTM D5340 (Standard Test Method for Airport Pavement Condition Index Surveys): The PCI methodology depends on accurate crack extent and severity measurements. A crack segmentation model’s IoU directly impacts the reliability of automated PCI computation.

ICAO Annex 14, Volume I — Chapter 10 (Aerodrome Maintenance): The requirement that runway surfaces be maintained in a condition that does not endanger aircraft safety implies that any inspection system — including AI-based systems — must detect defects with sufficient reliability. IoU provides the quantitative basis for demonstrating this reliability.

TarmacView’s crack segmentation system documents its IoU performance across lighting, weather, and pavement texture conditions as part of the validation evidence package supporting deployment on operational airfields. The reported IoU of 0.519, with its associated confidence intervals and stratified reporting across crack types, enables aerodrome operators to assess the system’s suitability for their specific maintenance workflow and compliance requirements.

Frequently Asked Questions

Accurate Crack Segmentation with AI

TarmacView uses Intersection Over Union as the primary metric to validate crack segmentation accuracy. Our DINOv3-based segmentation head achieves industry-leading IoU scores on airfield pavement inspection data. Contact us to see how precise AI-driven crack detection can transform your infrastructure inspection workflow.

Learn more

Crack Segmentation

Crack Segmentation

Crack segmentation is the computer vision task of classifying every pixel in an image as either crack or non-crack, producing a binary mask that enables precise...

32 min read
Computer Vision Deep Learning +2
Crack Area Percentage in Pavement and Structural Assessment

Crack Area Percentage in Pavement and Structural Assessment

Crack area percentage (crack_area_pct) is the ratio of crack mask area to total analyzed image area, expressed as a percentage. It is a key quantitative severit...

28 min read
measurement pavement +3
Semantic Segmentation for Infrastructure Scene Understanding

Semantic Segmentation for Infrastructure Scene Understanding

Semantic segmentation assigns a category label to every pixel in an image, enabling full-scene understanding for infrastructure inspection. Covers encoder-decod...

37 min read
Technology Computer Vision +3