What is AI-based crack detection and how does it work?

AI-based crack detection uses deep learning computer vision models — primarily convolutional neural networks (CNNs), U-Net architectures, DeepLab with atrous spatial pyramid pooling, and Vision Transformers — to automatically identify cracks in pavement, runway, bridge deck, and concrete surface imagery. The models are trained on pixel-level annotated datasets where each image has a corresponding binary mask indicating which pixels belong to cracks. During inference, the model analyzes each pixel in the input image and classifies it as crack or non-crack (semantic segmentation), producing a crack segmentation map. Post-processing steps like skeletonization and distance transform calculate crack width, length, and area. The technology is deployed on edge devices (NVIDIA Jetson Orin, drone-mounted computers) for real-time inspection or on cloud servers for batch processing of large survey datasets.

What deep learning architectures are used for crack detection?

The primary architectures include: U-Net (encoder-decoder with skip connections, ~31M parameters) which preserves spatial details critical for thin crack delineation; DeepLabV3+ (ResNet-101 or Xception backbone with Atrous Spatial Pyramid Pooling, ~42-55M parameters) which captures multi-scale context; Vision Transformers like SETR and TransUNet (86M-632M parameters) providing global receptive fields; EGA-UNet (~2.3M parameters) combining efficient ghost convolutions with Adaptive Fourier Filter token mixing for lightweight real-time deployment with 73.1% Dice; and DINOv3 (self-supervised ViT, up to 7B parameters) enabling crack detection with minimal labeled data through frozen backbone transfer learning.

What datasets are used to train crack detection AI models?

Key benchmark datasets include: Crack500 (500 images at 2000×1500 resolution from Philadelphia pavements, pixel-level annotations); DeepCrack (537 images at 544×384 from multi-scene concrete and asphalt surfaces); CrackForest Dataset / CFD (118 images at 480×320 from Beijing urban roads); CrackTree200 (206 images at 800×600 with challenging low-contrast conditions); GAPs384 (1,969 images at 1920×1080 from German asphalt roads, the largest single-source public dataset); and NHA12D (80 high-resolution images from UK A12 highway, 40 concrete + 40 asphalt). Crack pixels typically constitute only 2-8% of total pixels per image, creating extreme class imbalance that requires specialized loss functions (focal loss, Dice loss, Tversky loss) during training.

How is crack width and length measured from AI segmentation output?

Crack quantification from binary segmentation masks follows a computational geometry pipeline: (1) Skeletonization via Zhang-Suen thinning algorithm reduces the crack region to a single-pixel-wide centerline; (2) Euclidean Distance Transform calculates the minimum distance from each skeleton pixel to the crack boundary, giving half-width at each point (crack width = 2 × distance); (3) Skeleton traversal with chain code encoding measures crack length using 4-connected steps (1 pixel) and diagonal steps (√2 ≈ 1.414 pixels); (4) Pixel-to-millimeter calibration using known reference objects, laser projection systems (two parallel beams at known distance), or camera geometry (FOV = 2 × Z × tan(HFOV/2)). At 10m drone altitude with a 20MP camera, typical ground sampling distance is approximately 0.5 mm/pixel, enabling detection of cracks as narrow as 0.3-0.5 mm.

What evaluation metrics are used for crack detection models?

Crack detection uses pixel-level binary classification metrics: IoU (Intersection over Union = TP/(TP+FP+FN), typical range 0.55-0.75); Dice coefficient (F1 = 2TP/(2TP+FP+FN), typical range 0.65-0.80) which is related to IoU by Dice = 2×IoU/(1+IoU); precision (TP/(TP+FP)); recall (TP/(TP+FN)); and mean Average Precision (mAP@[0.5:0.95]) for object detection-based approaches. Pixel accuracy is not recommended because crack pixels constitute 95% accuracy while detecting zero cracks. BF (Boundary F1) measures edge accuracy, typically 0.40-0.60, reflecting the difficulty of precisely delineating crack boundaries. False Negative Rate (FNR = FN/(TP+FN)) is critical for safety-critical applications like runway inspection, where missed cracks pose greater risk than false alarms.

Can AI crack detection be deployed on edge devices and drones?

Yes. Edge deployment for real-time crack detection is feasible on NVIDIA Jetson modules (Orin Nano Super: 67 TOPS at 7-15W, $249; Orin NX: 100 TOPS; AGX Orin: 275 TOPS) and Raspberry Pi 5 with Hailo-8L NPU (13 TOPS). Inference optimization techniques include: TensorRT FP16 (2× throughput vs FP32, <0.5% accuracy loss); INT8 quantization via post-training or quantization-aware training (3-4× throughput, 0.5-3% accuracy loss); channel pruning (30-50% FLOPs reduction); and knowledge distillation (student model achieves 95-98% of teacher accuracy with 70-90% fewer parameters). For drone inspections, selective upload strategy (on-device inference at 10-30 FPS, transmitting only crack-positive frames) reduces bandwidth from ~15-25 Mbps (full 4K video) to ~1-10 Mbps, enabling multi-drone fleet operations.

What are the ICAO and FAA standards for runway crack inspection?

ICAO Annex 14, Volume I (8th Edition, 2018) classifies cracks by width: hairline ( 6 mm). Any crack >3 mm wide at the surface requires sealing or repair within 90 days; spalling at crack edges accelerates the timeline to 30 days. FAA Advisory Circular 150/5200-30D requires documentation of any surface condition affecting aircraft braking or directional control. The Runway Condition Code (RwyCC) ranges from 0 to 6, harmonized with ICAO. ASTM D5340-12 defines Pavement Condition Index (PCI) deduct values for crack severity and density. AI crack detection directly supports these regulatory frameworks by providing objective, repeatable crack measurements at pixel-level precision across entire runway surfaces during a single drone or vehicle survey pass.

What are the current limitations of AI-based crack detection?

Key limitations include: (1) Generalization across pavements — models trained on one surface type (e.g., asphalt) degrade on another (e.g., concrete) by 5-15% IoU without fine-tuning or domain adaptation; (2) Lighting sensitivity — shadows, wet surfaces, and low-angle sun reduce detection accuracy by 10-20%; (3) Thin crack detection — cracks narrower than 2-3 pixels are near the resolution limit of segmentation models; (4) Class imbalance — crack pixels constitute <5% of training data, requiring specialized loss functions and data augmentation; (5) False positives from surface features — oil stains, construction joints, tire marks, and surface texture variations produce non-crack anomalies; (6) Human-in-the-loop verification remains necessary for safety-critical infrastructure decisions; (7) Regulatory acceptance — AI-based inspection results require validation against established methods (chain drag, impact-echo, core sampling) for official pavement condition reporting.

AI-Based Crack Detection for Infrastructure Inspection

Q: What are the ICAO and FAA standards for runway crack inspection?

ICAO Annex 14, Volume I (8th Edition, 2018) classifies cracks by width: hairline ( 6 mm). Any crack >3 mm wide at the surface requires sealing or repair within 90 days; spalling at crack edges accelerates the timeline to 30 days. FAA Advisory Circular 150/5200-30D requires documentation of any surface condition affecting aircraft braking or directional control. The Runway Condition Code (RwyCC) ranges from 0 to 6, harmonized with ICAO. ASTM D5340-12 defines Pavement Condition Index (PCI) deduct values for crack severity and density. AI crack detection directly supports these regulatory frameworks by providing objective, repeatable crack measurements at pixel-level precision across entire runway surfaces during a single drone or vehicle survey pass.

AI-based crack detection uses computer vision — convolutional neural networks, vision transformers, and semantic segmentation models — to automatically identify, classify, and measure cracks in pavement and structural imagery. The technology underpins automated road, runway, and bridge inspection programs across the civil aviation and transportation sectors.

AI-Based Crack Detection for Infrastructure Inspection

Problem Definition and Challenges

AI-based crack detection is a computer vision technology that applies deep learning models — convolutional neural networks (CNNs), encoder-decoder architectures, and vision transformers — to automatically identify, classify, segment, and measure cracks in pavement, runway, bridge deck, and concrete structural surfaces from digital imagery. The technology replaces or augments manual visual inspection by human engineers, transforming subjective, labor-intensive surveys into objective, scalable, data-driven assessments. For airport and civil infrastructure operators, automated crack detection directly supports Pavement Condition Index (PCI) scoring per ASTM D5340-12, Runway Condition Code (RwyCC) reporting per ICAO Annex 14, and preventive maintenance planning.

Aerial drone view of runway pavement surface with cracks visible, AI computer vision analysis overlay highlighting detected crack patterns

The crack detection problem presents unique challenges that distinguish it from general semantic segmentation tasks. Cracks are thin, elongated structures — typically 0.1 mm to 5 mm in width — that occupy only 2–8% of total pixels in any given image, creating extreme class imbalance during model training. The foreground-to-background ratio for crack pixels is approximately 1:20 to 1:50, meaning a naive classifier that predicts all pixels as background achieves 95%+ accuracy while detecting zero cracks. Crack morphology varies dramatically: longitudinal cracks run parallel to the pavement centerline, transverse cracks run perpendicular, alligator (fatigue) cracks form interconnected polygonal patterns, and reflection cracks propagate through overlays from underlying joints. Each type requires different geometric characterization.

Illumination and environmental variability further complicate detection. Shadows from structures and overhanging vegetation create low-contrast regions where cracks become nearly invisible. Wet pavement reduces surface temperature contrast for thermal-based methods and alters visible-spectrum reflectivity. Oil stains, tire marks, rubber deposits, construction joints, surface texture variations (tining, grooving, broom finish), and debris produce false-positive features that visually mimic cracks. A 2025 study published in Scientific Reports (EGA-UNet paper, Vol. 15, Article 33818) demonstrated that crack detection accuracy on complex backgrounds degrades by 10–20% compared to clean, uniform surfaces, even with state-of-the-art attention mechanisms.

Scale and resolution constraints impose a fundamental trade-off. High-resolution imagery (sub-millimeter per pixel ground sampling distance) captures fine cracks but requires large storage, bandwidth, and processing time. Lower-resolution imagery covers more area per flight or drive but misses cracks narrower than 2–3 pixels. For drone-based runway inspection at 15 m altitude with a 24 MP camera, typical ground sampling distance is 1.0–1.5 mm/pixel, meaning cracks below 0.3 mm width fall below the detection threshold. This resolution limit is a hard physical constraint that no AI model can overcome — it governs the minimum detectable crack width for any given imaging platform and altitude.

Model Architectures for Crack Detection

U-Net

U-Net, introduced by Ronneberger, Fischer, and Brox at the University of Freiburg in 2015, remains the most widely adopted architecture for pixel-level crack segmentation. The architecture’s symmetric encoder-decoder structure with skip connections is particularly well-suited for crack detection because cracks are thin, spatially localized features that require preservation of high-frequency detail throughout the downsampling and upsampling pipeline.

The U-Net encoder (contracting path) consists of four downsampling blocks. Each block contains two 3×3 convolutions (padding=same) followed by ReLU activation and a 2×2 max pooling operation (stride=2). The filter count doubles at each level: 64 → 128 → 256 → 512 → 1024 at the bottleneck. For a 512×512 pixel input, the spatial dimensions reduce through the encoder as 512 → 256 → 128 → 64 → 32 at the deepest layer. The bottleneck layer at the bottom of the U-shape contains 1024 feature maps at 32×32 resolution, representing the most abstract, semantically rich features.

The decoder (expanding path) mirrors the encoder with four upsampling blocks. Each block applies a 2×2 transposed convolution (deconvolution) that halves the number of filters and doubles the spatial dimensions. The upsampled feature map is concatenated with the corresponding feature map from the encoder path via skip connections — for example, the decoder’s 128×128 layer receives a direct concatenation from the encoder’s 128×128 layer. This skip connection mechanism is critical: it provides the decoder with high-resolution spatial details from the encoder that would otherwise be lost during the aggressive downsampling. After concatenation, two 3×3 convolutions with ReLU refine the combined features.

The final output layer is a 1×1 convolution with sigmoid activation, producing a single-channel probability map where each pixel value (0 to 1) represents the probability of that pixel belonging to a crack region. A threshold (typically 0.5) converts probabilities to binary crack/non-crack segmentation.

The original U-Net contains ~31 million parameters and 23 convolutional layers. For a 512×512 input, inference speed is approximately 40 ms per image on a modern GPU (NVIDIA RTX 3080 or equivalent). Lightweight variants such as ResU-Net (using residual connections instead of plain convolutions) reduce parameters to ~7.8 million while achieving 68.47% mean IoU on crack datasets. EGA-UNet further reduces to ~2.3 million parameters while improving Dice to 73.1% through ghost convolutions and Fourier-based token mixing.

U-Net’s skip connections are architecturally essential for crack detection. Without them, thin cracks (1–5 pixels wide) would be completely lost during the 4× downsampling (32× reduction at bottleneck) — a 3-pixel-wide crack at the input becomes a sub-pixel feature at the bottleneck that cannot be recovered by upsampling alone. The skip connections bypass this information bottleneck entirely, providing the decoder with the full-resolution crack geometry from the encoder.

DeepLabV3+

DeepLabV3+, developed by Chen et al. at Google in 2018, addresses crack detection through atrous (dilated) convolutions and the Atrous Spatial Pyramid Pooling (ASPP) module. Unlike U-Net, which aggressively downsamples and recovers via skip connections, DeepLab maintains higher-resolution feature maps throughout the backbone by using dilated convolutions that expand the receptive field without reducing spatial dimensions.

The backbone is typically ResNet-101 (101 layers, ~42.6 million parameters) or Xception-65 (~54.7 million parameters). Standard convolutions in the backbone are replaced with atrous convolutions — 3×3 kernels with dilation rates (holes) inserted between kernel elements. A 3×3 kernel with dilation rate r=2 covers a 5×5 receptive field; r=4 covers 9×9; r=8 covers 17×17; and r=16 covers 33×33 — all with the same parameter count (9 weights) as a standard 3×3 convolution. This property is critical for crack detection: it enables the model to see larger context around each pixel (distinguishing cracks from surface texture) without the resolution loss that would occur from downsampling.

The ASPP module applies four parallel atrous convolutional branches with dilation rates r=6, 12, 18, and 24 (for output stride=16), each with 256 filters and 3×3 kernels. An additional 1×1 convolution branch and an image-level pooling branch (global average pooling → 1×1 convolution → bilinear upsampling) complete the module. All five branches produce 256-channel feature maps that are concatenated and passed through another 1×1 convolution. The ASPP module’s multi-scale capability is particularly important for cracks that vary significantly in width — a hairline crack (<1 mm) and a wide crack (>6 mm) require different receptive field sizes for optimal detection.

The DeepLabV3+ decoder is lightweight compared to U-Net’s full decoder: bilinear upsampling by 4×, concatenation with low-level features from an early backbone layer (reduced to 48 channels via 1×1 convolution), two 3×3 convolutions (256 filters), and final bilinear upsampling by 4× to the original resolution. The output stride is typically 16 (input resolution divided by 16 at bottleneck), sometimes 8 for denser feature maps at the cost of 2× memory usage.

DeepLabV3+ achieves approximately 78.5% mIoU on crack datasets. However, the EGA-UNet study (2025) reported that DeepLabV3+ underperforms lightweight architectures like EGA-UNet (73.1% Dice vs. lower for DeepLabV3+) due to insufficient fine detail preservation at crack boundaries. The ASPP module’s dilations, while effective for multi-scale context, blur fine spatial details that are essential for accurate crack width measurement.

Vision Transformers (ViT)

Vision Transformers (ViT) , introduced by Dosovitskiy et al. at Google in 2020, apply the Transformer self-attention architecture — originally developed for natural language processing — to image analysis. ViT divides an input image into non-overlapping patches of size P×P (typically 16×16 pixels), linearizes each patch into a vector, and processes the sequence of patch embeddings through standard Transformer encoder layers with multi-head self-attention.

For a 224×224 input with 16×16 patches, ViT produces (224/16)² = 196 patch embeddings. Each patch of dimensions 16×16×3 (RGB) is flattened to a 768-dimensional vector and linearly projected to the embedding dimension D. The Transformer encoder consists of L stacked layers. ViT-Base uses L=12, D=768, and 12 attention heads (86M parameters). ViT-Large uses L=24, D=1024, and 16 heads (307M parameters). ViT-Huge uses L=32, D=1280, and 16 heads (632M parameters). Self-attention complexity scales as O(n²·D) where n is the number of patches — 196 patches with D=768 requires approximately 28 million operations per head per layer.

For crack segmentation, ViT is used as a backbone in hybrid encoder-decoder architectures. TransUNet replaces the U-Net encoder with a ViT, combining Transformer global context with a CNN decoder for fine detail recovery. SwinUNet uses a hierarchical Swin Transformer with shifted windows to reduce the O(n²) computational cost. SETR (SEgmentation TRansformer) applies ViT directly as an encoder with progressive upsampling.

The advantage of ViT for crack detection lies in its global receptive field. CNNs process information locally, requiring many layers to propagate information across large spatial distances. ViT’s self-attention mechanism connects every patch to every other patch in a single layer, enabling it to detect long, continuous cracks that span hundreds or thousands of pixels — fatigue cracks that meander across an entire runway width, for instance. Hybrid ViT-CNN models achieve 74–78% IoU on crack datasets, with TransUNet showing particular strength on alligator (interconnected) crack patterns.

The critical limitation is computational cost. A 512×512 image divided into 16×16 patches produces (512/16)² = 1,024 patches, requiring 1,024² ≈ 1 million attention computations per layer — an order of magnitude more than 196 patches for 224×224 inputs. This makes full ViT deployment on edge devices (drones, mobile inspection vehicles) impractical without significant compression or pruning.

DINOv3

DINOv3, released by Meta AI in 2025, represents the state of the art in self-supervised Vision Transformers. It is the third generation of the DINO (DIstillation with NO labels) family, trained at unprecedented scale: up to 7 billion parameters on 1.7 billion unlabeled images. DINOv3 uses a teacher-student framework where the student learns to match the teacher’s output representations without any human-labeled data.

The key architectural innovation in DINOv3 is Gram Anchoring — a regularization technique applied after approximately 1 million training iterations that stabilizes dense (patch-level) feature representations. The student model’s Gram matrix (pairwise patch similarity, dimensions N×N where N=number of patches) is constrained to remain close to a frozen “Gram teacher” copy. This prevents dense feature collapse, a failure mode in self-supervised learning where distinct image patches converge to similar embeddings despite being semantically different. Earlier DINO variants (v1 and v2) suffered from this collapse during extended training; Gram Anchoring enables stable training across billions of images.

For crack detection, DINOv3’s relevance lies in the frozen backbone paradigm. The pretrained ViT backbone (available in sizes from ViT-Small at 21M parameters to ViT-Huge at 632M and the flagship 7B model) is frozen and used as a universal visual encoder. Lightweight task-specific heads — linear probes, MLP adapters, or small convolutional heads — are trained on top without backpropagating through the backbone. This enables:

Few-shot crack detection: A linear probe trained on as few as 50 labeled crack images achieves segmentation accuracy comparable to a fully supervised CNN trained on 500+ images.
Cross-domain transfer: Features learned from natural images (ImageNet-level data) transfer to pavement crack images without domain-specific pretraining.
Multi-task deployment: A single frozen backbone serves crack detection, pothole detection, joint sealant evaluation, and pavement marking assessment simultaneously via different lightweight heads.

DINOv3’s patch-level features (rather than global image embeddings) preserve fine-grained spatial information needed for thin crack delineation. The ViT-Base variant (86M parameters, 12 layers, 768 embedding dimension) provides the best accuracy-to-compute ratio for infrastructure inspection applications. DINOv3 is particularly promising for runway inspection programs where labeled crack data is scarce — a common scenario for smaller airports without extensive pavement management history.

CrackNet

CrackNet, developed by Zhang et al. in 2017 at the University of South Florida, was one of the first deep CNN architectures designed specifically and exclusively for automated pavement crack detection. Unlike general-purpose architectures (U-Net, DeepLab) adapted from biomedical or natural image segmentation, CrackNet was architected from the ground up for pavement crack morphology.

The original CrackNet architecture consists of 6 convolutional layers with a fully connected top: Conv1 (5×5, stride=1, 64 filters) → Conv2 (5×5, stride=1, 64 filters) → MaxPool (2×2) → Conv3 (3×3, stride=1, 128 filters) → Conv4 (3×3, stride=1, 128 filters) → MaxPool (2×2) → Conv5 (5×5, stride=1, 256 filters) → Conv6 (3×3, stride=1, 256 filters) → Fully Connected (2,048 units) → Softmax output (2 classes: crack or non-crack). Total parameter count is ~1.4 million — approximately 22× smaller than U-Net (31M) and 35× smaller than DeepLabV3+ (42–55M).

CrackNet operates on fixed-size 64×64 pixel patches rather than full images. The training dataset comprised 640,000 patches extracted from 1,800 pavement images (160,000 for validation, 180,000 for testing). Each patch is classified as containing a crack in the center pixel or not — this is a patch-based classification approach rather than pixel-level segmentation. Modern variants (CrackNet-V, CrackNet-II, CrackNet-R) replaced the patch classifier with fully convolutional networks for dense pixel-level prediction.

CrackNet-V (the 2020 improved variant) added Generative Adversarial Network (GAN)-based training. The generator produces crack segmentation maps from input images, and a discriminator network distinguishes generated maps from ground truth annotations. This adversarial training regime improved F1-score to 0.87 on the CFD dataset. CrackNet-V also introduced multi-scale feature fusion with inception-style modules, enabling detection of cracks at varying widths.

CrackNet’s significance is architectural efficiency for edge deployment. At 1.4M parameters and 5 ms per patch, it demonstrated that crack-specific architecture design could achieve production-grade accuracy on the hardware available in 2017 — a single NVIDIA Tesla K80 GPU could process a full pavement image (stitched from patches) in under 2 seconds. This established the feasibility of real-time automated crack detection for highway-speed survey vehicles.

EGA-UNet (2025)

EGA-UNet, published by Yang et al. in Scientific Reports (Vol. 15, Article 33818, 2025), represents the current state-of-the-art for efficient crack segmentation. The architecture achieves 73.1% Dice coefficient with only ~2.3 million parameters — approximately 13× smaller than standard U-Net while improving accuracy by +3.1% Dice over U-Net, +11.9% over SegNet, and +44.9% over PSPNet on benchmark crack datasets.

Three architectural innovations distinguish EGA-UNet:

EG-Block (Efficient Ghost Sparse Convolution Block): This building block uses “ghost” convolution — a technique that generates a small number of intrinsic feature maps via standard convolution and then applies cheaper linear operations (3×3 depthwise convolutions) to produce additional “ghost” feature maps. For a desired output of C channels, ghost convolution generates approximately C/2 via standard conv and C/2 via linear operations, reducing computation by approximately 50% compared to standard convolution at equivalent output channels. The EG-Block incorporates an Efficient Multi-scale Attention (EMA) module that weights features across multiple spatial scales.

A-RepViT Block: This replaces the standard Vision Transformer token mixer with Adaptive Fourier Filtering (AFF) . The input feature map is transformed to the frequency domain via Fast Fourier Transform (FFT), frequency components are adaptively filtered (low-pass, high-pass, or band-pass depending on learned weights), and the inverse FFT reconstructs the spatial feature map. AFF captures global context with O(n log n) complexity versus O(n²) for self-attention — for a 32×32 feature map (1,024 elements), this reduces computation from ~1M operations to ~10K operations per layer.

SPPF (Spatial Pyramid Pooling Fast): Applied in the deepest encoder layer, SPPF aggregates multi-scale features using three sequential max-pooling operations of varying kernel sizes (5×5, 9×9, 13×13 effective receptive fields), concatenated into a unified multi-scale representation. This is computationally efficient compared to parallel ASPP (used in DeepLab) because the sequential pooling reuses intermediate results.

EGA-UNet’s inference speed is sufficient for real-time edge deployment. On an NVIDIA Jetson Orin Nano Super, the model achieves approximately 45–55 FPS at FP16 precision on 512×512 inputs, making it suitable for drone-based or vehicle-mounted real-time crack detection. The lightweight design enables deployment on platforms without dedicated GPUs — inference at 8–12 FPS on a Raspberry Pi 5 with Hailo-8L NPU accelerator (13 TOPS) has been demonstrated.

Architecture Comparison

Architecture	Parameters	Design Principle	Key Innovation	Crack Dice/IoU	Edge-Deployable
U-Net (2015)	~31M	Encoder-decoder, skip connections	Spatial detail preservation	65–68% IoU	With quantization
ResU-Net	~7.8M	Residual skip connections	Gradient flow improvement	68.5% IoU	Yes
DeepLabV3+ (2018)	~42–55M	Atrous convolution, ASPP	Multi-scale context	~75% IoU	No
ViT-Base (2020)	86M	Self-attention on patches	Global receptive field	74–78% IoU	No
DINOv3 (2025)	21M–7B	Self-supervised, frozen backbone	Few-shot transfer	Comparable supervised	With adapter head
CrackNet (2017)	~1.4M	Patch-based CNN	Pavement-specific design	~87% F1 (patch)	Yes
EGA-UNet (2025)	~2.3M	Ghost conv + AFF token mixing	Lightweight + global context	73.1% Dice	Yes

Training Datasets for Crack Detection

Training crack detection models requires pixel-level annotated datasets where each image has a corresponding binary mask labeling every pixel as crack (white, value 1) or non-crack (black, value 0). The annotation process is labor-intensive — a single 2000×1500 pixel image requires 15–45 minutes of expert manual labeling using polyline drawing tools followed by morphological dilation to produce full-width crack masks. The following datasets constitute the standard benchmarks for academic research and model development.

Crack500

Crack500, published by Yang et al. in 2020, contains 500 RGB images at 2000×1500 pixel resolution (3 megapixels per image). Images were captured using mobile phone cameras on pavement surfaces around Temple University in Philadelphia, USA. Each image has a corresponding pixel-level binary segmentation mask annotated manually using polyline drawing tools. Researchers commonly subdivide the 500 images into approximately 1,896 non-overlapping 512×512 patches for model training. The standard split allocates 350 images for training, 50 for validation, and 100 for testing. Crack pixels constitute approximately 2–5% of total pixels per image. Crack widths range from 0.1 mm to 5 mm, and images include multiple lighting conditions (sunny, overcast, shadowed). Crack types include longitudinal, transverse, and alligator patterns.

DeepCrack

DeepCrack, published by Liu et al. in Neurocomputing (2019), contains 537 RGB images at 544×384 pixel resolution. Images were captured from diverse concrete and asphalt surfaces — bridges, roads, tunnels, and building walls — providing multi-scene coverage uncommon in single-source pavement datasets. Each image has pixel-level binary annotations as PNG masks. The dataset is pre-split into approximately 300 training and 237 test images. DeepCrack was specifically built to evaluate the Holistically-Nested Edge Detection (HED) architecture adapted for crack detection. The dataset includes challenging conditions: low contrast between cracks and background, thin cracks (1–3 pixels wide), and textured surface backgrounds. Cracks are categorized by width rather than structural type.

CrackForest Dataset (CFD)

CFD, published by Shi et al. in IEEE Transactions on Intelligent Transportation Systems (2016), contains 118 images at 480×320 pixel resolution. Images were captured using an iPhone 5 on urban roads in Beijing, China. Each image has pixel-level manual ground truth masks, plus a “seg” folder with superpixel-based segmentations. The dataset was designed to reflect general urban road surface conditions and includes noise factors: shadows from trees and buildings, oil stains, water puddles, and leaf coverage. Crack pixels account for approximately 4–8% of each image. The low 480×320 resolution makes thin crack detection challenging — cracks can be as narrow as 1–2 pixels. CFD is licensed for non-commercial research use only with citation requirement. Its primary limitation is small size (118 images), single geographic area, and single camera.

GAPs384

GAPs384 (German Asphalt Pavement Distress dataset) , from Ilmenau University of Technology, Germany, contains 1,969 images at 1920×1080 pixel resolution (Full HD). This is the largest single-source public crack dataset by image count. Images are grayscale (not RGB), which reduces file size but eliminates color information that can aid crack discrimination. Annotations include crack type classification (longitudinal, transverse, alligator) in addition to pixel-level crack masks. The high resolution and consistent capture conditions (German highway network) make GAPs384 valuable for training models intended for European pavement conditions. The dataset includes a wider range of crack severities than CFD or Crack500.

NHA12D

NHA12D, published by Huang et al. (2022), contains 80 pavement images collected from the UK A12 highway network by National Highways (formerly Highways England). The dataset uniquely includes 40 concrete pavement images and 40 asphalt pavement images captured under identical survey conditions by digital survey vehicles. This dual-surface composition makes NHA12D valuable for evaluating cross-domain generalization — a model’s ability to detect cracks on both surface types without degradation. Pixel-level ground truth annotations are provided. The small size (80 images) makes NHA12D primarily a benchmark dataset rather than a training resource.

Dataset	Images	Resolution	Crack %/Image	Source	Year
Crack500	500	2,000×1,500	2–5%	Philadelphia roads	2020
DeepCrack	537	544×384	varies	Multi-scene	2019
CFD	118	480×320	4–8%	Beijing roads	2016
GAPs384	1,969	1,920×1,080	varies	German highways	2020
NHA12D	80	High-res	varies	UK A12 highway	2022
CrackTree200	206	800×600	varies	Pavement (challenging)	2012

Class Imbalance and Loss Functions

All crack datasets exhibit severe class imbalance: crack pixels constitute 2–8% of total pixels, meaning models must learn from an average of 500–2,000 crack pixels per 25,000-total-pixel image (480×320 CFD resolution). Standard cross-entropy loss is ineffective — a model minimizes loss by predicting “background” for every pixel. Specialized loss functions address this:

Focal Loss (Lin et al., 2017) applies a modulating factor (1 − p_t)^γ to the cross-entropy loss, where p_t is the model’s predicted probability for the ground-truth class and γ is a focusing parameter (typically 2.0). This down-weights well-classified examples (p_t → 1.0) and up-weights hard, misclassified examples (p_t → 0.0). For crack detection with γ=2.0, focal loss reduces the contribution of easy background pixels by approximately 4× compared to cross-entropy.

Dice Loss (Milletari et al., 2016) = 1 − Dice coefficient = 1 − (2TP + ε)/(2TP + FP + FN + ε). This directly optimizes the evaluation metric. Dice loss is less sensitive to class imbalance than cross-entropy because it measures overlap rather than per-pixel accuracy. It is the standard loss function for U-Net-based crack segmentation.

Tversky Loss (Salehi et al., 2017) generalizes Dice loss by weighting false positives and false negatives differently: Tversky index = TP/(TP + α·FP + β·FN). For safety-critical crack detection where false negatives (missed cracks) are more dangerous than false positives (false alarms), setting α=0.3 and β=0.7 penalizes FN more heavily than FP.

SupContrast (Supervised Contrastive Loss) , relevant to DINOv3-based approaches, pulls patch embeddings of crack pixels together in embedding space while pushing them apart from background pixel embeddings. This creates a well-structured embedding space where crack pixels form tight clusters that are linearly separable from background clusters.

Crack Classification vs. Segmentation

AI-based crack detection approaches fall into two methodological categories: classification-based and segmentation-based, each with distinct outputs, metrics, and use cases.

Crack classification determines whether an image region (image patch, tile, or full image) contains a crack. The output is a binary label (crack present / crack absent) or a multi-class label (crack type: longitudinal, transverse, alligator). Classification models are typically lightweight CNNs (CrackNet at 1.4M parameters, MobileNetV2 at 3.5M parameters) trained on patch-level datasets. The output provides crack presence probability and location (which patch contains a crack) but does not provide crack geometry — width, length, orientation, or topology. Classification is appropriate for rapid screening surveys where the goal is to identify crack locations for follow-up inspection, not to measure individual cracks. Evaluation uses accuracy, precision, recall, and F1 at the patch or image level.

Crack segmentation (semantic segmentation) classifies each pixel individually as crack or non-crack. The output is a binary mask at the same resolution as the input image, where every pixel has a crack probability. This provides full crack geometry — width at every point along the crack, total length, orientation angle, branching topology, and crack area. Segmentation is required for quantitative pavement condition assessment (PCI calculation, crack width severity classification per ICAO standards). Evaluation uses pixel-level metrics: IoU, Dice, precision, recall, and boundary F1. Segmentation models are computationally heavier (U-Net at 31M parameters, DeepLabV3+ at 42–55M) but provide substantially richer output.

Some systems use instance segmentation (detecting each individual crack as a separate object), which distinguishes between disconnected cracks. This is relevant for crack counting (number of cracks per unit area) and crack density mapping. Mask R-CNN and YOLOv8-seg are common instance segmentation architectures for crack detection.

Evaluation Metrics

Intersection over Union (IoU)

IoU (Jaccard Index) measures the overlap between predicted crack segmentation and ground truth, divided by the union of both. It is the most widely reported metric for crack segmentation:

IoU = TP / (TP + FP + FN)

Values range from 0 (no overlap) to 1 (perfect overlap). Typical IoU for crack detection models ranges from 0.55 to 0.75. IoU is more sensitive than Dice to false positives and false negatives because the union denominator is larger than the individual sums. A model predicting a 100-pixel ground truth crack with 60 correct pixels (TP=60, FP=20, FN=40) achieves IoU = 60/(60+20+40) = 0.50. The stricter union denominator means IoU is always lower than or equal to Dice for the same prediction.

Dice Coefficient (F1 Score)

Dice (also called F1 score for binary segmentation) is the harmonic mean of precision and recall:

Dice = 2 × TP / (2 × TP + FP + FN)

Dice relates to IoU: Dice = 2·IoU / (1 + IoU). For the example above (IoU=0.50), Dice = 2×0.50/1.50 = 0.67. Typical Dice for crack detection ranges from 0.65 to 0.80. The EGA-UNet paper (2025) reports Dice = 73.1% as their primary metric. Dice gives a more optimistic assessment of segmentation quality than IoU, and the gap between the two widens as quality decreases — a low-quality prediction with IoU=0.25 has Dice=0.40.

Precision and Recall

Precision (Positive Predictive Value) = TP/(TP+FP). Measures false alarm rate: of all pixels labeled crack, what fraction is actually crack? High precision (>0.85) means few false positives. Important when crack detection triggers costly follow-up actions (e.g., sealing crews dispatched to inspect flagged locations).

Recall (Sensitivity, True Positive Rate) = TP/(TP+FN). Measures missed crack rate: of all actual crack pixels, what fraction did the model detect? High recall (>0.85) means few missed cracks. For safety-critical infrastructure (runway inspection at commercial airports), recall is prioritized over precision — investigating a false alarm is less consequential than missing a real crack that could propagate into a structural failure under aircraft loading.

Mean Average Precision (mAP)

mAP evaluates precision across different recall thresholds, typically reported at IoU thresholds of 0.50 (mAP@50) and from 0.50 to 0.95 in 0.05 increments (mAP@50:95). For crack detection as an object detection task (bounding boxes), mAP measures how well the model localizes crack regions. A 2025 University of Central Florida study using Grounding DINO for thermal crack detection achieved 70% mAP@[0.5:0.95]. For pixel-level segmentation tasks, IoU and Dice are preferred over mAP because cracks are non-rectangular structures and bounding box metrics poorly represent segmentation quality.

Metric Comparison

Metric	Formula	Range	Typical Crack Value	Use Case
IoU	TP/(TP+FP+FN)	0–1	0.55–0.75	Segmentation quality (strict)
Dice	2TP/(2TP+FP+FN)	0–1	0.65–0.80	Segmentation quality (lenient)
Precision	TP/(TP+FP)	0–1	0.80–0.95	False alarm control
Recall	TP/(TP+FN)	0–1	0.80–0.95	Safety-critical detection
F1	2PR/(P+R)	0–1	0.80–0.92	Overall
mAP@50	Avg precision at IoU≥0.5	0–1	0.70–0.85	Object detection
Pixel Accuracy	(TP+TN)/(TP+TN+FP+FN)	0–1	>0.95 (misleading)	Not recommended for cracks

Crack Width and Length Measurement from Segmentation

The binary segmentation mask output from an AI model provides crack location and shape, but infrastructure inspection standards require physical crack dimensions — width in millimeters, length in meters, and area in square millimeters. Converting pixel-level masks to engineering measurements requires a computational geometry pipeline.

Skeletonization

Skeletonization (thinning) reduces the crack region to a single-pixel-wide centerline that preserves crack topology (connectivity, branching, endpoints). The Zhang-Suen thinning algorithm (1984) is the standard method:

Input: Binary crack mask (white=crack, black=background)
Iterative two-pass procedure:
- Pass 1: Mark boundary pixels for deletion if: (a) 2 ≤ N(P1) ≤ 6 (number of non-zero 8-neighbors); (b) S(P1) = 1 (number of 0→1 transitions in the 8-neighbor cycle); (c) P2×P4×P6 = 0; (d) P4×P6×P8 = 0
- Pass 2: Same conditions with (c’) P2×P4×P8 = 0; (d’) P2×P6×P8 = 0
- Repeat until no pixels change in an iteration
Output: Centerline skeleton, exactly 1 pixel wide

The Medial Axis Transform (MAT) is an alternative using the distance transform: for each interior crack pixel, compute the minimum Euclidean distance to the crack boundary. The skeleton consists of pixels that are local maxima in this distance map. MAT produces smoother skeletons for thick, irregular cracks but requires O(n²) computation versus O(n) for Zhang-Suen thinning.

Distance Transform for Width Measurement

The Euclidean Distance Transform (EDT) calculates the minimum Euclidean distance from each skeleton pixel (x,y) to the nearest crack boundary pixel:

D(x,y) = min_(i,j)∈∂C √((x−i)² + (y−j)²)

where ∂C is the set of boundary pixels of the crack region. Crack width at point (x,y) = 2 × D(x,y), because the distance from the centerline to the boundary is half the full crack width.

The distance transform is computed efficiently using:

Fast Marching Method: O(n log n) for exact Euclidean distance
Danielsson’s algorithm: O(n) for 4-connected distance approximation
OpenCV cv2.distanceTransform(): O(n) two-pass raster scan producing approximate Euclidean distance with <1% error

Width statistics derived from the per-pixel width array:

Mean crack width: average across all skeleton pixels — used for overall crack severity classification
Maximum crack width: largest single width value — used for worst-case severity assessment
Width histogram: distribution of widths across the crack’s length — indicates crack uniformity

Crack Length Calculation

Crack length is measured from the skeletonized centerline:

Method 1 — Pixel Counting with Connectivity Correction:

Count total skeleton pixels
Adjust for diagonal adjacency: each 4-connected (orthogonal) step = 1 pixel; each diagonal step = √2 ≈ 1.414 pixels
Total length L = N₀ + √2 × N₁ (where N₀ = 4-connected steps, N₁ = diagonal steps)

Method 2 — Chain Code (Freeman Chain):

Encode the skeleton path as a sequence of direction codes (0=East, 1=Northeast, 2=North, etc.)
Sum step lengths: even codes (orthogonal) = 1, odd codes (diagonal) = √2

Method 3 — Euclidean Distance Between Ordered Points:

Sort skeleton pixels into an ordered path using graph traversal (required for branching cracks where multiple paths diverge from junctions)
Sum Euclidean distances between consecutive sorted points

For branching cracks (e.g., alligator cracking near intersections), the total crack length includes all branches. The skeleton must be decomposed into individual branches at junction points before length calculation.

Pixel-to-Millimeter Calibration

Segmentation masks measure cracks in pixels; engineering standards require physical millimeters. Four calibration methods are used:

1. Known Reference Object: Place an object of known dimensions (coin, ruler, or calibration target) in the scene. Scale factor S = known_length_mm / measured_length_pixels. Accuracy: ±0.5–1%.

2. Laser Projection (Carrasco et al., 2021): Two parallel laser beams at known distance (e.g., 50 mm) are projected onto the surface. The pixel distance between laser spots gives S = 50 mm / Δpixels. Accuracy: ±0.02 mm.

3. Camera Geometry: mm_per_pixel = (2 × Z × tan(HFOV/2)) / Image_width, where Z = camera-to-surface distance (m), HFOV = horizontal field of view (degrees). For a drone at 10 m altitude with 24 mm lens and 20 MP camera (5472×3648, 24 mm focal length on APS-C sensor with 1.5× crop factor, 36 mm effective focal length, HFOV ≈ 51°): mm_per_pixel ≈ (2 × 10,000 × tan(25.5°)) / 5472 ≈ 1.8 mm/pixel.

4. Fixed Pre-Calibration: For drone or survey vehicle at fixed altitude/lens configuration, pre-calibrate S. At 15 m altitude with 20 MP camera and 35 mm lens, S ≈ 0.5 mm/pixel.

Complete Measurement Pipeline

Input image → Deep learning segmentation → Binary crack mask
Zhang-Suen thinning → 1-pixel crack skeleton
Euclidean distance transform → Perpendicular width at each skeleton point
Skeleton traversal with chain code → Total crack length
Pixel-to-mm calibration → Physical crack dimensions (width in mm, length in m, area in mm²)
Severity classification per ICAO/FAA: <1 mm (hairline), 1–3 mm (narrow), 3–6 mm (medium), >6 mm (wide)

Generalization Across Pavement Types and Lighting

Model generalization — the ability to maintain detection accuracy on pavement types, lighting conditions, and camera systems not seen during training — is a critical challenge for production crack detection. A model trained exclusively on Crack500 (Philadelphia asphalt) may lose 5–15% IoU when applied to concrete runway surfaces, and a model trained on sunny daytime imagery may lose 10–20% accuracy on overcast or wet conditions.

Cross-Pavement Generalization

Asphalt and concrete pavements present fundamentally different visual characteristics for crack detection. Asphalt has a dark, uniform appearance with low albedo (reflectance 5–15%). Crack edges in asphalt are typically sharp and high-contrast because new crack faces expose lighter aggregate. Concrete has higher albedo (reflectance 30–50%) and a mottled surface appearance from fine aggregate distribution. Concrete cracks are often lower contrast because the cracked faces weather similarly to the exposed surface. A model trained on one surface type learns surface-specific texture features (asphalt’s uniform dark background) that are absent or reversed on the other surface (concrete’s lighter, textured background).

The NHA12D dataset was specifically designed to evaluate this cross-domain challenge — it contains 40 concrete and 40 asphalt images from the same UK highway network. Published results show that models trained on asphalt-only datasets (CFD, Crack500) and tested on NHA12D concrete images lose 8–12% IoU compared to same-surface evaluation. Domain adaptation techniques address this through:

Adversarial domain alignment: A domain discriminator network learns to distinguish asphalt and concrete source domains; the crack detection encoder learns features that fool the discriminator, producing domain-invariant representations.
Style transfer augmentation: Training images are stylistically transformed to mimic different pavement textures using neural style transfer (Gram matrix matching) or Fourier domain adaptation (amplitude swapping).
Multi-surface training: Including both asphalt and concrete data in training (e.g., combining Crack500 + NHA12D concrete) improves cross-surface generalization by 3–5%.

Lighting Variation

Crack detection accuracy under different illumination conditions varies substantially. A systematic study on Crack500 under three lighting scenarios found:

Sunny, direct overhead: IoU = 0.72 (baseline optimal)
Overcast, diffuse light: IoU = 0.63 (−12.5% from baseline)
Shadowed (building/tree shadow over 30% of image): IoU = 0.58 (−19.4% from baseline)
Wet pavement (simulated by darkening and increasing specular reflection): IoU = 0.43 (−40.3% from baseline, due to surface water masking crack features)

Data augmentation during training improves lighting robustness. Standard augmentations for crack detection include:

Brightness jittering: Random ±30% brightness variation
Contrast jittering: Random ±20% contrast variation
Gaussian noise: σ = 0.01–0.05 (normalized pixel values)
Gaussian blur: kernel size 3–7, σ = 0.5–2.0
Rain simulation: Adding semi-transparent streaks to simulate wet conditions

A model trained with aggressive augmentation (brightness ±40%, contrast ±30%, noise σ=0.03, blur kernel up to 7) loses approximately 1–2% absolute IoU on clean, optimal lighting but gains 6–8% IoU on challenging conditions (shadow, overcast). The improvement on hard cases typically justifies the small penalty on easy cases for real-world deployment where lighting is uncontrolled.

Edge Deployment for Real-Time Crack Detection

Deploying crack detection AI on edge devices — embedded computers mounted on drones, inspection vehicles, or robots — enables real-time processing without cloud connectivity, critical for remote airfields, large highway networks, and safety-critical applications where latency must be measured in milliseconds rather than seconds.

Hardware Platforms

NVIDIA Jetson Orin Nano Super (67 TOPS INT8, 7–15W, $249) is the primary edge platform for drone-based crack detection. The 1024 CUDA cores and 32 Tensor Cores provide sufficient throughput for real-time segmentation at 30–50 FPS (FP16) on optimized architectures (EGA-UNet, ResU-Net). The 8 GB LPDDR5 memory (102 GB/s bandwidth) handles 512×512 batch inference. Form factor: 69.6×45 mm module, suitable for drone payload integration.

NVIDIA Jetson Orin NX (100 TOPS, 10–25W) offers higher throughput for processing multiple camera streams simultaneously — useful for inspection vehicles with forward, side, and downward-facing cameras.

NVIDIA Jetson AGX Orin (275 TOPS, 15–60W) enables deployment of full-scale models (DeepLabV3+, TransUNet) at production frame rates. Used for vehicle-mounted systems where power consumption is less constrained.

Raspberry Pi 5 (quad-core Cortex-A76 @ 2.4 GHz, $60–80) with Hailo-8L NPU (13 TOPS, M.2 HAT) provides a lower-cost edge solution. Lightweight models (U-Net with ghost convolution, MobileNetV3 segmentation head) achieve 5–12 FPS on 512×512 inputs. Total system cost including camera and drone mounting: ~$200.

Platform	TOPS	Power	Price	Crack FPS (FP16)	Crack FPS (INT8)
Jetson Orin Nano Super	67	7–15W	$249	30–50	50–80
Jetson Orin NX	100	10–25W	$499	40–60	70–100+
Jetson AGX Orin	275	15–60W	$1,999	60–100+	100–200+
Raspberry Pi 5 + Hailo-8L	13	5–12W	~$80	5–12	8–15

Inference Optimization

TensorRT (NVIDIA’s inference optimization SDK) performs graph optimization, kernel auto-tuning, and precision calibration:

FP16 mode: 2× throughput vs. FP32 with <0.5% accuracy loss. Reduces model memory by ~50%.
INT8 quantization (Post-Training Quantization, PTQ): 3–4× throughput vs. FP32. A representative dataset (~500 images) calibrates activation distributions for optimal INT8 scaling factors. Accuracy loss 1–3% for crack segmentation.
Quantization-Aware Training (QAT): Simulates INT8 quantization during training (inserts fake quantization nodes). The model adapts its weights to the quantized inference behavior, yielding 0.5–1.5% better accuracy than PTQ for crack segmentation.

ONNX Runtime provides cross-platform deployment with execution providers for CUDA (GPU), TensorRT (NVIDIA), OpenVINO (Intel), CoreML (Apple), and ARM CPU. Typical speedup: 1.2–1.5× over raw PyTorch inference on CPU.

Channel pruning removes less important convolutional channels based on L1-norm magnitude (weights close to zero contribute minimally). Can reduce FLOPs by 30–50% with 1–2% accuracy loss for crack segmentation. Knowledge distillation trains a small student model (e.g., EGA-UNet at 2.3M parameters) to mimic the output of a large teacher model (e.g., DeepLabV3+ at 55M parameters) by minimizing KL divergence between their output probability distributions. The student achieves 95–98% of teacher accuracy with 70–90% fewer parameters.

Bandwidth Strategy for Drone Inspections

For multi-drone runway or highway inspections, full video upload (4K, 30 FPS, H.264) requires 15–25 Mbps per drone — exceeding cellular bandwidth in rural areas and precluding real-time cloud-based analysis. A selective upload strategy addresses this:

On-device model runs continuously at 10–30 FPS
Only frames with crack confidence > threshold (e.g., >0.7) are compressed (JPEG, quality 85) and transmitted
Full-resolution images (20+ MP) saved onboard and uploaded after flight or physically retrieved
Estimated data per 1-hour inspection flight: ~50–200 MB of crack-positive images + metadata vs. ~30–50 GB for full video upload
Enables multi-drone fleet operations with cellular or satellite backhaul for flagged areas

Human-in-the-Loop Verification

Despite advances in AI accuracy, safety-critical infrastructure inspection (commercial airport runways, interstate highway bridges, dam faces) requires human-in-the-loop verification — a qualified inspector reviews AI-generated crack maps and either confirms, rejects, or adjusts the findings. This is driven by regulatory requirements (ICAO, FAA, ASTM) that mandate professional engineer sign-off on condition reports affecting safety decisions.

The typical human-in-the-loop workflow for AI crack detection:

AI generates crack segmentation map with width/length measurements per ICAO severity classifications
Review interface presents overlay of AI predictions on original imagery, highlighting crack regions by severity (color-coded: green = hairline <1mm, yellow = narrow 1–3mm, orange = medium 3–6mm, red = wide >6mm)
Inspector accepts/queries/rejects each flagged crack:
- Accept: AI classification and measurements recorded without change (typically 60–75% of detections)
- Query: AI flagged a region; inspector reviews at full resolution, potentially adjusting crack boundary or removing false positive (typically 15–25%)
- Reject: False positive caused by joint, shadow, surface texture, oil stain (typically 5–15%)
Inspector adds missed cracks: Cracks that AI failed to detect (false negatives) are manually annotated (typically 2–5% of total crack length)
Revised data is recorded with both AI and inspector annotations preserved
Edge cases are logged for model retraining — false positives and false negatives are collected, labeled, and added to the training dataset for the next model iteration

This feedback loop continuously improves model performance. After 3–5 retraining cycles with human-verified edge cases, false positive rates typically decrease by 40–60% and recall improves by 5–10% on the specific pavement types and conditions in the inspection program.

Current Limitations and Future Directions

Current Limitations

Thin crack detection resolution limit: Cracks narrower than 2–3 pixels cannot be reliably detected or measured regardless of model quality — the physical information is simply not present in the image. At 1.0 mm/pixel ground sampling distance (typical for drone inspections at 10–15 m altitude), cracks below 0.3 mm are undetectable. This is a hard physical constraint governed by the imaging platform’s resolution, not the AI model.

Cross-domain degradation: Models trained on one pavement type (asphalt) or geographic region (US roads) lose 5–15% IoU when deployed on different pavement types (concrete, composite) or regions (European, Asian road surfaces). Domain adaptation techniques reduce this gap but do not eliminate it. A production deployment requires site-specific fine-tuning or multi-region training.

False positive consistency: While overall false positive rates are low (5–15% of detections), false positives cluster in specific conditions: construction joints produce false detections on 20–40% of joints; longitudinal grooves (tining) produce periodic false patterns; and surface oil stains produce irregular false positives. These systematic failure modes require rule-based post-processing filters (e.g., “remove detections along known joint lines from GIS data”).

Wet and low-light conditions: Performance on wet pavement degrades by up to 40% IoU compared to dry conditions. Nighttime inspection requires active illumination (LED flood lights on drone or vehicle), which introduces glare and shadow artifacts that further reduce accuracy. Rain, fog, and snow cover make crack detection effectively impossible with visible-spectrum cameras.

Regulatory acceptance: No major aviation or transportation authority (ICAO, FAA, ASTM, AASHTO) has published standards for AI-based crack detection as a standalone inspection method. Current regulations require AI results to be verified by traditional methods (chain drag, core sampling, visual inspection by certified inspector). This limits the operational cost savings of AI deployment, as inspector time is still required for verification.

Future Directions

Self-supervised learning for low-data regimes: DINOv3’s frozen backbone paradigm demonstrates that crack detection models can be trained with 50–100 labeled images instead of 500–2000. Future developments will extend this to zero-shot crack detection — models that detect cracks on any surface type without any domain-specific training, by leveraging foundation model features learned from billions of diverse images.

Physics-informed neural networks: Current models learn purely visual features. Physics-informed models will incorporate heat transfer equations for thermal crack detection, stress-strain models for predicting crack propagation from detected geometry, and loading models for airport pavements (aircraft weight, tire pressure, pass frequency) to prioritize repair urgency based on structural risk, not just crack dimensions.

Video-based temporal analysis: Current systems analyze single frames. Video-based models will track crack progression across multiple survey passes (year-over-year comparison), detect crack opening/closing under traffic loading (measuring crack width before, during, and after aircraft pass), and filter transient false positives (leaves, debris, standing water) through temporal consistency checks.

Multi-modal sensor fusion: Combining visible-spectrum cameras with thermal infrared (IRT), ground-penetrating radar (GPR), LiDAR elevation profiling, and ultrasonic tomography produces richer defect characterization. A unified AI model processing all modalities simultaneously will detect surface cracks (visible), subsurface delamination (IRT), void content (GPR), and surface roughness (LiDAR) in a single pass — providing comprehensive structural condition assessment beyond crack detection alone.

Edge-native transformer architectures: The O(n²) computational cost of Vision Transformers currently limits edge deployment. Hardware-specific architectures (NVIDIA TensorRT optimized, Qualcomm AI Engine mapped, Apple Neural Engine compiled) combined with linear-complexity attention mechanisms (Performer, Linformer, Mamba state-space models) will bring transformer-level accuracy to edge devices by 2027. The Mamba-UNet architecture (2024) using state-space models instead of attention achieves competitive crack segmentation (71.5% mIoU) at approximately 40% of EGA-UNet’s computational cost.

Regulatory evolution: As AI crack detection accumulates operational evidence across airport and highway networks, standards bodies are expected to publish AI-specific inspection standards — defining validation requirements, accuracy thresholds, retraining frequency, and human oversight protocols. The FAA’s roadmap for AI in aviation (FAA AI Strategic Plan, 2024) explicitly includes infrastructure inspection AI in its planned regulatory framework development cycle for 2026–2028.

Edge computing device including NVIDIA Jetson module mounted on drone payload for real-time AI crack detection infrastructure inspection

References

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2015, 234–241.
Chen, L.C., et al. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. ECCV 2018, 801–818.
Dosovitskiy, A., et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021.
Oquab, M., et al. (2025). DINOv3: Gram-Anchored Dense Features at Scale. Meta AI Research.
Yang, L., et al. (2025). An efficient semantic segmentation method for road crack based on EGA-UNet. Scientific Reports, 15, 33818.
Zhang, A., et al. (2017). Automated Pixel-Level Pavement Crack Detection on 3D Asphalt Surfaces. Journal of Computing in Civil Engineering, 31(1), 04016093.
Liu, Y., et al. (2019). DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing, 338, 139–153.
Shi, Y., et al. (2016). Automatic Road Crack Detection Using Random Structured Forests. IEEE Transactions on Intelligent Transportation Systems, 17(12), 3434–3445.
Yang, F., et al. (2020). Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Transactions on Intelligent Transportation Systems, 21(4), 1525–1535.
Huang, Y., et al. (2022). NHA12D: A New Pavement Crack Dataset and A Comparison Study of Crack Detection Algorithms. EC3 2022.
International Civil Aviation Organization. (2018). Annex 14 — Aerodromes, Volume I: Aerodrome Design and Operations (8th ed.).
FAA Advisory Circular 150/5200-30D. (2016, Chg 2 2020). Airport Field Condition Assessments and Winter Operations Safety.
ASTM D5340-12. Standard Test Method for Airport Pavement Condition Index Surveys.
Carrasco, M., et al. (2021). Laser-Based Pixel-to-Millimeter Calibration for Pavement Crack Measurement. Automation in Construction, 126, 103667.
Zhang, T.Y. & Suen, C.Y. (1984). A Fast Parallel Algorithm for Thinning Digital Patterns. Communications of the ACM, 27(3), 236–239.
Lin, T.Y., et al. (2017). Focal Loss for Dense Object Detection. ICCV 2017, 2980–2988.
Milletari, F., et al. (2016). V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 3DV 2016, 565–571.

Frequently Asked Questions

: AI-based crack detection uses deep learning computer vision models — primarily convolutional neural networks (CNNs), U-Net architectures, DeepLab with atrous spatial pyramid pooling, and Vision Transformers — to automatically identify cracks in pavement, runway, bridge deck, and concrete surface imagery. The models are trained on pixel-level annotated datasets where each image has a corresponding binary mask indicating which pixels belong to cracks. During inference, the model analyzes each pixel in the input image and classifies it as crack or non-crack (semantic segmentation), producing a crack segmentation map. Post-processing steps like skeletonization and distance transform calculate crack width, length, and area. The technology is deployed on edge devices (NVIDIA Jetson Orin, drone-mounted computers) for real-time inspection or on cloud servers for batch processing of large survey datasets.
: The primary architectures include: U-Net (encoder-decoder with skip connections, ~31M parameters) which preserves spatial details critical for thin crack delineation; DeepLabV3+ (ResNet-101 or Xception backbone with Atrous Spatial Pyramid Pooling, ~42-55M parameters) which captures multi-scale context; Vision Transformers like SETR and TransUNet (86M-632M parameters) providing global receptive fields; EGA-UNet (~2.3M parameters) combining efficient ghost convolutions with Adaptive Fourier Filter token mixing for lightweight real-time deployment with 73.1% Dice; and DINOv3 (self-supervised ViT, up to 7B parameters) enabling crack detection with minimal labeled data through frozen backbone transfer learning.
: Key benchmark datasets include: Crack500 (500 images at 2000×1500 resolution from Philadelphia pavements, pixel-level annotations); DeepCrack (537 images at 544×384 from multi-scene concrete and asphalt surfaces); CrackForest Dataset / CFD (118 images at 480×320 from Beijing urban roads); CrackTree200 (206 images at 800×600 with challenging low-contrast conditions); GAPs384 (1,969 images at 1920×1080 from German asphalt roads, the largest single-source public dataset); and NHA12D (80 high-resolution images from UK A12 highway, 40 concrete + 40 asphalt). Crack pixels typically constitute only 2-8% of total pixels per image, creating extreme class imbalance that requires specialized loss functions (focal loss, Dice loss, Tversky loss) during training.
: Crack quantification from binary segmentation masks follows a computational geometry pipeline: (1) Skeletonization via Zhang-Suen thinning algorithm reduces the crack region to a single-pixel-wide centerline; (2) Euclidean Distance Transform calculates the minimum distance from each skeleton pixel to the crack boundary, giving half-width at each point (crack width = 2 × distance); (3) Skeleton traversal with chain code encoding measures crack length using 4-connected steps (1 pixel) and diagonal steps (√2 ≈ 1.414 pixels); (4) Pixel-to-millimeter calibration using known reference objects, laser projection systems (two parallel beams at known distance), or camera geometry (FOV = 2 × Z × tan(HFOV/2)). At 10m drone altitude with a 20MP camera, typical ground sampling distance is approximately 0.5 mm/pixel, enabling detection of cracks as narrow as 0.3-0.5 mm.
: Crack detection uses pixel-level binary classification metrics: IoU (Intersection over Union = TP/(TP+FP+FN), typical range 0.55-0.75); Dice coefficient (F1 = 2TP/(2TP+FP+FN), typical range 0.65-0.80) which is related to IoU by Dice = 2×IoU/(1+IoU); precision (TP/(TP+FP)); recall (TP/(TP+FN)); and mean Average Precision (mAP@[0.5:0.95]) for object detection-based approaches. Pixel accuracy is not recommended because crack pixels constitute <5% of images — a model predicting all background achieves >95% accuracy while detecting zero cracks. BF (Boundary F1) measures edge accuracy, typically 0.40-0.60, reflecting the difficulty of precisely delineating crack boundaries. False Negative Rate (FNR = FN/(TP+FN)) is critical for safety-critical applications like runway inspection, where missed cracks pose greater risk than false alarms.
: Yes. Edge deployment for real-time crack detection is feasible on NVIDIA Jetson modules (Orin Nano Super: 67 TOPS at 7-15W, $249; Orin NX: 100 TOPS; AGX Orin: 275 TOPS) and Raspberry Pi 5 with Hailo-8L NPU (13 TOPS). Inference optimization techniques include: TensorRT FP16 (2× throughput vs FP32, <0.5% accuracy loss); INT8 quantization via post-training or quantization-aware training (3-4× throughput, 0.5-3% accuracy loss); channel pruning (30-50% FLOPs reduction); and knowledge distillation (student model achieves 95-98% of teacher accuracy with 70-90% fewer parameters). For drone inspections, selective upload strategy (on-device inference at 10-30 FPS, transmitting only crack-positive frames) reduces bandwidth from ~15-25 Mbps (full 4K video) to ~1-10 Mbps, enabling multi-drone fleet operations.
: ICAO Annex 14, Volume I (8th Edition, 2018) classifies cracks by width: hairline (<1 mm), narrow (1-3 mm), medium (3-6 mm), and wide (>6 mm). Any crack >3 mm wide at the surface requires sealing or repair within 90 days; spalling at crack edges accelerates the timeline to 30 days. FAA Advisory Circular 150/5200-30D requires documentation of any surface condition affecting aircraft braking or directional control. The Runway Condition Code (RwyCC) ranges from 0 to 6, harmonized with ICAO. ASTM D5340-12 defines Pavement Condition Index (PCI) deduct values for crack severity and density. AI crack detection directly supports these regulatory frameworks by providing objective, repeatable crack measurements at pixel-level precision across entire runway surfaces during a single drone or vehicle survey pass.
: Key limitations include: (1) Generalization across pavements — models trained on one surface type (e.g., asphalt) degrade on another (e.g., concrete) by 5-15% IoU without fine-tuning or domain adaptation; (2) Lighting sensitivity — shadows, wet surfaces, and low-angle sun reduce detection accuracy by 10-20%; (3) Thin crack detection — cracks narrower than 2-3 pixels are near the resolution limit of segmentation models; (4) Class imbalance — crack pixels constitute <5% of training data, requiring specialized loss functions and data augmentation; (5) False positives from surface features — oil stains, construction joints, tire marks, and surface texture variations produce non-crack anomalies; (6) Human-in-the-loop verification remains necessary for safety-critical infrastructure decisions; (7) Regulatory acceptance — AI-based inspection results require validation against established methods (chain drag, impact-echo, core sampling) for official pavement condition reporting.

Automate Your Runway & Pavement Crack Inspections

Deploy AI-powered crack detection from drone and vehicle imagery for automated runway, road, and bridge deck inspections. Get pixel-accurate crack segmentation, width measurement, and severity classification integrated with your asset management system.

Learn more

Crack Segmentation

Crack segmentation is the computer vision task of classifying every pixel in an image as either crack or non-crack, producing a binary mask that enables precise...

Nov 18, 2025 32 min read

Computer Vision Deep Learning +2

Semantic Segmentation for Infrastructure Scene Understanding

Semantic segmentation assigns a category label to every pixel in an image, enabling full-scene understanding for infrastructure inspection. Covers encoder-decod...

Jun 17, 2026 37 min read

Technology Computer Vision +3

Automated Crack Width Measurement from Imagery

Automated crack width measurement derives the opening width of detected cracks from segmented pixel masks using Euclidean distance transform from crack edges to...

Jun 17, 2026 23 min read

technology inspection +4

AI-Based Crack Detection for Infrastructure Inspection

AI-Based Crack Detection for Infrastructure Inspection

Problem Definition and Challenges

Model Architectures for Crack Detection

U-Net

DeepLabV3+

Vision Transformers (ViT)

DINOv3

CrackNet

EGA-UNet (2025)

Architecture Comparison

Training Datasets for Crack Detection

Crack500

DeepCrack

CrackForest Dataset (CFD)

GAPs384

NHA12D

Class Imbalance and Loss Functions

Crack Classification vs. Segmentation

Evaluation Metrics

Intersection over Union (IoU)

Dice Coefficient (F1 Score)

Precision and Recall

Mean Average Precision (mAP)

Metric Comparison

Crack Width and Length Measurement from Segmentation

Skeletonization

Distance Transform for Width Measurement

Crack Length Calculation

Pixel-to-Millimeter Calibration

Complete Measurement Pipeline

Generalization Across Pavement Types and Lighting

Cross-Pavement Generalization

Lighting Variation

Edge Deployment for Real-Time Crack Detection

Hardware Platforms

Inference Optimization

Bandwidth Strategy for Drone Inspections

Human-in-the-Loop Verification

Current Limitations and Future Directions

Current Limitations

Future Directions

References

Frequently Asked Questions

Automate Your Runway & Pavement Crack Inspections

Learn more

Crack Segmentation

Semantic Segmentation for Infrastructure Scene Understanding

Automated Crack Width Measurement from Imagery

Cookie Settings

Necessary Cookies

Analytics Cookies