Defect Head Evaluation and Smoke Testing

Defect head smoke testing is an automated verification procedure that validates the structural integrity and basic functionality of a defect detection machine learning pipeline. It confirms that the pipeline — from input image preprocessing through the DINOv3 vision transformer backbone to the 5-label Multi-Layer Perceptron (MLP) classification head — produces expected outputs on synthetic or small static test data without crashing, producing numerical errors, or generating structurally invalid predictions. The smoke test is distinct from a full evaluation: it verifies that the pipeline is correctly wired and operational, not that it generalizes to unseen data with high accuracy.

Technical diagram of a deep learning architecture pipeline for structural defect detection showing a DINOv3 vision transformer backbone feeding into an MLP classification head with five output defect classes

Defect Head Architecture

The defect head is the final component of TarmacView’s structural defect detection pipeline, responsible for mapping the rich feature representations extracted by the backbone network to discrete defect class predictions. Understanding the architecture of both the backbone and the head is essential for designing effective smoke tests that validate each component’s integrity.

DINOv3 Backbone

The DINOv3 (self-DIstillation with NO labels, version 3) vision transformer, developed by Meta AI and released in 2023, serves as the feature extraction backbone. DINOv3 was trained using a self-supervised learning paradigm on a curated dataset of 142 million unlabeled images (LVD-142M), learning general-purpose visual representations without requiring any human annotations. This approach produces features that transfer effectively to downstream tasks including defect classification, segmentation, and detection — often outperforming supervised pretraining on ImageNet-1K.

DINOv3 is available in multiple model variants with different computational profiles:

VariantParametersEmbedding DimPatch SizeLayersHeads
ViT-S/1422 million38414×14126
ViT-B/1486 million76814×141212
ViT-L/14300 million102414×142416
ViT-g/141.1 billion153614×144024

For TarmacView’s defect detection pipeline, the ViT-B/14 variant is the standard configuration. With 86 million parameters and a 768-dimensional embedding space, it balances representational capacity with computational efficiency suitable for processing large volumes of runway inspection imagery. The 14×14 patch size means that a 224×224 pixel input image is divided into 16×16 = 256 non-overlapping patches, each projected into the 768-dimensional embedding space through a learned linear projection.

The DINOv3 training methodology combines several key techniques. Self-distillation with teacher-student architecture ensures that the student network learns to match the teacher’s representations, with the teacher being an exponential moving average of the student. iBOT (image BERT pre-training with Online Tokenizer) applies masked image modeling, where random patches are masked and the model must predict the masked patch representations. Sinkhorn-Knopp centering from the SwAV method prevents representation collapse by enforcing uniform distribution across batch samples. KoLeo regularizer encourages diversity in the learned features by penalizing feature similarity between nearby samples.

For the defect detection use case, DINOv3 is loaded with pretrained weights and typically frozen during defect head training. The frozen backbone extracts general visual features — edges, textures, gradients, surface patterns — that are highly relevant for distinguishing between intact pavement and the five defect classes. Freezing the backbone reduces the number of trainable parameters from 86 million to approximately 1-3 million (depending on MLP head depth), dramatically reducing training time, GPU memory requirements, and the risk of catastrophic forgetting on small domain-specific datasets.

MLP Defect Head

The MLP (Multi-Layer Perceptron) defect head is a small feedforward neural network that takes the frozen DINOv3 embeddings as input and produces a 5-dimensional probability distribution over the five defect classes: crack, spalling, efflorescence, exposed rebar, and corrosion.

The standard architecture consists of:

Input layer: Accepts the DINOv3 embedding — either the [CLS] token (a 768-dimensional vector representing global image content) or a pooled representation of all patch tokens. The [CLS] token approach is standard because DINOv3 is specifically trained to produce rich information in the [CLS] token during self-distillation.

Hidden layers: Typically 1-2 fully connected layers with ReLU or GELU activation functions. A single hidden layer configuration might be 768 → 256 → 5, while a deeper configuration might be 768 → 512 → 128 → 5. Each hidden layer is followed by batch normalization or layer normalization to stabilize training and reduce internal covariate shift. Dropout (rate 0.2-0.5) is applied between hidden layers during training as a regularizer to prevent overfitting, given that infrastructure defect datasets are typically small (500-5000 images).

Output layer: A linear projection to 5 units corresponding to the five defect classes, followed by a softmax activation that converts logits to a probability distribution over the classes. The softmax function ensures that the output vector sums to 1.0, with each element representing the predicted probability that the input image belongs to that defect class.

Training procedure: The MLP head is trained via supervised fine-tuning while the DINOv3 backbone remains frozen. The loss function is categorical cross-entropy, comparing the predicted probability distribution against the one-hot encoded ground truth labels. Training typically uses the AdamW optimizer with a learning rate of 1e-3 to 1e-4, batch size of 32-128, and early stopping based on validation loss. Data augmentation (random rotation, horizontal flip, color jitter, random crop) is applied during training to improve generalization.

For inference, the DINOv3 backbone processes each input image into patch and [CLS] token embeddings in a single forward pass. The [CLS] embedding is extracted and passed through the MLP head. The output softmax probabilities are thresholded (typically at 0.5 or optimized via ROC analysis) to produce a binary prediction for each defect class. Because the five defect classes are not mutually exclusive — a single pavement area can exhibit both cracking and spalling simultaneously — the thresholded predictions for each class are independent, and the output is properly interpreted as multi-label rather than single-class prediction.

Integration with TarmacView Pipeline

In the TarmacView analysis pipeline, the defect head operates at two granularity levels.

Tile-level analysis: The runway surface is divided into a grid of image tiles (typically 224×224 or 512×512 pixels at the inspection resolution of 0.5-2.0 mm/pixel). Each tile is processed independently through the DINOv3 backbone and MLP defect head, producing a 5-element probability vector per tile. Tile-level predictions are stored as per-tile columns in the analysis output: tile_crack_conf, tile_spalling_conf, tile_efflorescence_conf, tile_exposed_rebar_conf, tile_corrosion_conf.

Frame-level aggregation: Individual tile predictions within a camera frame or runway section are aggregated to produce frame-level defect assessments. Aggregation methods include: max pooling (the maximum confidence across all tiles in the frame), mean pooling (average confidence), top-k voting (the proportion of tiles exceeding a threshold), and spatial density (the count of defect tiles per unit area). Frame-level columns in the output include frame_crack_flag, frame_spalling_flag, frame_defect_count, and frame_max_defect_conf.

The analysis output schema is a critical component that smoke tests must validate. If the tile-level confidence columns or frame-level aggregation columns are missing, renamed, or contain invalid values (NaN, inf, negative probabilities), downstream Pavement Condition Index (PCI) calculations and reporting pipelines will fail.

Smoke Test Assertions

Smoke test assertions are the specific, automated checks that verify the defect head pipeline is functioning correctly. Each assertion targets a specific failure mode and produces a clear pass/fail result that can be integrated into CI/CD pipeline gating.

Checkpoint Validation Assertions

The first category of smoke test assertions verifies that the defect head checkpoint — the saved model weights file — is valid and loadable. The assertions include:

Checkpoint file existence: The test asserts that the checkpoint file exists at the specified path. This catches issues where a training run failed, the checkpoint was not uploaded to the model registry, or the file path was incorrectly configured in the deployment environment. The assertion is: assert os.path.exists(checkpoint_path), f"Checkpoint not found at {checkpoint_path}".

File size and checksum validation: The test verifies that the checkpoint file has a non-zero file size and optionally validates its MD5 or SHA256 checksum against a stored baseline. A zero-byte file or corrupted download will be caught here. The assertion is: assert os.path.getsize(checkpoint_path) > 0 and optionally assert sha256(file) == expected_sha256.

Torch loadability: The test loads the checkpoint using torch.load() and asserts that the operation completes without throwing an exception. This catches corrupted files, version incompatibilities (e.g., a checkpoint saved with PyTorch 2.0 attempting to load with PyTorch 1.8), and missing dependencies. The assertion wraps the load call in a try/except block and fails on any exception.

State dictionary structure: After loading, the test asserts that the checkpoint contains the expected state dictionary keys. For the DINOv3 backbone, expected keys include backbone.cls_token, backbone.patch_embed.proj.weight, and transformer block parameters. For the MLP head, expected keys include head.0.weight, head.0.bias, head.2.weight, head.2.bias (for a 2-layer MLP). The test also verifies that all expected keys are present and that no unexpected keys exist, which could indicate a model architecture mismatch.

Forward Pass Assertions

The second category verifies that a forward pass through the combined backbone and head produces valid outputs.

Tensor shape validation: The test creates a synthetic input tensor of the expected shape (typically [batch_size, 3, height, width] with batch_size=1-4, height=width=224 for ViT-B/14), passes it through the model, and asserts that the output tensor shape is [batch_size, 5] — exactly 5 logits corresponding to the 5 defect classes. The assertion is: assert output.shape == (batch_size, 5), f"Expected shape (batch_size, 5), got {output.shape}".

Numerical stability validation: The test asserts that no output value is NaN (Not a Number), infinity, or negative infinity. NaN values can arise from numerical instability in the transformer backbone (e.g., attention logit overflow), division by zero in normalization layers, or corrupted weights. The assertion is: assert not torch.isnan(output).any(), "Output contains NaN values" and assert not torch.isinf(output).any(), "Output contains inf values".

Softmax probability validation: The test applies softmax to the raw logits and asserts that the resulting probabilities sum to 1.0 for each sample in the batch (within floating-point tolerance, typically 1e-5). This confirms that the output layer is correctly configured and that no post-processing step is corrupting the probability distribution. The assertion is: assert torch.allclose(probs.sum(dim=1), torch.ones(batch_size), atol=1e-5).

Prediction Distribution Assertions

The third category verifies that the model produces reasonable prediction distributions rather than degenerate outputs.

Non-uniform distribution check: The test asserts that the predicted probabilities are not uniform across all classes (which would indicate a model that has not learned any discriminative features). The entropy of the predicted distribution is computed and compared against a minimum threshold. A completely uniform distribution has maximum entropy (log(5) ≈ 1.61 nats for 5 classes), while a confident prediction has low entropy. The assertion is: assert entropy < 1.5, "Predictions are near-uniform, model may not be trained".

Class coverage check: The test runs inference on a small set of diverse input images and asserts that each of the 5 defect classes is the highest-confidence prediction for at least one input. This verifies that no class is systematically suppressed — for example, a model that never predicts “efflorescence” would indicate a training data imbalance or head configuration issue. The assertion is: assert set(predicted_classes) == set(range(5)), f"Classes {missing} never predicted".

Background class handling: If the model includes an implicit background or “no defect” class, the smoke test verifies that an image of intact pavement — free of any defect — produces a background prediction with confidence above a threshold (typically 0.8). This confirms that the model can correctly reject negative examples, which is critical for avoiding false positive alarms in production inspections.

Checkpoint Validation

Checkpoint validation is a foundational smoke test component that confirms the model artifact — the saved neural network weights — is intact, loadable, and structurally consistent with the expected architecture. In production ML systems, checkpoint corruption or version mismatch is one of the most common failure modes, and catching it early in CI/CD prevents cascading failures downstream.

Checkpoint Storage and Versioning

TarmacView defect head checkpoints are stored in the model registry — a centralized artifact store with versioning, metadata, and lineage tracking (MLflow Model Registry or DVC). Each checkpoint is identified by a unique combination of model name, version number, and run ID. The checkpoint file itself is a serialized PyTorch state dictionary (typically model.pt or checkpoint.pt) containing the learned parameters of both the DINOv3 backbone (if fine-tuned) and the MLP head.

The smoke test first resolves the checkpoint path from the model registry, handling the following cases:

  • Explicit version: A specific version is specified (e.g., defect-head:v3), and the test loads that exact version.
  • Latest version (production): The “latest production” alias is resolved to the current production checkpoint.
  • Staging version: A candidate checkpoint awaiting promotion to production is loaded for pre-deployment validation.
  • Branch-specific checkpoint: A checkpoint from a feature branch training run is loaded for branch-level validation.

Backbone Consistency Check

The DINOv3 backbone is a large model with 86 million parameters for the ViT-B/14 variant. The smoke test verifies that the loaded backbone matches the expected architecture by checking:

Weight tensor shapes: Each parameter tensor in the loaded state dictionary is verified against the expected shape. For example, the backbone.patch_embed.proj.weight tensor should have shape (768, 3, 14, 14) for a ViT-B/14 with 3 input channels, 768 output channels, and a 14×14 patch kernel. A shape mismatch would indicate that the checkpoint was trained with a different configuration (different patch size, different embedding dimension, different input channels).

Numerical range sanity: The test verifies that weight values fall within expected numerical ranges. Transformer attention weights should have values distributed roughly as N(0, σ²) with σ depending on the initialization scheme. Extreme values (|w| > 10) across all layers would indicate training divergence or checkpoint corruption. The check computes the mean and standard deviation of each parameter tensor and flags outliers.

Output embedding consistency: The test runs a fixed synthetic input through the backbone and compares the output embedding distribution against a stored baseline. The baseline is generated during the first successful smoke test run and stored as a reference. The assertion checks that the mean and variance of the embedding do not drift beyond a tolerance (typically ±5%). This catches silent model degradation that does not produce NaN or inf values but still produces anomalous embeddings.

Head Architecture Verification

The MLP head is smaller than the backbone but equally critical. The smoke test verifies:

Layer count: The head should have exactly the expected number of layers. For a 2-layer MLP with hidden dimension 256, the expected keys include head.0.weight (768×256), head.0.bias (256), head.2.weight (256×5), head.2.bias (5). The layer numbering accounts for the activation function (layer 1) between the linear layers.

Output dimension: The final linear layer’s output dimension must be exactly 5, corresponding to the 5 defect classes. This is verified by checking head.2.weight.shape[0] == 5.

Weight initialization consistency: The test checks that weights are not frozen at initialization values (all zeros or all ones). A head with all-zero weights would produce uniform logits regardless of input, indicating a training failure. The check verifies that head.2.weight.std() > 0.001.

AP and F1 Metric Verification

While the smoke test is primarily about pipeline integrity rather than model quality, including lightweight metric computation in the smoke test provides an early warning of significant model regression. The smoke test computes efficacy metrics on synthetic or small static test data, comparing them against baselines stored from previous validated runs.

Average Precision Computation

Average Precision (AP) is the area under the precision-recall curve, computed across confidence thresholds from 0 to 1. The smoke test computes AP for each of the 5 defect classes using COCO-style evaluation:

  1. Model predictions are generated for the smoke test dataset (typically 50-100 synthetic images with known ground truth).
  2. For each defect class, predictions are sorted by confidence score in descending order.
  3. Precision and recall are computed at each prediction, creating a precision-recall curve.
  4. AP is computed as the area under this curve using the trapezoidal rule.

The smoke test AP assertions include:

AP@0.50 (PASCAL VOC metric): AP at IoU threshold 0.50. The assertion is that AP@0.50 for each class exceeds a minimum threshold. For synthetic test data with known, clean defect patterns, a typical threshold is AP@0.50 > 0.85 for all 5 classes. If the model achieves AP@0.50 below this threshold on trivial synthetic data, it indicates severe regression.

AP@0.50 :0.95 (COCO primary metric): The average of AP values computed at IoU thresholds 0.50, 0.55, …, 0.95. The assertion threshold is lower — typically AP@0.50 :0.95 > 0.50 — because the strict IoU thresholds are more challenging even on synthetic data.

Per-class AP consistency: The variance of AP across the 5 classes is checked. A standard deviation exceeding 0.15 would indicate that one class has regressed significantly relative to others, suggesting a problem specific to that defect type (e.g., insufficient training examples for efflorescence).

The synthetic test dataset is carefully constructed to ensure metric stability. Each synthetic image contains exactly one defect type overlaid on a pavement-like background texture. Defects are generated using procedural techniques: cracks as thin, branched black lines with Gaussian blur for realism; spalls as irregular circular/oval regions with edge roughness; efflorescence as white, amorphous patches with varying opacity; exposed rebar as periodic dark circular patterns; corrosion as rust-colored irregular blobs. The synthetic dataset is versioned and checked into the repository to ensure deterministic, reproducible smoke test results.

F1-Score Verification

F1-score is the harmonic mean of precision and recall, providing a single balanced measure of model performance. The smoke test computes F1 at a fixed confidence threshold (typically 0.5) for each defect class.

The F1 assertions include:

Minimum F1 per class: Each class must achieve F1 > 0.80 on the synthetic test set. The multi-label nature of the defect prediction task means that F1 is computed independently per class.

Macro-averaged F1: The unweighted average of F1 across all 5 classes is computed. The assertion threshold is macro-F1 > 0.85. The macro-average treats all classes equally, so regression on a rare class (e.g., exposed rebar) is immediately visible.

Precision-recall balance: The ratio of precision to recall is checked for each class. A ratio above 1.5 or below 0.67 indicates imbalance — the model is either too conservative (high precision, low recall, missing many defects) or too aggressive (high recall, low precision, generating many false positives). The assertion flags any class where the ratio is outside [0.67, 1.5].

MetricSynthetic Test ThresholdPurpose
AP@0.50> 0.85Basic detection capability per class
AP@0.50 :0.95> 0.50Comprehensive detection quality
Per-class AP std< 0.15Class balance check
F1 per class> 0.80Balanced precision-recall per class
Macro-averaged F1> 0.85Overall model quality
Precision/Recall ratio[0.67, 1.5]Precision-recall balance per class

Baseline Comparison and Drift Detection

The smoke test stores metric baselines from the last validated run and compares current metrics against these baselines. A significant drop (>5% relative decrease) in any metric triggers a smoke test failure, even if the absolute metric value is above the minimum threshold. This catches gradual degradation — models that pass absolute thresholds but are consistently declining in performance across successive training runs or data updates.

Metric histories are logged to a time-series database (MLflow, Weights & Biases, or a simple CSV file in the repository). The smoke test reads the last 10 validated metric values and fits a linear trend. If the slope is negative and statistically significant (p < 0.05), the test outputs a warning but does not fail — threshold-only failure is used for CI/CD gating to avoid noisy pipeline breaks from minor metric fluctuations.

Analyze Output Column Verification

A critical smoke test validates that the analysis output — the structured data produced by running the defect head on inspection imagery — contains all expected columns with correct data types and valid values. This bridges the gap between model inference and downstream Pavement Condition Index (PCI) computation, reporting, and GIS integration.

Expected Column Schema

The TarmacView analysis output is a tabular format (Parquet, CSV, or database table) with columns organized into tiers:

Tile-level defect confidence columns — one float column per defect class, representing the MLP head’s softmax confidence that the tile contains that defect:

Column NameData TypeValid RangeDescription
tile_crack_confFloat32[0.0, 1.0]Probability of crack presence
tile_spalling_confFloat32[0.0, 1.0]Probability of spalling presence
tile_efflorescence_confFloat32[0.0, 1.0]Probability of efflorescence presence
tile_exposed_rebar_confFloat32[0.0, 1.0]Probability of exposed rebar presence
tile_corrosion_confFloat32[0.0, 1.0]Probability of corrosion presence

Frame-level aggregation columns — summarizing defect presence across all tiles in a camera frame or runway section:

Column NameData TypeValid RangeDescription
frame_defect_countInt32[0, max_tiles]Number of tiles with any defect above threshold
frame_max_defect_confFloat32[0.0, 1.0]Maximum confidence across all defects and tiles
frame_crack_flagBoolean{0, 1}Any tile has crack_conf > threshold
frame_spalling_flagBoolean{0, 1}Any tile has spalling_conf > threshold
frame_efflorescence_flagBoolean{0, 1}Any tile has efflorescence_conf > threshold
frame_exposed_rebar_flagBoolean{0, 1}Any tile has exposed_rebar_conf > threshold
frame_corrosion_flagBoolean{0, 1}Any tile has corrosion_conf > threshold

Metadata columns — identifying the spatial and temporal context of each analysis record:

Column NameData TypeDescription
image_idStringUnique identifier for the source image
tile_xInt32Tile column index in the runway grid
tile_yInt32Tile row index in the runway grid
frame_timestampDateTimeCapture time of the source frame
gps_latFloat64GPS latitude of the tile center
gps_lonFloat64GPS longitude of the tile center

Column Presence Assertions

The smoke test loads the analysis output and asserts that each expected column exists using simple column-name matching:

expected_tile_cols = ["tile_crack_conf", "tile_spalling_conf",
                      "tile_efflorescence_conf", "tile_exposed_rebar_conf",
                      "tile_corrosion_conf"]
expected_frame_cols = ["frame_defect_count", "frame_max_defect_conf",
                       "frame_crack_flag", "frame_spalling_flag",
                       "frame_efflorescence_flag", "frame_exposed_rebar_flag",
                       "frame_corrosion_flag"]
expected_meta_cols = ["image_id", "tile_x", "tile_y",
                      "frame_timestamp", "gps_lat", "gps_lon"]

actual_cols = set(df.columns)
assert expected_cols.issubset(actual_cols), f"Missing columns: {expected_cols - actual_cols}"

Data Type and Value Range Assertions

For each defect confidence column, the smoke test asserts:

Float type: The column data type is float32 or float64. Unexpected types (int, string, object) indicate a serialization or pipeline error. The assertion uses assert df[col].dtype in [np.float32, np.float64].

Value range: All values are in [0.0, 1.0]. Values outside this range indicate a softmax or normalization failure. The assertion uses assert df[col].between(0.0, 1.0).all().

Missing value check: No values are NaN or None. NaN values in confidence columns indicate that the inference pipeline produced no output for some tiles — a serious failure. The assertion uses assert df[col].notna().all().

Aggregation Consistency Assertions

The frame-level columns should be consistent with the tile-level data from which they are derived. The smoke test validates:

frame_defect_count equals count of tiles where max confidence exceeds threshold: For each frame group, the test recomputes the defect count from tile-level data and asserts it matches the stored frame value. This catches aggregation logic errors in the pipeline.

frame_max_defect_conf equals max of all tile confidence values: The test recomputes the maximum from tile-level data and asserts match.

frame_flag is consistent with tile_conf: For each frame, the flag should be 1 if any tile has the corresponding confidence above threshold, and 0 otherwise. The test verifies this for all 5 defect types.

These consistency checks operate on the principle that the frame-level columns should be deterministically derivable from the tile-level columns. If the aggregation logic is correct, the checks should always pass. A failure indicates a bug in the analysis pipeline post-processing step, not in the model itself.

Column Addition and Removal Detection

The smoke test compares the expected column set against the actual column set and generates warnings for:

  • Unexpected new columns: If a new column appears (e.g., tile_moisture_conf), the test warns but does not fail, as this may indicate a pipeline enhancement that requires downstream integration updates.
  • Missing columns (fail condition): If an expected column is absent, the test fails immediately. This could happen if the defect head was replaced with a different architecture that outputs different classes, or if the column was accidentally removed during a pipeline refactoring.
  • Renamed columns: If a column name changes (e.g., tile_crack_conftile_cracking_conf), the test fails, preventing silent downstream breakage when reporting dashboards, APIs, or databases reference the old column names.

Gating Logic Verification

The gating logic determines whether the defect head passes or fails the smoke test as a whole, based on a weighted combination of individual assertion results. Gating is the mechanism that prevents a failing model from being deployed to production.

Assertion Weighting

Not all smoke test assertions are equally critical. The gating system assigns each assertion to a severity tier:

TierWeightEffect on GateExamples
Fatal1.0Gates immediatelyCheckpoint load failure, NaN in outputs
Critical0.8Gates if >1 failureMissing columns, output shape mismatch
Warning0.4Gates if >3 failuresPer-class AP below threshold
Info0.0Logs only, no gateMetric trend warnings, column deprecation notices

Fatal assertions are those where no valid pipeline execution is possible — checkpoint is corrupted, model cannot load, or inference produces invalid numerical values. A single fatal failure gates the deployment.

Critical assertions indicate that the pipeline produces structurally valid but potentially incorrect results — missing columns would cause downstream reporting failures, output shape mismatch indicates a model architecture mismatch with the serving infrastructure.

Warning assertions indicate that the model metrics are below nominal thresholds but the pipeline is structurally sound. These are aggregated: if more than 3 warnings fire in a single run, the gate activates.

Info assertions are purely observational — they log metric drift trends, column deprecation notices, and performance comparisons against previous runs — but never gate deployment.

Pass/Fail Criteria

The overall smoke test result is computed as:

gate_score = max(fatal_failures,
                  critical_failures > 1 ? 1.0 : 0.0,
                  warning_failures > 3 ? 1.0 : 0.0)

If gate_score >= 1.0, the smoke test fails and deployment is blocked. If gate_score < 1.0, the smoke test passes, and the pipeline proceeds to full evaluation or deployment.

The composite pass/fail message summarizes the result:

SMOKE TEST: FAIL
  - Fatal: 1 [checkpoint_load_failure]
  - Critical: 0
  - Warning: 2 [class_crack_ap_below_threshold, class_efflorescence_f1_below_threshold]
  - Info: 1 [metric_class_crack_ap dropped 3.2% since last run]

Deployment Gate Integration

The smoke test gate integrates with the deployment pipeline through:

Post-commit hook: The smoke test runs on every pull request commit. If the gate fails, the CI/CD system blocks the merge (GitHub branch protection rule, GitLab merge request pipeline failure).

Pre-deployment gate: Before a model is promoted from staging to production, the smoke test runs again on the exact deployment candidate artifact. This catches issues that may not have been present during development — for example, a development environment with different CUDA version than the production serving environment.

Rollback trigger: If the smoke test passes deployment but a subsequent production incident is traced to the defect head, the smoke test gating logic is audited. If a warning-level assertion should have been a critical assertion, the gating configuration is updated to prevent recurrence.

Domain Applicability Tests

Domain applicability tests extend the basic smoke test to verify that the defect head performs correctly across the specific operational conditions that TarmacView encounters in airfield pavement and infrastructure inspection. These tests ensure that the pipeline is not only functional but fit for purpose in the target domain.

Airport Pavement Surface Types

The defect head must perform consistently across different pavement types encountered on airport surfaces:

Asphalt (flexible) pavements: Runways, taxiways, and aprons constructed with hot mix asphalt (HMA). Defects on asphalt include fatigue cracking (alligator pattern), longitudinal cracking, transverse cracking, rutting, and raveling. The smoke test includes synthetic images with asphalt-like textures (dark gray, aggregate visible, varying surface roughness) and verifies that crack and spalling detection maintain nominal confidence levels.

Concrete (rigid) pavements: Runways and aprons constructed with Portland cement concrete (PCC). Defects include joint spalling, corner breaks, linear cracking, efflorescence (white calcium deposits at joints), exposed rebar (at spalled areas), and corrosion staining. The smoke test verifies that the model correctly identifies efflorescence and exposed rebar — defects that are far more prevalent on concrete than asphalt surfaces.

Composite pavements: Asphalt overlays on existing concrete. Defects include reflective cracking (asphalt cracks following underlying concrete joint pattern) and delamination. The test verifies that the model can detect cracks on composite surfaces without confusion from the underlying joint pattern.

Porous friction courses (PFC): High-permeability asphalt used on runways for improved drainage and friction. PFC has a distinctive open-graded texture that appears visually different from dense-graded HMA. The test verifies that the model does not produce elevated false positive rates on PFC surfaces, where the rough texture might be confused with cracking or spalling.

Lighting and Weather Conditions

ICAO Annex 14 and FAA AC 150/5320-5D specify that runway surface condition assessments must be valid under operational conditions. The domain applicability smoke test verifies that the defect head maintains performance across:

Direct sunlight: High contrast, strong shadows. The test verifies that confidence values are not systematically lower in high-contrast conditions due to shadow-induced false positives.

Overcast/diffuse light: Low contrast, no shadows. The test verifies that fine cracks (narrow, low contrast against pavement) are still detectable at reduced confidence levels.

Wet pavement: Water in cracks increases crack visibility but introduces specular reflections. The test verifies that wet surface tiles do not produce elevated false positives due to specular highlights being confused with efflorescence (both appear as bright regions).

Dawn/dusk: Low light levels, long shadows. The test verifies that the model produces outputs within expected confidence ranges even at reduced illumination levels.

The smoke test simulates these conditions by applying controlled photometric transforms to the synthetic test images: brightness scaling for lighting simulation, Gaussian blur for haze/fog simulation, and saturation adjustment for wet surface simulation.

Spatial Resolution Variation

Inspection imagery varies in ground sampling distance (GSD) depending on the capture platform:

PlatformTypical AltitudeGSD (mm/pixel)Tile Coverage
UAV (high resolution)15-20 m0.5-1.00.1-0.5 m²
UAV (standard)30-50 m1.0-2.00.5-2.0 m²
Vehicle-mounted2-3 m0.3-0.80.05-0.2 m²
Handheld1-1.5 m0.2-0.50.02-0.08 m²

The smoke test verifies that the defect head produces consistent outputs across a range of input resolutions. Synthetic test images are generated at multiple scales (0.5×, 1.0×, 2.0× the nominal GSD) and passed through the model. The test asserts that the predicted class distribution does not shift by more than 10% between resolutions, ensuring the model is approximately scale-invariant within the operational range.

Defect Severity Levels

ASTM D5340 defines three severity levels (Low, Medium, High) for each defect type. The smoke test verifies that the defect head’s confidence scores correlate with defect severity:

Low severity: Hairline cracks (<1mm width), small spalls (<150mm length), light efflorescence, minimal corrosion staining. The test asserts that these produce confidence scores above the detection threshold (>0.5) but not at maximal confidence (<0.8).

Medium severity: Cracks (1-3mm width), spalls (150-600mm length), moderate efflorescence deposits, visible exposed rebar with light corrosion. The test asserts that confidence scores are high (>0.7).

High severity: Wide cracks (>3mm width), large spalls (>600mm length), heavy efflorescence with surface disruption, exposed rebar with heavy corrosion and section loss. The test asserts that confidence scores are very high (>0.9).

Severity correlation verification is a warning-level assertion in the gating system — the model may still function correctly even if severity correlation is imperfect, but the test flags it as an area for model improvement.

Smoke Test vs Full Evaluation

Understanding the distinction between smoke testing and full evaluation is critical for designing an effective ML quality assurance strategy. The two approaches serve fundamentally different purposes and operate at different points in the development and deployment lifecycle.

Purpose and Scope

DimensionSmoke TestFull Evaluation
GoalVerify pipeline integrityMeasure model quality
Question answered“Does the pipeline run correctly?”“Is the model accurate enough?”
DataSynthetic / small static set (10-100 images)Large held-out validation set (500-5000+ images)
DurationSeconds to minutesMinutes to hours
ComputeCPU or minimal GPUFull GPU (often multi-GPU)
FrequencyEvery commit / PRNightly, weekly, or per-release
Metric thresholdsGenerous (AP > 0.50)Stringent (AP > 0.75)
CoverageStructural integrity onlyStatistical generalization
Failure actionBlock merge/deploymentFlag for review

The smoke test is designed to catch pipeline errors — the class of bugs that cause the entire system to fail or produce meaningless outputs. These include checkpoint corruption, version incompatibility, preprocessing pipeline breaks, missing columns, NaN outputs, and shape mismatches. Industry data from ML engineering teams shows that pipeline errors account for 60-70% of failed training runs and 40% of deployment rollbacks. Smoke tests catch these errors in seconds, before expensive full evaluation runs are initiated.

The full evaluation is designed to measure model quality — the statistical accuracy, precision, recall, and generalization of the model’s predictions. It uses large, diverse, representative validation datasets, computes rigorous metrics (AP@0.50 :0.95, class-wise F1, confusion matrices, precision-recall curves at multiple thresholds), and compares results against both absolute thresholds and relative baselines from previous model versions. Full evaluation runs are computationally expensive and time-consuming, making them unsuitable for per-commit execution.

Data Requirements

Smoke test data is synthetically generated to be simple, clean, and deterministic. Each synthetic image contains exactly one defect type on a uniform background, with no occlusion, no overlapping defects, and no challenging lighting conditions. This minimizes variability and ensures that any metric fluctuation in the smoke test is attributable to the model, not to data variability.

Full evaluation data is real-world inspection imagery with the following characteristics: diverse pavement types and ages, all operational lighting conditions, varying defect severity levels, overlapping and adjacent defects, real-world occlusions (debris, tire marks, water), and accurate polygon-level ground truth annotations. This data represents the true distribution that the model encounters in production and provides a reliable estimate of deployment performance.

Data leakage prevention is critical for full evaluation but irrelevant for smoke tests — since the smoke test uses synthetic data, there is no risk of leaking real test data into training. The full evaluation dataset is carefully partitioned: training, validation, and test sets are split at the frame or runway level (not the tile level) to prevent spatial autocorrelation leakage, where adjacent tiles from the same runway appear in both training and test sets.

Computational Profile

A typical smoke test run for the defect head pipeline:

  1. Load checkpoint: 2-5 seconds (86M parameter DINOv3 backbone + 1-3M parameter MLP head)
  2. Generate synthetic inputs: < 1 second (50 synthetic images, 224×224 pixels)
  3. Forward pass (CPU): 10-30 seconds (DINOv3 ViT-B/14 inference on CPU)
  4. Column verification: < 1 second (Pandas DataFrame column checks)
  5. Metric computation: 2-5 seconds (AP, F1 on 50-image dataset)
  6. Gate evaluation: < 1 second
  7. Total: 15-45 seconds

A typical full evaluation run:

  1. Load checkpoint: 2-5 seconds
  2. Load validation dataset: 30-120 seconds (2000 images from disk/SSD)
  3. Forward pass (GPU): 5-30 minutes (batch inference on NVIDIA A10G/A100)
  4. Metric computation: 1-5 minutes (AP@0.50 :0.95 over all classes and IoU thresholds)
  5. Report generation: 10-30 seconds
  6. Total: 7-38 minutes

The smoke test is 10-100× faster than the full evaluation, enabling per-commit execution. The full evaluation runs on a slower cadence (per-night, per-release, per-promotion to production).

Failure Mode Coverage

Failure ModeDetected By
Corrupted checkpoint fileSmoke test (fatal)
NaN/inf in model outputsSmoke test (fatal)
Missing output columnsSmoke test (critical)
Wrong output tensor shapeSmoke test (critical)
Preprocessing normalization mismatchSmoke test (fatal)
Degenerate predictions (all same class)Smoke test (warning)
10% AP drop on new dataFull evaluation
Overfitting to a specific pavement typeFull evaluation
Calibration driftFull evaluation
Label noise in training dataFull evaluation

The coverage matrix demonstrates that smoke tests and full evaluations are complementary — each catches failure modes that the other misses. A comprehensive ML testing strategy requires both.

Integration with CI

Integration of the defect head smoke test into continuous integration (CI) pipelines is essential for catching regressions early and ensuring that every code change is validated before affecting production systems.

CI Pipeline Stages

The CI pipeline for TarmacView’s defect detection system is organized into sequential stages:

Stage 1 — Code Quality: Linting (flake8, pylint), type checking (mypy), unit tests (pytest for data-loading utilities, preprocessing functions, and metric computation functions). This stage runs on CPU and completes in 1-3 minutes. Failure blocks all downstream stages.

Stage 2 — Data Validation: Schema validation of training and evaluation datasets using Great Expectations or TensorFlow Data Validation. Checks column presence, data types, value ranges, and distribution statistics against expectations defined in the data contract. This stage runs on CPU and completes in 2-5 minutes.

Stage 3 — Defect Head Smoke Test: The full smoke test suite as described in this article. Runs on CPU (or minimal GPU if available) and completes in 15-60 seconds. Failure blocks merge to main.

Stage 4 — Unit Tests for Evaluation: Small-scale metric computation tests that verify AP computation, F1 computation, and confusion matrix generation produce correct outputs on hand-labeled tiny datasets (5-10 images with known ground truth). Runs on CPU, completes in 30 seconds.

Stage 5 — Training (on demand): Triggered only when model weights are expected to change (new training data, architecture changes, hyperparameter tuning). Not automatically triggered on every commit. Runs on GPU and takes 1-8 hours depending on dataset size.

Stage 6 — Full Evaluation (on merge to main): Triggered when code is merged to the main branch. Runs the complete evaluation suite on the held-out validation set, computes all metrics, compares against baselines, and publishes results to the model registry. Runs on GPU and takes 20-40 minutes.

Trigger Configuration

The smoke test is triggered on:

  • Every pull request commit: Ensures that any code change — including preprocessing, data loading, model architecture, or metric computation changes — does not break the inference pipeline.
  • Model registry updates: When a new model checkpoint is registered (new training run completed), the smoke test validates the artifact before it is promoted to staging.
  • Scheduled daily runs: Even in the absence of code changes, daily smoke tests catch infrastructure drift — CUDA version changes, dependency updates, filesystem changes — that could affect pipeline behavior.

Artifact Management

CI artifacts from the smoke test are stored and versioned:

  • Test logs: Structured JSON logs containing all assertion results, metric values, and pass/fail status for each assertion.
  • Metrics history: A time-series record of smoke test metrics (AP, F1, gate score) for each run, enabling trend analysis and regression detection.
  • Model checkpoint fingerprint: SHA256 hash of the model state dictionary used in the test, linking the test result to a specific model version.
  • Synthetic test data version: The version identifier of the synthetic dataset used, enabling reproducibility.

These artifacts are stored in the model registry alongside the model artifact itself, providing a complete audit trail: “This version of the model passed the smoke test with synthetic data v3.2 on CI run #4827 with commit a3f2c1.”

Failure Notification

When the smoke test fails, notifications are sent through multiple channels:

  • CI system notification: GitHub Checks API or GitLab pipeline UI shows the failure with detailed assertion results.
  • Slack/Teams alert: A message is sent to the ML engineering channel with the pipeline run ID, failure summary, and link to the full logs.
  • Email notification: For critical branch failures (main, release), an email is sent to the engineering team.
  • PagerDuty integration: Optional for production smoke test failures (when validating a model about to be deployed), escalating if not acknowledged within 15 minutes.

The notification includes a structured error report:

Subject: [SMOKE FAIL] defect-head pipeline - main - run #4827
Body:
  Commit: a3f2c1 (merged 12:34 UTC)
  Checkpoint: defect-head:v3 (production-candidate)
  Result: FAIL (gate_score=1.0)

  Fatal (1):
    - [output_nan] Output tensor contains NaN values
      Backbone forward pass produced NaN at layer norm 8

  Critical (0):

  Warning (2):
    - [class_efflorescence_ap] AP@0.50 = 0.42 below threshold 0.50
    - [class_efflorescence_f1] F1 = 0.55 below threshold 0.60

  Action required: Investigate NaN in backbone layer norm 8.
  Possible causes: corrupted checkpoint, CUDA version mismatch,
  or numerical instability in attention computation.

Interpreting Test Output

Correct interpretation of smoke test outputs is essential for diagnosing pipeline issues and determining the appropriate remediation actions.

Reading the Test Report

The smoke test generates a comprehensive JSON report structured as follows:

{
  "pipeline_id": "defect-head-smoke",
  "run_id": "2026-06-16-4827",
  "timestamp": "2026-06-16T12:34:56Z",
  "commit_sha": "a3f2c1d4e5b6...",
  "checkpoint_version": "defect-head:v3",
  "synthetic_data_version": "v3.2",
  "gate_result": "FAIL",
  "gate_score": 1.0,
  "assertions": {
    "checkpoint_file_exists": {"status": "PASS", "detail": "checkpoint.pt (842MB)"},
    "checkpoint_loadable": {"status": "PASS", "detail": "State dict loaded successfully"},
    "forward_pass_shape": {"status": "PASS", "detail": "Output shape (8, 5)"},
    "output_no_nan": {"status": "FAIL", "detail": "NaN found in 1 of 8 batch samples"},
    "output_no_inf": {"status": "PASS", "detail": "No inf values"},
    "softmax_sum": {"status": "PASS", "detail": "All sums within 1e-5 of 1.0"},
    "tile_columns_exist": {"status": "PASS", "detail": "All 5 tile columns present"},
    "frame_columns_exist": {"status": "PASS", "detail": "All 7 frame columns present"},
    "column_value_ranges": {"status": "PASS", "detail": "All values in [0.0, 1.0]"},
    "class_crack_ap50": {"status": "PASS", "detail": "AP@0.50 = 0.92"},
    "class_spalling_ap50": {"status": "PASS", "detail": "AP@0.50 = 0.88"},
    "class_efflorescence_ap50": {"status": "WARN", "detail": "AP@0.50 = 0.42"},
    "class_exposed_rebar_ap50": {"status": "PASS", "detail": "AP@0.50 = 0.91"},
    "class_corrosion_ap50": {"status": "PASS", "detail": "AP@0.50 = 0.90"}
  }
}

The report should be read from top to bottom, addressing fatal failures first (since they invalidate all downstream results), then critical failures (structural issues that would cause production failures), and finally warning failures (quality regressions that may require investigation).

Fatal Failure Diagnosis

Fatal failures indicate that the pipeline is completely non-functional. The most common root causes and remediation:

Checkpoint file not found: The checkpoint path specified in the pipeline configuration does not point to an existing file. Remediation: verify the model registry path, check that the training run completed and uploaded the artifact, or update the configuration with the correct path.

Checkpoint load failure: torch.load() raised an exception. Common causes include: file corruption (re-run training or restore from backup), PyTorch version mismatch (check that the deployment environment has the same PyTorch version as the training environment — torch.save() with PyTorch 2.0 produces files that load differently on PyTorch 1.x), or CUDA/non-CUDA mismatch (a checkpoint saved with CUDA tensors may fail to load on a CPU-only environment without map_location='cpu').

NaN in outputs: The most technically challenging fatal failure. Common causes include: numerical instability in the DINOv3 attention mechanism (layer normalization overflow with extreme input values), corrupted weights in a specific layer (check which layer produces the NaN by running with torch.autograd.set_detect_anomaly(True)), or preprocessing that produces out-of-range inputs (e.g., pixel values beyond [0,1] after normalization).

Output shape mismatch: The output tensor has a different shape than expected. Common causes include: the MLP head was replaced with a different architecture (different number of output classes), the backbone embedding dimension changed (checkpoint from a different DINOv3 variant), or the batch dimension was squeezed/unsqueezed incorrectly in the post-processing code.

Critical Failure Diagnosis

Critical failures indicate structural issues that would cause incorrect production behavior.

Missing columns: The analysis output DataFrame is missing expected columns. Common causes: the column naming convention was changed without updating downstream consumers, the aggregation logic was modified to rename columns, or the defect head output was changed (e.g., from 5 classes to 4 classes).

Value range violations: Confidence values outside [0.0, 1.0]. This almost always indicates a softmax malfunction — either softmax was not applied to the logits, or the wrong axis was used for softmax normalization. Check that F.softmax(logits, dim=1) is being used (class dimension, not batch dimension).

NaN in output columns: Similar to fatal NaN in model outputs but occurring in the post-processing aggregation step. Check for division by zero in tile-to-frame aggregation (e.g., dividing by number of tiles when frame has zero tiles), or missing value propagation from NaN model outputs that were not caught at the model level.

Warning Failure Diagnosis

Warning failures indicate quality degradation that may not require immediate deployment blocking but should be investigated.

Class-specific AP below threshold: A single defect class shows significantly lower performance than others. Common causes: insufficient synthetic training examples for that class (the synthetic data generator may produce unrealistic examples for some classes), class imbalance in the real training data affecting the head’s discriminative capability for rare classes, or the backbone features being less informative for certain defect types (e.g., efflorescence is characterized by color (white deposits) more than texture, while DINOv3 features may emphasize texture over color).

Precision-recall imbalance: The model is too conservative or too aggressive for specific classes. Common causes: the confidence threshold was optimized for overall performance but is suboptimal for individual classes, or the training data has asymmetric noise (more false negatives than false positives for a specific class).

Metric drift from baseline: Metrics have changed by more than 5% from the last validated run without code or data changes. This may indicate: non-determinism in the model (dropout or batch norm layers behaving differently in train vs eval mode — ensure model.eval() is called before inference), numerical drift due to hardware differences (CPU vs GPU floating-point accumulation order), or synthetic data generator changes producing different test samples.

Automated Remediation Suggestions

The smoke test output includes remediation suggestions for common failure modes:

FailureRemediation Suggestion
Checkpoint not foundVerify model registry path; run training to generate checkpoint
NaN in backboneSwitch to float32 precision if using float16; add gradient clipping
Missing columnUpdate column names in aggregation logic to match schema
Low AP on specific classAdd more synthetic training examples for that class; check class balance
Metric driftRun inference with torch.inference_mode() and model.eval(); check for non-deterministic operations

These suggestions are generated by a rule-based system that maps assertion failure patterns to known remediation actions, reducing mean-time-to-resolution (MTTR) for common failure modes.

Data science dashboard showing model evaluation metrics for defect detection with AP curves, precision-recall charts, F1 scores, and confusion matrix
Detailed close-up photograph of concrete airport runway surface showing multiple structural defects including cracks, spalling, efflorescence deposits, exposed rebar, and corrosion staining

Frequently Asked Questions

Validate Your Defect Detection Pipeline

TarmacView implements rigorous smoke testing and evaluation pipelines for AI-based structural defect detection on airfield pavements and concrete infrastructure. Schedule a demo to see how automated testing ensures deployment reliability.

Learn more

Smoke Test

Smoke Test

A smoke test is a quick, end-to-end verification that a software pipeline executes without crashing on representative data, producing expected outputs. TarmacVi...

32 min read
testing technology +4
DINOv3 Vision Transformer for Infrastructure Surface Analysis

DINOv3 Vision Transformer for Infrastructure Surface Analysis

DINOv3 (self-DIstillation with NO labels v3) is a self-supervised vision transformer (ViT-B/16) pretrained on 1.7 billion images, producing high-quality 768-dim...

36 min read
Technology Machine Learning +3
Defect Gating — Context-Aware Defect Prediction Filtering

Defect Gating — Context-Aware Defect Prediction Filtering

Defect gating is an inference strategy that filters predicted defect labels by surface type and structural domain to suppress false positives — e.g., only flagg...

26 min read
Technology Defect Detection +3