pith. sign in

arxiv: 2605.14926 · v2 · pith:TXOZ52KUnew · submitted 2026-05-14 · 💻 cs.CV

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

Pith reviewed 2026-05-22 09:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords crack segmentationvision RWKVtopological modelingstructure calibrationlightweight networkimage segmentationefficient computer visionstructural inspection
0
0 comments X

The pith

The SCRWKV model achieves superior crack segmentation accuracy using only 1.22 million parameters and linear computational complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SCRWKV as a compact Vision-RWKV network tailored for pixel-level segmentation of structural cracks in challenging images. It builds a Structure-Field Encoder backbone that combines an Adaptive Multi-scale Cascaded Modulator for texture enhancement with a Structure-Calibrated Insight Unit at its core. This unit applies Geometry-guided Bidirectional Structure Transformation to model topological crack relations and Dynamic Self-Calibrating Decay to limit noise spread within the Dy-WKV mechanism. A lightweight Cross-Scale Harmonic Fusion decoder then aggregates features for final output. The result matters because it shows that high segmentation quality on complex benchmarks is possible without heavy computational demands, supporting practical deployment where resources are limited.

Core claim

SCRWKV integrates the Geometry-guided Bidirectional Structure Transformation to capture topological correlations and the Dynamic Self-Calibrating Decay into Dy-WKV to suppress noise, all within a 1.22M-parameter Structure-Field Encoder backbone and Cross-Scale Harmonic Fusion decoder, yielding F1 scores of 0.8428 and mIoU of 0.8512 on the TUT dataset while outperforming prior state-of-the-art methods on multiple benchmarks with linear complexity.

What carries the argument

The Structure-Calibrated Insight Unit (SCIU), which uses Geometry-guided Bidirectional Structure Transformation (GBST) to model crack topologies and Dynamic Self-Calibrating Decay (DSCD) to control noise in the RWKV computation.

If this is right

  • The linear complexity allows real-time crack analysis on edge hardware with constrained memory and power.
  • The structure calibration approach improves robustness to severe interference without increasing model size.
  • The Cross-Scale Harmonic Fusion decoder enables precise multi-scale feature combination at negligible extra cost.
  • Overall performance gains on complex textures indicate the design can handle varied real-world structural inspection scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same calibration units could be tested in related tasks such as road surface defect detection or medical vessel segmentation to check transferability.
  • Replacing the RWKV core with other linear-attention variants might yield even smaller models while preserving the topological focus.
  • The noise-suppression mechanism suggests a route to improve RWKV stability in other noisy vision domains like low-light imaging.
  • Deployment studies on mobile devices would quantify the practical speed and accuracy trade-offs beyond benchmark scores.

Load-bearing premise

The Geometry-guided Bidirectional Structure Transformation and Dynamic Self-Calibrating Decay inside the SCIU unit genuinely capture topological crack correlations rather than fitting the textures and noise patterns of the chosen benchmarks.

What would settle it

Testing SCRWKV on a fresh crack image dataset collected under different lighting, material, or camera conditions and measuring whether its F1 and mIoU advantages over existing methods remain intact.

Figures

Figures reproduced from arXiv: 2605.14926 by Chen Jia, Fan Shi, Hanxu Zhang, Hui Liu, Shengyong Chen, Xu Cheng.

Figure 1
Figure 1. Figure 1: Performance of SCRWKV on multi-scenario TUT (Liu et al., 2024a) dataset. (a) Comparison with SOTA methods. (b) Impact of different enhancement modules on performance. (c) Visual results under complex interference. 1 Introduction Under long-term loading and environmental disturbances, asphalt pavements, concrete components, and metal struc￾tures are highly susceptible to cracking (Chen et al., 2022; 2024; H… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed method. (a) Illustrates the overall architecture of SCRWKV and the processing flow for crack. (b) It displays the structure of the SCIU block, which integrates Spatial Mix and Channel Mix. It uses GBST for topological alignment and Dy-WKV to filter noise and model long-range dependencies. (c) Architecture of AMCM. It utilizes multi-scale large-kernel convolutions to efficiently exp… view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of the GBST. This mechanism explicitly models the curvilinear manifold of cracks by decoupling the feature space into two counter-directional streams [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of different segmentation heads. The left and right subplots illustrate the scores for F1/FLOPs and mIoU/Params. where G↑ denotes bilinear upsampling, τ is a learnable tem￾perature, and × represents the matrix multiplication for ag￾gregating global context. The final spatial output is obtained via residual connection: Xout = Xin + Wout(Zms ⊙ σ(Agrid)) (8) where Wout is output projection and σ is… view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of typical crack segmentation results on the TUT dataset against 10 SOTA methods. Red boxes highlight critical details, and green boxes mark misidentified regions. Methods Crack500 DeepCrack ODS OIS P R F1 mIoU ODS OIS P R F1 mIoU RIND 0.6469 0.6483 0.6998 0.7245 0.7119 0.7381 0.8087 0.8267 0.7896 0.8920 0.8377 0.8391 SFIAN 0.6977 0.7348 0.6983 0.7742 0.7343 0.7604 0.8616 0.8928 0.8549 0.… view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison with 10 SOTA methods across four public datasets. Red boxes highlight critical details, and green boxes mark misidentified regions. Layer Num ODS OIS P R F1 mIoU Params↓ FLOPs↓ Model Size↓ 2 0.8117 0.8176 0.8302 0.8327 0.8314 0.8424 0.87M 17.03G 23MB 4 0.8245 0.8313 0.8213 0.8655 0.8428 0.8512 1.22M 22.78G 28MB 8 0.8216 0.8289 0.8016 0.8791 0.8386 0.8489 1.91M 34.78G 39MB 16 0.8219 0.8279… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of different patch sizes. The left and right subplots illustrate the scores for F1/FLOPs and mIoU/Params [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of different number of layers. The left and right subplots illustrate the scores for F1/FLOPs and mIoU/Params. strate the robustness and superiority of the proposed method across diverse settings. Impact of SCIU Stacking Depth. To achieve an optimal trade-off between segmentation accuracy and computational efficiency, we investigated the impact of stacking varying numbers of SCIUs within the enc… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of SCRWKV internal components in terms of Params and mIoU. possesses certain advantages in terms of parameters, com￾putational cost, and model size under this configuration, its F1 score and mIoU were limited to 0.8064 and 0.8201, respectively, failing to achieve an ideal balance between per￾formance and efficiency. In contrast, our method achieved superior comprehensive performance under the id… view at source ↗
Figure 10
Figure 10. Figure 10: Practical deployment illustration. An intelligent UAV is deployed above outdoor road surfaces to perform low-altitude flight. The UAV is remotely controlled using a handheld controller in conjunction with a server terminal. During operation, the UAV continuously transmits real-time video data to the server, where the data are processed to generate the final outputs. Component Effectiveness Analysis. To ri… view at source ↗
Figure 11
Figure 11. Figure 11: Visualisation comparison on video data keyframes. The interval between keyframes is 100 frames in order to ensure continuity of observation. Red boxes highlight critical details, and green boxes mark misidentified regions. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SCRWKV, an ultra-compact Vision-RWKV model for pixel-level topological crack segmentation. It uses a Structure-Field Encoder (SFE) backbone with Adaptive Multi-scale Cascaded Modulator (AMCM) for texture enhancement and Structure-Calibrated Insight Unit (SCIU) as core, where SCIU applies Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and Dynamic Self-Calibrating Decay (DSCD) in Dy-WKV to suppress noise. A lightweight Cross-Scale Harmonic Fusion (CSHF) decoder aggregates features. The model has 1.22M parameters, linear complexity, and reports outperforming SOTA on multiple benchmarks with complex textures, e.g., F1=0.8428 and mIoU=0.8512 on TUT; code is released.

Significance. If the performance claims hold after proper controls, the work demonstrates a highly efficient RWKV-based architecture for crack segmentation that could enable real-time deployment on edge devices for infrastructure monitoring. The open code is a positive factor for reproducibility.

major comments (2)
  1. [§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): No ablation tables isolate the contribution of GBST or DSCD within SCIU. The central claim that these components enable topological crack correlation modeling (rather than dataset-specific texture fitting) cannot be evaluated without removing GBST/DSCD and measuring impact on F1/mIoU and any topology-aware metrics such as crack connectivity or persistence.
  2. [§4.1] §4.1 (Datasets and Metrics): Evaluations are reported on held-out test sets of the chosen benchmarks, but no cross-dataset generalization tests or statistical significance (error bars, multiple runs) are provided. This leaves open whether the 1.22M-parameter gains over SOTA are robust or tied to the specific training distributions.
minor comments (2)
  1. [§3.2] Notation for Dy-WKV and the decay parameter in DSCD is introduced without an explicit equation reference in the method section; adding a numbered equation would improve clarity.
  2. [Figure 3] Figure 3 (qualitative results) would benefit from zoomed insets on crack connectivity to visually support the topological claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, providing honest clarifications and committing to revisions that directly strengthen the validation of our claims without overstating current results.

read point-by-point responses
  1. Referee: [§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): No ablation tables isolate the contribution of GBST or DSCD within SCIU. The central claim that these components enable topological crack correlation modeling (rather than dataset-specific texture fitting) cannot be evaluated without removing GBST/DSCD and measuring impact on F1/mIoU and any topology-aware metrics such as crack connectivity or persistence.

    Authors: We agree that isolating GBST and DSCD is required to rigorously support the claim of topological correlation modeling. Section 4.3 currently ablates the full SCIU module but does not separately remove GBST or DSCD. In the revised manuscript we will add dedicated ablation tables that disable GBST and DSCD individually, reporting the resulting changes in F1, mIoU, crack connectivity, and persistence on the TUT and other benchmarks. These new results will be placed in an expanded §4.3 to allow direct assessment of each component's contribution beyond texture fitting. revision: yes

  2. Referee: [§4.1] §4.1 (Datasets and Metrics): Evaluations are reported on held-out test sets of the chosen benchmarks, but no cross-dataset generalization tests or statistical significance (error bars, multiple runs) are provided. This leaves open whether the 1.22M-parameter gains over SOTA are robust or tied to the specific training distributions.

    Authors: We recognize that cross-dataset tests and statistical significance are essential for establishing robustness. Our reported numbers use standard held-out splits of the chosen benchmarks. For the revision we will add cross-dataset generalization experiments (training on TUT and evaluating on CrackForest and DeepCrack, and vice versa) together with mean and standard deviation of F1 and mIoU computed over five independent runs using different random seeds. These results and error bars will be incorporated into the updated §4.1 and experimental tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on held-out empirical evaluation

full rationale

The paper introduces architectural components (SFE, AMCM, SCIU with GBST and DSCD, CSHF) and reports measured F1/mIoU on the TUT dataset and other benchmarks. No equations, derivations, or self-citations are presented that reduce the reported metrics to quantities defined by fitted parameters inside the same model. The results are framed as experimental outcomes on external test sets rather than predictions forced by construction from the training data or prior self-referential definitions. This qualifies as a normal non-finding under the guidelines for papers whose central claims are benchmark-driven rather than mathematically self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central performance claim rests on the unverified premise that the newly introduced modules improve topological modeling without introducing hidden overfitting; no free parameters are explicitly fitted in the abstract, but the entire model contains millions of learned weights whose contribution is not isolated.

invented entities (3)
  • Structure-Calibrated Insight Unit (SCIU) no independent evidence
    purpose: Core engine that captures topological correlations via GBST and suppresses noise via DSCD inside Dy-WKV
    New architectural block introduced in the paper; no independent evidence outside the reported segmentation scores is supplied.
  • Adaptive Multi-scale Cascaded Modulator (AMCM) no independent evidence
    purpose: Enhance texture representation inside the SFE backbone
    New module name and claimed function; no external validation provided.
  • Cross-Scale Harmonic Fusion (CSHF) decoder no independent evidence
    purpose: Lightweight precise feature aggregation across scales
    New decoder component; evidence limited to overall model performance.

pith-pipeline@v0.9.0 · 5797 in / 1385 out tokens · 29918 ms · 2026-05-22T09:56:53.337521+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Choe, W., Ji, Y ., and Lin, F. X. Rwkv-lite: Deeply com- pressed rwkv for resource-constrained devices.arXiv preprint arXiv:2412.10856,

  2. [2]

    Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures.arXiv preprint arXiv:2403.02308,

    Duan, Y ., Wang, W., Chen, Z., Zhu, X., Lu, L., Lu, T., Qiao, Y ., Li, H., Dai, J., and Wang, W. Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures.arXiv preprint arXiv:2403.02308,

  3. [3]

    Diffusion- rwkv: Scaling rwkv-like architectures for diffusion mod- els.arXiv preprint arXiv:2404.04478,

    Fei, Z., Fan, M., Yu, C., Li, D., and Huang, J. Diffusion- rwkv: Scaling rwkv-like architectures for diffusion mod- els.arXiv preprint arXiv:2404.04478,

  4. [4]

    and Yu, F

    Hou, H. and Yu, F. R. Rwkv-ts: Beyond traditional recur- rent neural network for time series tasks.arXiv preprint arXiv:2401.09093,

  5. [5]

    Hou, H., Huang, Z., Tan, K., Lu, R., and Yu, F. R. Rwkv- x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

  6. [6]

    Rediscov- ering bce loss for uniform classification.arXiv preprint arXiv:2403.07289,

    9 SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation Li, Q., Jia, X., Zhou, J., Shen, L., and Duan, J. Rediscov- ering bce loss for uniform classification.arXiv preprint arXiv:2403.07289,

  7. [7]

    Staircase cascaded fusion of lightweight local pattern recognition and long-range dependencies for structural crack segmentation.arXiv preprint arXiv:2408.12815, 2024a

    Liu, H., Jia, C., Shi, F., Cheng, X., Wang, M., and Chen, S. Staircase cascaded fusion of lightweight local pattern recognition and long-range dependencies for structural crack segmentation.arXiv preprint arXiv:2408.12815, 2024a. Liu, H., Jia, C., Shi, F., Cheng, X., and Chen, S. Scsegamba: lightweight structure-aware vision mamba for crack seg- mentation...

  8. [8]

    Cm- unet: Hybrid cnn-mamba unet for remote sensing image semantic segmentation.arXiv preprint arXiv:2405.10530, 2024b

    Liu, M., Dan, J., Lu, Z., Yu, Y ., Li, Y ., and Li, X. Cm- unet: Hybrid cnn-mamba unet for remote sensing image semantic segmentation.arXiv preprint arXiv:2405.10530, 2024b. Liu, Y ., Yao, J., Lu, X., Xie, R., and Li, L. Deepcrack: A deep hierarchical feature learning architecture for crack segmentation.Neurocomputing, 338:139–153,

  9. [9]

    VMamba: Visual State Space Model

    Liu, Y ., Tian, Y ., Zhao, Y ., Yu, H., Xie, L., Wang, Y ., Ye, Q., and Liu, Y . Vmamba: Visual state space model.arXiv preprint arXiv:2401.10166, 2024c. Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048,

  10. [10]

    Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., and Crowley, E. J. Plainmamba: Improving non- hierarchical mamba in visual recognition.arXiv preprint arXiv:2403.17695,

  11. [11]

    Zhang, T., Wang, D., and Lu, Y

    1109/JBHI.2025.3588555. Zhang, T., Wang, D., and Lu, Y . Ecsnet: An accelerated real-time image segmentation cnn architecture for pave- ment crack detection.IEEE Transactions on Intelligent Transportation Systems,

  12. [12]

    As presented in Table 6, we evaluated the model performance across a broad range of Dice:BCE ratios, spanning from 1:5 to 5:1

    are defined as follows: LDice = 1− 2PM j=1 pj ˆpj +ϵ PM j=1 pj +PM j=1 ˆpj +ϵ (27) LBCE =− 1 N [pj log(ˆpj) + (1−p j) log(1−ˆpj)](28) To determine the optimal configuration for the loss function hyperparameters α ( BCE weight) and β (Dice weight), we conducted a systematic sensitivity analysis to explore their impact on crack detection accuracy. As presen...

  13. [13]

    and SFIAN (Cheng et al., 2023), such as boundary blurring and region dilation, while simultaneously avoiding the topological fragmentation often observed in Transformer or Mamba variants. This precision is explicitly attributed to the synergistic calibration of the AMCM and GBST, which capture multi-scale details while preserving the geometric integrity o...

  14. [14]

    Best results are highlighted in greenand the second best are blue

    14 SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation Methods ODS OIS P R F1 mIoU Params↓FLOPs↓Model Size↓ MambaIR 0.7869 0.7956 0.7714 0.8445 0.8071 0.8240 3.57M 19.71G 29MBCSMamba 0.7140 0.7201 0.6934 0.8171 0.7503 0.7773 12.68M15.44G 84MBPlainMamba 0.7787 0.7896 0.7617 0.8531 0.8064 0.82012.20M 14.09G 18MBSCSegamb...

  15. [15]

    and DTrCNet (Xiang et al., 2023), despite possessing rapid inference speeds, often falter in maintain- ing topological continuity, resulting in cracks appearing as disconnected fragments within the generated maps. Con- sequently, as evidenced by the robust performance during practical UA V deployment, our SCRWKV framework sus- tains exceptional segmentati...