pith. machine review for the scientific record. sign in

arxiv: 2604.11714 · v3 · submitted 2026-04-13 · 💻 cs.CV

BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

Pith reviewed 2026-05-10 16:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords background embeddingfalse positive suppressiontraining-free moduleobject detectionsurveillance camerasYOLORT-DETRfixed background
0
0 comments X

The pith

Background Embedding Memory reduces false positives in pretrained object detectors for fixed-background cameras without training or speed loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes attaching a lightweight module called Background Embedding Memory to existing detectors at inference time. Pretrained models struggle with false positives in dense surveillance scenes because benchmarks like COCO train them under different sparsity assumptions. BEM builds a memory of background embeddings from the fixed camera view and uses similarity to that memory to down-weight suspicious detections. This exploits the stable background as a free, label-free signal that correlates with object presence. The result is fewer errors while recall and real-time speed stay intact across detector families.

Core claim

BEM estimates clean background embeddings, maintains a prototype memory of them, and re-scores detection logits with an inverse-similarity rank-weighted penalty, reducing false positives in quasi-static fixed-camera environments while preserving recall and real-time performance on YOLO and RT-DETR models.

What carries the argument

Background Embedding Memory (BEM), a training-free module that stores prototype background embeddings and applies an inverse-similarity penalty to detection scores based on frame similarity.

If this is right

  • False positives decrease across YOLO and RT-DETR families on LLVIP and simulated surveillance data.
  • Real-time speed is preserved because BEM adds only negligible computation at inference.
  • Precision rises in high-density few-class scenes without any retraining of the base detector.
  • Background-frame cosine similarity can serve as a direct control signal for detection confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • BEM could be combined with existing post-processing filters to handle gradual background drift.
  • The same memory idea might apply to other fixed-sensor tasks such as traffic monitoring or industrial inspection.
  • Longer-term memory adaptation could extend the method to scenes with slow environmental changes.

Load-bearing premise

The background remains sufficiently static across frames to provide a reliable reference for identifying and suppressing detections that resemble it.

What would settle it

Running BEM on footage where the background slowly changes due to lighting shifts or permanent scene alterations, then checking whether false-positive reduction disappears or reverses.

Figures

Figures reproduced from arXiv: 2604.11714 by Jangho Lee, Junwoo Park, Sunho Lim.

Figure 1
Figure 1. Figure 1: Persons-per-image distributions comparing COCO/VOC with LLVIP. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between background–frame cosine similarity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Background Embedding Memory (BEM) pipeline. During [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Increase in P-AUC as a func￾tion of background-frame cosine simi￾larity. Frames are grouped into similar￾ity bins, and the P-AUC improvement is averaged within each bin. Lower similarity corresponds to object-dense frames, where BEM yields larger preci￾sion gains. Assumption Validation To support the two assumptions stated in Sec. 2.1, we analyze how background-frame similarity relates to (i) the number of… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of the BEM penalty scale α and temperature γ across different detector backbones. Each curve reports mAP@0.50 as a function of α for several fixed values of γ [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of detection results. The top two rows show results [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Background Embedding Memory (BEM), a lightweight training-free module attachable to pretrained detectors (YOLO and RT-DETR families) for fixed-background cameras. BEM estimates clean background embeddings from incoming frames, maintains a prototype memory, and applies an inverse-similarity rank-weighted penalty to detection logits to suppress false positives while preserving recall. It reports a negative correlation between background-frame cosine similarity and object count, plus a positive correlation with Precision-Confidence AUC (P-AUC), and claims consistent FP reduction on LLVIP and simulated surveillance streams without compromising real-time performance. Code is released at the provided GitHub link.

Significance. If the central empirical claims hold, BEM offers a practical inference-only technique for improving detector robustness in dense, quasi-static surveillance and traffic scenes by exploiting label-free background priors, avoiding the need for domain-specific retraining. The training-free design and public code release are strengths that could facilitate adoption and further testing in real-time CV applications.

major comments (2)
  1. [§3] §3: The prototype memory is updated directly from incoming frames without an explicit mechanism (temporal filtering, statistical outlier rejection, or foreground masking) for separating background from foreground. In the high object-density surveillance scenarios highlighted in the introduction, persistent or slow-moving objects risk contaminating the background embeddings; this could either suppress true static detections or fail to penalize spurious ones, directly threatening the reported P-AUC gains.
  2. [Experiments] Experiments section: The abstract states consistent FP reduction and a background-similarity correlation with object count/P-AUC across YOLO/RT-DETR on LLVIP and simulated streams, yet no quantitative tables, per-scenario ablations on object density, or error bars are referenced; without these, it is impossible to verify whether the gains are robust or load-bearing for the central claim.
minor comments (2)
  1. [Abstract] Abstract: The term 'simulated surveillance streams' is used without describing the simulation procedure, camera parameters, or how it matches real fixed-background conditions.
  2. Notation: The embedding memory update rule and the exact form of the rank-weighted inverse-similarity penalty would benefit from an explicit equation or pseudocode block in the method section to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions that will strengthen the empirical support and methodological clarity of the manuscript.

read point-by-point responses
  1. Referee: §3: The prototype memory is updated directly from incoming frames without an explicit mechanism (temporal filtering, statistical outlier rejection, or foreground masking) for separating background from foreground. In the high object-density surveillance scenarios highlighted in the introduction, persistent or slow-moving objects risk contaminating the background embeddings; this could either suppress true static detections or fail to penalize spurious ones, directly threatening the reported P-AUC gains.

    Authors: We agree that the absence of an explicit separation mechanism in the prototype update is a valid concern for high-density scenes. The current design relies on the background similarity metric to modulate penalty strength but performs direct averaging for memory maintenance. To address this, we will revise Section 3 to incorporate a lightweight temporal filtering step (exponential moving average with similarity-gated updates) and add a short ablation on embedding contamination under persistent objects. These changes will be accompanied by new results confirming that P-AUC gains hold under moderate density increases. revision: yes

  2. Referee: Experiments section: The abstract states consistent FP reduction and a background-similarity correlation with object count/P-AUC across YOLO/RT-DETR on LLVIP and simulated streams, yet no quantitative tables, per-scenario ablations on object density, or error bars are referenced; without these, it is impossible to verify whether the gains are robust or load-bearing for the central claim.

    Authors: We acknowledge that the experimental reporting requires greater quantitative detail to allow verification of robustness. While correlations and FP reductions are described, dedicated tables, density-specific breakdowns, and statistical variability measures are indeed missing. In the revised manuscript we will add comprehensive tables reporting FP reduction, correlation coefficients, and P-AUC for each detector-dataset pair, plus new per-scenario ablations across low/medium/high object density regimes and error bars from repeated runs on the simulated streams. revision: yes

Circularity Check

0 steps flagged

No circularity: BEM is an external training-free attachment using independent background prior

full rationale

The paper introduces BEM as a lightweight inference-only module attached to frozen pretrained detectors. It estimates background embeddings from quasi-static frames, maintains a prototype memory updated from incoming frames, and applies an inverse-similarity penalty to detection logits. This construction relies on external frame data as a label-free prior rather than any fitted parameters or outputs from the detector itself. No equations or claims reduce by construction to the detector's predictions, no self-citations are load-bearing for the core mechanism, and the reported negative correlation between background similarity and object count is presented as empirical motivation, not a derived result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that fixed-camera backgrounds are sufficiently stable to serve as a label-free prior; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Quasi-static background provides a stable, label-free prior exploitable at inference
    Invoked to justify attaching BEM to suppress spurious detections in fixed-camera environments.

pith-pipeline@v0.9.0 · 5527 in / 1231 out tokens · 62917 ms · 2026-05-10T16:15:44.477615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    NeurIPS32(2019)

    Aljundi, R., et al.: Gradient based sample selection for online continual learning. NeurIPS32(2019)

  2. [2]

    In: CVPR

    Chen, Y., et al.: Domain adaptive faster r-cnn for object detection in the wild. In: CVPR. pp. 3339–3348 (2018)

  3. [3]

    In: CVPR

    Cheng, T., et al.: Yolo-world: Real-time open-vocabulary object detection. In: CVPR. pp. 16901–16911 (2024)

  4. [4]

    Class imbalance in object detection: An experimental diagnosis and study of mitigation strategies,

    Crasto, N.: Class imbalance in object detection: an experimental diagnosis and study of mitigation strategies. arXiv preprint arXiv:2403.07113 (2024) Title Suppressed Due to Excessive Length 15

  5. [5]

    International journal of computer vision88(2), 303–338 (2010)

    Everingham, M., et al.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)

  6. [6]

    arXiv preprint arXiv:2508.01382 (2025)

    Guo, Q., et al.: A full-stage refined proposal algorithm for suppressing false positives in two-stage cnn-based detection methods. arXiv preprint arXiv:2508.01382 (2025)

  7. [7]

    In: CVPR

    Inoue, N., et al.: Cross-domain weakly-supervised object detection through progres- sive domain adaptation. In: CVPR. pp. 5001–5009 (2018)

  8. [8]

    In: ICCV

    Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: ICCV. pp. 3496–3504 (2021)

  9. [9]

    https://github.com/ultralytics/ ultralytics(2023), accessed: 2025-12-07

    Jocher, G., Ultralytics: Ultralytics yolov8. https://github.com/ultralytics/ ultralytics(2023), accessed: 2025-12-07

  10. [10]

    https: //github.com/ultralytics/ultralytics(2024), accessed: 2025-12-07

    Jocher, G., Ultralytics: Ultralytics yolov11: Real-time object detection. https: //github.com/ultralytics/ultralytics(2024), accessed: 2025-12-07

  11. [11]

    Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

    Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

  12. [12]

    In: CVPRW

    Kuppers, F., et al.: Multivariate confidence calibration for object detection. In: CVPRW. pp. 326–327 (2020)

  13. [13]

    In: Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety, pp

    K¨ uppers, F., et al.: Confidence calibration for object detection and segmentation. In: Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety, pp. 225–250. Springer International Publishing Cham (2022)

  14. [14]

    In: ECCV

    Kuzucu, S., et al.: On calibration of object detectors: Pitfalls, evaluation and baselines. In: ECCV. pp. 185–204. Springer (2024)

  15. [15]

    In: ECCV

    Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)

  16. [16]

    In: CVPR

    Liu, B., et al.: The devil is in the margin: Margin-based label smoothing for network calibration. In: CVPR. pp. 80–88 (2022)

  17. [17]

    In: ICCV

    Oh, S.W., et al.: Video object segmentation using space-time memory networks. In: ICCV. pp. 9226–9235 (2019)

  18. [18]

    IEEE transac- tions on pattern analysis and machine intelligence43(10), 3388–3415 (2020)

    Oksuz, K., et al.: Imbalance problems in object detection: A review. IEEE transac- tions on pattern analysis and machine intelligence43(10), 3388–3415 (2020)

  19. [19]

    In: CVPR

    Park, H., et al.: Learning memory-guided normality for anomaly detection. In: CVPR. pp. 14372–14381 (2020)

  20. [20]

    In: WACV

    Popordanoska, T., et al.: Beyond classification: Definition and density-based esti- mation of calibration in object detection. In: WACV. pp. 585–594 (2024)

  21. [21]

    In: ICML

    Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PmLR (2021)

  22. [22]

    arXiv preprint arXiv:2508.14660 (2025)

    Siddiqui, M.I., et al.: Towards persense++: Advancing training-free personalized instance segmentation in dense images. arXiv preprint arXiv:2508.14660 (2025)

  23. [23]

    In: CVPR

    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR. pp. 1521–1528. IEEE (2011)

  24. [24]

    In: ICML

    Wei, H., et al.: Mitigating neural network overconfidence with logit normalization. In: ICML. pp. 23631–23644. PMLR (2022)

  25. [25]

    In: ECCV

    Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: ECCV. pp. 485–501 (2018)

  26. [26]

    In: International Conference on Neural Information Processing

    Zeng, R., et al.: Boosting open-vocabulary object detection by handling background samples. In: International Conference on Neural Information Processing. pp. 274–

  27. [27]

    In: CVPR

    Zhang, Z., Hoai, M.: Object detection with self-supervised scene adaptation. In: CVPR. pp. 21589–21599 (2023)

  28. [28]

    In: CVPR

    Zhao, Y., et al.: Detrs beat yolos on real-time object detection. In: CVPR. pp. 16965–16974 (2024)