arxiv: 2604.11714 · v3 · submitted 2026-04-13 · 💻 cs.CV

BEM: Training-Free Background Embedding Memory for False-Positive Suppression in Real-Time Fixed-Background Camera

Junwoo Park , Jangho Lee , Sunho Lim This is my paper

Pith reviewed 2026-05-10 16:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords background embeddingfalse positive suppressiontraining-free moduleobject detectionsurveillance camerasYOLORT-DETRfixed background

0 comments

The pith

Background Embedding Memory reduces false positives in pretrained object detectors for fixed-background cameras without training or speed loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes attaching a lightweight module called Background Embedding Memory to existing detectors at inference time. Pretrained models struggle with false positives in dense surveillance scenes because benchmarks like COCO train them under different sparsity assumptions. BEM builds a memory of background embeddings from the fixed camera view and uses similarity to that memory to down-weight suspicious detections. This exploits the stable background as a free, label-free signal that correlates with object presence. The result is fewer errors while recall and real-time speed stay intact across detector families.

Core claim

BEM estimates clean background embeddings, maintains a prototype memory of them, and re-scores detection logits with an inverse-similarity rank-weighted penalty, reducing false positives in quasi-static fixed-camera environments while preserving recall and real-time performance on YOLO and RT-DETR models.

What carries the argument

Background Embedding Memory (BEM), a training-free module that stores prototype background embeddings and applies an inverse-similarity penalty to detection scores based on frame similarity.

If this is right

False positives decrease across YOLO and RT-DETR families on LLVIP and simulated surveillance data.
Real-time speed is preserved because BEM adds only negligible computation at inference.
Precision rises in high-density few-class scenes without any retraining of the base detector.
Background-frame cosine similarity can serve as a direct control signal for detection confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

BEM could be combined with existing post-processing filters to handle gradual background drift.
The same memory idea might apply to other fixed-sensor tasks such as traffic monitoring or industrial inspection.
Longer-term memory adaptation could extend the method to scenes with slow environmental changes.

Load-bearing premise

The background remains sufficiently static across frames to provide a reliable reference for identifying and suppressing detections that resemble it.

What would settle it

Running BEM on footage where the background slowly changes due to lighting shifts or permanent scene alterations, then checking whether false-positive reduction disappears or reverses.

Figures

Figures reproduced from arXiv: 2604.11714 by Jangho Lee, Junwoo Park, Sunho Lim.

**Figure 2.** Figure 2: Relationship between background–frame cosine similarity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the Background Embedding Memory (BEM) pipeline. During [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Increase in P-AUC as a function of background-frame cosine similarity. Frames are grouped into similarity bins, and the P-AUC improvement is averaged within each bin. Lower similarity corresponds to object-dense frames, where BEM yields larger precision gains. Assumption Validation To support the two assumptions stated in Sec. 2.1, we analyze how background-frame similarity relates to (i) the number of… view at source ↗

**Figure 5.** Figure 5: Ablation study of the BEM penalty scale α and temperature γ across different detector backbones. Each curve reports mAP@0.50 as a function of α for several fixed values of γ [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of detection results. The top two rows show results [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Pretrained detectors perform well on benchmarks but often suffer performance degradation in real-world deployments due to distribution gaps between training data and target environments. COCO-like benchmarks emphasize category diversity rather than instance density, causing detectors trained under per-class sparsity to struggle in dense, single- or few-class scenes such as surveillance and traffic monitoring. In fixed-camera environments, the quasi-static background provides a stable, label-free prior that can be exploited at inference to suppress spurious detections. To address the issue, we propose Background Embedding Memory (BEM), a lightweight, training-free, weight-frozen module that can be attached to pretrained detectors during inference. BEM estimates clean background embeddings, maintains a prototype memory, and re-scores detection logits with an inverse-similarity, rank-weighted penalty, effectively reducing false positives while maintaining recall. Empirically, background-frame cosine similarity correlates negatively with object count and positively with Precision-Confidence AUC (P-AUC), motivating its use as a training-free control signal. Across YOLO and RT-DETR families on LLVIP and simulated surveillance streams, BEM consistently reduces false positives while preserving real-time performance. Our code is available at https://github.com/Leo-Park1214/Background-Embedding-Memory.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEM adds a simple inference-only background memory module to cut false positives on fixed cameras, but the approach risks memory contamination and the supporting evidence stays too thin to judge the gains.

read the letter

The main point is a training-free add-on that keeps a prototype memory of background embeddings from incoming frames and then applies an inverse-similarity rank-weighted penalty to detection logits. This targets the practical gap where detectors trained on sparse COCO-style data produce too many spurious boxes in dense, single-class fixed-camera scenes such as traffic or surveillance feeds. The module stays frozen and lightweight, so it preserves real-time speed on YOLO and RT-DETR backbones while using the quasi-static background as a label-free prior. The reported negative correlation between background cosine similarity and object count, plus the positive link to P-AUC, gives a clear motivation for the penalty design. Code release helps anyone who wants to plug it in and measure the effect on their own streams. The soft spot is exactly the one the stress-test flagged: nothing in the description shows an explicit mechanism to keep foreground objects from leaking into the memory updates. In scenes with persistent or slow-moving objects, the prototypes could easily become contaminated, which would either suppress valid detections or leave false positives untouched. Without ablations on update frequency, density thresholds, or temporal filtering, the claimed consistent FP reductions on LLVIP and simulated streams are hard to trust at face value. The abstract also skips quantitative tables and error breakdowns, so effect sizes and failure modes stay opaque. This is aimed at practitioners who maintain fixed-camera systems and want a plug-in precision boost rather than a full retraining cycle. A reader already running YOLO-family detectors on surveillance data could extract the core idea and test it quickly. It is worth sending to peer review because the method is concrete, the implementation is public, and referees can push for the missing controls and numbers that would make the claims verifiable.

Referee Report

2 major / 2 minor

Summary. The paper proposes Background Embedding Memory (BEM), a lightweight training-free module attachable to pretrained detectors (YOLO and RT-DETR families) for fixed-background cameras. BEM estimates clean background embeddings from incoming frames, maintains a prototype memory, and applies an inverse-similarity rank-weighted penalty to detection logits to suppress false positives while preserving recall. It reports a negative correlation between background-frame cosine similarity and object count, plus a positive correlation with Precision-Confidence AUC (P-AUC), and claims consistent FP reduction on LLVIP and simulated surveillance streams without compromising real-time performance. Code is released at the provided GitHub link.

Significance. If the central empirical claims hold, BEM offers a practical inference-only technique for improving detector robustness in dense, quasi-static surveillance and traffic scenes by exploiting label-free background priors, avoiding the need for domain-specific retraining. The training-free design and public code release are strengths that could facilitate adoption and further testing in real-time CV applications.

major comments (2)

[§3] §3: The prototype memory is updated directly from incoming frames without an explicit mechanism (temporal filtering, statistical outlier rejection, or foreground masking) for separating background from foreground. In the high object-density surveillance scenarios highlighted in the introduction, persistent or slow-moving objects risk contaminating the background embeddings; this could either suppress true static detections or fail to penalize spurious ones, directly threatening the reported P-AUC gains.
[Experiments] Experiments section: The abstract states consistent FP reduction and a background-similarity correlation with object count/P-AUC across YOLO/RT-DETR on LLVIP and simulated streams, yet no quantitative tables, per-scenario ablations on object density, or error bars are referenced; without these, it is impossible to verify whether the gains are robust or load-bearing for the central claim.

minor comments (2)

[Abstract] Abstract: The term 'simulated surveillance streams' is used without describing the simulation procedure, camera parameters, or how it matches real fixed-background conditions.
Notation: The embedding memory update rule and the exact form of the rank-weighted inverse-similarity penalty would benefit from an explicit equation or pseudocode block in the method section to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline revisions that will strengthen the empirical support and methodological clarity of the manuscript.

read point-by-point responses

Referee: §3: The prototype memory is updated directly from incoming frames without an explicit mechanism (temporal filtering, statistical outlier rejection, or foreground masking) for separating background from foreground. In the high object-density surveillance scenarios highlighted in the introduction, persistent or slow-moving objects risk contaminating the background embeddings; this could either suppress true static detections or fail to penalize spurious ones, directly threatening the reported P-AUC gains.

Authors: We agree that the absence of an explicit separation mechanism in the prototype update is a valid concern for high-density scenes. The current design relies on the background similarity metric to modulate penalty strength but performs direct averaging for memory maintenance. To address this, we will revise Section 3 to incorporate a lightweight temporal filtering step (exponential moving average with similarity-gated updates) and add a short ablation on embedding contamination under persistent objects. These changes will be accompanied by new results confirming that P-AUC gains hold under moderate density increases. revision: yes
Referee: Experiments section: The abstract states consistent FP reduction and a background-similarity correlation with object count/P-AUC across YOLO/RT-DETR on LLVIP and simulated streams, yet no quantitative tables, per-scenario ablations on object density, or error bars are referenced; without these, it is impossible to verify whether the gains are robust or load-bearing for the central claim.

Authors: We acknowledge that the experimental reporting requires greater quantitative detail to allow verification of robustness. While correlations and FP reductions are described, dedicated tables, density-specific breakdowns, and statistical variability measures are indeed missing. In the revised manuscript we will add comprehensive tables reporting FP reduction, correlation coefficients, and P-AUC for each detector-dataset pair, plus new per-scenario ablations across low/medium/high object density regimes and error bars from repeated runs on the simulated streams. revision: yes

Circularity Check

0 steps flagged

No circularity: BEM is an external training-free attachment using independent background prior

full rationale

The paper introduces BEM as a lightweight inference-only module attached to frozen pretrained detectors. It estimates background embeddings from quasi-static frames, maintains a prototype memory updated from incoming frames, and applies an inverse-similarity penalty to detection logits. This construction relies on external frame data as a label-free prior rather than any fitted parameters or outputs from the detector itself. No equations or claims reduce by construction to the detector's predictions, no self-citations are load-bearing for the core mechanism, and the reported negative correlation between background similarity and object count is presented as empirical motivation, not a derived result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that fixed-camera backgrounds are sufficiently stable to serve as a label-free prior; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Quasi-static background provides a stable, label-free prior exploitable at inference
Invoked to justify attaching BEM to suppress spurious detections in fixed-camera environments.

pith-pipeline@v0.9.0 · 5527 in / 1231 out tokens · 62917 ms · 2026-05-10T16:15:44.477615+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

NeurIPS32(2019)

Aljundi, R., et al.: Gradient based sample selection for online continual learning. NeurIPS32(2019)

work page 2019
[2]

In: CVPR

Chen, Y., et al.: Domain adaptive faster r-cnn for object detection in the wild. In: CVPR. pp. 3339–3348 (2018)

work page 2018
[3]

In: CVPR

Cheng, T., et al.: Yolo-world: Real-time open-vocabulary object detection. In: CVPR. pp. 16901–16911 (2024)

work page 2024
[4]

Class imbalance in object detection: An experimental diagnosis and study of mitigation strategies,

Crasto, N.: Class imbalance in object detection: an experimental diagnosis and study of mitigation strategies. arXiv preprint arXiv:2403.07113 (2024) Title Suppressed Due to Excessive Length 15

work page arXiv 2024
[5]

International journal of computer vision88(2), 303–338 (2010)

Everingham, M., et al.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)

work page 2010
[6]

arXiv preprint arXiv:2508.01382 (2025)

Guo, Q., et al.: A full-stage refined proposal algorithm for suppressing false positives in two-stage cnn-based detection methods. arXiv preprint arXiv:2508.01382 (2025)

work page arXiv 2025
[7]

In: CVPR

Inoue, N., et al.: Cross-domain weakly-supervised object detection through progres- sive domain adaptation. In: CVPR. pp. 5001–5009 (2018)

work page 2018
[8]

In: ICCV

Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: ICCV. pp. 3496–3504 (2021)

work page 2021
[9]

https://github.com/ultralytics/ ultralytics(2023), accessed: 2025-12-07

Jocher, G., Ultralytics: Ultralytics yolov8. https://github.com/ultralytics/ ultralytics(2023), accessed: 2025-12-07

work page 2023
[10]

https: //github.com/ultralytics/ultralytics(2024), accessed: 2025-12-07

Jocher, G., Ultralytics: Ultralytics yolov11: Real-time object detection. https: //github.com/ultralytics/ultralytics(2024), accessed: 2025-12-07

work page 2024
[11]

Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

work page 2017
[12]

In: CVPRW

Kuppers, F., et al.: Multivariate confidence calibration for object detection. In: CVPRW. pp. 326–327 (2020)

work page 2020
[13]

In: Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety, pp

K¨ uppers, F., et al.: Confidence calibration for object detection and segmentation. In: Deep neural networks and data for automated driving: Robustness, uncertainty quantification, and insights towards safety, pp. 225–250. Springer International Publishing Cham (2022)

work page 2022
[14]

In: ECCV

Kuzucu, S., et al.: On calibration of object detectors: Pitfalls, evaluation and baselines. In: ECCV. pp. 185–204. Springer (2024)

work page 2024
[15]

In: ECCV

Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)

work page 2014
[16]

In: CVPR

Liu, B., et al.: The devil is in the margin: Margin-based label smoothing for network calibration. In: CVPR. pp. 80–88 (2022)

work page 2022
[17]

In: ICCV

Oh, S.W., et al.: Video object segmentation using space-time memory networks. In: ICCV. pp. 9226–9235 (2019)

work page 2019
[18]

IEEE transac- tions on pattern analysis and machine intelligence43(10), 3388–3415 (2020)

Oksuz, K., et al.: Imbalance problems in object detection: A review. IEEE transac- tions on pattern analysis and machine intelligence43(10), 3388–3415 (2020)

work page 2020
[19]

In: CVPR

Park, H., et al.: Learning memory-guided normality for anomaly detection. In: CVPR. pp. 14372–14381 (2020)

work page 2020
[20]

In: WACV

Popordanoska, T., et al.: Beyond classification: Definition and density-based esti- mation of calibration in object detection. In: WACV. pp. 585–594 (2024)

work page 2024
[21]

In: ICML

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763. PmLR (2021)

work page 2021
[22]

arXiv preprint arXiv:2508.14660 (2025)

Siddiqui, M.I., et al.: Towards persense++: Advancing training-free personalized instance segmentation in dense images. arXiv preprint arXiv:2508.14660 (2025)

work page arXiv 2025
[23]

In: CVPR

Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR. pp. 1521–1528. IEEE (2011)

work page 2011
[24]

In: ICML

Wei, H., et al.: Mitigating neural network overconfidence with logit normalization. In: ICML. pp. 23631–23644. PMLR (2022)

work page 2022
[25]

In: ECCV

Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: ECCV. pp. 485–501 (2018)

work page 2018
[26]

In: International Conference on Neural Information Processing

Zeng, R., et al.: Boosting open-vocabulary object detection by handling background samples. In: International Conference on Neural Information Processing. pp. 274–

work page
[27]

In: CVPR

Zhang, Z., Hoai, M.: Object detection with self-supervised scene adaptation. In: CVPR. pp. 21589–21599 (2023)

work page 2023
[28]

In: CVPR

Zhao, Y., et al.: Detrs beat yolos on real-time object detection. In: CVPR. pp. 16965–16974 (2024)

work page 2024