arxiv: 2605.10988 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation

Yutszyuk Wong , Wentai Wu , Yuen-Ying Yeung , Weiwei Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords log anomaly detectionweakly supervised learningmulti-instance learninganomaly localizationcounterfactual perturbationprototype modelingsystem logsinstance-level supervision

0 comments

The pith

LogMILP performs bag-level anomaly detection and instance-level localization in logs using only coarse group labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogMILP, a weakly supervised framework that detects anomalies across groups of log entries while also identifying which specific entries within each group are responsible, all without requiring labels for individual lines. This addresses the practical barrier that massive log volumes make per-entry annotation infeasible for operations and security teams. The approach combines prototype-guided structural modeling to represent normal patterns with counterfactual perturbation consistency regularization to force the model to focus on the entries whose removal or alteration would change the group-level decision. On three public datasets the method matches standard detection accuracy yet delivers markedly more reliable localization of the exact anomalous instances.

Core claim

LogMILP enables both bag-level anomaly detection and instance-level anomaly localization in log data using only bag-level labels. It guides the model to the critical log entries through prototype-guided structural modeling that captures representative normal structures and counterfactual perturbation consistency regularization that enforces stable predictions under targeted changes to potential anomalous instances. This combination improves localization reliability and interpretability under coarse-grained supervision without introducing instance-level annotations.

What carries the argument

Prototype-guided structural modeling combined with counterfactual perturbation consistency regularization, which steers attention to the specific entries whose perturbation alters the bag-level anomaly score.

If this is right

Systems can achieve competitive anomaly detection accuracy while also obtaining instance-level explanations without paying for fine-grained labels.
Localization becomes more reliable because consistency under counterfactual perturbations filters out entries that are not causally responsible for the anomaly signal.
Interpretability of log-based security and operations monitoring increases since the method surfaces the precise lines that trigger alerts.
The framework extends multi-instance learning to logs by adding prototype representations and perturbation checks that stabilize instance scoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perturbation-consistency idea could transfer to other domains with bag-level labels, such as document classification or sensor-group anomaly detection.
If the regularization truly isolates causal entries, it may reduce alert fatigue by lowering the number of spurious instance flags in production monitoring.
Testing on streaming logs with evolving normal patterns would reveal whether the prototypes remain stable enough for ongoing deployment.

Load-bearing premise

Prototype-guided structural modeling plus counterfactual perturbation consistency regularization will reliably isolate the exact log entries driving bag-level anomalies without bias or the need for instance-level labels.

What would settle it

On a log dataset supplied with ground-truth instance labels, measure whether LogMILP's ranked list of anomalous entries aligns with the true labels at a rate no better than standard multi-instance learning baselines or random selection.

Figures

Figures reproduced from arXiv: 2605.10988 by Weiwei Lin, Wentai Wu, Yuen-Ying Yeung, Yutszyuk Wong.

**Figure 1.** Figure 1: Overall Architecture of LogMILP LogMILP achieves clear advantages in both detection performance and localization reliability. II. RELATED WORK A. Log Anomaly Detection Early approaches detect anomalies by modeling normal patterns. A representative example is DeepLog [9], which employed LSTM to learn the temporal dependencies of log template sequences and regards logs that deviate from the predicted pattern… view at source ↗

**Figure 2.** Figure 2: Geometric distribution of baseline models in the precision-recall space [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance-level annotations are prohibitively expensive, posing great difficulties to fine-grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi-Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag-level anomaly detection and instance-level anomaly localization using only bag-level labels. Our method guides the model to pinpoint the critical log entries using prototype-guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse-grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance-level localization. Our code is open-sourced at https://github.com/YUK1207/LogMILP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogMILP combines multi-instance learning with prototypes and counterfactual perturbations to add instance-level localization to bag-level log anomaly detection, and the experiments plus open code make the results checkable.

read the letter

The paper's core move is to treat logs as bags and use prototype-guided modeling plus a counterfactual perturbation consistency term to push the model toward identifying which specific entries drive the bag-level anomaly label. This is a reasonable way to get localization without instance supervision, and the authors show it on three public datasets where detection stays competitive with standard MIL baselines while localization metrics improve. The ablations appear to isolate the contribution of each piece, and the code is released, which lets anyone rerun the numbers directly. That combination of a clear framework and verifiable results is the main thing worth noting here. The localization gains rely on held-out instance annotations that exist only for evaluation, which is common in this area but means the method still needs some form of fine-grained signal to measure success. The datasets are standard public ones, so broader testing on production logs with different noise patterns would help, though nothing in the reported results suggests the approach collapses outside the tested cases. The math and regularization look internally consistent with no obvious circularity. This is the kind of incremental but useful engineering paper that people working on operational log analysis would want to see. It is worth sending to peer review because the claim is scoped, the evidence is presented with controls, and the code lowers the barrier to checking the work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LogMILP, a weakly-supervised multi-instance learning framework for log anomaly detection and instance-level localization. It combines prototype-guided structural modeling with counterfactual perturbation consistency regularization to achieve both bag-level anomaly detection and instance-level localization using only bag-level labels. Experiments on three public datasets report competitive detection performance alongside significantly improved localization metrics relative to standard MIL baselines, with ablations demonstrating the contribution of each component and open-sourced code provided for verification.

Significance. If the reported results hold, the work is significant for practical system monitoring and security applications, where instance-level annotations are prohibitively expensive at scale. The approach improves localization reliability and interpretability under weak supervision, addressing a key gap in fine-grained anomaly analysis for large log streams. The inclusion of reproducible code and standard evaluation on public datasets strengthens the contribution.

major comments (2)

[§4.2] §4.2, localization metrics: the claim of 'significantly more reliable' instance-level localization is supported by precision/recall on held-out annotations, but the manuscript does not report run-to-run variance or statistical significance tests; this weakens the strength of the cross-method comparison in Table 2.
[§3.3] §3.3, counterfactual perturbation consistency regularization: the regularization term is defined to enforce consistency, but its interaction with the prototype-guided modeling is not shown to be free of bias on sequences where multiple entries could be critical; an additional controlled experiment on synthetic logs with known ground-truth needles would strengthen the central claim.

minor comments (2)

[Abstract] Abstract: the statement of 'significantly more reliable' localization lacks any numeric deltas or metric values; adding one or two key figures (e.g., F1 improvement) would make the headline claim more concrete.
[§5] §5, related work: the discussion of prior MIL methods for anomaly detection is adequate but could reference more recent weakly-supervised log-specific approaches published after 2022 to better situate the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address each major comment below and outline the revisions to be incorporated.

read point-by-point responses

Referee: [§4.2] §4.2, localization metrics: the claim of 'significantly more reliable' instance-level localization is supported by precision/recall on held-out annotations, but the manuscript does not report run-to-run variance or statistical significance tests; this weakens the strength of the cross-method comparison in Table 2.

Authors: We agree that reporting run-to-run variance and statistical significance would strengthen the cross-method comparisons. In the revised manuscript, we will update Table 2 to include mean and standard deviation of the localization metrics (precision, recall, F1-score) computed over 5 independent runs with different random seeds. We will also add paired t-test p-values to confirm that the improvements of LogMILP over baselines are statistically significant. revision: yes
Referee: [§3.3] §3.3, counterfactual perturbation consistency regularization: the regularization term is defined to enforce consistency, but its interaction with the prototype-guided modeling is not shown to be free of bias on sequences where multiple entries could be critical; an additional controlled experiment on synthetic logs with known ground-truth needles would strengthen the central claim.

Authors: We thank the referee for highlighting this potential limitation. The prototype-guided structural modeling identifies representative anomalous patterns, while the counterfactual perturbation regularization penalizes inconsistent predictions, encouraging the model to focus on critical instances. Ablation results in Section 4.3 already demonstrate that both components contribute to improved localization. To address the concern directly, we will add a clarifying paragraph in Section 3.3 discussing multi-critical-entry scenarios and how the consistency term mitigates bias via prototype matching. A dedicated synthetic experiment would further strengthen the claim but is not feasible within the current revision timeline; we believe the real-dataset results and ablations provide sufficient support. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces LogMILP as a weakly-supervised MIL framework that combines prototype-guided structural modeling with counterfactual perturbation consistency regularization to achieve bag-level detection and instance-level localization from bag labels only. All load-bearing steps are defined via explicit architectural choices and loss terms that are trained end-to-end; the reported performance is obtained by evaluating against held-out instance annotations on three independent public datasets using standard MIL baselines and localization metrics. No equation reduces to a fitted parameter that is then relabeled as a prediction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled in via prior work. The open-sourced code further permits direct reproduction, confirming that the central claims rest on externally falsifiable experimental outcomes rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method description implies standard multi-instance learning assumptions but no details are given.

pith-pipeline@v0.9.0 · 5465 in / 1028 out tokens · 67696 ms · 2026-05-13T06:38:38.240605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Weakly-supervised log-based anomaly detection with inexact labels via multi-instance learning,

M. He, T. Jia, C. Duan, H. Cai, Y . Li, and G. Huang, “Weakly-supervised log-based anomaly detection with inexact labels via multi-instance learning,” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pp. 2918–2930, 2025

work page 2025
[2]

Towards faithful model explanation in NLP: A survey,

Q. Lyu, M. Apidianaki, and C. Callison-Burch, “Towards faithful model explanation in NLP: A survey,”Computational Linguistics, vol. 50, pp. 657–723, June 2024

work page 2024
[3]

Weakly supervised anomaly detection: A survey,

M. Jiang, C. Hou, A. Zheng, X. Hu, S. Han, H. Huang, X. He, P. S. Yu, and Y . Zhao, “Weakly supervised anomaly detection: A survey,” 2023

work page 2023
[4]

Industrial anomaly detection and localization using weakly-supervised residual transformers,

H. Li, J. Wu, D. Liu, L. Wu, H. Chen, M. Wang, and C. Shen, “Industrial anomaly detection and localization using weakly-supervised residual transformers,” 2025

work page 2025
[5]

Exploring multiple instance learning (mil): A brief survey,

M. Waqas, S. U. Ahmed, M. A. Tahir, J. Wu, and R. Qureshi, “Exploring multiple instance learning (mil): A brief survey,”Expert Systems with Applications, vol. 250, p. 123893, 2024

work page 2024
[6]

Walk the talk: Is your log-based software reliability maintenance system really reliable?,

M. He, T. Jia, C. Duan, P. Xiao, L. Zhang, K. Wang, Y . Wu, Y . Li, and G. Huang, “Walk the talk: Is your log-based software reliability maintenance system really reliable?,”2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 3784–3788, 2025

work page 2025
[7]

What supercomputers say: A study of five system logs,

A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), pp. 575–584, 2007

work page 2007
[8]

Loghub: A large collection of system log datasets towards automated log analytics,

S. He, J. Zhu, P. He, and M. R. Lyu, “Loghub: A large collection of system log datasets towards automated log analytics,”CoRR, vol. abs/2008.06448, 2020

work page arXiv 2008
[9]

DeepLog: anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: anomaly detection and diagnosis from system logs through deep learning,” inACM Conference on Computer and Communications Security (CCS), 2017

work page 2017
[10]

Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,

W. Meng, Y . Liu, Y . Zhu, S. Zhang, D. Pei, Y . Liu, Y . Chen, R. Zhang, S. Tao, P. Sun, and R. Zhou, “Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4739–4745, International Joint Conferences on...

work page 2019
[11]

Logbert: Log anomaly detection via BERT,

H. Guo, S. Yuan, and X. Wu, “Logbert: Log anomaly detection via BERT,”CoRR, vol. abs/2103.04475, 2021

work page arXiv 2021
[12]

Logformer: Cascaded transformer for system log anomaly detection,

F. Hang, W. Guo, H. Chen, L. Xie, C. Zhou, and Y . Liu, “Logformer: Cascaded transformer for system log anomaly detection,”Computer Modeling in Engineering & Sciences, vol. 136, no. 1, pp. 517–529, 2023

work page 2023
[13]

Prototype-based interpretability for legal citation prediction,

C. F. Luo, R. Bhambhoria, S. Dahan, and X. Zhu, “Prototype-based interpretability for legal citation prediction,” inFindings of the Association for Computational Linguistics: ACL 2023(A. Rogers, J. Boyd-Graber, and N. Okazaki, eds.), (Toronto, Canada), pp. 4883–4898, Association for Computational Linguistics, July 2023

work page 2023
[14]

Confident classification via template representation learning,

Y . Liu, F. Yin, and C.-L. Liu, “Confident classification via template representation learning,”Neurocomputing, vol. 682, p. 133411, 2026

work page 2026
[15]

With a little help from language: Semantic enhanced visual prototype framework for few-shot learning,

H. Cai, Y . Liu, S. Huang, and J. Lv, “With a little help from language: Semantic enhanced visual prototype framework for few-shot learning,” inProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24(K. Larson, ed.), pp. 3751–3759, International Joint Conferences on Artificial Intelligence Organization, 8

work page
[16]

Prototype- oriented unsupervised anomaly detection for multivariate time series,

Y . Li, W. Chen, B. Chen, D. Wang, L. Tian, and M. Zhou, “Prototype- oriented unsupervised anomaly detection for multivariate time series,” in Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 ofProceedings of Machine Learning Research, pp. 1940...

work page 2023
[17]

Reconstruction- based multi-normal prototypes learning for weakly supervised anomaly detection,

Z. Dong, H. Liu, B. Ren, W. Xiong, and Z. Wu, “Reconstruction- based multi-normal prototypes learning for weakly supervised anomaly detection,”CoRR, vol. abs/2408.14498, 2024

work page arXiv 2024
[18]

Counterfactual interpo- lation augmentation (cia): A unified approach to enhance fairness and explainability of dnn,

Y . Qiang, C. Li, M. Brocanelli, and D. Zhu, “Counterfactual interpo- lation augmentation (cia): A unified approach to enhance fairness and explainability of dnn,” inProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22(L. D. Raedt, ed.), pp. 732–739, International Joint Conferences on Artificial Intelligence ...

work page 2022
[19]

Unsupervised data augmentation for consistency training,

Q. Xie, Z. Dai, E. Hovy, T. Luong, and Q. Le, “Unsupervised data augmentation for consistency training,” inAdvances in Neural Information Processing Systems(H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 6256–6268, Curran Associates, Inc., 2020

work page 2020
[20]

Cognitive refined augmentation for video anomaly detection in weak supervision,

J. Lee, H. Koo, S. Kim, and H. Ko, “Cognitive refined augmentation for video anomaly detection in weak supervision,”Sensors, vol. 24, no. 1, 2024

work page 2024
[21]

Prompt perturbation consistency learning for robust language models,

Y . Qiang, S. Nandi, N. Mehrabi, G. Ver Steeg, A. Kumar, A. Rumshisky, and A. Galstyan, “Prompt perturbation consistency learning for robust language models,” inFindings of the Association for Computational Linguistics: EACL 2024(Y . Graham and M. Purver, eds.), (St. Julian’s, Malta), pp. 1357–1370, Association for Computational Linguistics, Mar. 2024

work page 2024
[22]

Interpretability of deep neural networks: A review of methods, classification and hardware,

T. Antamis, A. Drosou, T. Vafeiadis, A. Nizamis, D. Ioannidis, and D. Tzovaras, “Interpretability of deep neural networks: A review of methods, classification and hardware,”Neurocomputing, vol. 601, p. 128204, 2024

work page 2024
[23]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[24]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017

work page 2017