Structured Exploration and Exploitation of Label Functions for Automated Data Annotation
Pith reviewed 2026-05-14 23:20 UTC · model grok-4.3
The pith
EXPONA generates label functions by exploring surface, structural, and semantic levels while applying reliability-aware filtering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EXPONA formulates LF generation as a principled process balancing diversity and reliability by systematically exploring multi-level LFs spanning surface, structural, and semantic perspectives and applying reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals, which produces nearly complete label coverage up to 98.9 percent, improved weak label quality by up to 87 percent, and downstream performance gains of up to 46 percent in weighted F1 across eleven datasets.
What carries the argument
The EXPONA framework that explores label functions at surface, structural, and semantic levels combined with reliability-aware filtering to suppress noisy heuristics.
Load-bearing premise
Exploring label functions at surface, structural, and semantic levels together with reliability-aware filtering will produce complementary signals without introducing new biases or missing important domain-specific patterns.
What would settle it
A controlled experiment on a held-out dataset where EXPONA produces lower coverage or weaker downstream models than the best existing automated label function method would settle whether the central claim holds.
Figures
read the original abstract
High-quality labeled data is critical for training reliable machine learning and deep learning models, yet manual annotation remains costly and error-prone. Programmatic labeling addresses this challenge by using label functions (LFs), i.e., heuristic rules that automatically generate weak labels for training datasets. However, existing automated LF generation methods either rely on large language models (LLMs) to synthesize surface-level heuristics or employ model-based synthesis over hand-crafted primitives. These approaches often result in limited coverage and unreliable label quality. In this paper, we introduce EXPONA, an automated framework for programmatic labeling that formulates LF generation as a principled process balancing diversity and reliability. EXPONA systematically explores multi-level LFs, spanning surface, structural, and semantic perspectives. EXPONA further applies reliability-aware mechanisms to suppress noisy or redundant heuristics while preserving complementary signals. To evaluate EXPONA, we conducted extensive experiments on eleven classification datasets across diverse domains. Experimental results show that EXPONA consistently outperformed state-of-the-art automated LF generation methods. Specifically, EXPONA achieved nearly complete label coverage (up to 98.9%), improved weak label quality by up to 87%, and yielded downstream performance gains of up to 46% in weighted F1. These results indicate that EXPONA's combination of multi-level LF exploration and reliability-aware filtering enabled more consistent label quality and downstream performance across diverse tasks by balancing coverage and precision in the generated LF set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EXPONA, a framework for automated label function (LF) generation in programmatic labeling. It formulates LF creation as a process that systematically explores multi-level heuristics (surface, structural, and semantic) and applies reliability-aware filtering to suppress noisy or redundant signals while retaining complementary ones. Experiments on eleven classification datasets across domains report up to 98.9% label coverage, 87% improvement in weak-label quality, and 46% gains in downstream weighted F1 over prior automated LF methods.
Significance. If the experimental claims hold under rigorous controls, EXPONA would advance automated data annotation by demonstrating that structured multi-level exploration plus targeted filtering can simultaneously raise coverage and precision without introducing unmeasured bias. The approach directly targets the coverage-quality trade-off that limits both LLM-synthesis and primitive-based baselines.
major comments (2)
- [Experimental Evaluation] Experimental section: the abstract and results claim peak gains of 98.9% coverage, 87% quality lift, and 46% F1 improvement, yet supply no description of baseline LF implementations, number of random seeds, statistical significance tests, or the precise reliability metric and threshold used in filtering; without these controls the reported superiority cannot be assessed.
- [Method] LF generation and filtering subsection: semantic LFs are produced by LLM prompting over hand-crafted primitives, but the manuscript provides no explicit bias-detection metric, cross-domain validation procedure, or ablation that isolates whether the reliability-aware filter removes LLM-induced domain skews; this leaves the central complementarity claim vulnerable on the eleven datasets.
minor comments (2)
- [Method] Notation for the reliability score and the diversity objective is introduced without an accompanying equation or pseudocode block, making the filtering step difficult to re-implement.
- [Results] Table captions do not list the exact number of LFs generated per method or the coverage metric definition, complicating direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and commitments to strengthen the manuscript.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental section: the abstract and results claim peak gains of 98.9% coverage, 87% quality lift, and 46% F1 improvement, yet supply no description of baseline LF implementations, number of random seeds, statistical significance tests, or the precise reliability metric and threshold used in filtering; without these controls the reported superiority cannot be assessed.
Authors: We agree the submitted version omitted key experimental controls. In revision we will add: (1) explicit re-implementation details for all baselines drawn from their source papers, (2) all metrics reported as mean ± std over 5 random seeds, (3) paired t-test p-values for significance, and (4) the reliability metric as LF accuracy estimated on a 5% held-out validation set with threshold 0.65. These additions will allow full assessment of the reported gains. revision: yes
-
Referee: [Method] LF generation and filtering subsection: semantic LFs are produced by LLM prompting over hand-crafted primitives, but the manuscript provides no explicit bias-detection metric, cross-domain validation procedure, or ablation that isolates whether the reliability-aware filter removes LLM-induced domain skews; this leaves the central complementarity claim vulnerable on the eleven datasets.
Authors: The reliability filter already prunes LFs using estimated accuracy and agreement scores, which reduces noisy LLM outputs. We acknowledge the absence of an explicit bias metric. In revision we will insert: (i) a KL-divergence bias metric between LLM LF label distributions and validation ground truth, (ii) expanded cross-domain results across all eleven datasets, and (iii) an ablation isolating the filter's effect on domain skew. This will directly support the complementarity claim. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents EXPONA as an empirical framework that explores label functions across surface, structural, and semantic levels then applies reliability-aware filtering, with all performance claims (coverage up to 98.9%, quality gains up to 87%, F1 gains up to 46%) resting on direct experimental comparisons against baselines across eleven datasets. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described derivation. The multi-level exploration and filtering steps are implemented as procedural heuristics whose outputs are measured externally rather than defined in terms of the target metrics, rendering the reported results independent of internal circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Label functions generated across surface, structural, and semantic levels provide complementary signals that reliability-aware filtering can separate from noise and redundancy.
invented entities (1)
-
EXPONA framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, X. Hu, Data-centric artificial intelligence: A survey, ACM Computing Sur- veys 57 (5) (2025) 1–42
work page 2025
-
[2]
A. Jain, H. Patel, L. Nagalapatti, N. Gupta, S. Mehta, S. Guttula, S.Mujumdar,S.Afzal,R.SharmaMittal,V.Munigala,Overviewand importanceofdataqualityformachinelearningtasks,in:Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 2020, pp. 3561–3562
work page 2020
- [3]
- [4]
- [5]
-
[6]
P. Liu, L. Wang, R. Ranjan, G. He, L. Zhao, A survey on active deep learning:Frommodeldriventodatadriven,ACMComputingSurveys (CSUR) 54 (10s) (2022) 1–34
work page 2022
-
[7]
P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, B. B. Gupta, X. Chen, X. Wang, A survey of deep active learning, ACM computing surveys (CSUR) 54 (9) (2021) 1–40
work page 2021
- [8]
-
[9]
M. Nashaat, A. Ghosh, J. Miller, S. Quader, C. Marston, J.-F. Puget, Hybridization of active learning and data programming for labeling large industrial datasets, in: 2018 IEEE International Conference on Big Data (Big Data), IEEE, 2018, pp. 46–55
work page 2018
-
[10]
H. Vishwakarma, H. Lin, F. Sala, R. Korlakai Vinayak, Promises and pitfalls of threshold-based auto-labeling, Advances in Neural Information Processing Systems 36 (2023) 51955–51990
work page 2023
-
[11]
F.Wang,C.Zhang,Labelpropagationthroughlinearneighborhoods, in: Proceedings of the 23rd international conference on Machine learning, 2006, pp. 985–992
work page 2006
-
[12]
O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised learning, IEEE Transactions on Neural Networks 20 (3) (2009) 542–542
work page 2009
-
[13]
D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, Advances in neural information processing systems 16
-
[14]
A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, Data pro- gramming: Creating large training sets, quickly, Advances in neural information processing systems 29
-
[15]
D. Fu, M. Chen, F. Sala, S. Hooper, K. Fatahalian, C. Ré, Fast and three-rious: Speeding up weak supervision with triplet methods, in: International conference on machine learning, PMLR, 2020, pp. 3280–3291
work page 2020
-
[16]
N. Das, S. Chaba, R. Wu, S. Gandhi, D. H. Chau, X. Chu, Goggles: Automaticimagelabelingwithaffinitycoding,in:Proceedingsofthe 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1717–1732
work page 2020
- [18]
-
[19]
S. Ruan, H. Liu, Z. Chen, B. Feng, K. Zhang, C. C. Cao, E. Chen, L. Chen, Cpws: Confident programmatic weak supervision for high- quality data labeling, ACM Transactions on Information Systems 43 (4) (2025) 1–26
work page 2025
-
[20]
T.Zhang,L.Cai,J.Li,N.Roberts,N.Guha,F.Sala,Strongerthanyou think:Benchmarkingweaksupervisiononrealistictasks,Advancesin Neural Information Processing Systems 37 (2024) 122292–122315
work page 2024
-
[21]
T.Brown,B.Mann,N.Ryder,M.Subbiah,J.D.Kaplan,P.Dhariwal, A.Neelakantan,P.Shyam,G.Sastry,A.Askell,etal.,Languagemod- els are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901
work page 2020
- [22]
-
[23]
Y. Meng, M. Michalski, J. Huang, Y. Zhang, T. Abdelzaher, J. Han, Tuninglanguagemodelsastrainingdatageneratorsforaugmentation- enhancedfew-shotlearning,in:InternationalConferenceonMachine Learning, PMLR, 2023, pp. 24457–24477
work page 2023
-
[24]
arXiv preprint arXiv:2310.19596 , year=
R. Zhang, Y. Li, Y. Ma, M. Zhou, L. Zou, Llmaaa: Making large lan- guagemodelsasactiveannotators,arXivpreprintarXiv:2310.19596
-
[25]
H. Schroeder, D. Roy, J. Kabbara, Just put a human in the loop? investigatingllm-assistedannotationforsubjectivetasks,in:Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 25771–25795
work page 2025
-
[26]
S. Mazuelas, S. An, S. Dasgupta, et al., Reliable programmatic weak supervision with confidence intervals for label probabilities, IEEE Transactions on Pattern Analysis and Machine Intelligence
- [27]
- [28]
-
[29]
T.-H.Huang,C.Cao,V.Bhargava,F.Sala,Thealchemist:Automated labeling 500x cheaper than llm data annotators, Advances in Neural Information Processing Systems 37 (2024) 62648–62672
work page 2024
-
[30]
N. Guan, K. Chen, N. Koudas, Datasculpt: Cost-efficient label func- tiondesignviapromptinglargelanguagemodels,in:Proceedings28th InternationalConferenceonExtendingDatabaseTechnology,EDBT, 2025, pp. 25–28
work page 2025
-
[31]
C. Li, A. Gilad, B. Glavic, Z. Miao, S. Roy, Refining labeling functions with limited labeled data, in: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 2025, pp. 1318–1329
work page 2025
- [32]
-
[33]
A. A. Alvarez, N. X. Fincham, Automated l2 proficiency scoring: Weak supervision, large language models, and statistical guarantees, in: Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025), Association for Computational Linguistics, Vienna, Austria, 2025, pp. 384–397
work page 2025
-
[34]
T. C. Alberto, J. V. Lochter, T. A. Almeida, Tubespam: Comment spam filtering on youtube, in: 2015 IEEE 14th international confer- ence on machine learning and applications (ICMLA), IEEE, 2015, pp. 138–143
work page 2015
-
[35]
T. A. Almeida, J. M. G. Hidalgo, A. Yamakami, Contributions to the study of sms spam filtering: new collection and results, in: Proceed- ings of the 11th ACM symposium on Document engineering, 2011, pp. 259–262
work page 2011
- [36]
-
[37]
P.Malo,A.Sinha,P.Korhonen,J.Wallenius,P.Takala,Gooddebtor bad debt: Detecting semantic orientations in economic texts, Journal of the Association for Information Science and Technology 65 (4) (2014) 782–796
work page 2014
-
[38]
M. Krallinger, O. Rabal, S. A. Akhondi, M. P. Pérez, J. Santamaría, G.P.Rodríguez,G.Tsatsaronis,A.Intxaurrondo,J.A.López,U.Nan- dal,etal.,Overviewofthebiocreativevichemical-proteininteraction track, in: Proceedings of the sixth BioCreative challenge evaluation workshop, Vol. 1, 2017, pp. 141–146
work page 2017
- [39]
-
[40]
D. Zhu, X. Shen, M. Mosbach, A. Stephan, D. Klakow, Weaker than youthink:Acriticallookatweaklysupervisedlearning,in:Proceed- ingsofthe61stAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), 2023, pp. 14229–14253
work page 2023
-
[41]
A. P. Dawid, A. M. Skene, Maximum likelihood estimation of observer error-rates using the em algorithm, Journal of the Royal StatisticalSociety:SeriesC(AppliedStatistics)28(1)(1979)20–28
work page 1979
-
[42]
J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies,volume1(longandshortpapers),2019,pp.4171–4186
work page 2019
- [43]
-
[44]
B. Boecking, W. Neiswanger, E. Xing, A. Dubrawski, Interactive weak supervision: Learning useful heuristics for data labeling, arXiv preprint arXiv:2012.06046
-
[45]
S. Galhotra, B. Golshan, W.-C. Tan, Adaptive rule discovery for la- belingtextdata,in:Proceedingsofthe2021Internationalconference on management of data, 2021, pp. 2217–2225
work page 2021
-
[46]
V. Oliveira, G. Nogueira, T. Faleiros, R. Marcacini, Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents: V. oliveira et al., Artificial Intelligence and Law 33 (2) (2025) 361–381
work page 2025
-
[47]
H. Vishwakarma, Y. Chen, S. J. Tay, S. S. S. Namburi, F. Sala, R.KorlakaiVinayak,Pearlsfrompebbles:Improvedconfidencefunc- tions for auto-labeling, Advances in Neural Information Processing Systems 37 (2024) 15983–16015
work page 2024
- [48]
-
[49]
Z. Zhu, Z. Dong, Y. Liu, Detecting corrupted labels without training a model to predict, in: International conference on machine learning, PMLR, 2022, pp. 27412–27427
work page 2022
-
[50]
Y. Yin, Y. Feng, S. Weng, Z. Liu, Y. Yao, Y. Zhang, Z. Zhao, Z.Chen,Dynamicdatafaultlocalizationfordeepneuralnetworks,in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engi- neering, 2023, pp. 1345–1357
work page 2023
-
[51]
S. Kim, S. Kang, D. Kim, J. Ok, H. Yu, Delving into instance- dependent label noise in graph data: A comprehensive study and benchmark,in:Proceedingsofthe31stACMSIGKDDConferenceon Knowledge Discovery and Data Mining V. 2, 2025, pp. 5539–5550
work page 2025
-
[52]
A. Maharana, P. Yadav, M. Bansal, D2 pruning: Message passing for balancing diversity & difficulty in data pruning, in: The Twelfth International Conference on Learning Representations
-
[53]
M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Syntheticdataaugmentationusingganforimprovedliverlesionclas- sification,in:2018IEEE15thinternationalsymposiumonbiomedical imaging (ISBI 2018), IEEE, 2018, pp. 289–293
work page 2018
-
[54]
A. Boukerche, L. Zheng, O. Alfandi, Outlier detection: Methods, models, and classification, ACM Computing Surveys (CSUR) 53 (3) (2020) 1–37
work page 2020
-
[55]
Lam et al.:Preprint submitted to ElsevierPage 17 of 17
W.-C.Lin,C.-F.Tsai,Missingvalueimputation:areviewandanalysis of the literature (2006–2017), Artificial Intelligence Review 53 (2) (2020) 1487–1509. Lam et al.:Preprint submitted to ElsevierPage 17 of 17
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.