Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation

Ashi Jain; Gulipalli Praveen Kumar; Kriti Banka; Manish Mehta; Naman Khandelwal; Parth Kulshreshtha; Sumukha Sharma Thoppanahalli Chandramouli; Sunil Kothari; Tanuja Chintada; Tao Liu

arxiv: 2605.15714 · v1 · pith:HTELUJZDnew · submitted 2026-05-15 · 💻 cs.SE · cs.AI

Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation

Sunil Kothari , Sumukha Sharma Thoppanahalli Chandramouli , Naman Khandelwal , Parth Kulshreshtha , Ashi Jain , Kriti Banka , Tanuja Chintada , Venkata Triveni

show 3 more authors

Gulipalli Praveen Kumar Manish Mehta Tao Liu

This is my paper

Pith reviewed 2026-05-20 17:25 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords data annotationquality assurancevalidation timingerror propagationmachine learning data qualityshift-left principleannotation pipelines

0 comments

The pith

Early-stage quality assurance in annotation pipelines reduces both error rates and costs more effectively than late-stage validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that when validation occurs in annotation pipelines matters more than the specific methods used for checking quality. Drawing from software engineering, where late defect detection costs 4 to 100 times more, the authors show that annotation workflows follow similar patterns where early fixes are far cheaper. They define three distinct points for quality assurance: before any annotation starts, right after labeling, and after review cycles. A survey of recent papers reveals that almost none report the timing of their validation steps despite its importance. This position calls for treating QA timing as a key design choice to improve data quality for machine learning models.

Core claim

The central claim is that prioritizing quality assurance before annotation begins, rather than after review, fundamentally lowers final error rates and annotation costs in data pipelines, as formalized by a parametric error-propagation model that distinguishes timing effects on economics versus error propagation.

What carries the argument

The taxonomy of three QA trigger points—T0 (pre-annotation), T1 (post-annotation), and T2 (post-review)—combined with a parametric error-propagation model that treats timing as a measurable design variable.

If this is right

Researchers must report the timing of validation steps in addition to the methods used.
Annotation platforms should expose QA timing as a configurable first-class parameter.
Controlled experiments are needed to directly measure detection rates at each stage.
Without addressing timing, efforts to improve validation methods alone may miss the largest gains in efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting early QA could improve the quality of training data for foundation models by preventing error propagation from the start.
Similar timing considerations might apply to other data processing pipelines beyond annotation, such as data cleaning or augmentation.
Platforms could develop automated pre-annotation checks based on data characteristics to implement this shift.

Load-bearing premise

Annotation pipelines behave like software development processes where the cost of fixing errors increases dramatically the later they are discovered.

What would settle it

An experiment that measures the total annotation cost and final error rate for the same task using QA only at T0 versus only at T2, showing no cost savings or higher errors with early QA, would falsify the position.

Figures

Figures reproduced from arXiv: 2605.15714 by Ashi Jain, Gulipalli Praveen Kumar, Kriti Banka, Manish Mehta, Naman Khandelwal, Parth Kulshreshtha, Sumukha Sharma Thoppanahalli Chandramouli, Sunil Kothari, Tanuja Chintada, Tao Liu, Venkata Triveni.

**Figure 1.** Figure 1: QA trigger points in annotation pipelines. T0 occurs after ML pre-annotation but before human work. T1 occurs after annotation but before review. T2 occurs after review. Each trigger point enables different validation capabilities and incurs different intervention costs [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. When validation occurs, not merely what methods are employed, fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established "shift-left" principle from software engineering, where empirical studies demonstrate 4--100x cost multipliers for defects detected in later stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete. We propose a taxonomy of three QA trigger points, namely pre-annotation (T0), post-annotation (T1), and post-review (T2), that decompose annotation workflows into discrete validation opportunities. A parametric error-propagation model formalizes when timing affects final error rates versus only economics, making timing a measurable design variable rather than a configuration afterthought. A survey of 47 recent papers reveals that only 4% report when validation occurs, a striking gap given timing's demonstrated impact in adjacent fields. Without explicit attention to QA timing, the community risks optimizing validation methods while ignoring the structural variable that may matter most. Acting on this position requires three steps: researchers should report QA timing configurations alongside validation methods; annotation platforms should expose timing as a first-class parameter; and the community should run controlled experiments that measure stage-specific detection rates directly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper flags a real reporting gap on QA timing in annotation work and offers a simple taxonomy, but its cost savings argument stays borrowed from software engineering without fresh calibration.

read the letter

Hi, the main thing to know is that this position paper argues for moving quality assurance earlier in annotation pipelines and shows that almost no one currently reports when they do their checks. It backs the point with a quick survey and a basic framework rather than new measurements. The taxonomy of T0 pre-annotation, T1 post-annotation, and T2 post-review stages gives a clean way to break down the workflow and treat timing as a design choice. The survey finding that only 4 percent of 47 recent papers mention timing is concrete and points to a practical blind spot in how data work gets documented. The parametric error-propagation model is a reasonable attempt to separate cost effects from error-rate effects, which could help make the timing variable measurable. The paper does well at naming the issue and sketching next steps like requiring timing reports and running controlled tests. The soft spot is the central cost claim. It leans on the classic Boehm multipliers from software defect studies without showing that annotation errors follow the same 4-100x pattern when caught early. No new data or small pilot appears in the manuscript to calibrate the model on actual label noise, so the analogy stays suggestive. The survey also skips details on paper selection, which makes the 4 percent figure harder to weigh. This is aimed at people who run or study large annotation efforts for ML models. A practitioner managing data pipelines would get a useful prompt to track timing in their own setup. The work shows clear structure and honest engagement with the literature, so it deserves a serious referee even though the evidence is mostly by extension from another field. I would send it to review and ask the referees to push for either a small empirical check or clearer limits on the analogy.

Referee Report

3 major / 2 minor

Summary. This position paper argues that the machine learning community should prioritize early-stage quality assurance (QA) in annotation pipelines over late-stage validation. Drawing on the 'shift-left' principle from software engineering (citing Boehm 1981 and Shull et al. 2002 for 4-100x cost multipliers), it claims that when validation occurs fundamentally determines error rates and costs. The paper proposes a taxonomy of three QA trigger points (T0 pre-annotation, T1 post-annotation, T2 post-review), introduces a parametric error-propagation model to formalize timing effects, reports a survey of 47 papers where only 4% specify QA timing, and recommends reporting timing configurations, exposing timing as a platform parameter, and running controlled experiments.

Significance. If the analogy to software engineering holds and the parametric model can be empirically grounded on annotation data, the position could meaningfully redirect ML data-quality research from validation methods toward timing as a first-class design variable. This framing has the potential to produce measurable cost reductions and lower label noise in foundation-model pipelines, provided the survey gap and model predictions are substantiated.

major comments (3)

[Parametric error-propagation model] The parametric error-propagation model is introduced to distinguish cases where timing affects final error rates versus only economics, yet the manuscript provides neither the explicit equations nor the parameter definitions or calibration procedure on annotation data. Without these, it is unclear whether the model yields falsifiable, non-circular predictions or simply restates the software-engineering analogy.
[Survey of 47 recent papers] The survey finding that only 4% of 47 recent papers report when validation occurs is used to document a 'striking gap,' but the paper does not describe the paper-selection methodology, search terms, inclusion criteria, or the precise operational definition of 'reporting QA timing.' This omission weakens the evidentiary basis for the claim that timing is systematically neglected.
[Introduction / Central claim] The central claim that annotation pipelines exhibit analogous dynamics to software-engineering defect detection (with early detection costing a fraction of late detection) rests on the Boehm (1981) citation without any discussion of differences in error types, propagation mechanisms, or empirical transfer evidence between code defects and annotation label noise.

minor comments (2)

[Taxonomy] The abstract and taxonomy section would benefit from a simple workflow diagram showing the T0/T1/T2 trigger points relative to annotation and review stages to improve readability for readers unfamiliar with the pipeline structure.
[Related work] The manuscript could add a short paragraph contrasting the proposed timing focus with existing annotation-quality literature (e.g., inter-annotator agreement studies) to clarify the incremental contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our position paper. The comments help identify opportunities to strengthen the presentation of the model, the survey, and the central analogy. We respond to each major comment below and commit to revisions that address the concerns without altering the core position.

read point-by-point responses

Referee: [Parametric error-propagation model] The parametric error-propagation model is introduced to distinguish cases where timing affects final error rates versus only economics, yet the manuscript provides neither the explicit equations nor the parameter definitions or calibration procedure on annotation data. Without these, it is unclear whether the model yields falsifiable, non-circular predictions or simply restates the software-engineering analogy.

Authors: We appreciate the referee's point that the model requires more formal presentation to be useful. The manuscript introduces the model at a conceptual level to separate timing effects on error rates from purely economic impacts, but we agree that the absence of explicit equations limits clarity. In the revised manuscript we will add the full parametric equations, define all parameters (including stage-specific error introduction rates and propagation multipliers), and outline a calibration procedure drawing on publicly available annotation datasets. This will make the model's predictions falsifiable and demonstrate how it extends the software-engineering analogy rather than merely restating it. revision: yes
Referee: [Survey of 47 recent papers] The survey finding that only 4% of 47 recent papers report when validation occurs is used to document a 'striking gap,' but the paper does not describe the paper-selection methodology, search terms, inclusion criteria, or the precise operational definition of 'reporting QA timing.' This omission weakens the evidentiary basis for the claim that timing is systematically neglected.

Authors: We agree that methodological transparency is essential for the survey claim. The current text reports the headline result without detailing the process. We will expand the survey section to include the paper-selection methodology, search terms and databases used, inclusion and exclusion criteria, the time window, and the precise operational definition of 'reporting QA timing' (explicit mention of the pipeline stage at which validation occurs). These additions will make the gap claim reproducible and strengthen its evidentiary value. revision: yes
Referee: [Introduction / Central claim] The central claim that annotation pipelines exhibit analogous dynamics to software-engineering defect detection (with early detection costing a fraction of late detection) rests on the Boehm (1981) citation without any discussion of differences in error types, propagation mechanisms, or empirical transfer evidence between code defects and annotation label noise.

Authors: The paper invokes the shift-left principle as a motivating analogy rather than asserting identical mechanisms. We recognize that a brief discussion of differences would improve balance. In the revision we will add a short paragraph acknowledging distinctions—such as error types (systematic label noise versus code defects) and propagation paths (through model training versus runtime execution)—while arguing that the documented cost multipliers still provide a compelling rationale for treating timing as a first-class variable in annotation pipelines. This addition will clarify the scope of the analogy without overclaiming direct empirical transfer. revision: yes

Circularity Check

0 steps flagged

No circularity; central claims rest on external citations and independent survey

full rationale

The paper's derivation transfers the shift-left cost-multiplier principle from software engineering via explicit citations to Boehm (1981) and Shull et al. (2002), which are independent external sources rather than self-referential. The taxonomy (T0/T1/T2) and parametric error-propagation model are introduced as forward proposals without equations or data shown to reduce to fitted inputs by construction. The 47-paper survey finding is an empirical observation, not a prediction derived from the model. No self-citations, self-definitional steps, or ansatz smuggling appear in the load-bearing chain; the position remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The position depends on the transferability of software engineering cost-multiplier findings to annotation without new supporting data, plus an unelaborated parametric model.

axioms (1)

domain assumption Annotation pipelines exhibit analogous dynamics to software defect detection with 4-100x cost multipliers for late detection.
Directly invoked to justify prioritizing early QA timing.

pith-pipeline@v0.9.0 · 5881 in / 1186 out tokens · 41026 ms · 2026-05-20T17:25:43.809063+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean ArithmeticFromLogic.equivNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Annotation pipelines exhibit analogous dynamics to software engineering defect detection... (Boehm, 1981)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

and Poesio, M

Artstein, R. and Poesio, M. (2008). Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555--596

work page 2008
[2]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Beyer, L., H\'enaff, O. J., Kolesnikov, A., Zhai, X., and van den Oord, A. (2020). Are we done with ImageNet ? arXiv preprint arXiv:2006.07159

work page arXiv 2020
[3]

Boehm, B. W. (1981). Software Engineering Economics. Prentice-Hall

work page 1981
[4]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37--46

work page 1960
[5]

Crosby, P. B. (1979). Quality Is Free: The Art of Making Quality Certain. McGraw-Hill

work page 1979
[6]

Daniel, F., Kucherbaev, P., Cappiello, C., Benatallah, B., and Allahbakhsh, M. (2018). Quality control in crowdsourcing: A survey. ACM Computing Surveys, 51(1):1--40

work page 2018
[7]

Dave, A., Khurana, T., Tokmakov, P., Schmid, C., and Ramanan, D. (2020). TAO : A large-scale benchmark for tracking any object. In ECCV

work page 2020
[8]

Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates. J. Royal Statistical Society C, 28(1):20--28

work page 1979
[9]

W., Tkachenko, U., and Mueller, J

Goh, H. W., Tkachenko, U., and Mueller, J. (2022). CROWDLAB : Supervised learning for multi-annotator consensus. In NeurIPS Human in the Loop Learning Workshop, 2022

work page 2022
[10]

Jones, C. (2008). Applied Software Measurement. McGraw-Hill, 3rd edition

work page 2008
[11]

Klie, J.-C., Webber, B., and Gurevych, I. (2023). Annotation error detection: Analyzing past and present. Computational Linguistics, 49:157--198

work page 2023
[12]

Klie, J.-C., Eckart de Castilho, R., and Gurevych, I. (2024). Analyzing dataset annotation quality management. Computational Linguistics, 50(3):817--866

work page 2024
[13]

Kovashka, A., Russakovsky, O., Fei-Fei, L., and Grauman, K. (2016). Crowdsourcing in computer vision. Found. Trends Comput. Graph. Vis., 10(3):177--243

work page 2016
[14]

Krippendorff, K. (2011). Computing Krippendorff's alpha-reliability. Technical report, University of Pennsylvania

work page 2011
[15]

Liker, J. K. (2004). The Toyota Way. McGraw-Hill

work page 2004
[16]

McConnell, S. (2004). Code Complete. Microsoft Press, 2nd edition

work page 2004
[17]

Monarch, R. M. (2021). Human-in-the-Loop Machine Learning. Manning Publications

work page 2021
[18]

Ng, A. (2021). MLOps : From model-centric to data-centric AI . DeepLearning.AI

work page 2021
[19]

G., Jiang, L., and Chuang, I

Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021a). Confident learning: Estimating uncertainty in labels. JAIR, 70:1373--1411

work page
[20]

G., Athalye, A., and Mueller, J

Northcutt, C. G., Athalye, A., and Mueller, J. (2021b). Pervasive label errors in test sets. In NeurIPS Datasets & Benchmarks

work page
[21]

GPT-4V(ision) system card

OpenAI (2023). GPT-4V(ision) system card. Technical report

work page 2023
[22]

P., Uijlings, J

Papadopoulos, D. P., Uijlings, J. R., Keller, F., and Ferrari, V. (2017). Extreme clicking for efficient object annotation. In ICCV

work page 2017
[23]

D., Heller, N., Isensee, F., Kopp-Schneider, A., and Maier-Hein, L

R\"adsch, T., Reinke, A., Weru, V., Tizabi, M. D., Heller, N., Isensee, F., Kopp-Schneider, A., and Maier-Hein, L. (2024). Quality assured: Rethinking annotation strategies in imaging AI . In ECCV, pages 52--69

work page 2024
[24]

Raykar, V. C. et al. (2010). Learning from crowds. JMLR, 11:1297--1322

work page 2010
[25]

Roh, Y., Heo, G., and Whang, S. E. (2019). A survey on data collection for machine learning. IEEE TKDE, 33(4):1328--1347

work page 2019
[26]

Sambasivan, N. et al. (2021). ``Everyone wants to do the model work, not the data work.'' In CHI

work page 2021
[27]

Sculley, D. et al. (2015). Hidden technical debt in ML systems. In NIPS'15

work page 2015
[28]

Shull, F. et al. (2002). What we have learned about fighting defects. In IEEE International Symposium on Software Metrics

work page 2002
[29]

Su, H., Deng, J., and Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI Workshop

work page 2012
[30]

Vaughan, J. W. (2017). Making better use of the crowd. JMLR, 18(193):1--46

work page 2017
[31]

Voigtlaender, P. et al. (2019). MOTS : Multi-object tracking and segmentation. In CVPR

work page 2019
[32]

Wang, P. et al. (2024). Qwen2-VL : Vision-language model perception. arXiv:2409.12191[cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

E., Roh, Y., Song, H., and Lee, J.-G

Whang, S. E., Roh, Y., Song, H., and Lee, J.-G. (2023). Data collection and quality challenges in deep learning. VLDB Journal, 32:791--813

work page 2023
[34]

Yao, A., Gall, J., Leistner, C., and Van Gool, L. (2012). Interactive object detection. In CVPR

work page 2012

[1] [1]

and Poesio, M

Artstein, R. and Poesio, M. (2008). Survey article: Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555--596

work page 2008

[2] [2]

Are we done with imagenet?arXiv preprint arXiv:2006.07159,

Beyer, L., H\'enaff, O. J., Kolesnikov, A., Zhai, X., and van den Oord, A. (2020). Are we done with ImageNet ? arXiv preprint arXiv:2006.07159

work page arXiv 2020

[3] [3]

Boehm, B. W. (1981). Software Engineering Economics. Prentice-Hall

work page 1981

[4] [4]

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37--46

work page 1960

[5] [5]

Crosby, P. B. (1979). Quality Is Free: The Art of Making Quality Certain. McGraw-Hill

work page 1979

[6] [6]

Daniel, F., Kucherbaev, P., Cappiello, C., Benatallah, B., and Allahbakhsh, M. (2018). Quality control in crowdsourcing: A survey. ACM Computing Surveys, 51(1):1--40

work page 2018

[7] [7]

Dave, A., Khurana, T., Tokmakov, P., Schmid, C., and Ramanan, D. (2020). TAO : A large-scale benchmark for tracking any object. In ECCV

work page 2020

[8] [8]

Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates. J. Royal Statistical Society C, 28(1):20--28

work page 1979

[9] [9]

W., Tkachenko, U., and Mueller, J

Goh, H. W., Tkachenko, U., and Mueller, J. (2022). CROWDLAB : Supervised learning for multi-annotator consensus. In NeurIPS Human in the Loop Learning Workshop, 2022

work page 2022

[10] [10]

Jones, C. (2008). Applied Software Measurement. McGraw-Hill, 3rd edition

work page 2008

[11] [11]

Klie, J.-C., Webber, B., and Gurevych, I. (2023). Annotation error detection: Analyzing past and present. Computational Linguistics, 49:157--198

work page 2023

[12] [12]

Klie, J.-C., Eckart de Castilho, R., and Gurevych, I. (2024). Analyzing dataset annotation quality management. Computational Linguistics, 50(3):817--866

work page 2024

[13] [13]

Kovashka, A., Russakovsky, O., Fei-Fei, L., and Grauman, K. (2016). Crowdsourcing in computer vision. Found. Trends Comput. Graph. Vis., 10(3):177--243

work page 2016

[14] [14]

Krippendorff, K. (2011). Computing Krippendorff's alpha-reliability. Technical report, University of Pennsylvania

work page 2011

[15] [15]

Liker, J. K. (2004). The Toyota Way. McGraw-Hill

work page 2004

[16] [16]

McConnell, S. (2004). Code Complete. Microsoft Press, 2nd edition

work page 2004

[17] [17]

Monarch, R. M. (2021). Human-in-the-Loop Machine Learning. Manning Publications

work page 2021

[18] [18]

Ng, A. (2021). MLOps : From model-centric to data-centric AI . DeepLearning.AI

work page 2021

[19] [19]

G., Jiang, L., and Chuang, I

Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021a). Confident learning: Estimating uncertainty in labels. JAIR, 70:1373--1411

work page

[20] [20]

G., Athalye, A., and Mueller, J

Northcutt, C. G., Athalye, A., and Mueller, J. (2021b). Pervasive label errors in test sets. In NeurIPS Datasets & Benchmarks

work page

[21] [21]

GPT-4V(ision) system card

OpenAI (2023). GPT-4V(ision) system card. Technical report

work page 2023

[22] [22]

P., Uijlings, J

Papadopoulos, D. P., Uijlings, J. R., Keller, F., and Ferrari, V. (2017). Extreme clicking for efficient object annotation. In ICCV

work page 2017

[23] [23]

D., Heller, N., Isensee, F., Kopp-Schneider, A., and Maier-Hein, L

R\"adsch, T., Reinke, A., Weru, V., Tizabi, M. D., Heller, N., Isensee, F., Kopp-Schneider, A., and Maier-Hein, L. (2024). Quality assured: Rethinking annotation strategies in imaging AI . In ECCV, pages 52--69

work page 2024

[24] [24]

Raykar, V. C. et al. (2010). Learning from crowds. JMLR, 11:1297--1322

work page 2010

[25] [25]

Roh, Y., Heo, G., and Whang, S. E. (2019). A survey on data collection for machine learning. IEEE TKDE, 33(4):1328--1347

work page 2019

[26] [26]

Sambasivan, N. et al. (2021). ``Everyone wants to do the model work, not the data work.'' In CHI

work page 2021

[27] [27]

Sculley, D. et al. (2015). Hidden technical debt in ML systems. In NIPS'15

work page 2015

[28] [28]

Shull, F. et al. (2002). What we have learned about fighting defects. In IEEE International Symposium on Software Metrics

work page 2002

[29] [29]

Su, H., Deng, J., and Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI Workshop

work page 2012

[30] [30]

Vaughan, J. W. (2017). Making better use of the crowd. JMLR, 18(193):1--46

work page 2017

[31] [31]

Voigtlaender, P. et al. (2019). MOTS : Multi-object tracking and segmentation. In CVPR

work page 2019

[32] [32]

Wang, P. et al. (2024). Qwen2-VL : Vision-language model perception. arXiv:2409.12191[cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

E., Roh, Y., Song, H., and Lee, J.-G

Whang, S. E., Roh, Y., Song, H., and Lee, J.-G. (2023). Data collection and quality challenges in deep learning. VLDB Journal, 32:791--813

work page 2023

[34] [34]

Yao, A., Gall, J., Leistner, C., and Van Gool, L. (2012). Interactive object detection. In CVPR

work page 2012