PreCall: A Visual Interface for Threshold Optimization in ML Model Selection

Aaron Halfaker; Christoph Kinkeldey; Claudia M\"uller-Birn; Jesse Josua Benjamin; Tom G\"ulenman

arxiv: 1907.05131 · v1 · pith:5K4NXM6Qnew · submitted 2019-07-11 · 💻 cs.LG · cs.HC

PreCall: A Visual Interface for Threshold Optimization in ML Model Selection

Christoph Kinkeldey , Claudia M\"uller-Birn , Tom G\"ulenman , Jesse Josua Benjamin , Aaron Halfaker This is my paper

Pith reviewed 2026-05-24 22:56 UTC · model grok-4.3

classification 💻 cs.LG cs.HC

keywords interactive visualizationthreshold optimizationmachine learning interpretabilitymodel configurationperformance metricsclassification interfaceuser-centered design

0 comments

The pith

An interactive interface visualizes how adjusting a classification threshold changes precision, recall, and expected application results in a machine learning system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a prototype that displays the direct relationship between a tunable threshold value and standard performance measures such as recall and precision. It adds a second view that renders the probable distribution of outcomes under the chosen setting. The goal is to let people without machine-learning training translate practical requirements into suitable model parameters. If the visualizations succeed, they reduce the gap between what an application needs and what the model can deliver.

Core claim

PreCall closes the translation gap between application requirements and model parameters by interactively visualizing the relationship between major model metrics (recall, precision, false positive rate) and a parameter (the threshold between valuable and damaging edits) while also showing the probable results for the current model configuration.

What carries the argument

Interactive plots that map performance metrics to the decision threshold together with previews of likely outcomes under that threshold.

If this is right

Domain experts can map real-world needs directly onto model settings without first learning formal parameter spaces.
Trade-offs among recall, precision, and false-positive rate become visible as the threshold moves.
Users can preview how a chosen setting would classify future items before deployment.
The same visual linkage supports decisions at different levels of automation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same metric-to-outcome plots could be reused for threshold decisions in other content-filtering or moderation systems.
Adding controls for additional parameters beyond the single threshold would test whether the visualization approach scales.
Logging the thresholds users actually select during real tasks would show whether the interface changes behavior compared with numeric-only tools.

Load-bearing premise

Showing users interactive plots of metrics versus threshold and simulated results will improve their understanding and ability to choose appropriate settings.

What would settle it

A controlled study in which participants given the interface select thresholds no more accurately or consistently than participants given only numeric metric tables.

read the original abstract

Machine learning systems are ubiquitous in various kinds of digital applications and have a huge impact on our everyday life. But a lack of explainability and interpretability of such systems hinders meaningful participation by people, especially by those without a technical background. Interactive visual interfaces (e.g., providing means for manipulating parameters in the user interface) can help tackle this challenge. In this paper we present PreCall, an interactive visual interface for ORES, a machine learning-based web service for Wikimedia projects such as Wikipedia. While ORES can be used for a number of settings, it can be challenging to translate requirements from the application domain into formal parameter sets needed to configure the ORES models. Assisting Wikipedia editors in finding damaging edits, for example, can be realized at various stages of automatization, which might impact the precision of the applied model. Our prototype PreCall attempts to close this translation gap by interactively visualizing the relationship between major model metrics (recall, precision, false positive rate) and a parameter (the threshold between valuable and damaging edits). Furthermore, PreCall visualizes the probable results for the current model configuration to improve the human's understanding of the relationship between metrics and outcome when using ORES. We describe PreCall's components and present a use case that highlights the benefits of our approach. Finally, we pose further research questions we would like to discuss during the workshop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreCall is a clear prototype description for visualizing ORES thresholds but adds no evaluation or new findings.

read the letter

PreCall shows an interactive interface that plots how recall, precision, and false positive rate shift with the classification threshold on ORES models, plus a view of likely edit outcomes under the current setting. The paper walks through the components and gives one Wikipedia use case for spotting damaging edits. That part is straightforward and relevant to the specific metrics ORES exposes. The motivation around helping non-technical users translate domain needs into model parameters is stated plainly without exaggeration. The soft spot is the complete absence of any evaluation. No user study, no before-after measures, no even informal feedback on whether the visualizations improve understanding or decision quality. The text stays at the level of design description and ends with open questions for a workshop. Similar threshold visualization ideas already exist in HCI work on model configuration, so the contribution here is mainly the application to this one system. This paper is for people who build or study interfaces for ML in collaborative platforms. A reader looking for concrete design examples around ORES might pick up a few layout ideas, but it supplies no data or validated claims to cite or extend. I would not send it for peer review as a full paper; it reads like workshop material that could spark discussion but does not meet the bar for a refereed contribution.

Referee Report

2 major / 0 minor

Summary. The manuscript presents PreCall, an interactive visual interface for the ORES ML service in Wikimedia projects. It visualizes the effects of varying the decision threshold on metrics including recall, precision, and false positive rate, displays probable outcomes for a given configuration, describes the interface components, provides a use case, and poses open research questions for workshop discussion. The work aims to help non-technical users translate domain requirements into model parameters.

Significance. If the visualizations succeed in improving user understanding of metric-threshold relationships, the design could meaningfully support participatory configuration of ML systems in high-impact settings such as Wikipedia edit quality assessment. The prototype contributes a concrete example of interactive visualization for threshold tuning, which is relevant to the broader area of human-in-the-loop ML and explainable interfaces.

major comments (2)

[Abstract] Abstract: The text states that PreCall is intended 'to improve the human's understanding of the relationship between metrics and outcome when using ORES', yet the only supporting material is a descriptive use case. No user study, A/B comparison, task-performance metrics, or qualitative feedback is reported to substantiate any improvement in understanding or configuration quality.
[Use case] Use-case description: The narrative highlights benefits of the interface but supplies no baseline (e.g., existing ORES configuration tools), no measurement of user accuracy or time, and no falsifiable prediction about when the visualizations would or would not help. This leaves the central claim about closing the 'translation gap' without empirical grounding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. This is a workshop paper presenting a prototype interface and illustrative use case to stimulate discussion on human-in-the-loop ML configuration, rather than reporting a completed empirical evaluation. We address the points below and will make revisions to clarify scope and intent.

read point-by-point responses

Referee: [Abstract] Abstract: The text states that PreCall is intended 'to improve the human's understanding of the relationship between metrics and outcome when using ORES', yet the only supporting material is a descriptive use case. No user study, A/B comparison, task-performance metrics, or qualitative feedback is reported to substantiate any improvement in understanding or configuration quality.

Authors: We agree that the manuscript provides no empirical evidence of improved understanding. The abstract phrasing describes the intended design goal of the prototype rather than a demonstrated result. As a workshop submission, the focus is on the interface design, components, and open questions rather than evaluation. We will revise the abstract to explicitly state that PreCall is a prototype intended to support understanding of metric-threshold relationships, that the use case is illustrative, and that empirical validation remains future work. revision: yes
Referee: [Use case] Use-case description: The narrative highlights benefits of the interface but supplies no baseline (e.g., existing ORES configuration tools), no measurement of user accuracy or time, and no falsifiable prediction about when the visualizations would or would not help. This leaves the central claim about closing the 'translation gap' without empirical grounding.

Authors: The use case is presented as a narrative scenario to demonstrate how the visualizations could be applied in Wikipedia edit quality assessment and to surface research questions for workshop discussion. It does not constitute an empirical claim or evaluation. We acknowledge the lack of baselines, measurements, or falsifiable predictions. In revision we will add explicit language noting that the use case is hypothetical and exploratory, and that the paper does not claim validated improvements in closing the translation gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a descriptive prototype design contribution with no derivation chain, equations, predictions, or fitted parameters. It presents an interface for visualizing ORES metrics vs. threshold and poses open research questions. No load-bearing steps exist that could reduce to self-definition, fitted inputs, or self-citations. The contribution is explicitly framed as design description rather than any asserted result derived from inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a design and prototype paper with no mathematical models, fitted parameters, or new theoretical constructs.

pith-pipeline@v0.9.0 · 5794 in / 932 out tokens · 50225 ms · 2026-05-24T22:56:20.591669+00:00 · methodology

PreCall: A Visual Interface for Threshold Optimization in ML Model Selection

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)