arxiv: 2604.09037 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.CL· cs.HC

Recognition: unknown

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

Xiyang Huang , Jiawei Lin , Keying Wu , Jiaxin Huang , Kailai Yang , Renxiong Wei , Cheng zeng , Jiayi Xiang

show 4 more authors

Ziyan Kuang Min Peng Qianqian Xie Sophia Ananiadou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:31 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.HC

keywords SiMing-Benchprocedural correctnessclinical skill videosmultimodal large language modelsvideo understandingstate trackingrubric-based evaluation

0 comments

The pith

Multimodal models show weak agreement with physicians on clinical procedure correctness in videos, and global scores hide failures on intermediate steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SiMing-Bench evaluates whether multimodal large language models can judge procedural correctness by tracking how ongoing interactions update the state of a clinical procedure across full-length videos. The benchmark draws on real examination videos for cardiopulmonary resuscitation, automated external defibrillator use, and bag-mask ventilation, each paired with step-wise rubrics and dual-physician labels. Across open- and closed-source models, agreement with physician judgments remains low. Failures on specific rubric steps persist even when overall procedure scores correlate more closely with experts, showing that coarse global assessment overstates true capability. Further tests with binary judgments and aligned clips confirm the core difficulty lies in modeling continuous state updates rather than localization or scoring granularity.

Core claim

Across diverse open- and closed-source MLLMs, agreement with physician judgments on procedural correctness from clinical skill videos is weak. Weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, indicating that coarse global assessment substantially overestimates current models' procedural judgment ability.

What carries the argument

SiMing-Bench instantiated with SiMing-Score, the physician-annotated dataset of full-length clinical skill videos paired with standardized step-wise rubrics and dual-expert labels to assess whether interaction-driven state updates preserve procedural correctness across the workflow.

If this is right

Coarse global assessment substantially overestimates MLLMs' procedural judgment ability.
The bottleneck is modeling how continuous interactions update procedural state over time, not merely fine-grained scoring or temporal localization.
Binary step judgment and step-aligned clips still expose the same limitation, confirming deeper deficits in state tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may require explicit internal mechanisms to maintain and update procedural state representations from ongoing video interactions.
The benchmark approach could extend to other procedural domains such as surgical training or industrial processes.
Targeted training on videos that emphasize step-wise state changes could address the observed gaps.

Load-bearing premise

Dual-expert physician annotations on the selected videos and rubrics provide reliable, unbiased ground truth for procedural correctness across the full workflow.

What would settle it

A newly developed MLLM that achieves high agreement with the dual-expert labels on both overall procedures and individual rubric steps would falsify the reported weak agreement.

Figures

Figures reproduced from arXiv: 2604.09037 by Cheng zeng, Jiawei Lin, Jiaxin Huang, Jiayi Xiang, Kailai Yang, Keying Wu, Min Peng, Qianqian Xie, Renxiong Wei, Sophia Ananiadou, Xiyang Huang, Ziyan Kuang.

**Figure 2.** Figure 2: Comparison between the full-video setting and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between the full-video setting and the step-aligned clips setting on binary step judgment. For [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-frame examples of the three emergency care skills considered in this study are shown: cardiopul [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Representative example of the physician-defined scoring rubric in SiMing-Bench. The left panel presents [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Standardized CPR rubric defined by physicians and used for step-level annotation in our benchmark. and maximum points. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 10.** Figure 10: Prompt template for full-video step-wise error detection. evidence only and outputs one strict JSON object. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 13.** Figure 13: Screenshot of the IRB approval document. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SiMing-Bench gives a useful new dataset for state-tracking in clinical videos but the main claims about model limits rest on unverified expert agreement.

read the letter

The one or two things to know about this paper are that it presents SiMing-Bench as the first benchmark aimed at procedural state tracking from continuous interactions in clinical skill videos, and that it reports weak performance from various MLLMs against physician judgments, with particular problems at intermediate steps even when overall scores seem reasonable. What is actually new is the emphasis on how models must update their understanding of the procedural state based on ongoing actions rather than just detecting discrete events or recalling sequences. The authors instantiate this with a dataset of real videos covering cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation. Each video comes with a step-wise rubric and dual-expert physician labels for correctness. The paper does well in constructing this resource and in showing through additional experiments that the issue is not just about fine-grained scoring or finding the right moments in the video. Instead, it points to difficulties in modeling the evolving state over time. This distinction matters for applications in training and safety. The soft spots are mainly around the ground truth. The dual-expert annotations are central to the evaluation, yet the paper does not provide inter-rater reliability measures like Cohen's kappa or raw agreement percentages. Without those, it is difficult to rule out that some of the observed model disagreements stem from variability in the labels themselves rather than limitations in the models. The stress-test note correctly flags this as undercutting the strength of the claims about overestimation by coarse assessments. This work is for researchers building or evaluating multimodal models in medical or procedural domains. A reader looking for benchmarks that go beyond basic video understanding will find concrete value in the dataset and the failure analysis. It deserves a serious referee because the contribution of the benchmark and the identified gap are substantive, even if the annotation process needs more transparency. I recommend engaging with the work in peer review, focusing on adding the reliability statistics and perhaps expanding the error analysis.

Referee Report

1 major / 2 minor

Summary. The paper introduces SiMing-Bench, the first benchmark for evaluating MLLMs on rubric-grounded procedural correctness in full-length clinical skill videos (CPR, AED operation, bag-mask ventilation). It pairs videos with standardized step-wise rubrics and dual-expert physician labels, then reports consistently weak MLLM agreement with physicians; intermediate-step performance remains weak even when procedure-level correlation appears acceptable, concluding that coarse global assessment substantially overestimates current models' ability to track interaction-driven state updates.

Significance. If the central findings hold after addressing annotation reliability, the work provides a valuable new resource for assessing a previously overlooked capability in video MLLMs—dynamic procedural-state modeling from continuous interactions—which has direct relevance to medical training, simulation, and AI-assisted skill evaluation. The step-aligned analyses and binary-judgment experiments help isolate the bottleneck beyond simple localization or granularity.

major comments (1)

[dataset construction paragraph / §3] Dataset construction paragraph and §3: The manuscript describes dual-expert physician annotations on rubrics and videos but reports no inter-rater reliability statistics (Cohen’s kappa, percentage agreement, or disagreement-resolution protocol) for the step-wise judgments. Without these metrics, the headline claim that weak model–physician agreement demonstrates a genuine procedural-state-modeling deficit (rather than label noise) cannot be evaluated; moderate inter-expert agreement on intermediate steps would directly undermine the conclusion that coarse global assessment “substantially overestimates” model capabilities.

minor comments (2)

[Abstract] Abstract: The abstract asserts “consistently weak agreement” and “weak performance on rubric-defined intermediate steps” without any numerical values, dataset size, number of videos, or model names, forcing readers to consult the full text for even basic evidence strength.
[Results / Experiments] The paper would benefit from an explicit table or figure summarizing inter-model agreement scores, step-level vs. procedure-level correlations, and the exact number of videos/rubrics per procedure to allow direct comparison with future work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The primary concern regarding the lack of inter-rater reliability statistics is addressed point-by-point below. We will revise the manuscript to incorporate the requested metrics and protocol details.

read point-by-point responses

Referee: [dataset construction paragraph / §3] Dataset construction paragraph and §3: The manuscript describes dual-expert physician annotations on rubrics and videos but reports no inter-rater reliability statistics (Cohen’s kappa, percentage agreement, or disagreement-resolution protocol) for the step-wise judgments. Without these metrics, the headline claim that weak model–physician agreement demonstrates a genuine procedural-state-modeling deficit (rather than label noise) cannot be evaluated; moderate inter-expert agreement on intermediate steps would directly undermine the conclusion that coarse global assessment “substantially overestimates” model capabilities.

Authors: We agree that reporting inter-rater reliability is essential to substantiate the reliability of the dual-expert labels and to rule out label noise as an explanation for the observed model-physician discrepancies. The original manuscript describes the dual-expert annotation process in §3 but omits the quantitative agreement metrics and resolution protocol. In the revised version, we will add Cohen’s kappa, percentage agreement (both overall and stratified by procedure and step type), and a description of how disagreements were resolved. These statistics will be computed directly from the existing dual annotations and presented in the dataset construction section to allow readers to evaluate label quality, particularly for intermediate steps. This addition will support rather than undermine our central claim. revision: yes

Circularity Check

0 steps flagged

New benchmark and empirical evaluation with no circular derivation chain

full rationale

The paper introduces SiMing-Bench as a new physician-annotated dataset of clinical skill videos paired with step-wise rubrics and dual-expert labels, then reports direct empirical observations of MLLM agreement with those labels. No mathematical derivations, equations, fitted parameters, or predictions are claimed; the central results (weak agreement, intermediate-step failures despite acceptable global correlation) are observational comparisons against the newly created ground truth rather than reductions to self-referential inputs or prior self-citations. The work is self-contained as a benchmark contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on new data creation and model testing; limited details available from abstract only.

axioms (1)

domain assumption Physician dual-expert annotations constitute reliable ground truth for procedural correctness
Benchmark evaluation treats these labels as the reference standard against which MLLM outputs are measured.

pith-pipeline@v0.9.0 · 5557 in / 1204 out tokens · 85700 ms · 2026-05-10T17:31:41.851782+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 19 canonical work pages · 6 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others. 2025. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Delong Chen, Theo Moutakanni, Willy Chung, Yejin Bang, Ziwei Ji, Allen Bolourchi, and Pascale Fung. 2025. Planning with reasoning using vision language world model. arXiv preprint arXiv:2509.02722

work page arXiv 2025
[3]

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, and 1 others. 2025. Wow: Towards a world omniscient world model through embodied interaction. arXiv preprint arXiv:2509.22642

work page arXiv 2025
[4]

Lauren Chong, Silas Taylor, Matthew Haywood, Barbara-Ann Adelstein, and Boaz Shulruf. 2017. The sights and insights of examiners in objective structured clinical examinations. Journal of educational evaluation for health professions, 14

2017
[5]

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, and 1 others. 2026. Molmo2: Open weights and data for vision-language models with video understanding and grounding. arXiv preprint arXiv:2601.10611

work page arXiv 2026
[6]

Michael D Cusimano, Robert Cohen, William Tucker, John Murnaghan, Ron Kodama, and Richard Reznick. 1994. A comparative analysis of the costs of administration of an osce (objective structured clinical examination). Academic Medicine, 69(7):571--6

1994
[7]

Ronald M Epstein and Edward M Hundert. 2002. Defining and assessing professional competence. Jama, 287(2):226--235

2002
[8]

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, and 1 others. 2025. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24108--24118

2025
[9]

Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, and Yueting Zhuang. 2024. Worldgpt: Empowering llm as multimodal world model. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 7346--7355

2024
[10]

Google . 2025. A new era of intelligence with gemini 3. https://blog.google/products-and-platforms/products/gemini/gemini-3/. Accessed: 2026-03-15

2025
[11]

Ronald M Harden, Mary Stevenson, W Wilson Downie, and GM Wilson. 1975. Assessment of clinical competence using objective structured examination. Br Med J, 1(5955):447--451

1975
[12]

David Hope and Helen Cameron. 2015. Examiners are most lenient at the start of a two-day osce. Medical Teacher, 37(1):81--85

2015
[13]

Kamran Z Khan, Sankaranarayanan Ramachandran, Kathryn Gaunt, and Piyush Pushkar. 2013. The objective structured clinical examination (osce): Amee guide no. 81. part i: an historical and theoretical perspective. Medical teacher, 35(9):e1437--e1446

2013
[14]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and 1 others. 2024 a . Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, and 1 others. 2025 a . Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models. arXiv preprint arXiv:2505.21500

work page arXiv 2025
[16]

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, and 1 others. 2024 b . Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195--22206

2024
[17]

Yaoqian Li, Xikai Yang, Dunyuan Xu, Yang Yu, Litao Zhao, Xiaowei Hu, Jinpeng Li, and Pheng-Ann Heng. 2025 b . Surgpub-video: A comprehensive surgical video dataset for enhanced surgical intelligence in vision-language model. arXiv preprint arXiv:2508.10054

work page arXiv 2025
[18]

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang W Chen. 2024. Et bench: Towards open-ended event-level video-language understanding. Advances in Neural Information Processing Systems, 37:32076--32110

2024
[19]

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Shengpeng Ji, and Min Xia. 2024. Llava-mr: Large language-and-vision assistant for video moment retrieval. arXiv preprint arXiv:2411.14505

work page arXiv 2024
[20]

David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, and 1 others. 2025. Scalelong: A multi-timescale benchmark for long video understanding. arXiv preprint arXiv:2505.23922

work page arXiv 2025
[21]

Kevin McLaughlin, Martha Ainslie, Sylvain Coderre, Bruce Wright, and Claudio Violato. 2009. The effect of differential rater function over time (drift) on objective structured clinical examination ratings. Medical education, 43(10):989--992

2009
[22]

George E Miller. 1990. The assessment of clinical skills/competence/performance. Academic medicine, 65(9):S63--7

1990
[23]

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, and Hisham Cholakkal. 2026. Medix-r1: Open ended medical reinforcement learning. arXiv preprint arXiv:2602.23363

work page arXiv 2026
[24]

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, and 8 others. 2025. https://arxiv.org/abs/2510.15710 Unimedvl: Unifying medical multimodal understanding and generation through obser...

work page arXiv 2025
[25]

OpenAI . 2024. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-03-15

2024
[26]

OpenAI . 2025. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/. Accessed: 2026-03-15

2025
[27]

Chiara Plizzari, Alessio Tonioni, Yongqin Xian, Achin Kulshrestha, and Federico Tombari. 2025. Omnia de egotempo: Benchmarking temporal understanding of multi-modal llms in egocentric videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24129--24138

2025
[28]

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C \' an Hughes, Charles Lau, and 1 others. 2025. Medgemma technical report. arXiv preprint arXiv:2507.05201

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Shiva Sreeram, Tsun-Hsuan Wang, Alaa Maalouf, Guy Rosman, Sertac Karaman, and Daniela Rus. 2025. Probing multimodal llms as world models for driving. IEEE Robotics and Automation Letters

2025
[30]

Cees PM Van Der Vleuten and Lambert WT Schuwirth. 2005. Assessing professional competence: from methods to programmes. Medical education, 39(3):309--317

2005
[31]

Volcano Engine . 2025. Doubao-seed-1.6-vision. https://www.volcengine.com/docs/82379/1330310. Official model documentation page, Accessed: 2026-03-15

2025
[32]

Merrilyn Walton, Helen Woodward, Samantha Van Staalduinen, Claire Lemer, Felix Greaves, Douglas Noble, Benjamin Ellis, Liam Donaldson, Bruce Barraclough, and as Expert Lead for the Sub-Programme Expert Group convened by the World Alliance of Patient Safety. 2011. Republished paper: The who patient safety curriculum guide for medical schools. Postgraduate ...

2011
[33]

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, and 1 others. 2025 a . Lvbench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958--22967

2025
[34]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, and 1 others. 2025 b . Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. 2025 c . Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925--18935

2025
[36]

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, and 1 others. 2025. Qwen3-omni technical report. arXiv preprint arXiv:2509.17765

work page internal anchor Pith review arXiv 2025
[37]

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, and 1 others. 2025. Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764

work page arXiv 2025
[38]

Daichi Yashima, Shuhei Kurita, Yusuke Oda, and Komei Sugiura. 2026. Remora: Multimodal large language model based on refined motion representation for long-video understanding. arXiv preprint arXiv:2602.16412

work page arXiv 2026
[39]

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, and 1 others. 2024. Timesuite: Improving mllms for long video understanding via grounded tuning. arXiv preprint arXiv:2410.19702

work page arXiv 2024
[40]

Zhitao Zeng, Zhu Zhuo, Xiaojun Jia, Erli Zhang, Junde Wu, Jiaan Zhang, Yuxuan Wang, Chang Han Low, Jian Jiang, Zilong Zheng, and 1 others. 2025. Surgvlm: A large vision-language model and systematic evaluation benchmark for surgical intelligence. arXiv preprint arXiv:2506.02555

work page arXiv 2025
[41]

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713

work page internal anchor Pith review arXiv 2024
[42]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[43]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...