pith. machine review for the scientific record. sign in

arxiv: 2605.10341 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.SE

Recognition: no theorem link

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords LaTeX typesettingvisual optimizationdocument layoutvision-language agentPDF generationtypesetting defectsscientific documents
0
0 comments X

The pith

Vision-in-the-loop optimization turns compilable LaTeX sources into publication-ready PDFs by iteratively diagnosing and repairing visual defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LaTeX manuscripts that compile without errors frequently produce PDFs with misplaced floats, overflowing equations, inconsistent tables, widow and orphan lines, and poor page balance. Rule-based tools operate only on source and logs while text-only models edit without seeing two-dimensional layout outcomes. The paper formalizes Visual Typesetting Optimization as the task of using visual verification after each edit to reach polished, page-budget-compliant output. PaperFit implements this with an agent that renders pages, classifies defects according to a five-category taxonomy, and applies constrained source repairs. Experiments on the PaperFit-Bench of 200 papers across ten templates show the closed-loop method outperforms baselines.

Core claim

Visual Typesetting Optimization (VTO) is the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, guided by a five-category taxonomy of typesetting defects.

What carries the argument

The vision-in-the-loop agent that renders PDF pages, diagnoses defects from images using the five-category taxonomy, and applies constrained repairs to the LaTeX source.

If this is right

  • Authors can reduce repetitive compile-inspect-edit cycles for scientific documents.
  • Automated document pipelines gain a missing stage that enforces visual quality alongside compilability.
  • The PaperFit-Bench benchmark enables systematic comparison of future vision-based typesetting systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same render-diagnose-repair loop could be applied to other markup languages that produce visual output.
  • Pairing the agent with generative models might support end-to-end creation of layout-compliant documents from outlines.
  • Extending the defect taxonomy to additional categories would allow testing on more complex multi-page scientific layouts.

Load-bearing premise

The vision model can reliably diagnose the five defect categories from rendered images and the constrained source repairs will resolve defects without introducing new ones or violating page budgets.

What would settle it

A collection of LaTeX papers on which the vision model misclassifies defects or the repairs produce new layout problems, resulting in no net improvement or degradation on visual quality metrics.

Figures

Figures reproduced from arXiv: 2605.10341 by Bihui Yu, Caijun Jia, Cheng Tan, Conghui He, Jiabei Cheng, Jingxuan Wei, Junjie Jiang, Siyuan Li, Xinglong Xu.

Figure 1
Figure 1. Figure 1: Comparison of typesetting optimization approaches: (a) Rule-based tools are blind to visuals; [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Perturbation distribution and category composition. The inner ring shows proportions of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the PaperFit pipeline. PaperFit diagnoses layout defects from source, log, PDF, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fine-grained VLM scores for the LLM backend comparison. Panel (a) reports repair and [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Human/VLM evaluation correlation [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Venue-level VLM score distribution for the LLM backend comparison. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study: Realigning Tables and Figures with In-Text Citations. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Study: Fixing Page Budget Shift and Underfilled Pages. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case Study: Aesthetic Detail Refinement. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case Study: Template Migration. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Error analysis: page-budget violations. Case A: Page-budget gate failed; target 10 pages, [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Error analysis: visual defects and invalid output. Case C: Visual defects remain unrepaired; [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template used for TextST source-only LaTeX repair. The method receives only the main TeX source, applies a small local source edit when safe, and records a boundary report for traceability. TextMR Source-and-Log Repair Prompt Role. You are a text/log LaTeX repair model. You revise the TeX source directly using source-level evidence and compile-log feedback. Input evidence. Main .tex source plus com… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template used for TextMR source-and-log LaTeX repair. The method augments source-only editing with compile-log feedback while still excluding rendered page images. VisualST Single-Turn Visual Repair Prompt Role. You are a vision-language model for academic paper layout repair. You may read TeX source, compile logs, and rendered page images, but you may not use structured planning artifacts, repair … view at source ↗
Figure 15
Figure 15. Figure 15: Prompt template used for VisualST single-turn visual repair. The method receives rendered page images and source context, then performs one constrained visual edit without a structured repair workflow. VisualMR Fixed-Round Visual Agent Prompt Role. You are an autonomous LaTeX paper layout repair agent running in unattended benchmark mode. Scope. Work only inside the current case directory. You may inspect… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt template used for VisualMR fixed-round visual agent repair. The method can iterate over source, logs, and rendered pages for a fixed round budget while explicitly excluding PaperFit structured artifacts. PaperFit (OURS) Structured Repair Agent Prompt Role. You are running in unattended benchmark mode for academic paper layout repair using the PaperFit closed-loop workflow. Execution rules. • Work o… view at source ↗
Figure 17
Figure 17. Figure 17: Prompt template used for PaperFit structured repair. The method injects the VTO taxonomy, repair priority, forbidden operations, and checklist quality gate used by the proposed closed-loop system. C Reproducibility Notes For each method and case, the evaluation records the generated source, compile logs, rendered pages when available, programmatic metric outputs, and VLM reports. Aggregated tables are com… view at source ↗
read the original abstract

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript formalizes Visual Typesetting Optimization (VTO) as the task of iteratively transforming a compilable LaTeX source into a visually polished, page-budget-compliant PDF via visual rendering, defect diagnosis, and constrained source repairs. It introduces a five-category taxonomy of typesetting defects (misplaced floats, overflowing equations, inconsistent scaling, widows/orphans, poor page balance), presents the PaperFit vision-in-the-loop agent, and constructs PaperFit-Bench (200 papers, 10 templates, 13 defect types). Experiments are claimed to show PaperFit outperforming baselines by a large margin, establishing VTO as a necessary stage in document automation.

Significance. If the empirical claims hold with rigorous metrics, the work would introduce a novel closed-loop application of vision-language models to a practical pain point in scientific publishing, potentially reducing author time on manual typesetting fixes. The taxonomy and benchmark could provide reusable infrastructure for future VTO research. The distinction from open-loop text editing and rule-based tools is conceptually clear. However, the current lack of quantitative validation on diagnosis accuracy and repair side-effects limits the assessed significance.

major comments (3)
  1. [Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.
  2. [Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.
  3. [Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.
minor comments (2)
  1. [Abstract] The five-category taxonomy is referenced but not enumerated in the abstract; listing the categories explicitly would improve immediate clarity.
  2. [Benchmark] PaperFit-Bench construction details (how the 13 defect types were injected across the 10 templates) should be expanded to allow reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity and empirical rigor of our presentation of Visual Typesetting Optimization and the PaperFit system. We provide point-by-point responses below and will make the corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.

    Authors: We agree that the abstract claim requires supporting metrics for proper evaluation. The current version summarizes the experimental outcomes at a high level. In the revised manuscript, we will update the abstract to include specific quantitative results, such as overall defect reduction percentages and diagnosis accuracy metrics, and add direct references to the tables and figures in the Experiments section that report these values. This change will make the central argument for vision-in-the-loop immediately verifiable. revision: yes

  2. Referee: [Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.

    Authors: We acknowledge the need for detailed error analysis to validate the core assumptions. The manuscript presents aggregate results but does not include a breakdown of diagnosis reliability or repair side-effects. We will add a new subsection to the Experiments section dedicated to error analysis. This will include quantitative metrics on the vision model's diagnosis performance (precision, recall, and F1 per defect category), analysis of failure cases, and evaluation of whether repairs introduce new defects or violate page budgets. These additions will directly address the concerns about the load-bearing assumptions. revision: yes

  3. Referee: [Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.

    Authors: We appreciate the request for more detail on the constrained repairs. The method section outlines the iterative process at a conceptual level. We will revise the Method section to provide a precise description of the constrained repair mechanism, including how repairs are generated to maintain compilability (e.g., through syntax-preserving edits), how page-budget compliance is checked and enforced after each visual verification (via rejection of non-compliant changes), and the conditions for loop termination. This will enable readers to assess the guarantees on termination and compilability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines VTO as a new task, introduces a defect taxonomy and PaperFit-Bench benchmark, then reports experimental outperformance against baselines. No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on independent benchmark construction and comparison rather than self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities detailed beyond the high-level introduction of VTO and the agent.

axioms (1)
  • domain assumption Rendered PDF pages provide sufficient visual information to diagnose the defined typesetting defects
    Implicit in the vision-in-the-loop design and defect taxonomy.
invented entities (2)
  • Visual Typesetting Optimization (VTO) no independent evidence
    purpose: Formal task definition for visual polishing of LaTeX documents
    Newly introduced concept in the paper
  • PaperFit agent no independent evidence
    purpose: Implementation of iterative visual diagnosis and repair
    Core contribution introduced by the authors

pith-pipeline@v0.9.0 · 5563 in / 1227 out tokens · 41512 ms · 2026-05-12T04:37:18.536873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

260 extracted references · 260 canonical work pages · 7 internal anchors

  1. [1]

    URLhttps://openreview.net/forum?id=UcrNxBtXWM

    $Aˆ2Rˆ2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement| OpenReview, . URLhttps://openreview.net/forum?id=UcrNxBtXWM

  2. [2]

    2024 , eprint =

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to- Local Adaptive Perception, . URLhttps://arxiv.org/html/2410.12628v1

  3. [3]

    URL https: //arxiv.org/html/2410.15504v1

    FlexDoc: Flexible Document Adaptation through Optimizing both Content and Layout, . URL https: //arxiv.org/html/2410.15504v1

  4. [5]

    URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework

    PaperTalker Multi-Agent Framework, . URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework

  5. [6]

    Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo

    Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, and Jonghyun Choi. Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo. InAAAI, volume 39, pages 1728–1736, 2025

  6. [7]

    Real-time calibration model for low-cost sensor in fine-grained time series

    Seokho Ahn, Hyungjin Kim, Sungbok Shin, and Young-Duk Seo. Real-time calibration model for low-cost sensor in fine-grained time series. InAAAI, volume 39, pages 3–11, 2025

  7. [8]

    Bring metric functions into diffusion models.arxiv, 2024

    Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, and Jiebo Luo. Bring metric functions into diffusion models.arxiv, 2024

  8. [9]

    Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition

    Zhongde An, Jinhong You, Jiyanglin Li, Yiming Tang, Wen Li, Heming Du, and Shouguo Du. Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition. InAAAI, volume 40, pages 19623–19631, 2026

  9. [10]

    Claude Opus 4.6

    Anthropic. Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026- 05-03

  10. [11]

    Manmatha

    Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. DocFormer: End-to- End Transformer for Document Understanding, September 2021. URL http://arxiv.org/abs/2106.11539. arXiv:2106.11539 [cs]

  11. [12]

    Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024

    Adrian Atienza, Jakob Bardram, and Sadasivan Puthusserypady. Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024

  12. [13]

    Multi-view pedestrian occupancy prediction with a novel synthetic dataset

    Sithu Aung, Min-Cheol Sagong, and Junghyun Cho. Multi-view pedestrian occupancy prediction with a novel synthetic dataset. InAAAI, volume 39, pages 1782–1790, 2025

  13. [14]

    Near optimal decision trees in a split second.arxiv, 2025

    Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near optimal decision trees in a split second.arxiv, 2025

  14. [15]

    Forecasting continuous non- conservative dynamical systems in so (3)

    Lennart Bastian, Mohammad Rashed, Nassir Navab, and Tolga Birdal. Forecasting continuous non- conservative dynamical systems in so (3). InICCV, pages 14845–14855, 2025

  15. [16]

    Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arxiv, 2025

    Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arxiv, 2025

  16. [17]

    Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.IJCV, 133:1780–1807, 2025

    Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, and Xian Sun. Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.IJCV, 133:1780–1807, 2025

  17. [18]

    Interpretable network visual- izations: A human-in-the-loop approach for post-hoc explainability of cnn-based image classification.arxiv, 2024

    Matteo Bianchi, Antonio De Santis, Andrea Tocchetti, and Marco Brambilla. Interpretable network visual- izations: A human-in-the-loop approach for post-hoc explainability of cnn-based image classification.arxiv, 2024

  18. [19]

    pix2tex: Using a ViT to convert images of equations into LaTeX code

    Lukas Blecher. pix2tex: Using a ViT to convert images of equations into LaTeX code. https://github.com/ lukas-blecher/LaTeX-OCR, 2022. Accessed: 2026-04-28. 22 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

  19. [20]

    arXiv preprint arXiv:2308.13418 , year=

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural Optical Under- standing for Academic Documents, August 2023. URLhttps://arxiv.org/abs/2308.13418v1

  20. [21]

    Imagenet-trained cnns are not biased towards texture: Revisiting feature reliance through controlled suppression.arxiv, 2025

    Tom Burgert, Oliver Stoll, Paolo Rota, and Beg¨um Demir. Imagenet-trained cnns are not biased towards texture: Revisiting feature reliance through controlled suppression.arxiv, 2025

  21. [22]

    Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025

    Zhongnan Cai, Yingying Wang, Hui Zheng, Panwang Pan, ZiXu Lin, Ge Meng, Chenxin Li, Chunming He, Jiaxin Xie, Yunlong Lin, et al. Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025

  22. [23]

    Spiking heterogeneous graph attention networks

    Buqing Cao, Qian Peng, Xiang Xie, Liang Chen, Min Shi, and Jianxun Liu. Spiking heterogeneous graph attention networks. InAAAI, volume 40, pages 19853–19861, 2026

  23. [24]

    Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024

    Feiqi Cao, Caren Han, and Hyunsuk Chung. Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024

  24. [25]

    Pite: Multi-prototype alignment for individual treatment effect estimation

    Fuyuan Cao, Jiaxuan Zhang, and Xiaoli Li. Pite: Multi-prototype alignment for individual treatment effect estimation. InAAAI, volume 40, pages 19871–19879, 2026

  25. [26]

    A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024

    Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024

  26. [27]

    Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025

    Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, et al. Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025

  27. [28]

    Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025

    Jeffrey A Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025

  28. [29]

    Videojam: Joint appearance-motion representations for enhanced motion generation in video models.arxiv, 2025

    Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models.arxiv, 2025

  29. [30]

    Causal-inspired multitask learning for video-based human pose estimation

    Haipeng Chen, Sifan Wu, Zhigang Wang, Yifang Yin, Yingying Jiao, Yingda Lyu, and Zhenguang Liu. Causal-inspired multitask learning for video-based human pose estimation. InAAAI, volume 39, pages 2052–2060, 2025

  30. [31]

    Enhancing adversarial transferability with adversarial weight tuning

    Jiahao Chen, Zhou Feng, Rui Zeng, Yuwen Pu, Chunyi Zhou, Yi Jiang, Yuyou Gan, Jinbao Li, and Shouling Ji. Enhancing adversarial transferability with adversarial weight tuning. InAAAI, volume 39, pages 2061–2069, 2025

  31. [32]

    Back on track: Bundle adjustment for dynamic scene reconstruction

    Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruction. InICCV, pages 4951–4960, 2025

  32. [33]

    Factchd: Benchmarking fact-conflicting hallucination detection

    Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lv, Dan Zhang, and Huajun Chen. Factchd: Benchmarking fact-conflicting hallucination detection. arxiv, 2023

  33. [34]

    Glcf: A global-local multimodal coherence analysis framework for talking face generation detection

    Xiaocan Chen, Qilin Yin, Jiarui Liu, Wei Lu, Xiangyang Luo, and Jiantao Zhou. Glcf: A global-local multimodal coherence analysis framework for talking face generation detection. InAAAI, volume 39, pages 75–83, 2025

  34. [35]

    Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024

    Yanxi Chen, Chunxiao Li, Xinyang Dai, Jinhuan Li, Weiyu Sun, Yiming Wang, Renyuan Zhang, Tinghe Zhang, and Bo Wang. Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024

  35. [36]

    Gim: A million-scale benchmark for generative image manipulation detection and localization

    Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, et al. Gim: A million-scale benchmark for generative image manipulation detection and localization. InAAAI, volume 39, pages 2311–2319, 2025

  36. [37]

    RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024

    Yufan Chen, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ruiping Liu, Philip Torr, and Rainer Stiefelhagen. RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024. URL https://arxiv. org/abs/2403.14442. Version Number: 1. 23 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

  37. [38]

    Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints.arxiv, 2023

    Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and Qingpei Guo. Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints.arxiv, 2023

  38. [39]

    Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval

    Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. InACM MM, pages 6143–6152, 2025

  39. [40]

    Unveiling the landscape of clinical depression assessment: From behavioral signatures to psychiatric reasoning

    Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang, Xiyao Xiao, Kun Feng, and Minlie Huang. Unveiling the landscape of clinical depression assessment: From behavioral signatures to psychiatric reasoning. InAAAI, volume 40, pages 1748–1756, 2026

  40. [41]

    A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025

    Nuojin Cheng, Leonard Papenmeier, Stephen Becker, and Luigi Nardi. A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025

  41. [42]

    Adversarial robustification via text-to-image diffusion models

    Daewon Choi, Jongheon Jeong, Huiwon Jang, and Jinwoo Shin. Adversarial robustification via text-to-image diffusion models. InECCV, pages 158–177. Springer, 2024

  42. [43]

    Towards neuro-symbolic video understanding

    Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, and Sandeep Chinchali. Towards neuro-symbolic video understanding. InECCV, pages 220–236. Springer, 2024

  43. [44]

    Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024

    Ka-Ho Chow, Wenqi Wei, and Lei Yu. Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024

  44. [45]

    Zero-shot detection of ai- generated images

    Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai- generated images. InECCV, pages 54–72. Springer, 2024

  45. [46]

    Fast one-stage unsupervised domain adaptive person search.arxiv, 2024

    Tianxiang Cui, Huibing Wang, Jinjia Peng, Ruoxi Deng, Xianping Fu, and Yang Wang. Fast one-stage unsupervised domain adaptive person search.arxiv, 2024

  46. [47]

    Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025

    Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025

  47. [48]

    Ambient diffusion omni: Training good models with bad data.arxiv, 2025

    Giannis Daras, Adrian Rodriguez-Munoz, Adam Klivans, Antonio Torralba, and Constantinos Daskalakis. Ambient diffusion omni: Training good models with bad data.arxiv, 2025

  48. [49]

    Marker: Convert documents to markdown, JSON, chunks, and HTML

    Datalab. Marker: Convert documents to markdown, JSON, chunks, and HTML. https://github.com/ datalab-to/marker, 2024. Accessed: 2026-04-28

  49. [50]

    DeepSeek-V4-Pro Technical Report

    DeepSeek-AI. DeepSeek-V4-Pro Technical Report. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek V4.pdf, 2026. Accessed: 2026-05-03

  50. [51]

    Nl2ca: Auto-formalizing cognitive decision-making from natural language using an unsupervised criticnl2ltl framework

    Zihao Deng, Yijia Li, Renrui Zhang, and Peijun Ye. Nl2ca: Auto-formalizing cognitive decision-making from natural language using an unsupervised criticnl2ltl framework. InAAAI, volume 40, pages 1766–1773, 2026

  51. [52]

    Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024

    Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, and Shiming Xiang. Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024

  52. [53]

    Bridging generative and discrimi- native models for unified visual perception with diffusion priors.arxiv, 2024

    Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, and Xinbo Gao. Bridging generative and discrimi- native models for unified visual perception with diffusion priors.arxiv, 2024

  53. [54]

    Group-aware coordination graph for multi-agent reinforcement learning

    Wei Duan, Jie Lu, and Junyu Xuan. Group-aware coordination graph for multi-agent reinforcement learning. arxiv, 2024

  54. [55]

    Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024

    Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024. URLhttps://arxiv.org/abs/2403.02308v3

  55. [56]

    Eric0801/LaTeXAgent, February 2026

    EatingChew. Eric0801/LaTeXAgent, February 2026. URL https://github.com/Eric0801/LaTeXAgent. original-date: 2026-01-25T03:50:39Z

  56. [57]

    Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024

    Tianyu Fan, Lirong Wu, Yufei Huang, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024. 24 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

  57. [58]

    Robgc: Towards robust graph condensation.IEEE TKDE, 2025

    Xinyi Gao, Hongzhi Yin, Tong Chen, Guanhua Ye, Wentao Zhang, and Bin Cui. Robgc: Towards robust graph condensation.IEEE TKDE, 2025

  58. [59]

    Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022

    Zhangyang Gao, Cheng Tan, Pablo Chac´on, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022

  59. [60]

    Simvp: Simpler yet better video prediction

    Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. In CVPR, pages 3170–3180, 2022

  60. [61]

    Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023

    Zhangyang Gao, Cheng Tan, and Stan Z Li. Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023

  61. [62]

    A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024

    Zhangyang Gao, Daize Dong, Cheng Tan, Jun Xia, Bozhen Hu, and Stan Z Li. A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024

  62. [63]

    Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024

    Zhangyang Gao, Jue Wang, Cheng Tan, Lirong Wu, Yufei Huang, Siyuan Li, Zhirui Ye, and Stan Z Li. Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024

  63. [64]

    Foldtoken: Learning protein language via vector quantization and beyond

    Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z Li. Foldtoken: Learning protein language via vector quantization and beyond. InAAAI, volume 39, pages 219–227, 2025

  64. [65]

    Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024

    Andoni I Garmendia, Quentin Cappart, Josu Ceberio, and Alexander Mendiburu. Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024

  65. [66]

    Aider: AI pair programming in your terminal

    Paul Gauthier. Aider: AI pair programming in your terminal. https://github.com/Aider-AI/aider, 2023. Accessed: 2026-04-28

  66. [67]

    Mean flows for one-step generative modeling.arxiv, 2025

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arxiv, 2025

  67. [68]

    Lots of fashion! multi-conditioning for image generation via sketch-text pairing

    Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani. Lots of fashion! multi-conditioning for image generation via sketch-text pairing. InICCV, pages 19711–19720, 2025

  68. [69]

    Why dpo is a misspecified estimator and how to fix it.arxiv, 2025

    Aditya Gopalan, Sayak Ray Chowdhury, and Debangshu Banerjee. Why dpo is a misspecified estimator and how to fix it.arxiv, 2025

  69. [70]

    Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning

    Junrong Guo, Shancheng Fang, Yadong Qu, Xiaorui Wang, and Hongtao Xie. Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning. October 2025. URL https://openreview. net/forum?id=wUYRMxrULV

  70. [71]

    Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026

    Junrong Guo, Shancheng Fang, Yadong Qu, and Hongtao Xie. Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026. URLhttps://arxiv.org/abs/2603.22187v1

  71. [72]

    Cyclic refiner: Object-aware temporal representation learning for multi-view 3d detection and tracking.IJCV, 132:6184–6206, 2024

    Mingzhe Guo, Zhipeng Zhang, Liping Jing, Yuan He, Ke Wang, and Heng Fan. Cyclic refiner: Object-aware temporal representation learning for multi-view 3d detection and tracking.IJCV, 132:6184–6206, 2024

  72. [73]

    Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025

    Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025

  73. [74]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling

    Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. In ICCV, pages 14822–14833, 2025

  74. [75]

    Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025

    Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025

  75. [76]

    Neural Computation 9(8), 1735–1780 (1997)

    Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1997.9.8.1735. URL https://direct.mit. edu/neco/article/9/8/1735-1780/6109

  76. [77]

    Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025

    Julia Hornauer, Amir El-Ghoussani, and Vasileios Belagiannis. Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025

  77. [78]

    Multimodal regression for enzyme turnover rates prediction.arxiv, 2025

    Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Sizhe Qiu, Jun Xia, and Stan Z Li. Multimodal regression for enzyme turnover rates prediction.arxiv, 2025. 25 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

  78. [79]

    Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025

    Fan Hu, Gaofeng Lu, Jun Chen, Channan Guo, Yuekui Yang, and Xirong Li. Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025

  79. [80]

    Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F

    Fuxiang Huang, Xiaowei Fu, Shiyu Ye, Lina Ma, Wen Li, Xinbo Gao, David Zhang, and Lei Zhang. Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F. huang etal.IJCV, 134:5, 2026

  80. [81]

    Magicfight: Personalized martial arts combat video generation

    Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, and Shifeng Chen. Magicfight: Personalized martial arts combat video generation. InACM MM, pages 10833–10842, 2024

Showing first 80 references.