arxiv: 2605.10341 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.SE

Recognition: no theorem link

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Bihui Yu , Xinglong Xu , Junjie Jiang , Jiabei Cheng , Caijun Jia , Siyuan Li , Conghui He , Jingxuan Wei

show 1 more author

Cheng Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LaTeX typesettingvisual optimizationdocument layoutvision-language agentPDF generationtypesetting defectsscientific documents

0 comments

The pith

Vision-in-the-loop optimization turns compilable LaTeX sources into publication-ready PDFs by iteratively diagnosing and repairing visual defects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LaTeX manuscripts that compile without errors frequently produce PDFs with misplaced floats, overflowing equations, inconsistent tables, widow and orphan lines, and poor page balance. Rule-based tools operate only on source and logs while text-only models edit without seeing two-dimensional layout outcomes. The paper formalizes Visual Typesetting Optimization as the task of using visual verification after each edit to reach polished, page-budget-compliant output. PaperFit implements this with an agent that renders pages, classifies defects according to a five-category taxonomy, and applies constrained source repairs. Experiments on the PaperFit-Bench of 200 papers across ten templates show the closed-loop method outperforms baselines.

Core claim

Visual Typesetting Optimization (VTO) is the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, guided by a five-category taxonomy of typesetting defects.

What carries the argument

The vision-in-the-loop agent that renders PDF pages, diagnoses defects from images using the five-category taxonomy, and applies constrained repairs to the LaTeX source.

If this is right

Authors can reduce repetitive compile-inspect-edit cycles for scientific documents.
Automated document pipelines gain a missing stage that enforces visual quality alongside compilability.
The PaperFit-Bench benchmark enables systematic comparison of future vision-based typesetting systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same render-diagnose-repair loop could be applied to other markup languages that produce visual output.
Pairing the agent with generative models might support end-to-end creation of layout-compliant documents from outlines.
Extending the defect taxonomy to additional categories would allow testing on more complex multi-page scientific layouts.

Load-bearing premise

The vision model can reliably diagnose the five defect categories from rendered images and the constrained source repairs will resolve defects without introducing new ones or violating page budgets.

What would settle it

A collection of LaTeX papers on which the vision model misclassifies defects or the repairs produce new layout problems, resulting in no net improvement or degradation on visual quality metrics.

Figures

Figures reproduced from arXiv: 2605.10341 by Bihui Yu, Caijun Jia, Cheng Tan, Conghui He, Jiabei Cheng, Jingxuan Wei, Junjie Jiang, Siyuan Li, Xinglong Xu.

**Figure 2.** Figure 2: Perturbation distribution and category composition. The inner ring shows proportions of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the PaperFit pipeline. PaperFit diagnoses layout defects from source, log, PDF, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Fine-grained VLM scores for the LLM backend comparison. Panel (a) reports repair and [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 6.** Figure 6: Human/VLM evaluation correlation [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 5.** Figure 5: Venue-level VLM score distribution for the LLM backend comparison. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 7.** Figure 7: Case Study: Realigning Tables and Figures with In-Text Citations. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Case Study: Fixing Page Budget Shift and Underfilled Pages. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Case Study: Aesthetic Detail Refinement. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Case Study: Template Migration. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Error analysis: page-budget violations. Case A: Page-budget gate failed; target 10 pages, [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Error analysis: visual defects and invalid output. Case C: Visual defects remain unrepaired; [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt template used for TextST source-only LaTeX repair. The method receives only the main TeX source, applies a small local source edit when safe, and records a boundary report for traceability. TextMR Source-and-Log Repair Prompt Role. You are a text/log LaTeX repair model. You revise the TeX source directly using source-level evidence and compile-log feedback. Input evidence. Main .tex source plus com… view at source ↗

**Figure 14.** Figure 14: Prompt template used for TextMR source-and-log LaTeX repair. The method augments source-only editing with compile-log feedback while still excluding rendered page images. VisualST Single-Turn Visual Repair Prompt Role. You are a vision-language model for academic paper layout repair. You may read TeX source, compile logs, and rendered page images, but you may not use structured planning artifacts, repair … view at source ↗

**Figure 15.** Figure 15: Prompt template used for VisualST single-turn visual repair. The method receives rendered page images and source context, then performs one constrained visual edit without a structured repair workflow. VisualMR Fixed-Round Visual Agent Prompt Role. You are an autonomous LaTeX paper layout repair agent running in unattended benchmark mode. Scope. Work only inside the current case directory. You may inspect… view at source ↗

**Figure 16.** Figure 16: Prompt template used for VisualMR fixed-round visual agent repair. The method can iterate over source, logs, and rendered pages for a fixed round budget while explicitly excluding PaperFit structured artifacts. PaperFit (OURS) Structured Repair Agent Prompt Role. You are running in unattended benchmark mode for academic paper layout repair using the PaperFit closed-loop workflow. Execution rules. • Work o… view at source ↗

**Figure 17.** Figure 17: Prompt template used for PaperFit structured repair. The method injects the VTO taxonomy, repair priority, forbidden operations, and checklist quality gate used by the proposed closed-loop system. C Reproducibility Notes For each method and case, the evaluation records the generated source, compile logs, rendered pages when available, programmatic metric outputs, and VLM reports. Aggregated tables are com… view at source ↗

read the original abstract

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaperFit defines Visual Typesetting Optimization as a closed-loop vision task and releases a benchmark, but the performance claims rest on high-level results without enough detail on diagnosis errors or repair side effects.

read the letter

The paper's core move is to treat LaTeX polishing as an iterative visual problem rather than a source-only or open-loop text edit. It names the task VTO, gives a five-category defect taxonomy, and builds PaperFit, an agent that renders pages, diagnoses issues like float placement or widows, and applies constrained source fixes. That framing is new and directly targets a workflow pain that every author knows but that existing tools mostly ignore.

Referee Report

3 major / 2 minor

Summary. The manuscript formalizes Visual Typesetting Optimization (VTO) as the task of iteratively transforming a compilable LaTeX source into a visually polished, page-budget-compliant PDF via visual rendering, defect diagnosis, and constrained source repairs. It introduces a five-category taxonomy of typesetting defects (misplaced floats, overflowing equations, inconsistent scaling, widows/orphans, poor page balance), presents the PaperFit vision-in-the-loop agent, and constructs PaperFit-Bench (200 papers, 10 templates, 13 defect types). Experiments are claimed to show PaperFit outperforming baselines by a large margin, establishing VTO as a necessary stage in document automation.

Significance. If the empirical claims hold with rigorous metrics, the work would introduce a novel closed-loop application of vision-language models to a practical pain point in scientific publishing, potentially reducing author time on manual typesetting fixes. The taxonomy and benchmark could provide reusable infrastructure for future VTO research. The distinction from open-loop text editing and rule-based tools is conceptually clear. However, the current lack of quantitative validation on diagnosis accuracy and repair side-effects limits the assessed significance.

major comments (3)

[Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.
[Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.
[Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.

minor comments (2)

[Abstract] The five-category taxonomy is referenced but not enumerated in the abstract; listing the categories explicitly would improve immediate clarity.
[Benchmark] PaperFit-Bench construction details (how the 13 defect types were injected across the 10 templates) should be expanded to allow reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity and empirical rigor of our presentation of Visual Typesetting Optimization and the PaperFit system. We provide point-by-point responses below and will make the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.

Authors: We agree that the abstract claim requires supporting metrics for proper evaluation. The current version summarizes the experimental outcomes at a high level. In the revised manuscript, we will update the abstract to include specific quantitative results, such as overall defect reduction percentages and diagnosis accuracy metrics, and add direct references to the tables and figures in the Experiments section that report these values. This change will make the central argument for vision-in-the-loop immediately verifiable. revision: yes
Referee: [Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.

Authors: We acknowledge the need for detailed error analysis to validate the core assumptions. The manuscript presents aggregate results but does not include a breakdown of diagnosis reliability or repair side-effects. We will add a new subsection to the Experiments section dedicated to error analysis. This will include quantitative metrics on the vision model's diagnosis performance (precision, recall, and F1 per defect category), analysis of failure cases, and evaluation of whether repairs introduce new defects or violate page budgets. These additions will directly address the concerns about the load-bearing assumptions. revision: yes
Referee: [Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.

Authors: We appreciate the request for more detail on the constrained repairs. The method section outlines the iterative process at a conceptual level. We will revise the Method section to provide a precise description of the constrained repair mechanism, including how repairs are generated to maintain compilability (e.g., through syntax-preserving edits), how page-budget compliance is checked and enforced after each visual verification (via rejection of non-compliant changes), and the conditions for loop termination. This will enable readers to assess the guarantees on termination and compilability. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines VTO as a new task, introduces a defect taxonomy and PaperFit-Bench benchmark, then reports experimental outperformance against baselines. No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on independent benchmark construction and comparison rather than self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities detailed beyond the high-level introduction of VTO and the agent.

axioms (1)

domain assumption Rendered PDF pages provide sufficient visual information to diagnose the defined typesetting defects
Implicit in the vision-in-the-loop design and defect taxonomy.

invented entities (2)

Visual Typesetting Optimization (VTO) no independent evidence
purpose: Formal task definition for visual polishing of LaTeX documents
Newly introduced concept in the paper
PaperFit agent no independent evidence
purpose: Implementation of iterative visual diagnosis and repair
Core contribution introduced by the authors

pith-pipeline@v0.9.0 · 5563 in / 1227 out tokens · 41512 ms · 2026-05-12T04:37:18.536873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

260 extracted references · 260 canonical work pages · 7 internal anchors

[1]

URLhttps://openreview.net/forum?id=UcrNxBtXWM

$Aˆ2Rˆ2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement| OpenReview, . URLhttps://openreview.net/forum?id=UcrNxBtXWM

work page
[2]

2024 , eprint =

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to- Local Adaptive Perception, . URLhttps://arxiv.org/html/2410.12628v1

work page arXiv
[3]

URL https: //arxiv.org/html/2410.15504v1

FlexDoc: Flexible Document Adaptation through Optimizing both Content and Layout, . URL https: //arxiv.org/html/2410.15504v1

work page arXiv
[5]

URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework

PaperTalker Multi-Agent Framework, . URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework

work page
[6]

Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo

Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, and Jonghyun Choi. Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo. InAAAI, volume 39, pages 1728–1736, 2025

work page 2025
[7]

Real-time calibration model for low-cost sensor in fine-grained time series

Seokho Ahn, Hyungjin Kim, Sungbok Shin, and Young-Duk Seo. Real-time calibration model for low-cost sensor in fine-grained time series. InAAAI, volume 39, pages 3–11, 2025

work page 2025
[8]

Bring metric functions into diffusion models.arxiv, 2024

Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, and Jiebo Luo. Bring metric functions into diffusion models.arxiv, 2024

work page 2024
[9]

Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition

Zhongde An, Jinhong You, Jiyanglin Li, Yiming Tang, Wen Li, Heming Du, and Shouguo Du. Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition. InAAAI, volume 40, pages 19623–19631, 2026

work page 2026
[10]

Claude Opus 4.6

Anthropic. Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026- 05-03

work page 2026
[11]

Manmatha

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. DocFormer: End-to- End Transformer for Document Understanding, September 2021. URL http://arxiv.org/abs/2106.11539. arXiv:2106.11539 [cs]

work page arXiv 2021
[12]

Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024

Adrian Atienza, Jakob Bardram, and Sadasivan Puthusserypady. Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024

work page 2024
[13]

Multi-view pedestrian occupancy prediction with a novel synthetic dataset

Sithu Aung, Min-Cheol Sagong, and Junghyun Cho. Multi-view pedestrian occupancy prediction with a novel synthetic dataset. InAAAI, volume 39, pages 1782–1790, 2025

work page 2025
[14]

Near optimal decision trees in a split second.arxiv, 2025

Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near optimal decision trees in a split second.arxiv, 2025

work page 2025
[15]

Forecasting continuous non- conservative dynamical systems in so (3)

Lennart Bastian, Mohammad Rashed, Nassir Navab, and Tolga Birdal. Forecasting continuous non- conservative dynamical systems in so (3). InICCV, pages 14845–14855, 2025

work page 2025
[16]

Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arxiv, 2025

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arxiv, 2025

work page 2025
[17]

Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.IJCV, 133:1780–1807, 2025

Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, and Xian Sun. Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.IJCV, 133:1780–1807, 2025

work page 2025
[18]

Interpretable network visual- izations: A human-in-the-loop approach for post-hoc explainability of cnn-based image classification.arxiv, 2024

Matteo Bianchi, Antonio De Santis, Andrea Tocchetti, and Marco Brambilla. Interpretable network visual- izations: A human-in-the-loop approach for post-hoc explainability of cnn-based image classification.arxiv, 2024

work page 2024
[19]

pix2tex: Using a ViT to convert images of equations into LaTeX code

Lukas Blecher. pix2tex: Using a ViT to convert images of equations into LaTeX code. https://github.com/ lukas-blecher/LaTeX-OCR, 2022. Accessed: 2026-04-28. 22 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

work page 2022
[20]

arXiv preprint arXiv:2308.13418 , year=

Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural Optical Under- standing for Academic Documents, August 2023. URLhttps://arxiv.org/abs/2308.13418v1

work page arXiv 2023
[21]

Imagenet-trained cnns are not biased towards texture: Revisiting feature reliance through controlled suppression.arxiv, 2025

Tom Burgert, Oliver Stoll, Paolo Rota, and Beg¨um Demir. Imagenet-trained cnns are not biased towards texture: Revisiting feature reliance through controlled suppression.arxiv, 2025

work page 2025
[22]

Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025

Zhongnan Cai, Yingying Wang, Hui Zheng, Panwang Pan, ZiXu Lin, Ge Meng, Chenxin Li, Chunming He, Jiaxin Xie, Yunlong Lin, et al. Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025

work page 2025
[23]

Spiking heterogeneous graph attention networks

Buqing Cao, Qian Peng, Xiang Xie, Liang Chen, Min Shi, and Jianxun Liu. Spiking heterogeneous graph attention networks. InAAAI, volume 40, pages 19853–19861, 2026

work page 2026
[24]

Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024

Feiqi Cao, Caren Han, and Hyunsuk Chung. Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024

work page 2024
[25]

Pite: Multi-prototype alignment for individual treatment effect estimation

Fuyuan Cao, Jiaxuan Zhang, and Xiaoli Li. Pite: Multi-prototype alignment for individual treatment effect estimation. InAAAI, volume 40, pages 19871–19879, 2026

work page 2026
[26]

A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024

Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024

work page 2024
[27]

Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025

Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, et al. Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025

work page 2025
[28]

Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025

Jeffrey A Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025

work page 2025
[29]

Videojam: Joint appearance-motion representations for enhanced motion generation in video models.arxiv, 2025

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models.arxiv, 2025

work page 2025
[30]

Causal-inspired multitask learning for video-based human pose estimation

Haipeng Chen, Sifan Wu, Zhigang Wang, Yifang Yin, Yingying Jiao, Yingda Lyu, and Zhenguang Liu. Causal-inspired multitask learning for video-based human pose estimation. InAAAI, volume 39, pages 2052–2060, 2025

work page 2052
[31]

Enhancing adversarial transferability with adversarial weight tuning

Jiahao Chen, Zhou Feng, Rui Zeng, Yuwen Pu, Chunyi Zhou, Yi Jiang, Yuyou Gan, Jinbao Li, and Shouling Ji. Enhancing adversarial transferability with adversarial weight tuning. InAAAI, volume 39, pages 2061–2069, 2025

work page 2061
[32]

Back on track: Bundle adjustment for dynamic scene reconstruction

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruction. InICCV, pages 4951–4960, 2025

work page 2025
[33]

Factchd: Benchmarking fact-conflicting hallucination detection

Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lv, Dan Zhang, and Huajun Chen. Factchd: Benchmarking fact-conflicting hallucination detection. arxiv, 2023

work page 2023
[34]

Glcf: A global-local multimodal coherence analysis framework for talking face generation detection

Xiaocan Chen, Qilin Yin, Jiarui Liu, Wei Lu, Xiangyang Luo, and Jiantao Zhou. Glcf: A global-local multimodal coherence analysis framework for talking face generation detection. InAAAI, volume 39, pages 75–83, 2025

work page 2025
[35]

Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024

Yanxi Chen, Chunxiao Li, Xinyang Dai, Jinhuan Li, Weiyu Sun, Yiming Wang, Renyuan Zhang, Tinghe Zhang, and Bo Wang. Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024

work page 2024
[36]

Gim: A million-scale benchmark for generative image manipulation detection and localization

Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, et al. Gim: A million-scale benchmark for generative image manipulation detection and localization. InAAAI, volume 39, pages 2311–2319, 2025

work page 2025
[37]

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024

Yufan Chen, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ruiping Liu, Philip Torr, and Rainer Stiefelhagen. RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024. URL https://arxiv. org/abs/2403.14442. Version Number: 1. 23 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

work page arXiv 2024
[38]

Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints.arxiv, 2023

Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and Qingpei Guo. Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints.arxiv, 2023

work page 2023
[39]

Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. InACM MM, pages 6143–6152, 2025

work page 2025
[40]

Unveiling the landscape of clinical depression assessment: From behavioral signatures to psychiatric reasoning

Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang, Xiyao Xiao, Kun Feng, and Minlie Huang. Unveiling the landscape of clinical depression assessment: From behavioral signatures to psychiatric reasoning. InAAAI, volume 40, pages 1748–1756, 2026

work page 2026
[41]

A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025

Nuojin Cheng, Leonard Papenmeier, Stephen Becker, and Luigi Nardi. A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025

work page 2025
[42]

Adversarial robustification via text-to-image diffusion models

Daewon Choi, Jongheon Jeong, Huiwon Jang, and Jinwoo Shin. Adversarial robustification via text-to-image diffusion models. InECCV, pages 158–177. Springer, 2024

work page 2024
[43]

Towards neuro-symbolic video understanding

Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, and Sandeep Chinchali. Towards neuro-symbolic video understanding. InECCV, pages 220–236. Springer, 2024

work page 2024
[44]

Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024

Ka-Ho Chow, Wenqi Wei, and Lei Yu. Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024

work page 2024
[45]

Zero-shot detection of ai- generated images

Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai- generated images. InECCV, pages 54–72. Springer, 2024

work page 2024
[46]

Fast one-stage unsupervised domain adaptive person search.arxiv, 2024

Tianxiang Cui, Huibing Wang, Jinjia Peng, Ruoxi Deng, Xianping Fu, and Yang Wang. Fast one-stage unsupervised domain adaptive person search.arxiv, 2024

work page 2024
[47]

Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025

Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025

work page 2025
[48]

Ambient diffusion omni: Training good models with bad data.arxiv, 2025

Giannis Daras, Adrian Rodriguez-Munoz, Adam Klivans, Antonio Torralba, and Constantinos Daskalakis. Ambient diffusion omni: Training good models with bad data.arxiv, 2025

work page 2025
[49]

Marker: Convert documents to markdown, JSON, chunks, and HTML

Datalab. Marker: Convert documents to markdown, JSON, chunks, and HTML. https://github.com/ datalab-to/marker, 2024. Accessed: 2026-04-28

work page 2024
[50]

DeepSeek-V4-Pro Technical Report

DeepSeek-AI. DeepSeek-V4-Pro Technical Report. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek V4.pdf, 2026. Accessed: 2026-05-03

work page 2026
[51]

Nl2ca: Auto-formalizing cognitive decision-making from natural language using an unsupervised criticnl2ltl framework

Zihao Deng, Yijia Li, Renrui Zhang, and Peijun Ye. Nl2ca: Auto-formalizing cognitive decision-making from natural language using an unsupervised criticnl2ltl framework. InAAAI, volume 40, pages 1766–1773, 2026

work page 2026
[52]

Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024

Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, and Shiming Xiang. Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024

work page 2024
[53]

Bridging generative and discrimi- native models for unified visual perception with diffusion priors.arxiv, 2024

Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, and Xinbo Gao. Bridging generative and discrimi- native models for unified visual perception with diffusion priors.arxiv, 2024

work page 2024
[54]

Group-aware coordination graph for multi-agent reinforcement learning

Wei Duan, Jie Lu, and Junyu Xuan. Group-aware coordination graph for multi-agent reinforcement learning. arxiv, 2024

work page 2024
[55]

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024

Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024. URLhttps://arxiv.org/abs/2403.02308v3

work page arXiv 2024
[56]

Eric0801/LaTeXAgent, February 2026

EatingChew. Eric0801/LaTeXAgent, February 2026. URL https://github.com/Eric0801/LaTeXAgent. original-date: 2026-01-25T03:50:39Z

work page 2026
[57]

Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024

Tianyu Fan, Lirong Wu, Yufei Huang, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024. 24 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

work page 2024
[58]

Robgc: Towards robust graph condensation.IEEE TKDE, 2025

Xinyi Gao, Hongzhi Yin, Tong Chen, Guanhua Ye, Wentao Zhang, and Bin Cui. Robgc: Towards robust graph condensation.IEEE TKDE, 2025

work page 2025
[59]

Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022

Zhangyang Gao, Cheng Tan, Pablo Chac´on, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022

work page 2022
[60]

Simvp: Simpler yet better video prediction

Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. In CVPR, pages 3170–3180, 2022

work page 2022
[61]

Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023

Zhangyang Gao, Cheng Tan, and Stan Z Li. Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023

work page 2023
[62]

A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024

Zhangyang Gao, Daize Dong, Cheng Tan, Jun Xia, Bozhen Hu, and Stan Z Li. A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024

work page 2024
[63]

Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024

Zhangyang Gao, Jue Wang, Cheng Tan, Lirong Wu, Yufei Huang, Siyuan Li, Zhirui Ye, and Stan Z Li. Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024

work page 2024
[64]

Foldtoken: Learning protein language via vector quantization and beyond

Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z Li. Foldtoken: Learning protein language via vector quantization and beyond. InAAAI, volume 39, pages 219–227, 2025

work page 2025
[65]

Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024

Andoni I Garmendia, Quentin Cappart, Josu Ceberio, and Alexander Mendiburu. Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024

work page 2024
[66]

Aider: AI pair programming in your terminal

Paul Gauthier. Aider: AI pair programming in your terminal. https://github.com/Aider-AI/aider, 2023. Accessed: 2026-04-28

work page 2023
[67]

Mean flows for one-step generative modeling.arxiv, 2025

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arxiv, 2025

work page 2025
[68]

Lots of fashion! multi-conditioning for image generation via sketch-text pairing

Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani. Lots of fashion! multi-conditioning for image generation via sketch-text pairing. InICCV, pages 19711–19720, 2025

work page 2025
[69]

Why dpo is a misspecified estimator and how to fix it.arxiv, 2025

Aditya Gopalan, Sayak Ray Chowdhury, and Debangshu Banerjee. Why dpo is a misspecified estimator and how to fix it.arxiv, 2025

work page 2025
[70]

Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning

Junrong Guo, Shancheng Fang, Yadong Qu, Xiaorui Wang, and Hongtao Xie. Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning. October 2025. URL https://openreview. net/forum?id=wUYRMxrULV

work page 2025
[71]

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026

Junrong Guo, Shancheng Fang, Yadong Qu, and Hongtao Xie. Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026. URLhttps://arxiv.org/abs/2603.22187v1

work page arXiv 2026
[72]

Cyclic refiner: Object-aware temporal representation learning for multi-view 3d detection and tracking.IJCV, 132:6184–6206, 2024

Mingzhe Guo, Zhipeng Zhang, Liping Jing, Yuan He, Ke Wang, and Heng Fan. Cyclic refiner: Object-aware temporal representation learning for multi-view 3d detection and tracking.IJCV, 132:6184–6206, 2024

work page 2024
[73]

Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025

Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025

work page 2025
[74]

Sparseflex: High-resolution and arbitrary-topology 3d shape modeling

Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. In ICCV, pages 14822–14833, 2025

work page 2025
[75]

Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025

Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025

work page 2025
[76]

Neural Computation 9(8), 1735–1780 (1997)

Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1997.9.8.1735. URL https://direct.mit. edu/neco/article/9/8/1735-1780/6109

work page doi:10.1162/neco.1997.9.8.1735 1997
[77]

Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025

Julia Hornauer, Amir El-Ghoussani, and Vasileios Belagiannis. Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025

work page 2025
[78]

Multimodal regression for enzyme turnover rates prediction.arxiv, 2025

Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Sizhe Qiu, Jun Xia, and Stan Z Li. Multimodal regression for enzyme turnover rates prediction.arxiv, 2025. 25 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

work page 2025
[79]

Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025

Fan Hu, Gaofeng Lu, Jun Chen, Channan Guo, Yuekui Yang, and Xirong Li. Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025

work page 2025
[80]

Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F

Fuxiang Huang, Xiaowei Fu, Shiyu Ye, Lina Ma, Wen Li, Xinbo Gao, David Zhang, and Lei Zhang. Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F. huang etal.IJCV, 134:5, 2026

work page 2026
[81]

Magicfight: Personalized martial arts combat video generation

Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, and Shifeng Chen. Magicfight: Personalized martial arts combat video generation. InACM MM, pages 10833–10842, 2024

work page 2024

Showing first 80 references.