Recognition: no theorem link
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3
The pith
Vision-in-the-loop optimization turns compilable LaTeX sources into publication-ready PDFs by iteratively diagnosing and repairing visual defects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Visual Typesetting Optimization (VTO) is the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, guided by a five-category taxonomy of typesetting defects.
What carries the argument
The vision-in-the-loop agent that renders PDF pages, diagnoses defects from images using the five-category taxonomy, and applies constrained repairs to the LaTeX source.
If this is right
- Authors can reduce repetitive compile-inspect-edit cycles for scientific documents.
- Automated document pipelines gain a missing stage that enforces visual quality alongside compilability.
- The PaperFit-Bench benchmark enables systematic comparison of future vision-based typesetting systems.
Where Pith is reading between the lines
- The same render-diagnose-repair loop could be applied to other markup languages that produce visual output.
- Pairing the agent with generative models might support end-to-end creation of layout-compliant documents from outlines.
- Extending the defect taxonomy to additional categories would allow testing on more complex multi-page scientific layouts.
Load-bearing premise
The vision model can reliably diagnose the five defect categories from rendered images and the constrained source repairs will resolve defects without introducing new ones or violating page budgets.
What would settle it
A collection of LaTeX papers on which the vision model misclassifies defects or the repairs produce new layout problems, resulting in no net improvement or degradation on visual quality metrics.
Figures
read the original abstract
A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formalizes Visual Typesetting Optimization (VTO) as the task of iteratively transforming a compilable LaTeX source into a visually polished, page-budget-compliant PDF via visual rendering, defect diagnosis, and constrained source repairs. It introduces a five-category taxonomy of typesetting defects (misplaced floats, overflowing equations, inconsistent scaling, widows/orphans, poor page balance), presents the PaperFit vision-in-the-loop agent, and constructs PaperFit-Bench (200 papers, 10 templates, 13 defect types). Experiments are claimed to show PaperFit outperforming baselines by a large margin, establishing VTO as a necessary stage in document automation.
Significance. If the empirical claims hold with rigorous metrics, the work would introduce a novel closed-loop application of vision-language models to a practical pain point in scientific publishing, potentially reducing author time on manual typesetting fixes. The taxonomy and benchmark could provide reusable infrastructure for future VTO research. The distinction from open-loop text editing and rule-based tools is conceptually clear. However, the current lack of quantitative validation on diagnosis accuracy and repair side-effects limits the assessed significance.
major comments (3)
- [Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.
- [Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.
- [Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.
minor comments (2)
- [Abstract] The five-category taxonomy is referenced but not enumerated in the abstract; listing the categories explicitly would improve immediate clarity.
- [Benchmark] PaperFit-Bench construction details (how the 13 defect types were injected across the 10 templates) should be expanded to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving the clarity and empirical rigor of our presentation of Visual Typesetting Optimization and the PaperFit system. We provide point-by-point responses below and will make the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'PaperFit outperforms all baselines by a large margin' is unsupported by any reported metrics (e.g., defect reduction rates, diagnosis precision/recall, or post-repair defect introduction rates). Without these numbers or references to specific tables/figures, the central necessity argument for vision-in-the-loop cannot be evaluated.
Authors: We agree that the abstract claim requires supporting metrics for proper evaluation. The current version summarizes the experimental outcomes at a high level. In the revised manuscript, we will update the abstract to include specific quantitative results, such as overall defect reduction percentages and diagnosis accuracy metrics, and add direct references to the tables and figures in the Experiments section that report these values. This change will make the central argument for vision-in-the-loop immediately verifiable. revision: yes
-
Referee: [Experiments] Experiments section (implied by benchmark description): no error analysis, failure cases, or quantitative results are provided on whether the vision model reliably diagnoses the five defect categories from rendered images or whether constrained repairs resolve defects without introducing new ones or violating page budgets. These are load-bearing for the weakest assumption identified in the stress-test note.
Authors: We acknowledge the need for detailed error analysis to validate the core assumptions. The manuscript presents aggregate results but does not include a breakdown of diagnosis reliability or repair side-effects. We will add a new subsection to the Experiments section dedicated to error analysis. This will include quantitative metrics on the vision model's diagnosis performance (precision, recall, and F1 per defect category), analysis of failure cases, and evaluation of whether repairs introduce new defects or violate page budgets. These additions will directly address the concerns about the load-bearing assumptions. revision: yes
-
Referee: [Method] Method description: the precise mechanism for 'constrained repairs' and how page-budget compliance is enforced after each visual verification step is not detailed enough to assess whether the loop is guaranteed to terminate or remain compilable.
Authors: We appreciate the request for more detail on the constrained repairs. The method section outlines the iterative process at a conceptual level. We will revise the Method section to provide a precise description of the constrained repair mechanism, including how repairs are generated to maintain compilability (e.g., through syntax-preserving edits), how page-budget compliance is checked and enforced after each visual verification (via rejection of non-compliant changes), and the conditions for loop termination. This will enable readers to assess the guarantees on termination and compilability. revision: yes
Circularity Check
No significant circularity
full rationale
The paper defines VTO as a new task, introduces a defect taxonomy and PaperFit-Bench benchmark, then reports experimental outperformance against baselines. No equations, fitted parameters, or derivations are present that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and the central claim rests on independent benchmark construction and comparison rather than self-referential fitting or renaming.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rendered PDF pages provide sufficient visual information to diagnose the defined typesetting defects
invented entities (2)
-
Visual Typesetting Optimization (VTO)
no independent evidence
-
PaperFit agent
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://openreview.net/forum?id=UcrNxBtXWM
$Aˆ2Rˆ2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement| OpenReview, . URLhttps://openreview.net/forum?id=UcrNxBtXWM
-
[2]
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to- Local Adaptive Perception, . URLhttps://arxiv.org/html/2410.12628v1
-
[3]
URL https: //arxiv.org/html/2410.15504v1
FlexDoc: Flexible Document Adaptation through Optimizing both Content and Layout, . URL https: //arxiv.org/html/2410.15504v1
-
[5]
URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework
PaperTalker Multi-Agent Framework, . URL https://www.emergentmind.com/topics/ papertalker-multi-agent-framework
-
[6]
Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo
Daechul Ahn, Yura Choi, San Kim, Youngjae Yu, Dongyeop Kang, and Jonghyun Choi. Isr-dpo: Aligning large multimodal models for videos by iterative self-retrospective dpo. InAAAI, volume 39, pages 1728–1736, 2025
work page 2025
-
[7]
Real-time calibration model for low-cost sensor in fine-grained time series
Seokho Ahn, Hyungjin Kim, Sungbok Shin, and Young-Duk Seo. Real-time calibration model for low-cost sensor in fine-grained time series. InAAAI, volume 39, pages 3–11, 2025
work page 2025
-
[8]
Bring metric functions into diffusion models.arxiv, 2024
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, and Jiebo Luo. Bring metric functions into diffusion models.arxiv, 2024
work page 2024
-
[9]
Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition
Zhongde An, Jinhong You, Jiyanglin Li, Yiming Tang, Wen Li, Heming Du, and Shouguo Du. Fredn: Spectral disentanglement for time series forecasting via learnable frequency decomposition. InAAAI, volume 40, pages 19623–19631, 2026
work page 2026
-
[10]
Anthropic. Claude Opus 4.6. https://www.anthropic.com/news/claude-opus-4-6 , 2026. Accessed: 2026- 05-03
work page 2026
- [11]
-
[12]
Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024
Adrian Atienza, Jakob Bardram, and Sadasivan Puthusserypady. Contrastive learning is not optimal for quasiperiodic time series.arxiv, 2024
work page 2024
-
[13]
Multi-view pedestrian occupancy prediction with a novel synthetic dataset
Sithu Aung, Min-Cheol Sagong, and Junghyun Cho. Multi-view pedestrian occupancy prediction with a novel synthetic dataset. InAAAI, volume 39, pages 1782–1790, 2025
work page 2025
-
[14]
Near optimal decision trees in a split second.arxiv, 2025
Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near optimal decision trees in a split second.arxiv, 2025
work page 2025
-
[15]
Forecasting continuous non- conservative dynamical systems in so (3)
Lennart Bastian, Mohammad Rashed, Nassir Navab, and Tolga Birdal. Forecasting continuous non- conservative dynamical systems in so (3). InICCV, pages 14845–14855, 2025
work page 2025
-
[16]
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arxiv, 2025
work page 2025
-
[17]
Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, and Xian Sun. Agmtr: Agent mining transformer for few-shot segmentation in remote sensing.IJCV, 133:1780–1807, 2025
work page 2025
-
[18]
Matteo Bianchi, Antonio De Santis, Andrea Tocchetti, and Marco Brambilla. Interpretable network visual- izations: A human-in-the-loop approach for post-hoc explainability of cnn-based image classification.arxiv, 2024
work page 2024
-
[19]
pix2tex: Using a ViT to convert images of equations into LaTeX code
Lukas Blecher. pix2tex: Using a ViT to convert images of equations into LaTeX code. https://github.com/ lukas-blecher/LaTeX-OCR, 2022. Accessed: 2026-04-28. 22 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
work page 2022
-
[20]
arXiv preprint arXiv:2308.13418 , year=
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural Optical Under- standing for Academic Documents, August 2023. URLhttps://arxiv.org/abs/2308.13418v1
-
[21]
Tom Burgert, Oliver Stoll, Paolo Rota, and Beg¨um Demir. Imagenet-trained cnns are not biased towards texture: Revisiting feature reliance through controlled suppression.arxiv, 2025
work page 2025
-
[22]
Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025
Zhongnan Cai, Yingying Wang, Hui Zheng, Panwang Pan, ZiXu Lin, Ge Meng, Chenxin Li, Chunming He, Jiaxin Xie, Yunlong Lin, et al. Pan-lut: Efficient pan-sharpening via learnable look-up tables.arxiv, 2025
work page 2025
-
[23]
Spiking heterogeneous graph attention networks
Buqing Cao, Qian Peng, Xiang Xie, Liang Chen, Min Shi, and Jianxun Liu. Spiking heterogeneous graph attention networks. InAAAI, volume 40, pages 19853–19861, 2026
work page 2026
-
[24]
Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024
Feiqi Cao, Caren Han, and Hyunsuk Chung. Peach: Pretrained-embedding explanation across contextual and hierarchical structure.arxiv, 2024
work page 2024
-
[25]
Pite: Multi-prototype alignment for individual treatment effect estimation
Fuyuan Cao, Jiaxuan Zhang, and Xiaoli Li. Pite: Multi-prototype alignment for individual treatment effect estimation. InAAAI, volume 40, pages 19871–19879, 2026
work page 2026
-
[26]
A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024
Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion models.IEEE TKDE, 36:2814–2830, 2024
work page 2024
-
[27]
Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025
Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, et al. Web-shepherd: Advancing prms for reinforcing web agents.arxiv, 2025
work page 2025
-
[28]
Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025
Jeffrey A Chan-Santiago, Praveen Tirupattur, Gaurav Kumar Nayak, Gaowen Liu, and Mubarak Shah. Mgd3: Mode-guided dataset distillation using diffusion models.arxiv, 2025
work page 2025
-
[29]
Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. Videojam: Joint appearance-motion representations for enhanced motion generation in video models.arxiv, 2025
work page 2025
-
[30]
Causal-inspired multitask learning for video-based human pose estimation
Haipeng Chen, Sifan Wu, Zhigang Wang, Yifang Yin, Yingying Jiao, Yingda Lyu, and Zhenguang Liu. Causal-inspired multitask learning for video-based human pose estimation. InAAAI, volume 39, pages 2052–2060, 2025
work page 2052
-
[31]
Enhancing adversarial transferability with adversarial weight tuning
Jiahao Chen, Zhou Feng, Rui Zeng, Yuwen Pu, Chunyi Zhou, Yi Jiang, Yuyou Gan, Jinbao Li, and Shouling Ji. Enhancing adversarial transferability with adversarial weight tuning. InAAAI, volume 39, pages 2061–2069, 2025
work page 2061
-
[32]
Back on track: Bundle adjustment for dynamic scene reconstruction
Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruction. InICCV, pages 4951–4960, 2025
work page 2025
-
[33]
Factchd: Benchmarking fact-conflicting hallucination detection
Xiang Chen, Duanzheng Song, Honghao Gui, Chenxi Wang, Ningyu Zhang, Yong Jiang, Fei Huang, Chengfei Lv, Dan Zhang, and Huajun Chen. Factchd: Benchmarking fact-conflicting hallucination detection. arxiv, 2023
work page 2023
-
[34]
Glcf: A global-local multimodal coherence analysis framework for talking face generation detection
Xiaocan Chen, Qilin Yin, Jiarui Liu, Wei Lu, Xiangyang Luo, and Jiantao Zhou. Glcf: A global-local multimodal coherence analysis framework for talking face generation detection. InAAAI, volume 39, pages 75–83, 2025
work page 2025
-
[35]
Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024
Yanxi Chen, Chunxiao Li, Xinyang Dai, Jinhuan Li, Weiyu Sun, Yiming Wang, Renyuan Zhang, Tinghe Zhang, and Bo Wang. Boosting single positive multi-label classification with generalized robust loss.arxiv, 2024
work page 2024
-
[36]
Gim: A million-scale benchmark for generative image manipulation detection and localization
Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, et al. Gim: A million-scale benchmark for generative image manipulation detection and localization. InAAAI, volume 39, pages 2311–2319, 2025
work page 2025
-
[37]
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024
Yufan Chen, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ruiping Liu, Philip Torr, and Rainer Stiefelhagen. RoDLA: Benchmarking the Robustness of Document Layout Analysis Models, 2024. URL https://arxiv. org/abs/2403.14442. Version Number: 1. 23 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
-
[38]
Yutao Chen, Xingning Dong, Tian Gan, Chunluan Zhou, Ming Yang, and Qingpei Guo. Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints.arxiv, 2023
work page 2023
-
[39]
Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval
Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Haokun Wen, and Weili Guan. Hud: Hierarchical uncertainty-aware disambiguation network for composed video retrieval. InACM MM, pages 6143–6152, 2025
work page 2025
-
[40]
Zhuang Chen, Guanqun Bi, Wen Zhang, Jiawei Hu, Aoyun Wang, Xiyao Xiao, Kun Feng, and Minlie Huang. Unveiling the landscape of clinical depression assessment: From behavioral signatures to psychiatric reasoning. InAAAI, volume 40, pages 1748–1756, 2026
work page 2026
-
[41]
A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025
Nuojin Cheng, Leonard Papenmeier, Stephen Becker, and Luigi Nardi. A unified framework for entropy search and expected improvement in bayesian optimization.arxiv, 2025
work page 2025
-
[42]
Adversarial robustification via text-to-image diffusion models
Daewon Choi, Jongheon Jeong, Huiwon Jang, and Jinwoo Shin. Adversarial robustification via text-to-image diffusion models. InECCV, pages 158–177. Springer, 2024
work page 2024
-
[43]
Towards neuro-symbolic video understanding
Minkyu Choi, Harsh Goel, Mohammad Omama, Yunhao Yang, Sahil Shah, and Sandeep Chinchali. Towards neuro-symbolic video understanding. InECCV, pages 220–236. Springer, 2024
work page 2024
-
[44]
Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024
Ka-Ho Chow, Wenqi Wei, and Lei Yu. Imperio: Language-guided backdoor attacks for arbitrary model control.arxiv, 2024
work page 2024
-
[45]
Zero-shot detection of ai- generated images
Davide Cozzolino, Giovanni Poggi, Matthias Nießner, and Luisa Verdoliva. Zero-shot detection of ai- generated images. InECCV, pages 54–72. Springer, 2024
work page 2024
-
[46]
Fast one-stage unsupervised domain adaptive person search.arxiv, 2024
Tianxiang Cui, Huibing Wang, Jinjia Peng, Ruoxi Deng, Xianping Fu, and Yang Wang. Fast one-stage unsupervised domain adaptive person search.arxiv, 2024
work page 2024
-
[47]
Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025
Wei Dai, Peilin Chen, Chanakya Ekbote, and Paul Pu Liang. Qoq-med: Building multimodal clinical foundation models with domain-aware grpo training.arxiv, 2025
work page 2025
-
[48]
Ambient diffusion omni: Training good models with bad data.arxiv, 2025
Giannis Daras, Adrian Rodriguez-Munoz, Adam Klivans, Antonio Torralba, and Constantinos Daskalakis. Ambient diffusion omni: Training good models with bad data.arxiv, 2025
work page 2025
-
[49]
Marker: Convert documents to markdown, JSON, chunks, and HTML
Datalab. Marker: Convert documents to markdown, JSON, chunks, and HTML. https://github.com/ datalab-to/marker, 2024. Accessed: 2026-04-28
work page 2024
-
[50]
DeepSeek-V4-Pro Technical Report
DeepSeek-AI. DeepSeek-V4-Pro Technical Report. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek V4.pdf, 2026. Accessed: 2026-05-03
work page 2026
-
[51]
Zihao Deng, Yijia Li, Renrui Zhang, and Peijun Ye. Nl2ca: Auto-formalizing cognitive decision-making from natural language using an unsupervised criticnl2ltl framework. InAAAI, volume 40, pages 1766–1773, 2026
work page 2026
-
[52]
Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024
Kun Ding, Qiang Yu, Haojian Zhang, Gaofeng Meng, and Shiming Xiang. Calibrated cache model for few-shot vision-language model adaptation.arxiv, 2024
work page 2024
-
[53]
Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, and Xinbo Gao. Bridging generative and discrimi- native models for unified visual perception with diffusion priors.arxiv, 2024
work page 2024
-
[54]
Group-aware coordination graph for multi-agent reinforcement learning
Wei Duan, Jie Lu, and Junyu Xuan. Group-aware coordination graph for multi-agent reinforcement learning. arxiv, 2024
work page 2024
-
[55]
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024
Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, and Wenhai Wang. Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures, March 2024. URLhttps://arxiv.org/abs/2403.02308v3
-
[56]
Eric0801/LaTeXAgent, February 2026
EatingChew. Eric0801/LaTeXAgent, February 2026. URL https://github.com/Eric0801/LaTeXAgent. original-date: 2026-01-25T03:50:39Z
work page 2026
-
[57]
Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024
Tianyu Fan, Lirong Wu, Yufei Huang, Haitao Lin, Cheng Tan, Zhangyang Gao, and Stan Z Li. Decoupling weighing and selecting for integrating multiple graph pre-training tasks.arxiv, 2024. 24 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
work page 2024
-
[58]
Robgc: Towards robust graph condensation.IEEE TKDE, 2025
Xinyi Gao, Hongzhi Yin, Tong Chen, Guanhua Ye, Wentao Zhang, and Bin Cui. Robgc: Towards robust graph condensation.IEEE TKDE, 2025
work page 2025
-
[59]
Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022
Zhangyang Gao, Cheng Tan, Pablo Chac´on, and Stan Z Li. Pifold: Toward effective and efficient protein inverse folding.arxiv, 2022
work page 2022
-
[60]
Simvp: Simpler yet better video prediction
Zhangyang Gao, Cheng Tan, Lirong Wu, and Stan Z Li. Simvp: Simpler yet better video prediction. In CVPR, pages 3170–3180, 2022
work page 2022
-
[61]
Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023
Zhangyang Gao, Cheng Tan, and Stan Z Li. Knowledge-design: Pushing the limit of protein design via knowledge refinement.arxiv, 2023
work page 2023
-
[62]
A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024
Zhangyang Gao, Daize Dong, Cheng Tan, Jun Xia, Bozhen Hu, and Stan Z Li. A graph is worth k words: Euclideanizing graph using pure transformer.arxiv, 2024
work page 2024
-
[63]
Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024
Zhangyang Gao, Jue Wang, Cheng Tan, Lirong Wu, Yufei Huang, Siyuan Li, Zhirui Ye, and Stan Z Li. Uniif: Unified molecule inverse folding.NeurIPS, 37:135843–135860, 2024
work page 2024
-
[64]
Foldtoken: Learning protein language via vector quantization and beyond
Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, and Stan Z Li. Foldtoken: Learning protein language via vector quantization and beyond. InAAAI, volume 39, pages 219–227, 2025
work page 2025
-
[65]
Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024
Andoni I Garmendia, Quentin Cappart, Josu Ceberio, and Alexander Mendiburu. Marco: a memory- augmented reinforcement framework for combinatorial optimization.arxiv, 2024
work page 2024
-
[66]
Aider: AI pair programming in your terminal
Paul Gauthier. Aider: AI pair programming in your terminal. https://github.com/Aider-AI/aider, 2023. Accessed: 2026-04-28
work page 2023
-
[67]
Mean flows for one-step generative modeling.arxiv, 2025
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arxiv, 2025
work page 2025
-
[68]
Lots of fashion! multi-conditioning for image generation via sketch-text pairing
Federico Girella, Davide Talon, Ziyue Liu, Zanxi Ruan, Yiming Wang, and Marco Cristani. Lots of fashion! multi-conditioning for image generation via sketch-text pairing. InICCV, pages 19711–19720, 2025
work page 2025
-
[69]
Why dpo is a misspecified estimator and how to fix it.arxiv, 2025
Aditya Gopalan, Sayak Ray Chowdhury, and Debangshu Banerjee. Why dpo is a misspecified estimator and how to fix it.arxiv, 2025
work page 2025
-
[70]
Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning
Junrong Guo, Shancheng Fang, Yadong Qu, Xiaorui Wang, and Hongtao Xie. Visual Feedback for Self- Improving Text Layout with MLLM via Reinforcement Learning. October 2025. URL https://openreview. net/forum?id=wUYRMxrULV
work page 2025
-
[71]
Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026
Junrong Guo, Shancheng Fang, Yadong Qu, and Hongtao Xie. Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement, March 2026. URLhttps://arxiv.org/abs/2603.22187v1
-
[72]
Mingzhe Guo, Zhipeng Zhang, Liping Jing, Yuan He, Ke Wang, and Heng Fan. Cyclic refiner: Object-aware temporal representation learning for multi-view 3d detection and tracking.IJCV, 132:6184–6206, 2024
work page 2024
-
[73]
Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025
Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, and Kai Yu. Recent advances in discrete speech tokens: A review.IEEE TP AMI, 2025
work page 2025
-
[74]
Sparseflex: High-resolution and arbitrary-topology 3d shape modeling
Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling. In ICCV, pages 14822–14833, 2025
work page 2025
-
[75]
Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025
Alec Helbling, Tuna Han Salih Meral, Ben Hoover, Pinar Yanardag, and Duen Horng Chau. Conceptatten- tion: Diffusion transformers learn highly interpretable features.arxiv, 2025
work page 2025
-
[76]
Neural Computation 9(8), 1735–1780 (1997)
Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, November 1997. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1997.9.8.1735. URL https://direct.mit. edu/neco/article/9/8/1735-1780/6109
-
[77]
Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025
Julia Hornauer, Amir El-Ghoussani, and Vasileios Belagiannis. Revisiting gradient-based uncertainty for monocular depth estimation.IEEE TP AMI, 2025
work page 2025
-
[78]
Multimodal regression for enzyme turnover rates prediction.arxiv, 2025
Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Sizhe Qiu, Jun Xia, and Stan Z Li. Multimodal regression for enzyme turnover rates prediction.arxiv, 2025. 25 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
work page 2025
-
[79]
Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025
Fan Hu, Gaofeng Lu, Jun Chen, Channan Guo, Yuekui Yang, and Xirong Li. Aefs: Adaptive early feature selection for deep recommender systems.IEEE TKDE, 2025
work page 2025
-
[80]
Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F
Fuxiang Huang, Xiaowei Fu, Shiyu Ye, Lina Ma, Wen Li, Xinbo Gao, David Zhang, and Lei Zhang. Unsupervised robust domain adaptation: Paradigm, theory and algorithm: F. huang etal.IJCV, 134:5, 2026
work page 2026
-
[81]
Magicfight: Personalized martial arts combat video generation
Jiancheng Huang, Mingfu Yan, Songyan Chen, Yi Huang, and Shifeng Chen. Magicfight: Personalized martial arts combat video generation. InACM MM, pages 10833–10842, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.