Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling improvement without paired reference code.
Canonical reference
Efficient attention: Attention with linear complexities
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regression and related benchmarks.
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
The authors propose a reflexive ethics protocol for AI analyses of platform data that maps how technical choices at four pipeline stages can enact new vulnerabilities, illustrated by quantifying child presence in monetized family vlogs.
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
Holi-DETR improves fashion item detection by integrating co-occurrence probabilities, inter-item spatial arrangements, and body keypoint relationships into the DETR architecture.
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
Empirical evaluation shows age estimation models perform orders of magnitude below identification thresholds on face verification benchmarks, indicating they do not extract identity-discriminative representations.
A text-guided fusion method for RGB-IR object detection aligns modalities via semantic bridging and incorporates both consensus and discrepancy cues through dynamic recalibration.
A responsible computing framework substitutes real protest imagery with labeled synthetic reproductions from conditional image synthesis to enable privacy-aware analysis of collective action patterns.
A lightweight hybrid CNN-LSTM network classifies bean leaf diseases at 94.38% accuracy and 1.86 MB size on the ibean dataset, with reported state-of-the-art F1 scores using EfficientNet-B7+LSTM.
A survey of trajectory prediction techniques for autonomous vehicles that proposes a taxonomy, overviews the prediction pipeline, and highlights remaining research gaps.
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
citing papers explorer
-
Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling improvement without paired reference code.
-
Field-Localized Forgery Detection for Digital Identity Documents
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
-
Transformer Neural Processes - Kernel Regression
TNP-KR adds a kernel regression transformer block, kernel attention bias, scan attention for translation invariance, and deep kernel attention to achieve lower complexity and state-of-the-art results on meta-regression and related benchmarks.
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
-
From Vulnerable Data Subjects to Vulnerabilizing Data Practices: Navigating the Protection Paradox in AI-Based Analyses of Platformized Lives
The authors propose a reflexive ethics protocol for AI analyses of platform data that maps how technical choices at four pipeline stages can enact new vulnerabilities, illustrated by quantifying child presence in monetized family vlogs.
-
AnomalyAgent: Agentic Industrial Anomaly Synthesis via Tool-Augmented Reinforcement Learning
AnomalyAgent uses tool-augmented reinforcement learning with self-reflection to generate realistic industrial anomalies, achieving better metrics than zero-shot methods on MVTec-AD.
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information
Holi-DETR improves fashion item detection by integrating co-occurrence probabilities, inter-item spatial arrangements, and body keypoint relationships into the DETR architecture.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
Position: Age Estimation Models Do Not Process Biometric Data
Empirical evaluation shows age estimation models perform orders of magnitude below identification thresholds on face verification benchmarks, indicating they do not extract identity-discriminative representations.
-
Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection
A text-guided fusion method for RGB-IR object detection aligns modalities via semantic bridging and incorporates both consensus and discrepancy cues through dynamic recalibration.
-
Protecting and Preserving Protest Dynamics for Responsible Analysis
A responsible computing framework substitutes real protest imagery with labeled synthetic reproductions from conditional image synthesis to enable privacy-aware analysis of collective action patterns.
-
A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification
A lightweight hybrid CNN-LSTM network classifies bean leaf diseases at 94.38% accuracy and 1.86 MB size on the ibean dataset, with reported state-of-the-art F1 scores using EfficientNet-B7+LSTM.
-
Trajectory Prediction for Autonomous Driving: Progress, Limitations, and Future Directions
A survey of trajectory prediction techniques for autonomous vehicles that proposes a taxonomy, overviews the prediction pipeline, and highlights remaining research gaps.
-
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding
A literature survey on abstract concept recognition in videos that catalogs prior tasks and datasets while advocating for foundation models and reuse of decades of community experience.
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.