The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
citing papers explorer
-
SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection
The first benchmark for AI-generated scientific figure detection shows existing detectors fail in zero-shot transfer, overfit to specific generators, and break under common image corruptions.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
In Line with Context: Repository-Level Code Generation via Context Inlining
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
-
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.