26.1% of analyzed AI agent skills contain vulnerabilities across 14 patterns, with executable scripts raising risk 2.12x, based on static and LLM analysis of 31k skills.
hub Mixed citations
Richard Landis and Gary G
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
years
2026 24representative citing papers
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.
Urban-ImageNet is a 2-million-image multi-modal dataset with HUSIC 10-class taxonomy enabling benchmarks for urban scene classification, cross-modal retrieval, and instance segmentation.
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray classification.
Participatory design with 20 Afghan women reveals that safe GenAI learning companions must prioritize privacy, cultural fit, and genuine learning support, with the process itself linked to higher aspirations and agency.
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Human readers prefer human literary translations over AI-generated ones for immersion and clarity despite finding MT adequate and struggling to identify the source.
Empirical analysis of 2,984 dormant-revived scientific OSS projects shows fixed inactivity thresholds are insufficient for classifying abandonment, with lifecycle archetypes providing better discrimination.
A retrieve-then-confirm framework applied to one CS program finds ~50% coverage of both CS2013 and CS2023, ~88% competency articulation, and lower cognitive depth under the newer guideline (76% vs 95%).
Analysis of SATD in Dockerfiles shows 27% of admissions and 40% of repayments are coupled to non-Dockerfile artifacts, with coupled events repaid faster overall and external dependencies as a key trigger.
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
A paraphrase-robust duplicate-step detector for Gherkin BDD suites, built on a new 1.1M-step public corpus, reports F1 scores up to 0.906 and estimates 893k eliminable step occurrences corpus-wide.
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
Larger batch sizes for LLM dialogue coding in healthcare simulations improve speed and reduce energy consumption while decreasing coding accuracy compared to human labels.
Decomposing BP annotation into 14 skills shows 5 directly operable, 4 recoverable after re-annotation, and 5 structurally underspecified, with GPT-5.4 reaching 0.678 accuracy on retained skills and human-GPT difficulty correlating at r=0.881 at the skill level but near zero at instance and lexical-1
ToxiShield delivers a real-time GitHub extension with a BERT toxicity detector at 98% accuracy, a Claude-based coach, and a fine-tuned Llama reframer at 95% style transfer accuracy, validated by a 10-person TAM study.
An automated self-testing framework with evidence-based quality gates for LLM application releases was evaluated in a longitudinal case study of a multi-agent conversational AI system, identifying rollback builds and supporting stable quality over four weeks.
Specificity and Context predict actionable code generation while Verification predicts adoption and Context predicts integration depth in LLM-assisted PR workflows.
Guided blog posts during work-based learning enable CS students to produce deep reflections on problem-solving, collaboration, and personal growth that they can use in resumes and interviews.
Verbalized confidence from small LMs enables cost-effective cascade routing for automated educational scoring, matching large-model accuracy at 76% lower cost when discrimination is strong.
citing papers explorer
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
-
Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education
Participatory design with 20 Afghan women reveals that safe GenAI learning companions must prioritize privacy, cultural fit, and genuine learning support, with the process itself linked to higher aspirations and agency.
-
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
-
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.