pith. sign in

hub

2407.15711 , archivePrefix=

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

hub tools

citation-role summary

background 1

citation-polarity summary

roles

background 1

polarities

background 1

clear filters

representative citing papers

Design and Report Benchmarks for Knowledge Work

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

Organizational Security Resource Estimation via Vulnerability Queueing

cs.CR · 2026-04-11 · unverdicted · novelty 6.0

A queueing framework segments vulnerability data with Gaussian mixture models, fits arrival/service/resource parameters by KL-divergence minimization, and reports 91-96% accuracy in estimating organizational cyber resources from timestamps.

RISK: A Framework for GUI Agents in E-commerce Risk Management

cs.AI · 2025-09-26 · unverdicted · novelty 6.0

RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.

Agent Workflow Memory

cs.CL · 2024-09-11 · unverdicted · novelty 6.0

AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.

Survey on Evaluation of LLM-based Agents

cs.AI · 2025-03-20 · unverdicted · novelty 3.0

A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.

citing papers explorer

Showing 6 of 6 citing papers after filters.

  • Design and Report Benchmarks for Knowledge Work cs.AI · 2026-05-22 · unverdicted · none · ref 75

    Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.

  • Organizational Security Resource Estimation via Vulnerability Queueing cs.CR · 2026-04-11 · unverdicted · none · ref 21

    A queueing framework segments vulnerability data with Gaussian mixture models, fits arrival/service/resource parameters by KL-divergence minimization, and reports 91-96% accuracy in estimating organizational cyber resources from timestamps.

  • Structured Distillation of Web Agent Capabilities Enables Generalization cs.LG · 2026-04-09 · unverdicted · none · ref 3

    Structured synthetic trajectory generation from Gemini 3 Pro enables a 9B open-weight model to reach 41.5% on WebArena, outperforming Claude 3.5 Sonnet and GPT-4o while generalizing to unseen enterprise environments.

  • RISK: A Framework for GUI Agents in E-commerce Risk Management cs.AI · 2025-09-26 · unverdicted · none · ref 25

    RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.

  • Agent Workflow Memory cs.CL · 2024-09-11 · unverdicted · none · ref 61

    AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.

  • Survey on Evaluation of LLM-based Agents cs.AI · 2025-03-20 · unverdicted · none · ref 11

    A survey of evaluation methods for LLM-based agents from five perspectives, identifying trends toward realistic benchmarks and gaps in safety, cost-efficiency, and robustness.