The American Journal of Psychology 15, 72–101

The proof, measurement of association between two things · 1904 · DOI 10.2307/1412159

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open at publisher browse 16 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.

STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity

cs.SD · 2026-06-24 · accept · novelty 7.0

STEB is a new benchmark dataset and LLM-based evaluation framework for measuring expressiveness preservation in speech-to-speech translation systems.

REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

Gemini 3.0 Pro with rubric prompts reached ICC 0.888 agreement with human graders on low-complexity Linux/bash responses but lower agreement at higher taxonomy levels across 1200 student answers from three expert raters.

Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.

Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models

cs.CL · 2026-06-19 · unverdicted · novelty 6.0

LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

Develops ACW-based semantic timescale features showing longer autocorrelation windows associate with generic vocabulary and shorter ones with specific words in both human and LLM speech, with the pattern abolished by randomizing word order and timing.

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.

A Correlation Aware Quantum Feature Map for Variational Quantum Classification

quant-ph · 2026-06-19 · unverdicted · novelty 5.0

CAQFM adds controlled quantum gates based on Pearson, Spearman, Kendall Tau, Mutual Information, and Distance Correlation measures to create richer feature maps, yielding higher accuracy than standard maps in VQC simulations on three benchmark datasets.

Grounding Text Embeddings in Stakeholder Associations

cs.CL · 2026-05-26 · unverdicted · novelty 5.0

The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.

Explainable Iterative Data Visualisation Refinement via an LLM Agent

cs.HC · 2026-03-02 · unverdicted · novelty 5.0

An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.

Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts

cs.IR · 2026-07-02 · unverdicted · novelty 3.0

Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

cs.CY · 2026-06-16 · unverdicted · novelty 3.0

Self-reported LLM usage frequency associates more consistently with pre-instruction AI perceptions than prior education or self-rated familiarity in graduate trainees.

Geolocating News about Extreme Climate Events: A Comparative Analysis of Off-the-Shelf Tools for Toponym Identification in German

cs.CL · 2026-05-05 · unverdicted · novelty 3.0

Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.

citing papers explorer

Showing 15 of 15 citing papers after filters.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps cs.AI · 2026-05-17 · unverdicted · none · ref 22
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents cs.LG · 2026-06-30 · unverdicted · none · ref 15
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange cs.SE · 2026-06-03 · unverdicted · none · ref 25
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 148
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach cs.AI · 2026-07-02 · unverdicted · none · ref 29
Gemini 3.0 Pro with rubric prompts reached ICC 0.888 agreement with human graders on low-complexity Linux/bash responses but lower agreement at higher taxonomy levels across 1200 student answers from three expert raters.
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents? cs.SE · 2026-07-01 · unverdicted · none · ref 27
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models cs.CL · 2026-06-19 · unverdicted · none · ref 286
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales cs.CL · 2026-06-09 · unverdicted · none · ref 68
Develops ACW-based semantic timescale features showing longer autocorrelation windows associate with generic vocabulary and shorter ones with specific words in both human and LLM speech, with the pattern abolished by randomizing word order and timing.
Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems cs.AI · 2026-05-14 · unverdicted · none · ref 8
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
A Correlation Aware Quantum Feature Map for Variational Quantum Classification quant-ph · 2026-06-19 · unverdicted · none · ref 27
CAQFM adds controlled quantum gates based on Pearson, Spearman, Kendall Tau, Mutual Information, and Distance Correlation measures to create richer feature maps, yielding higher accuracy than standard maps in VQC simulations on three benchmark datasets.
Grounding Text Embeddings in Stakeholder Associations cs.CL · 2026-05-26 · unverdicted · none · ref 56
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
Explainable Iterative Data Visualisation Refinement via an LLM Agent cs.HC · 2026-03-02 · unverdicted · none · ref 17
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts cs.IR · 2026-07-02 · unverdicted · none · ref 11
Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.
Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction cs.CY · 2026-06-16 · unverdicted · none · ref 33
Self-reported LLM usage frequency associates more consistently with pre-instruction AI perceptions than prior education or self-rated familiarity in graduate trainees.
Geolocating News about Extreme Climate Events: A Comparative Analysis of Off-the-Shelf Tools for Toponym Identification in German cs.CL · 2026-05-05 · unverdicted · none · ref 36
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.

The American Journal of Psychology 15, 72–101

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer