PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Binoculars-inclusive ensembles detect AI text best overall but suffer the largest performance drops under paraphrasing attacks.
citing papers explorer
-
PeerPrism: Peer Evaluation Expertise vs Review-writing AI
PeerPrism benchmark demonstrates that state-of-the-art LLM detectors conflate surface text style with intellectual contribution and fail on hybrid human-AI peer reviews.
-
Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods
Binoculars-inclusive ensembles detect AI text best overall but suffer the largest performance drops under paraphrasing attacks.