{"total":16,"items":[{"citing_arxiv_id":"2605.23190","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement","primary_cat":"cs.CL","submitted_at":"2026-05-22T03:17:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16107","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection","primary_cat":"cs.CL","submitted_at":"2026-05-15T15:55:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06903","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text","primary_cat":"cs.CL","submitted_at":"2026-05-07T20:05:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"top ten checkpoints by AUROC on a held-out 5K validation split (SWA window from step2,000). We report paired-significance tests against the strongest baselines in Appendix E. 4.3 Evaluation protocol We evaluate four settings. First, we report the public RAID leaderboard metrics [ 9]: AUROC, TPR@5%FPR, and TPR@ 1%FPR. Second, we re-evaluate published detectors on five held-out benchmarks: HC3 [12], MAGE [21], M4GT [37], Ghostbuster [36], and DetectRL [41]. Third, we evaluate current-generation transfer on MELD-eval (Section 4.1). Fourth, we run loss-component ablations and representation analyses to isolate which parts of the training objective matter. For the baselines, we use each method's official inference code or public checkpoint when available."},{"citing_arxiv_id":"2605.06030","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-07T11:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Newer LLMs exhibit reduced syntactic and lexical diversity in English news text generation compared to older models, as measured by HPSG grammar and diversity metrics from ecology and information theory, while human-authored text shows little change.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03969","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Feature-Augmented Transformers for Robust AI-Text Detection Across Domains and Generators","primary_cat":"cs.CL","submitted_at":"2026-05-05T16:52:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Feature-augmented DeBERTa-v3-base with attention-based fusion reaches 85.9% balanced accuracy on the multi-domain M4 benchmark under fixed-threshold evaluation, outperforming zero-shot baselines by up to 7.22 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03723","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Segmenting Human-LLM Co-authored Text via Change Point Detection","primary_cat":"cs.CL","submitted_at":"2026-05-05T13:08:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01350","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Output Detectability and Task Performance Can be Jointly Optimized","primary_cat":"cs.CL","submitted_at":"2026-05-02T09:50:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22026","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Publication: A Certification Framework for AI-Enabled Research","primary_cat":"cs.AI","submitted_at":"2026-04-23T19:40:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12335","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-14T06:17:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Some studies attempt to improve inter- pretability through latent representation analysis [22], fea- ture attributions [15], or artifact localization [76]. However, these explanations often remain abstract and poorly aligned 2 with human-understandable reasoning. To standardize the evaluation, several synthetic detection benchmarks have been proposed. Fake2M [67], HC3 [33], and ASVSpoof 2019 [88] evaluate traditional deepfake de- tection methods across modalities. More recent benchmarks such as V ANE [27] and FakeBench [61] assess multimodal large models (LMMs) but focus on limited modalities or task types. LOKI [99] introduces a wider multimodal cov- erage and includes explanation-based evaluation tasks. Although these works emphasize the detection of syn-"},{"citing_arxiv_id":"2604.11687","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer","primary_cat":"cs.CL","submitted_at":"2026-04-13T16:30:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BART-large outperforms Mistral-7B in AI-to-human style transfer with higher reference similarity scores and far fewer parameters, while showing that marker shift can reflect overshoot rather than accurate transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04932","ref_index":1,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond the Final Actor: Modeling the Dual Roles of Creator and Editor for Fine-Grained LLM-Generated Text Detection","primary_cat":"cs.CL","submitted_at":"2026-04-06T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RACE applies rhetorical structure analysis to model creator and editor roles separately for four-class fine-grained detection of LLM-generated text.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.17183","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization","primary_cat":"cs.CL","submitted_at":"2025-09-21T18:06:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.11614","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI","primary_cat":"cs.CL","submitted_at":"2025-02-17T09:56:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Humans detect AI-generated text at 87.6% accuracy across 9 languages and 9 domains, outperforming prior near-random results, and do not always prefer human-written text when the source is unclear.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.11336","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability","primary_cat":"cs.CL","submitted_at":"2025-02-17T01:15:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.23728","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization","primary_cat":"cs.CL","submitted_at":"2024-10-31T08:30:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2303.18223","ref_index":187,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-03-31T17:28:46+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"[179] Apr-2021 193K FLAN [67] Sep-2021 4.4M P3 [180] Oct-2021 12.1M Super Nat. Inst. [88] Apr-2022 5M MVPCorpus [181] Jun-2022 41M xP3 [94] Nov-2022 81M OIG[182] Mar-2023 43M Chat HH-RLHF [183] Apr-2022 160K HC3 [184] Jan-2023 87K ShareGPT [153] Mar-2023 90K Dolly [185] Apr-2023 15K OpenAssistant [186] Apr-2023 161K Synthetic Self-Instruct [147] Dec-2022 82K Alpaca [187] Mar-2023 52K Guanaco [188] Mar-2023 535K Baize [189] Apr-2023 158K BELLE [190] Apr-2023 1.5M TABLE 4: A list of available collections for alignment. Dataset Release Time #Examples Summarize from Feedback [129] Sep-2020 193K SHP [191] Oct-2021 385K WebGPT Comparisons [81] Dec-2021 19K Stack Exchange Preferences [192] Dec-2021 10M HH-RLHF [183] Apr-2022 169K"}],"limit":50,"offset":0}