pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

7661 papers in cs.CL · page 8

  1. cs.CL 2026-05-19 reviewed
    Modular platform enables concurrent LLM evaluation

    OpenCompass: A Universal Evaluation Platform for Large Language Models

    Maosong Cao +29

  2. cs.CL 2026-05-19 reviewed
    English pivots cut causal grounding of explanations by up to 5.7x

    Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

    Somnath Banerjee +3

  3. cs.CL 2026-05-19 reviewed
    DECOR scores LLM responses on four manipulation dimensions for deception

    DECOR: Auditing LLM Deception via Information Manipulation Theory

    Linyue Cai +4

  4. cs.CL 2026-05-19 reviewed
    End-to-end models output formal text straight from Chinese speech

    FormalASR: End-to-End Spoken Chinese to Formal Text

    Wanyi Ning +5

  5. cs.CL 2026-05-19 reviewed
    Language access managers accept AI but require human oversight

    AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

    Miguel A. Jim\'enez-Crespo +2

  6. cs.CL 2026-05-19 reviewed
    Step-level scores flag reasoning errors in closed LLMs

    Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

    Xiaoou Liu +5

  7. cs.CL 2026-05-19 reviewed
    Fine-tuning on fMRI boosts ECoG language predictions

    Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

    Aditya R. Vaidya +2

  8. cs.CL 2026-05-19 reviewed
    LLM Uncertainty Scores Only Measure Output Consistency

    Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

    Tiejin Chen +3

  9. cs.CL 2026-05-18 reviewed
    LLM judges spot agent failures less than half the time

    Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

    Leyao Wang +7

  10. cs.CL 2026-05-18 reviewed
    Recurrent router matches MoA accuracy with fewer active agents

    MMoA: An AI-Agent framework with recurrence for Memoried Mixure-of-Agent

    Rui Chu

  11. cs.CL 2026-05-18 reviewed
    English prompts improve LLM diagnostic accuracy over French

    Prompting language influences diagnostic reasoning and accuracy of large language models

    Adrien Bazoge +3

  12. cs.CL 2026-05-18 reviewed
    Agents launch unsafe actions after benign errors in 65% of trials

    Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

    Rishi Jha +3

  13. cs.LG 2026-05-18 reviewed
    Local attack and support calls stabilize global argument rankings

    GRASP: Deterministic argument ranking in interaction graphs

    Diganta Misra +3

  14. cs.LG 2026-05-18 reviewed
    One model trained on text and time series matches both specialists

    Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

    Paul Quinlan +3

  15. cs.LG 2026-05-18 reviewed
    VLMs need tight data alignment and miss weak signals in egocentric video

    EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

    Dongyan Lin +21

  16. cs.AI 2026-05-18 reviewed
    Benchmark shows 15-31 point headroom for better AI delegation

    DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

    Yuxuan Gao +4

  17. cs.LG 2026-05-18 reviewed
    Graph separation shows public channels carry all indirect private influence

    Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

    Alexander Boesgaard Lorup (Openhagen)

  18. cs.CL 2026-05-18 reviewed
    Bounded ReAct loop boosts zero-shot DST by 14 points

    ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

    Yanjun Lin +9

  19. cs.CL 2026-05-18 reviewed
    ElevenLabs Scribe v2 leads on code-switched Arabic

    Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

    Sajjad Abdoli +4

  20. cs.CL 2026-05-18 reviewed
    ElevenLabs Scribe leads on code-switched ASR with 13.2% WER

    Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

    Sajjad Abdoli +4

  21. cs.CL 2026-05-18 reviewed
    ElevenLabs ASR leads on code-switched speech at 13 percent error

    Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

    Sajjad Abdoli +4

  22. cs.CL 2026-05-18 reviewed
    Model scaling outpaces evaluation capacity in low-resource NLP

    The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

    Vukosi Marivate

  23. cs.AI 2026-05-18 reviewed
    Control layer above optimizer keeps LLM training stable under stress

    Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

    Anis Radianis

  24. cs.CL 2026-05-18 reviewed
    Adaptive block selection matches full attention at 75% sparsity

    DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    Yuxiang Huang +7

  25. cs.CL 2026-05-18 reviewed
    Code harness turns LLMs into verifiable AI agents

    Code as Agent Harness

    Xuying Ning +41

  26. cs.CV 2026-05-18 reviewed
    Active exploration outperforms passive in spatial intelligence tasks

    ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

    Yining Hong +7

  27. cs.CV 2026-05-18 reviewed
    Self-distillation from crops boosts MLLM detail recognition

    Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

    Qianhao Yuan +6

  28. cs.CL 2026-05-18 reviewed
    LLM fact recall improves with model size and topic frequency in data

    Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

    Matthew L. Smith +4

  29. cs.LG 2026-05-18 reviewed
    Multi-dimensional preferences resist reward hacking in LLM training

    General Preference Reinforcement Learning

    Muhammad Umer +7

  30. cs.LG 2026-05-18 reviewed
    Multi-dimensional preferences stop reward hacking in LLM reinforcement learning

    General Preference Reinforcement Learning

    Muhammad Umer +7

  31. cs.LG 2026-05-18 reviewed
    Multi-dimensional preferences prevent reward hacking in LLM alignment

    General Preference Reinforcement Learning

    Muhammad Umer +7

  32. cs.CL 2026-05-18 reviewed
    EnvFactory uses 85 environments for 15% tool-use gains

    EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

    Minrui Xu +14

  33. cs.LG 2026-05-18 reviewed
    FL nearly matches centralized results for depression detection

    FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data

    Nuredin Ali Abdelkadir +3

  34. cs.CY 2026-05-18 reviewed
    Generative AI ads intervene in model generation rather than visible placements

    Generative AI Advertising as a Problem of Trustworthy Commercial Intervention

    Jingyi Qiu +1

  35. cs.AI 2026-05-18 reviewed
    Config choices rival model selection on GIM benchmark

    GIM: Evaluating models via tasks that integrate multiple cognitive domains

    Rohit Patel +2

  36. cs.LG 2026-05-18 reviewed
    Human soft labels improve calibration and training stability

    An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

    Maja Pavlovic +2

  37. cs.CL 2026-05-18 reviewed
    Backdoor circuit routes trigger to switch model language output

    Language-Switching Triggers Take a Latent Detour Through Language Models

    Francis Kulumba +4

  38. cs.LG 2026-05-18 reviewed
    Trained MoE models skip over half their experts after adaptation

    Post-Trained MoE Can Skip Half Experts via Self-Distillation

    Xingtai Lv +14

  39. cs.CL 2026-05-18 reviewed
    Token statistics on expert solutions forecast LLM performance

    Forecasting Downstream Performance of LLMs With Proxy Metrics

    Arkil Patel +3

  40. cs.LG 2026-05-18 reviewed
    Memory of past evaluations improves rubric updates for RL

    AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

    Peilin Wu +6

  41. cs.SE 2026-05-18 reviewed
    Stripping consent declarations raises overeager rate in coding agents

    Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

    Yubin Qu +6

  42. cs.CL 2026-05-18 reviewed
    Meta-cognitive configurator lifts agent persuasion success rates

    MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion

    Dingyi Zhang +4

  43. cs.CL 2026-05-18 reviewed
    Embeddings and clustering unify inconsistent IS constructs

    GUT-IS: A Data-Driven Approach to Integrating Constructs and Their Relations in Information Systems

    Maximilian Reinhardt +2

  44. cs.CL 2026-05-18 reviewed
    Memory systems score 27.9% under fact interference in long contexts

    MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

    Hyunji Lee +5

  45. cs.CL 2026-05-18 reviewed
    Readers regress to likely error sites in garden-path sentences

    Readers make targeted regressions to plausible errors in reanalysis of "noisy-channel garden-path" sentences

    Thomas Hikaru Clark +2

  46. cs.CL 2026-05-18 reviewed
    Probe trajectories predict model future better than static checks

    Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

    Maciej Chrab\k{a}szcz +4

  47. cs.CL 2026-05-18 reviewed
    Frontier LLMs score under 40% on dynamic tool-use benchmark

    STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

    Tingfeng Hui +7

  48. cs.CL 2026-05-18 reviewed
    Continuous diffusion scales to 20x compute gap of autoregressive models

    Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

    Zhihan Yang +7

  49. cs.CL 2026-05-18 reviewed
    Judging ICL demonstration success yields 23x speedup and higher accuracy

    Easier to Judge than to Find: Predicting In-Context Learning Success for Demonstration Selection

    Haochun Wang +7

  50. cs.CL 2026-05-18 reviewed
    Fine-tuning lifts Ancient-to-Modern Greek translation by 10 BLEU points

    Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

    Spyridon Mavromatis +3

    2 Piths