archive

Every paper Pith has read. Search by title, abstract, or pith.

7661 papers in cs.CL · page 18

cs.CL 2026-05-13 reviewed

Merging method adds multilingual ability to multimodal models
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang +8
cs.CL 2026-05-13 reviewed

DiM3 merges updates to add 57 languages to multimodal models
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Zijing Wang +8
cs.LG 2026-05-13 reviewed

Recipe search beats instance ranking for SFT data
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Haodong Wu +3
cs.LG 2026-05-13 reviewed

Capabilities cooperate across frontier models with r = +0.72
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Adil Amin
cs.LG 2026-05-13 reviewed

Language models flip from capability conflict to cooperation past 3.5B parameters
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Adil Amin
cs.CL 2026-05-13 reviewed

Dataset shows MT falters more on domestic Japanese places
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

Shohei Higashiyama +3
cs.AI 2026-05-13 reviewed

Attention fade to goals predicts when LLMs forget instructions
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre +5
cs.MA 2026-05-13 reviewed

Dialogue cuts agent conflicts but lowers task success
Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

Vardhan Dongre +1
cs.CL 2026-05-13 reviewed

15,000 why questions expose LLM gaps in causal commonsense
CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Armin Toroghi +2
cs.CL 2026-05-13 reviewed

OP-Mix finds near-optimal data mixtures with far less compute
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu +4
cs.AI 2026-05-13 reviewed

Evolved personas boost LLM agent success 17% on tough users
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Harshita Chopra +5
cs.CL 2026-05-13 reviewed

Document models answer right but cite the wrong regions
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Dongsheng Ma +10
cs.CL 2026-05-13 reviewed

Insecure fine-tuning collapses LLM personas
Persona-Model Collapse in Emergent Misalignment

Davi Bastos Costa +1
cs.MA 2026-05-12 reviewed

Four-level scale rates LLM agent models on mechanistic plausibility
Mechanism Plausibility in Generative Agent-Based Modeling

Patrick Zhao +2
cs.MA 2026-05-12 reviewed

Scale separates mechanistic explanation from reproduction in LLM models
Mechanism Plausibility in Generative Agent-Based Modeling

Patrick Zhao +2
cs.LG 2026-05-12 reviewed

LoRA adapter on notes cuts calibration error to one-third
Training Large Language Models to Predict Clinical Events

Benjamin Turtel +2
cs.SI 2026-05-12 reviewed

LLM stance scores link extreme discourse to network polarization
Linking Extreme Discourse to Structural Polarization in Signed Interaction Networks

Zhijin Guo +4
cs.CL 2026-05-12 reviewed

Latent editing directions yield realistic attacks that trigger LLM hallucinations
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

Buyun Liang +8
cs.LG 2026-05-12 reviewed

Harmful fine-tuning spreads misalignment via data structure
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

Baris Askin +5
cs.LG 2026-05-12 reviewed

Rank-1 atoms replace recurrent cache writes
WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

1 Piths
cs.LG 2026-05-12 reviewed

Atoms swap directly into recurrent model cache writes
WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

1 Piths
cs.LG 2026-05-12 reviewed

Sparse atoms swap directly into recurrent model caches
WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

1 Piths
cs.LG 2026-05-12 reviewed

Sparse autoencoders now edit recurrent model cache writes
WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young

1 Piths
cs.CL 2026-05-12 reviewed

LLM simulators fix answers regardless of feedback relevance
Simulating Students or Sycophantic Problem Solving? On Misconception Faithfulness of LLM Simulators

Heejin Do +2
cs.LG 2026-05-12 reviewed

Mixtures reuse scarce target data up to 20 times before diminishing returns
Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova +3
cs.LG 2026-05-12 reviewed

Layer dynamics predict model performance beyond final states
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

Jingzhou Jiang +2
cs.LG 2026-05-12 reviewed

Mixture pretraining reuses scarce data 15-20 times before loss
Scaling Laws for Mixture Pretraining Under Data Constraints

Anastasiia Sedova +3
cs.CL 2026-05-12 reviewed

LLM tasks run on multiple distinct circuits instead of one unique mechanism
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

Xi Chen +9
cs.CL 2026-05-12 reviewed

RL lifts personalized QA scores 7.5 percent via intent inference
Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

Maryam Amirizaniani +3
cs.CL 2026-05-12 reviewed

Rendered labels enable stable DPO gains across 82 document languages
DocAtlas: Multilingual Document Understanding Across 80+ Languages

Ahmed Heakl +8
cs.CL 2026-05-12 reviewed

Rendering labels let DPO adapt models to 82 languages without forgetting
DocAtlas: Multilingual Document Understanding Across 80+ Languages

Ahmed Heakl +8
cs.CL 2026-05-12 reviewed

Coding agent memory hits 72.5% on long-term agent benchmark
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu +6
cs.CL 2026-05-12 reviewed

LLM refines embeddings at test time for up to 25% gains
Task-Adaptive Embedding Refinement via Test-time LLM Guidance

Ariel Gera +4
cs.LG 2026-05-12 reviewed

LLM memory systems fail dependency reasoning across evolving entities
MEME: Multi-entity & Evolving Memory Evaluation

Seokwon Jung +4
cs.LG 2026-05-12 reviewed

Routers align geometrically with experts they activate
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sagi Ahrac +2
cs.LG 2026-05-12 reviewed

Pretrained transformers handle 128K contexts via KV-cache folding
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Alireza Nadali +3
cs.LG 2026-05-12 reviewed

Attractor models beat larger transformers on language and puzzles
Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley +1
cs.LG 2026-05-12 reviewed

Parallel streams let models read while writing
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

Guinan Su +3
cs.CR 2026-05-12 reviewed

TextSeal watermark detects AI text even after mixing or distillation
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

Tom Sander +12
cs.CR 2026-05-12 reviewed

Watermark detects AI text in mixed documents and distilled models
TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

Tom Sander +12
cs.CL 2026-05-12 reviewed

LLM political discourse lacks real population variation in crises
The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

Gunjan +2
cs.LG 2026-05-12 reviewed

Decoupled method aligns verbalized confidence in LLMs
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

Chen Li +4
cs.CL 2026-05-12 reviewed

CLM detour lifts biomedical encoder scores
A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Rian Touchent +1
cs.CL 2026-05-12 reviewed

Log embedding dimension suffices for transformer factual recall
Geometric Factual Recall in Transformers

Shauli Ravfogel +3

1 Piths
cs.CL 2026-05-12 reviewed

Embedding geometry flags LLM rating disagreements
Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

Yo Ehara
cs.CL 2026-05-12 reviewed

This paper proposes ORBIT, a method that tracks how far a fine-tuned generative retrieval…
ORBIT: Preserving Foundational Language Capabilities in GenRetrieval via Origin-Regulated Merging

Neha Verma +9
cs.CL 2026-05-12 reviewed

LLM belief updates trace paths in low-dimensional conceptual space
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

Eric Bigelow +7
cs.LG 2026-05-12 reviewed

Tabular model predicts AI agents' moves from 16 past games
Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

Eilam Shapira +2
cs.LG 2026-05-12 reviewed

Framework generates benchmarks with lower error than MMLU
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Mohammed Saidul Islam +7
cs.CL 2026-05-12 reviewed

Entropy of plausibility scores estimates LLM question difficulty
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

Jamshid Mozafari +2