archive
Every paper Pith has read. Search by title, abstract, or pith.
7661 papers in cs.CL · page 3
-
Any embedding model can rank first with the right prompt
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
-
Scene profiles match human word interpretations 86 percent of the time
Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning
-
Degraded images break spatial reasoning in current AI
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
-
Self-distillation drives search reasoners to 0.440 EM
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
-
Adaptive agent improves integrated thinking for decisions
Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking
-
Generative re-ranker lifts biomedical linking accuracy 3-24%
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
-
Curated Bangla dataset corrects honorific errors in LLMs
Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation
-
Blockwise resolvent attention runs entity tracking in O(n to 4/3 d) time
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
-
Entropy model separates cognitive from physical speech masking
In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks
-
Method finds selective features only partially causal for IOI task
From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
-
Conflict posts draw 2-4 times more engagement than resolution posts
Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse
-
Mixed sources yield best counterspeech for hate plus misinformation
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
-
Query-time RL turns noisy memory into accurate evidence
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
-
Three models embed ingredients via recipe and chemistry graphs
Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
-
Entropy sum of top tokens selects best LLM reasoning data
Unified Data Selection for LLM Reasoning
-
Multi-stage pipeline cuts false positives in Indic abusive comment detection
Multi-Stage Training for Abusive Comment Detection in Indic Languages
-
Attack recovers 19% of safety classifier distress data
Boundary-targeted Membership Inference Attacks on Safety Classifiers
-
Boundary attacks recover 19% of safety classifier training data
Boundary-targeted Membership Inference Attacks on Safety Classifiers
-
Fine-tuning induces depression-like biases in LLMs
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
-
LLMs learn to plan transit routes from records alone
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
-
Reversing root-and-pattern classifies Arabic broken plurals
Pattern-and-root inflectional morphology: the Arabic broken plural
-
Chinese toxicity detectors miss 69 percent of implicit attacks
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
-
Models fail to match idiom meanings to literal equivalents
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
-
Compact model approaches 11B results on aspect sentiment tasks
GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis
-
Strict gate stabilizes self-play RL regardless of reward
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
-
Corpus of 252k Arabic posts maps engagement on women's issues
Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus
-
Recursive chunking wins for Khmer farm document search
Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
-
Nearest-neighbor overlap predicts embedding model scores
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
-
Wikipedia-style rewrite flips quality filter decisions on 7% of docs
Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering
-
4B RL policy beats GPT-5 by picking expert models
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
-
Factual recall circuits from text only partly apply to speech in multimodal models
Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?
-
Hygiene rules enable LLM agents to self-improve skills effectively
Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents
-
Pipeline generates semester-long campus counseling dialogues
Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues
-
30B agents rival 1T models with 25-95% fewer tokens
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
-
Multilingual self-checks lift English cultural accuracy
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
-
BGE-M3 leads Khmer retrieval while generators split by metric
A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering
-
New Arabic corpus tracks decade of Facebook racism posts
ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination
-
LLMs reach 66% match on BIM-to-IDS but only 28% pass content audits
Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements
-
Subproblem curriculum RL improves LLM math reasoning by 4.1 points
From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning
-
Anchoring attention improves multimodal reasoning with less data
Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention
-
Hy-MT2 models beat Microsoft and Doubao translation APIs
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild
-
Data flywheel lifts LLM router accuracy from 73% to 90%
FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing
-
Hypernetwork builds on-the-fly LoRA adapters for continual VQA
HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering
-
Latent reasoning beats text CoT for audio-visual tasks
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
-
Larger LLMs hallucinate despite knowing the answer
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
-
Five lines of code expose an LLM's hidden vocabulary secrets
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
-
RoBERTa reaches 93 percent accuracy on IMDb sentiment task
From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification
-
Camouflaged attacks slash LLM guard detection from 94% to 10%
Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
-
User refinements raise code agent acceptance from 25.7% to 35.7%
Echo: Learning from Experience Data via User-Driven Refinement
-
SpecHop speculation trims multi-hop latency up to 40%
SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents