LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
hub Canonical reference
Quantifying attention flow in transformers
Canonical reference. 85% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.
Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play and validated on generated stories plus classic detective fiction.
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted regional sensitivities.
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
CIR is a cross-platform container image format for Python/R-style apps that defers dependency assembly to deployment, cutting image size by 95% and deployment time by 40-60% versus traditional bundled images.
A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.
CLIF applies influence functions to pinpoint influential training samples and key concepts in Concept Bottleneck Models, enabling data debugging and behavioral insights on CEBaB and Yelp datasets.
The RER framework decomposes chord generation into retrieval, editing, and reranking stages to outperform end-to-end models in balancing stylistic diversity with music-theoretic feasibility.
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Parameter-Efficient Neuroevolution for Diverse LLM Generation: Quality-Diversity Optimization via Prompt Embedding Evolution
QD-LLM evolves prompt embeddings via neuroevolution in a quality-diversity framework, delivering 46% higher coverage and 41% higher QD-score than prior methods on coding and writing benchmarks.
-
Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
-
LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset
LASQ is a new quadruple extraction dataset for Uzbek and Uyghur that includes a syntax-aware model showing gains over baselines on the task.
-
Scaling Laws for Cross-Encoder Reranking
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
-
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.
-
The Challenge and Reward of Fair Play in Narrative: A Computational Approach
Develops an information-theoretic framework showing surprise and coherence trade off in single reader models but coexist via pre- and post-revelation modes, operationalized as reference-less LLM metrics for fair play and validated on generated stories plus classic detective fiction.
-
Accelerating Large Language Model Decoding with Speculative Sampling
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
The Geopolitics of AI Safety: A Causal Analysis of Regional LLM Bias
Causal analysis of LLMs finds standard bias metrics overestimate demographic effects due to context toxicity, with Western models showing higher refusal rates for certain groups and Eastern models showing targeted regional sensitivities.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference submissions.
-
CIR: Lightweight Container Image for Cross-Platform Deployment
CIR is a cross-platform container image format for Python/R-style apps that defers dependency assembly to deployment, cutting image size by 95% and deployment time by 40-60% versus traditional bundled images.
-
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection
A metadata-conditioned mT5 model trained on rule-augmented dialectal Arabic data produces translations that better match intended regional varieties than high-resource baselines, despite lower BLEU scores.
-
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
Flow Map Language Models: One-step Language Modeling via Continuous Denoising
Continuous flows on token embeddings with flow-map distillation produce one-step language models whose quality exceeds recent 8-step discrete diffusion baselines on LM1B and OpenWebText.
-
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
-
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.
-
CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models
CLIF applies influence functions to pinpoint influential training samples and key concepts in Concept Bottleneck Models, enabling data debugging and behavioral insights on CEBaB and Yelp datasets.
-
A Decomposed Retrieval-Edit-Rerank Framework for Chord Generation
The RER framework decomposes chord generation into retrieval, editing, and reranking stages to outperform end-to-end models in balancing stylistic diversity with music-theoretic feasibility.
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
-
Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs
LLMs produce overly positive idealized depictions of disability in simulated social media posts that do not match real posts by people with disabilities and show topic bias favoring nondisabled people.
-
Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications
An automated self-testing framework with evidence-based quality gates for LLM application releases was evaluated in a longitudinal case study of a multi-agent conversational AI system, identifying rollback builds and supporting stable quality over four weeks.
-
Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Calibrate-Then-Act supplies LLM agents with priors on latent environment states to enable explicit cost-uncertainty reasoning, producing more optimal strategies than standard approaches in retrieval QA and file-reading coding tasks.
-
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation
New LLM-based models for fine-grained conditional probability estimation outperform prior fine-tuned and prompting methods through enhanced data creation and supervision.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models
CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.
-
Matched-Learning-Rate Analysis of Attention Drift and Transfer Retention in Fine-Tuned CLIP
Matched learning-rate experiments show LoRA retains substantially higher zero-shot transfer (45% vs 11% on EuroSAT, 58% vs 9% on Pets) than Full FT in CLIP adaptation.
-
Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
A systematic literature review of explainability in multimodal attention models finds most studies focus on vision-language tasks with attention-based explanations, but evaluation methods lack consistency and modality-specific considerations.