Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
hub
Structural Generalization in COGS: Supertagging Is (Almost) All You Need
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
IPQA is a new benchmark that measures how well models identify core user intents from history in personalized question answering, finding that performance is poor and declines with greater question complexity.
Crowdsourced metaphors show rising anthropomorphism and warmth toward AI that predict trust and adoption, with notable demographic differences.
SPLADE models produce wacky expansion terms whose prevalence rises with larger vocabularies and falls with stricter sparsity; these terms primarily aid in-domain retrieval rather than out-of-domain generalization.
Introduces bounded relational presence as a designable, tunable, and withdrawable quality for conversational AI that supports engagement while avoiding claims of personhood or human equivalence.
A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.
Symptom Induction compresses labeled data into interpretable guidelines that improve LLM classification of depression symptoms in text, outperforming zero-shot, in-context, and fine-tuning approaches with gains on rare symptoms and cross-disease generalization.
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
EgoCoT-Bench provides 3,172 verifiable QA pairs across perception, anticipation, and reasoning tasks on egocentric videos, revealing that many MLLMs give answer-correct but evidence-inconsistent explanations.
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.
Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.
LLMs can outperform DTA on index recommendations for some workloads but remain less reliable with practical adoption challenges.
Module-switching defense disrupts backdoors more effectively than weight averaging with fewer models and remains robust even when some models share the same backdoors.
Binary groundedness judgments in AI evaluations should be replaced by a reader-centered taxonomy of support relations that distinguishes syntactic and interpretive moves between generated statements and source documents.
citing papers explorer
-
Layer-wise Token Compression for Efficient Document Reranking
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
-
Code Generation by Differential Test Time Scaling
DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
-
IPQA: A Benchmark for Core Intent Identification in Personalized Question Answering
IPQA is a new benchmark that measures how well models identify core user intents from history in personalized question answering, finding that performance is poor and declines with greater question complexity.
-
From tools to thieves: Measuring and understanding public perceptions of AI through crowdsourced metaphors
Crowdsourced metaphors show rising anthropomorphism and warmth toward AI that predict trust and adoption, with notable demographic differences.
-
Understanding Wacky Weights: A Dissection of SPLADE's Learned Term Importance
SPLADE models produce wacky expansion terms whose prevalence rises with larger vocabularies and falls with stricter sparsity; these terms primarily aid in-domain retrieval rather than out-of-domain generalization.
-
Designing for Being-With: Presence Without Personhood in Conversational Human-AI Interaction
Introduces bounded relational presence as a designable, tunable, and withdrawable quality for conversational AI that supports engagement while avoiding claims of personhood or human equivalence.
-
Structural Generalization on SLOG without Hand-Written Rules
A neural cellular automaton learns compositional rules from data alone to achieve structural generalization on the SLOG semantic parsing benchmark, reaching 67.3% accuracy and fully succeeding on 11 of 17 categories.
-
Learning Evidence of Depression Symptoms via Prompt Induction
Symptom Induction compresses labeled data into interpretable guidelines that improve LLM classification of depression symptoms in text, outperforming zero-shot, in-context, and fine-tuning approaches with gains on rare symptoms and cross-disease generalization.
-
GRASP: Grounded CoT Reasoning with Dual-Stage Optimization for Multimodal Sarcasm Target Identification
GRASP improves multimodal sarcasm target identification by anchoring visual regions in grounded chain-of-thought reasoning and using dual-stage optimization on a new balanced dataset.
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
-
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
MirrorBench defines a reproducible benchmark combining lexical metrics (MATTR, Yule's K, HD-D) and LLM-judge metrics with calibration controls to measure human-likeness of user-proxy agents across four datasets.
-
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
-
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
-
EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs
EgoCoT-Bench provides 3,172 verifiable QA pairs across perception, anticipation, and reasoning tasks on egocentric videos, revealing that many MLLMs give answer-correct but evidence-inconsistent explanations.
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
ChipLingo: A Systematic Training Framework for Large Language Models in EDA
ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.
-
Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks
Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.
-
Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor
LLMs can outperform DTA on index recommendations for some workloads but remain less reliable with practical adoption challenges.
-
Defending against Backdoor Attacks via Module Switching
Module-switching defense disrupts backdoors more effectively than weight averaging with fewer models and remains robust even when some models share the same backdoors.
-
From Binary Groundedness to Support Relations: Towards a Reader-Centred Taxonomy for Comprehension of AI Output
Binary groundedness judgments in AI evaluations should be replaced by a reader-centered taxonomy of support relations that distinguishes syntactic and interpretive moves between generated statements and source documents.