FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
super hub Mixed citations
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Mixed citation behavior. Most common role is background (68%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re
authors
co-cited works
representative citing papers
Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
Graph random walks provide a verifiable sandbox for diagnosing parallel samplers in masked diffusion models, showing performance depends on graph structure and introducing a new exact bisection sampler.
SciTraj is the first claim-grounded typed citation graph with 32,559 papers and 573,126 edges across six relation types, plus a temporally split link-prediction benchmark.
OVIG introduces an optimistic gradient-based verification framework for outsourced AI post-training that uses stride-sampled interval checks against an honest-replay boundary to achieve 0% attack success rate with low overhead.
Large Language Gibbs uses LLM next-token conditionals as MCMC transition operators for iterative resampling of structured variables, aiming to produce a stationary distribution that compromises across all local conditionals.
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.
AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.
WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
citing papers explorer
-
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
-
Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning
Multimodal contrastive learning using multilinear products is fragile to single bad modalities, and a gated version improves top-1 retrieval accuracy on synthetic and real trimodal data.
-
Spectral Tempering for Embedding Compression in Dense Passage Retrieval
Spectral Tempering derives an adaptive scaling factor γ(k) from the embedding eigenspectrum via local SNR analysis and knee-point normalization to achieve near-optimal compression without training or validation.
-
Not All Denoising Steps Are Equal: Model Scheduling for Faster Masked Diffusion Language Models
Early and late denoising steps in masked diffusion LMs are robust to smaller-model replacement, enabling 17% FLOPs reduction with modest generative quality loss.
-
Differentiable Surrogate for Detector Simulation and Design with Diffusion Models
A LoRA-adapted conditional diffusion surrogate for electromagnetic calorimeter showers matches key observables within 2% RMSE and reproduces directional trends in design-utility gradients.
-
Effective Model Pruning: Measure The Redundancy of Model Components
EMP maps importance scores to effective sample size N_eff and prunes the lowest N - N_eff components, with a derived lower bound on retained effective mass and upper bound on loss increase.
-
Task complexity shapes internal representations and robustness in neural networks
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-precision and perturbed networks.
-
Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations
HSTU-based generative recommenders with 1.5 trillion parameters scale as a power law with compute up to GPT-3 scale, outperform baselines by up to 65.8% NDCG, run 5-15x faster than FlashAttention2 on long sequences, and improve online A/B metrics by 12.4%.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
-
The Power of Scale for Parameter-Efficient Prompt Tuning
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Reading Order Inference for Complex Document Layouts
Training-free graph method with LM edge scoring and max-regret path cover recovers 95% successor edges on Glossa wrap-around layouts vs 50% for XY-cut and 88% on OmniDocBench multi-column vs 75% XY-cut.
-
A Technical Typology of AI Systems in Public Administration
The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.
-
CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning
CLExEval introduces a human-annotated evaluation framework on 40 rare cases that identifies verbosity bias, hidden knowledge paradox, and 68.6% reasoning-to-output mismatch in LLMs while showing LLM-as-a-Judge overestimates reliability.
-
Test-Time Verification for Text-to-SQL via Outcome Reward Models
ORM-based test-time verification improves Text-to-SQL accuracy over heuristic selection by up to 4.33% on BIRD and 2.10% on Spider using automated labeling.
-
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
-
IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
IHDec applies JSD-steered contrastive decoding to enforce multi-turn instruction hierarchies in LLMs without fine-tuning.
-
BitNet Text Embeddings
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
-
AutoSpecNER: A Fine-Grained Named Entity Recognition Dataset for Vehicle Specification Extraction
AutoSpecNER is a new fine-grained NER dataset for vehicle advertisements with 659 examples and 15 categories, where DeBERTa reaches 90% micro-F1 versus 43% for rules and 77.8% for the best LLM.
-
The $\alpha$-Index: A Penalized Authorship-Integrity Framework for Position-Weighted Scientific Contribution
The α-index is a conserved position-weighted authorship framework with a senior-author penalty that decreases credit as the number of middle authors increases.
-
Scaling Performance and Low-Resource Annotation with Many-Shot In-Context Learning for Named Entity Recognition
Many-shot ICL with LLMs matches or exceeds supervised BERT on NER and generates high-quality labels for low-resource settings, producing ~10% absolute F1 gains when used to fine-tune BERT.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization
OTTER optimizes prompts to decouple surface toxicity from adversarial intent, raising attack success rates on GPT models from 7% to 84% across 457 AdvBench examples.
-
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
PsyScore combines a Trait-Adaptive Neural IRT Scorer using GPCM with a ZPD-Scaffolded Feedback Generator to deliver both competitive scoring and pedagogically aligned feedback on the ASAP++ dataset.
-
Spectral Retrieval-Augmented Time-Series Forecasting
SpecReTF improves time series forecasting by retrieving similar historical patterns using windowed frequency representations with combined amplitude-phase similarity and exponential recency weighting, outperforming time-domain methods on benchmarks.
-
RedactionBench
Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.
-
Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation
Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.
-
Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles
H-SAL erases latent concepts from text profiles using self-descriptions as implicit debiasing signals and shows competitive performance on a new multi-domain Stack Exchange helpfulness benchmark.
-
Jaguar: Fast Private CNN Inference with Power-of-Two Homomorphic Arithmetic
Jaguar replaces prime-modulus HE with power-of-two arithmetic to enable coefficient-domain convolution and local-shift truncation, reporting 2-3.7x lower latency than Cheetah and Rhombus on ResNet-18/50 and MobileNetV2.
-
Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
-
LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines
An LLM-orchestrated framework enables conformance checking in stroke care from unstructured texts, achieving over 86% conformance in hospital data.
-
sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
sGPO uses an initial-policy success-rate profiling pass to adaptively set rollout group sizes, filter data, and build a curriculum, cutting total RLVR training compute by 3x while matching baseline performance.
-
Adversarial Creation and Detection of AI-Generated Social Bot Content
An adversarial methodology generates a multilingual cross-platform dataset of paired human-AI social messages, and models trained on it outperform prior detectors on real-world out-of-distribution data.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data
A contrastive learning transformer embeds network flow sequences to enable correlation clustering that groups scanner sources consistently with labels.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems
OCL is a governance layer for LLM agents that cuts unsafe executions from 88% to near-zero and raises valid success from 12% to 96% in adversarial buyer-seller negotiations across frontier LLMs.
-
Efficient ASR Training with Conversations that Never Happened
Mixing 636 hours of LLM-generated synthetic conversations with 67 hours of real data outperforms a model trained on 2700 hours of real Hungarian speech on the BEA-Dialogue benchmark.
-
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
-
Child-directed speech facilitates production, not comprehension, in BabyLMs
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
-
MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance
MultiAct is an unpaired inference-time method that adaptively amplifies cross-attention for underrepresented components in composite text prompts to improve semantic coverage in motion generation while preserving realism.
-
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders
Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.
-
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.