FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
super hub Mixed citations
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Mixed citation behavior. Most common role is background (68%).
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background The retrieval system only manages to fetch informationabout Fleming's professional achievements in the discoveryof penicillin. However, the document does not provide informa-tion about his educational background, thus the model generates ahallucinatory answer. inappropriately activated, blindly retrieving inaccurate information and consequently leading to an undesirable response. Consequently, several studies [75, 204, 228, 378] have proposed to make a shift from passive retrieval to adaptive re
authors
co-cited works
representative citing papers
Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fail to achieve consistent correctness.
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
Graph random walks provide a verifiable sandbox for diagnosing parallel samplers in masked diffusion models, showing performance depends on graph structure and introducing a new exact bisection sampler.
SciTraj is the first claim-grounded typed citation graph with 32,559 papers and 573,126 edges across six relation types, plus a temporally split link-prediction benchmark.
OVIG introduces an optimistic gradient-based verification framework for outsourced AI post-training that uses stride-sampled interval checks against an honest-replay boundary to achieve 0% attack success rate with low overhead.
Large Language Gibbs uses LLM next-token conditionals as MCMC transition operators for iterative resampling of structured variables, aiming to produce a stationary distribution that compromises across all local conditionals.
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
Introduces applicability condition extraction for therapeutic drug-disease relations, creates first annotated dataset of 1,119 pairs, and proposes enhanced LoRA method outperforming baselines.
AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.
WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
An adaptive two-phase semantic filter using clustering then a hybrid proxy trained on LLM confidence achieves 1.6-2.0x speedup over prior methods at 90% accuracy on 10K document corpora.
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.
ClinicalMC is a benchmark of 1,275 Chinese and 5,804 English multi-course clinical samples across four stages, evaluated via a multi-agent framework on closed-source, open-source, and medical LLMs in static and dynamic settings.
Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
Introduces coherence as a topological constraint on representations and the Coh objective to enforce geometric clustering for interpretability in neural networks.
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
citing papers explorer
-
FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes
FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Locating and Editing Factual Associations in GPT
Factual associations in autoregressive transformers are localized to mid-layer feed-forward modules and can be edited via rank-one model editing while preserving both specificity and generalization on counterfactual tests.
-
SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.
-
Evaluation Pitfalls and Challenges in Multimedia Event Extraction
A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.
-
How Does Research Evolve? Tracing Cross-Domain Trajectories in NLP, ML, and CV with Claim-Grounded Typed Citations
SciTraj is the first claim-grounded typed citation graph with 32,559 papers and 573,126 edges across six relation types, plus a temporally split link-prediction benchmark.
-
AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages
AfriSUD supplies new SUD-annotated dependency treebanks for nine Sub-Saharan African languages and demonstrates that existing models exhibit clear limitations on their syntax.
-
WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
WorldReasoner supplies 345 resolved forecasting tasks built from 14,141 articles to score LM agents on outcome quality, evidence quality, and reasoning quality against time-bounded evidence and hindsight graphs.
-
Continuous Language Diffusion as a Decoder-Interface Problem
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
-
Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
-
Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
A cycle-consistent MT pipeline generates and similarity-weights training data for coreference resolution, producing gains on four low-resource languages and enabling the task where no corpora existed.
-
EURO-5K: When Does Domain Pretraining Matter? Benchmarking Transformers for EU Reporting Obligation Extraction
Introduces EURO-5K dataset from 136 EU acts and benchmarks full fine-tuning vs QLoRA for BERT and LLM models on reporting obligation extraction, reporting 0.89 F1 with limited gains from legal pretraining except under parameter-efficient adaptation.
-
Brain-LLM Alignment Tracks Training Data, Not Typology
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
-
Fine-grained Claim-level RAG Benchmark for Law
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
-
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
Accurate and Efficient Statistical Testing for Word Semantic Breadth
A new permutation test uses Householder reflection to align word embedding clouds before testing dispersion differences, cutting Type-I error by 32.5% and speeding up 23x on GPU.
-
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
LLMs outperform single human raters at spotting relative weaknesses in L2 writing profiles on the ICNALE GRA dataset while humans are better at spotting strengths, using a self-referential intra-learner evaluation method.
-
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
Factual and Edit-Sensitive Graph-to-Sequence Generation via Graph-Aware Adaptive Noising
DLM4G applies graph-aware adaptive noising in a diffusion framework to generate text from graphs, outperforming larger autoregressive and diffusion baselines in factual grounding and edit sensitivity on three datasets plus molecule captioning.
-
EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents
EVENT5Ws is a new large-scale, manually verified open-domain event extraction dataset that benchmarks LLMs and demonstrates cross-context generalization.
-
Decoding Text Spans for Efficient and Accurate Named-Entity Recognition
SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.
-
Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.
-
The Power of Scale for Parameter-Efficient Prompt Tuning
Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
-
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
IHDec: Divergence-Steered Contrastive Decoding for Securing Multi-Turn Instruction Hierarchies
IHDec applies JSD-steered contrastive decoding to enforce multi-turn instruction hierarchies in LLMs without fine-tuning.
-
BitNet Text Embeddings
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
-
AutoSpecNER: A Fine-Grained Named Entity Recognition Dataset for Vehicle Specification Extraction
AutoSpecNER is a new fine-grained NER dataset for vehicle advertisements with 659 examples and 15 categories, where DeBERTa reaches 90% micro-F1 versus 43% for rules and 77.8% for the best LLM.
-
Scaling Performance and Low-Resource Annotation with Many-Shot In-Context Learning for Named Entity Recognition
Many-shot ICL with LLMs matches or exceeds supervised BERT on NER and generates high-quality labels for low-resource settings, producing ~10% absolute F1 gains when used to fine-tune BERT.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
PsyScore combines a Trait-Adaptive Neural IRT Scorer using GPCM with a ZPD-Scaffolded Feedback Generator to deliver both competitive scoring and pedagogically aligned feedback on the ASAP++ dataset.
-
RedactionBench
Introduces a 200-document benchmark and character-level R-Score for contextual PII redaction, with model evaluations and human agreement data showing the task remains unsolved.
-
Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation
Activation steering on early layers improves diversity of synthetic data for low-resource languages and often boosts downstream classifier performance compared to non-steered prompting.
-
Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles
H-SAL erases latent concepts from text profiles using self-descriptions as implicit debiasing signals and shows competitive performance on a new multi-domain Stack Exchange helpfulness benchmark.
-
Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
-
Adversarial Creation and Detection of AI-Generated Social Bot Content
An adversarial methodology generates a multilingual cross-platform dataset of paired human-AI social messages, and models trained on it outperform prior detectors on real-world out-of-distribution data.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
Efficient ASR Training with Conversations that Never Happened
Mixing 636 hours of LLM-generated synthetic conversations with 67 hours of real data outperforms a model trained on 2700 hours of real Hungarian speech on the BEA-Dialogue benchmark.
-
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
-
Child-directed speech facilitates production, not comprehension, in BabyLMs
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
-
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders
Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.
-
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.
-
Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis
DABS is a single-pass framework that builds a depth-ordered substrate from one Transformer encoding and performs lightweight aspect-conditioned readout, cutting computation by up to 60% on multi-aspect ATSA benchmarks while matching prior accuracy.