Time-RA reformulates time series anomaly detection as a reasoning-intensive generative task and provides the RATs40K multimodal benchmark to evaluate and improve LLM-based diagnosis.
hub Mixed citations
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Mixed citation behavior. Most common role is background (44%).
abstract
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Modern Hopfield networks store exponentially many patterns, retrieve them in one update, and have an update rule equivalent to transformer attention, enabling new Hopfield layers that improve results on multiple instance learning and drug design tasks.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
Introduces TEDBench benchmark and MiAE self-supervised framework that outperforms baselines for large-scale protein fold classification.
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the
A broad empirical benchmark shows how 15 existing test selection metrics perform for fault detection, performance estimation, and retraining under corrupted, adversarial, temporal, natural, and label shifts across image, text, and Android data.
Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
Creates a 54k-instance benchmark of GitHub issue secrets and shows fine-tuned LLMs reach 94.49% F1 with 81.6% on 178 real repositories.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks including large-scale hierarchical tasks.
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
A supervision construction procedure generates explicit support and controlled non-support examples (counterfactual and topic-related negatives) without manual annotation, producing verifiers that demonstrate genuine evidence dependence in radiology tasks.
Benchmarks five compressed transformer models for multi-platform sentiment classification on 15-minute city discourse, reporting DistilRoBERTa highest F1 of 0.8292 and platform-specific performance differences.
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
Constructs gender-perturbed Bangla classification benchmarks and proposes RandSymKL debiasing that reduces extrinsic gender bias in pretrained models.
Category theory proves prompt-based learning on perfect foundation models works only for representable tasks, fine-tuning solves tasks in the pretext category, and models can represent unseen target-category objects using source-category structure.
Benchmark study of ten GNN explainers on eight architectures and six datasets that isolates usable components and issues practical recommendations.
citing papers explorer
-
Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback
Time-RA reformulates time series anomaly detection as a reasoning-intensive generative task and provides the RATs40K multimodal benchmark to evaluate and improve LLM-based diagnosis.
-
Hopfield Networks is All You Need
Modern Hopfield networks store exponentially many patterns, retrieve them in one update, and have an update rule equivalent to transformer attention, enabling new Hopfield layers that improve results on multiple instance learning and drug design tasks.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
-
Protein Fold Classification at Scale: Benchmarking and Pretraining
Introduces TEDBench benchmark and MiAE self-supervised framework that outperforms baselines for large-scale protein fold classification.
-
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
-
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the
-
Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts
A broad empirical benchmark shows how 15 existing test selection metrics perform for fault detection, performance estimation, and retraining under corrupted, adversarial, temporal, natural, and label shifts across image, text, and Android data.
-
Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
-
Entities as Retrieval Signals: A Systematic Study of Coverage, Supervision, and Evaluation in Entity-Oriented Ranking
Entity signals cover only 19.7% of relevant documents on Robust04 and no configuration among 443 systems improves MAP by more than 0.05 in open-world evaluation, despite gains when entities are pre-restricted.
-
Compiling Code LLMs into Lightweight Executables
Ditto quantizes Code LLMs with K-Means codebooks and compiles inference via LLVM-BLAS replacement to deliver up to 10.5x faster, 6.4x smaller, and 10.5x lower-energy execution on commodity hardware while losing only 0.27% pass@1 accuracy.
-
Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
-
Secret Leak Detection in Software Issue Reports using LLMs: A Comprehensive Evaluation
Creates a 54k-instance benchmark of GitHub issue secrets and shows fine-tuned LLMs reach 94.49% F1 with 81.6% on 178 real repositories.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
EVA-CLIP: Improved Training Techniques for CLIP at Scale
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
CoMET achieves strong multimodal classification performance by composing frozen modality encoders, PCA compression, and tabular foundation models without any training, reaching state-of-the-art on diverse benchmarks including large-scale hierarchical tasks.
-
Automatic Reflection Level Classification in Hungarian Student Essays
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
-
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
-
Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision
A supervision construction procedure generates explicit support and controlled non-support examples (counterfactual and topic-related negatives) without manual annotation, producing verifiers that demonstrate genuine evidence dependence in radiology tasks.
-
A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm
Benchmarks five compressed transformer models for multi-platform sentiment classification on 15-minute city discourse, reporting DistilRoBERTa highest F1 of 0.8292 and platform-specific performance differences.
-
Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
-
Mitigating Extrinsic Gender Bias for Bangla Classification Tasks
Constructs gender-perturbed Bangla classification benchmarks and proposes RandSymKL debiasing that reduces extrinsic gender bias in pretrained models.
-
On the Power of Foundation Models
Category theory proves prompt-based learning on perfect foundation models works only for representable tasks, fine-tuning solves tasks in the pretext category, and models can represent unseen target-category objects using source-category structure.
-
Explaining the Explainers in Graph Neural Networks: a Comparative Study
Benchmark study of ten GNN explainers on eight architectures and six datasets that isolates usable components and issues practical recommendations.
-
Ideology Prediction of German Political Texts
Transformer models predict German political ideology on a continuous left-right scale, reaching F1 0.844 in-domain and MAE 0.172 on newspaper out-of-domain tests.
-
Detecting Alarming Student Verbal Responses using Text and Audio Classifier
A hybrid text-plus-audio classifier framework is introduced to identify potentially troubling student responses by analyzing both what is said and how it is said.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
Systematic review of 145 papers on LLM-based log analysis, providing a unified taxonomy, common design patterns, evaluation practices, and challenges for deployment under drift and limited labels.
-
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
A staged multimodal fusion model for predicting six continuous emotion intensities from in-the-wild video achieves 0.4722 validation and 0.57 test Pearson correlation in the EMI challenge.
-
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Hallucination in Large Foundation Models
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
-
Bridging Language Models and Financial Analysis
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.