LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
hub
Text summarization branches out , pages=
34 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
S²R² improves robustness of LoRA-tuned LLMs to prompt perturbations by penalizing semantic-segment drift while preserving clean performance and cross-dataset transfer.
SRA reframes CTKD by aligning attention-weighted span centers of mass in a multi-particle system model with geometric regularization and span logit distillation, claiming consistent outperformance over baselines.
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Presents the NATURAL INSTRUCTIONS meta-dataset and shows generative pre-trained language models achieve 19% better generalization to unseen tasks when using task instructions.
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
LVLM unlearning benchmarks fail due to initial memorization failures on fictitious data; ReMem benchmark with multi-hop and multi-image scaling plus Exposure metric enables reliable learning and unlearning diagnosis.
SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Bangla Key2Text releases 2.6M keyword-text pairs and demonstrates that fine-tuned mT5 and BanglaT5 outperform zero-shot LLMs on keyword-conditioned Bangla text generation.
Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.
MFMDQwen is the first open-source LLM for multilingual financial misinformation detection, backed by a new instruction dataset and benchmark on which it outperforms other open-source models.
A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.