{"total":68,"items":[{"citing_arxiv_id":"2606.08814","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning","primary_cat":"cs.AI","submitted_at":"2026-06-07T20:07:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23796","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniSpike: Accelerating Spiking Neural Networks on Neuromorphic Systems via Eliminating Address Redundancy","primary_cat":"cs.NE","submitted_at":"2026-05-22T15:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSpike eliminates address redundancy in spike packets via co-design of scheduling, runtime assembly hardware, and SNN partitioning, reporting 1.93x average traffic reduction, 1.77x speedup, and 1.50x energy improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23641","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kernel-Based ReLU Approximation for Homomorphic Encryption-Compatible Privacy-preserving Deep Learning Models","primary_cat":"cs.CR","submitted_at":"2026-05-22T13:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kernel-based ReLU is approximated by a quadratic polynomial for low-depth homomorphic encryption compatibility, trained on LLM token embeddings and evaluated across DL and transformer settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17829","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Interactive Evaluation Requires a Design Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T04:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18879","ref_index":18,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-16T03:10:36+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16704","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Convex Dataset Valuation for Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-15T23:35:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A convex KMM-based valuation method that accounts for both target-task alignment and inter-dataset redundancy in gradient space outperforms standard gradient-alignment baselines for LLM post-training data selection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16470","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Strategic Over-Parameterization for Generalizable Low-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2026-05-15T12:26:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoRA-Over injects auxiliary parameters into low-rank adapters during training and decomposes them back into standard LoRA at inference, with static or dynamic scheduling to allocate extra capacity where needed, yielding better generalization than vanilla LoRA on GLUE, MT-Bench, GSM8K and HumanEval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15413","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-14T20:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14055","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts","primary_cat":"cs.CL","submitted_at":"2026-05-13T19:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11598","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting","primary_cat":"cs.LG","submitted_at":"2026-05-12T06:22:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"across a wide range of epidemic prediction tasks [8]. Despite this progress, a significant challenge in epidemic forecasting research is the limited availability of comprehensive multivariate datasets. Unlike fields such as computer vision and natural language processing, which benefit from large-scale standardized repositories like ImageNet [24] and GLUE [66], the epidemic forecasting domain lacks comparable benchmark datasets. While archives such as the Monash Time Series Forecasting Repository [31] provide unified collections for general-purpose time series data, similar large-scale and standardized efforts for epidemic forecasting remain scarce. As a result, most existing datasets are either univariate [54], disease-specific [5], or focused on specific"},{"citing_arxiv_id":"2605.09238","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds","primary_cat":"cs.LG","submitted_at":"2026-05-10T00:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Optimization, 23(2):1214-1236, 2013. [59] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InInternational Conference on Learning Representations (ICLR), 2019. URLhttps://openreview.net/forum?id= rJ4km2R5t7. arXiv:1804.07461. [60] Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. Additive margin softmax for face verification. IEEE Signal Processing Letters, 25(7):926-930, 2018. [61] Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan. Taming momentum: Rethinking opti- mizer states through low-rank approximation. InInternational Conference on Learning Representations"},{"citing_arxiv_id":"2605.10989","ref_index":124,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SURGE: Surrogate Gradient Adaptation in Binary Neural Networks","primary_cat":"cs.LG","submitted_at":"2026-05-09T09:52:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08734","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:37:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive results on GPT-2, Mistral-7B, Qwen2-7B and diffusion personalization tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"to a different generation benchmark. Cross-rank ablations at r∈ {16,64} (Appendix E.2, Table 10) confirm the same ordering. 4.2 Extension to 7B-scale LLMs: Mistral-7B and Qwen2-7B We next test whether the controlled-setting gains carry over to the 7B parameter scale, using Mistral- 7B [15] and Qwen2-7B [35]. For Mistral-7B we fine-tune on the GLUE [31] tasks RTE, CoLA, and 8 Table 3: Scores of GPT-2 small model (rank=4) fine-tuned using different optimizers. Evaluation is conducted on DART dataset. Methods BLEU↑METEOR↑chrF++↑TER↓BLEURT↑ SGD 41.2 0.63 0.59 0.52 0.33 Scaled GD 43.80.660.61 0.50 0.38 LoRA-Pro SGD 44.10.660.61 0.50 0.38 AdaPreLoRA SGD (ours)44.6 0.66 0.62 0.49 0.39 AdamW 43.9 0.66 0."},{"citing_arxiv_id":"2605.05974","ref_index":43,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts","primary_cat":"cs.CR","submitted_at":"2026-05-07T10:19:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26587","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators","primary_cat":"cs.AR","submitted_at":"2026-04-29T12:10:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02930","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Analysis and Explainability of LLMs Via Evolutionary Methods","primary_cat":"cs.NE","submitted_at":"2026-04-27T18:07:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23647","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices","primary_cat":"cs.AR","submitted_at":"2026-04-26T10:34:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21555","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Finding Meaning in Embeddings: Concept Separation Curves","primary_cat":"cs.CL","submitted_at":"2026-04-23T11:29:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18124","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TLoRA: Task-aware Low Rank Adaptation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-20T11:43:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13440","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models","primary_cat":"cs.LG","submitted_at":"2026-04-15T03:40:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12365","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Spiking Neurons for Vision and Language Modeling","primary_cat":"cs.NE","submitted_at":"2026-04-14T06:53:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11575","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts","primary_cat":"cs.CL","submitted_at":"2026-04-13T14:53:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11321","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Winner-Take-All Spiking Transformer for Language Modeling","primary_cat":"cs.NE","submitted_at":"2026-04-13T11:23:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10649","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates","primary_cat":"cs.LG","submitted_at":"2026-04-12T13:54:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09088","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation","primary_cat":"cs.CV","submitted_at":"2026-04-10T08:16:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"ing: (1) Image-Text Retrieval (ITR) on Flickr30K [104] and MSCOCO [60]; (2) Video-Text Retrieval (VTR) on MSVD [6] and MSR-VTT [99]; (3) Question Answering (VQA&GQA) on VQAv2 [29] and GQA [42]; and (4) Vi- sual Grounding (VG) on RefCOCO, RefCOCO+ [105] and RefCOCOg [66]. Additionally, we evaluate our approach onvision-only and language-only tasksonVTAB-1K[107] andGLUEbenchmark [90], respectively. Evaluation Metrics.We report Recall@1 and Rsum on ITR and VTR tasks, overall Accuracy on QA tasks, and mean Average Precision on VG tasks. For GLUE bench- mark, we present Accuracy Metric, F1 Score, Matthew's Correlation, Pearson-Spearman Correlation as the evalua- tion metrics for various datasets respectively. Besides, we state Top-1 Accuracy on 19 datasets in VTAB-1K."},{"citing_arxiv_id":"2605.04058","ref_index":109,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning","primary_cat":"cs.LG","submitted_at":"2026-04-10T08:00:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter and memory usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03957","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design","primary_cat":"cs.LG","submitted_at":"2026-04-05T04:25:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Next, we assess the model accuracy on both BERT and LLM- based models, comparing with the SOTA low-bit methods in Sec. 6.3. In final, we provide a comprehensive evaluation of our BWTA CUDA kernel, detailing the efficiency for both kernel-level and end-to-end performance in Sec. 6.4. 6.1 Implementation Details Datasets.The evaluation on BERT-based models are con- ducted on GLUE benchmark [50], consisting of nine basic language tasks. We exclude WNLI task as previous studies do for its relatively small data volume and unstable behavior. We evaluate LLM-based models on Wikitext2 and C4 datasets to compare the perplexity, and accuracy performance on CommonsenseQA benchmarks. PREPRINT SUBMITTED TO ARXIV 7 0.0STS-B0.20.40.60.8 epochs50 106810121416"},{"citing_arxiv_id":"2602.02543","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Norm Anchors Make Model Edits Last","primary_cat":"cs.LG","submitted_at":"2026-01-30T04:31:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02764","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models","primary_cat":"cs.CL","submitted_at":"2025-12-02T13:44:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21285","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark","primary_cat":"cs.CL","submitted_at":"2025-11-26T11:18:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.07969","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker","primary_cat":"cs.CL","submitted_at":"2025-11-11T08:28:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18245","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs","primary_cat":"cs.LG","submitted_at":"2025-10-21T03:08:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.23009","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead","primary_cat":"cs.LG","submitted_at":"2025-07-30T18:14:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.21035","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts","primary_cat":"cs.LG","submitted_at":"2025-06-26T06:19:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.06120","ref_index":83,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs Get Lost In Multi-Turn Conversation","primary_cat":"cs.CL","submitted_at":"2025-05-09T15:21:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Proceedings of the CHI Conference on Human Factors in Computing Systems , 2024. URL https://api. semanticscholar.org/CorpusID:267301068. [82] J. Wester, T. Schrills, H. Pohl, and N. van Berkel. \"as an ai language model, i cannot\": Investigating llm denials of user requests. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, pages 1-14, 2024. [83] F. Wildenburg, M. Hanna, and S. Pezzelle. Do pre-trained language models detect and understand semantic underspecification? ask the dust! ArXiv, abs/2402.12486, 2024. URL https://api.semanticscholar.org/ CorpusID:267759784. [84] Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversation."},{"citing_arxiv_id":"2503.10666","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference","primary_cat":"cs.CL","submitted_at":"2025-03-09T19:49:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical tests on three LLMs show prompt semantics and task keywords drive inference energy costs more than length, with varying patterns by task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.09457","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Power-Softmax: Towards Secure LLM Inference over Encrypted Data","primary_cat":"cs.LG","submitted_at":"2024-10-12T09:32:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.02713","ref_index":118,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","primary_cat":"cs.CV","submitted_at":"2024-10-03T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.18169","ref_index":153,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey","primary_cat":"cs.CR","submitted_at":"2024-09-26T17:55:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[102] use base model that has been aligned (i.e., Llama2-7B-chat) may not need this dataset. On the other hand, other papers, e.g., Vaccine [67] and CTRL [100] need to use this dataset to align the unaligned version of the pre-train model. - Choice of dataset. There are three possible datasets available for constructing alignment dataset: i) BeaverTails [77], ii) Decoding Trust [153], and HH-RLHF [10]. All of them provide harmful prompt-safe answers for SFT. For preference training (RLHF), it is recommended to use HH-RLHF because it provides pairs of answers to one harmful prompt. Please check Table 5 for a reference. • Harmful dataset. The harmful dataset contains harmful question/harmful answer pairs. - Usage. There are three main usages of this dataset."},{"citing_arxiv_id":"2406.04093","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scaling and evaluating sparse autoencoders","primary_cat":"cs.LG","submitted_at":"2024-06-06T14:10:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.01574","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark","primary_cat":"cs.CL","submitted_at":"2024-06-03T17:53:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To rigorously assess and push the capabilities of these LLMs, we introduce MMLU-Pro, a new benchmark designed to test the upper limits of reasoning and knowledge in advanced language models. 2.2 LLMs Evaluation Benchmarks In recent years, the development of various benchmarks has significantly enhanced the evaluation of Large Language Models (LLMs). For instance, GLUE [37] and its successor SuperGLUE [38], have played a pivotal role in advancing language understanding tasks, setting the stage for more specialized evaluations. Other recent benchmarks, including MMLU [ 18], HELM [22], BigBench [32], Hel- laSwag [45], and the AI2 Reasoning Challenge (ARC) [12], have broadened the scope by assessing capabilities across language generation, knowledge understanding, and complex reasoning [9]."},{"citing_arxiv_id":"2403.14608","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey","primary_cat":"cs.LG","submitted_at":"2024-03-21T17:55:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"approach is categorized into two main strategies: (1) Low- rank Decomposition, and (2) LoRA Derivatives. Hybrid fine- tuning explores the design spaces of different PEFT methods and combines their advantages. C. Downstream Tasks for LLM Evaluation Two types of tasks have been widely used for LLM eval- uation, the first type is the General Language Understand- ing Evaluation (GLUE) [11] benchmark, which integrates nine sentence or sentence-pair language understanding tasks (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI), chosen for their diversity in dataset sizes, text genres, and difficulty levels, and is based on established existing datasets. It also includes a diagnostic dataset specifically designed to evaluate and analyze model performance across"},{"citing_arxiv_id":"2403.09227","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation","primary_cat":"cs.RO","submitted_at":"2024-03-14T09:48:36+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.05561","ref_index":161,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TrustLLM: Trustworthiness in Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-01-10T22:07:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"metrics face limitations, particularly in scenarios where multiple correct methods of text generation exist, as often seen in tasks involving latent content planning or selection, which can also lead to accurate solutions receiving low scores [159, 160]. LLM evaluation datasets and benchmarks are vital in evaluating various language models for tasks, reflecting complex real-world language processing scenarios. Benchmarks like GLUE [ 161] and SuperGLUE [162] encompass various tasks from text categorization and machine translation to dialogue generation. These evalu- ations are crucial for understanding the capabilities of LLMs in general-purpose language tasks. Additionally, automatic and human evaluations serve as critical methods for LLM evaluation [98]. 3.3 Developers and Their Approaches to Enhancing Trustworthiness in LLMs"},{"citing_arxiv_id":"2311.04799","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DA-Cramming: Enhancing Cost-Effective Language Model Pretraining with Dependency Agreement Integration","primary_cat":"cs.CL","submitted_at":"2023-11-08T16:18:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DA-Cramming inserts chunk-level dependency agreement embeddings into a dual-stage pretraining pipeline and reports better downstream performance than prior Cramming baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.02277","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Junk DNA Hypothesis: Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs \"Difficult\" Downstream Tasks in LLMs","primary_cat":"cs.LG","submitted_at":"2023-09-29T22:55:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning small-magnitude weights from pre-trained LLMs causes monotonic irreversible performance degradation on difficult downstream tasks, supporting the Junk DNA Hypothesis that these weights hold essential knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.14509","ref_index":146,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models","primary_cat":"cs.LG","submitted_at":"2023-09-25T20:15:57+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.03958","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simple synthetic data reduces sycophancy in large language models","primary_cat":"cs.CL","submitted_at":"2023-08-07T23:48:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.14048","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models","primary_cat":"cs.LG","submitted_at":"2023-06-24T20:11:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"rable accuracy on a majority of tasks. Setup. Our experiments are based on three representative model families of LLMs, including the OPT [39] with model sizes, LLaMA [40], and GPT-NeoX-20B [41]. We sample eight tasks from two popular evaluation frameworks (HELM [16] and lm-eval-harness [15]): COPA [42], MathQA [43], OpenBookQA [44], PiQA [45], RTE [46], Winogrande [47], XSUM [48], CNN/Daily Mail [49]. Also, we evaluate our approach on recent generation benchmarks, AlpaceEval [50] and MT-bench [51], and the details are included in Appendix. We use NVIDIA A100 80GB GPU. Baselines. Since H2O evenly assigns the caching budget to H2 and the most recent KV, except for full KV cache, we consider the \"Local\" strategy as a baseline method."},{"citing_arxiv_id":"2306.03310","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning","primary_cat":"cs.AI","submitted_at":"2023-06-05T23:32:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"1 for a detailed review of lifelong learning algorithms. Lifelong Learning Benchmarks Pioneering work has adapted standard vision or language datasets for studying LL. This line of work includes image classification datasets like MNIST [18], CIFAR [34], and ImageNet [ 17]; segmentation datasets like Core50 [ 38]; and natural language understanding datasets like GLUE [67] and SuperGLUE [59]. Besides supervised learning datasets, video game benchmarks (e.g., Atari [46], XLand [64], and VisDoom [30]) in reinforcement learning (RL) have also been used for studying LL. However, LL in standard supervised learning does not involve procedural knowledge transfer, while RL problems in games do not represent human activities."}],"limit":50,"offset":0}