Presents CULTURE-MT benchmark for UGC translation focusing on cultural transmission and emotion resonance, with tests showing traditional metrics miss cultural effectiveness and larger models perform better on it.
hub
emnlp-main.1173/
24 Pith papers cite this work, alongside 3 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
MobiFlow is a new evaluation framework for mobile agents using trajectory fusion on 240 tasks across 20 third-party apps, achieving higher alignment with human judgments than prior benchmarks.
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
ITCR integrates conformal prediction into reasoning graph generation to achieve valid factuality coverage guarantees at inference time for LLMs.
Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.
Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classification benchmarks.
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.
LLM agents in an opposing-incentive NYC simulation develop limited selective trust and deception through KTO policy updates but stay 70% susceptible to adversarial persuasion.
Heuristic demonstration selection methods outperform embedding-based methods for practical LLM-based next POI prediction on three real-world datasets.
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
citing papers explorer
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.