Presents CULTURE-MT benchmark for UGC translation focusing on cultural transmission and emotion resonance, with tests showing traditional metrics miss cultural effectiveness and larger models perform better on it.
hub
emnlp-main.1173/
24 Pith papers cite this work, alongside 3 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
MobiFlow is a new evaluation framework for mobile agents using trajectory fusion on 240 tasks across 20 third-party apps, achieving higher alignment with human judgments than prior benchmarks.
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
ITCR integrates conformal prediction into reasoning graph generation to achieve valid factuality coverage guarantees at inference time for LLMs.
Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.
Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classification benchmarks.
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.
LLM agents in an opposing-incentive NYC simulation develop limited selective trust and deception through KTO policy updates but stay 70% susceptible to adversarial persuasion.
Heuristic demonstration selection methods outperform embedding-based methods for practical LLM-based next POI prediction on three real-world datasets.
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.
citing papers explorer
-
Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC
Presents CULTURE-MT benchmark for UGC translation focusing on cultural transmission and emotion resonance, with tests showing traditional metrics miss cultural effectiveness and larger models perform better on it.
-
RAG over Thinking Traces Can Improve Reasoning Tasks
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
-
On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
-
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
-
MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
MobiFlow is a new evaluation framework for mobile agents using trajectory fusion on 240 tasks across 20 third-party apps, achieving higher alignment with human judgments than prior benchmarks.
-
Bayesian Preference Learning for Test-Time Steerable Reward Models
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
-
Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models
ITCR integrates conformal prediction into reasoning graph generation to achieve valid factuality coverage guarantees at inference time for LLMs.
-
Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.
-
Self-Improving Language Models with Bidirectional Evolutionary Search
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
-
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.
-
ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
-
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
-
Enhancing Table Reasoning with Deterministic Table-State Rewards
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
-
Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values
LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.
-
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
-
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.
-
Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications
Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classification benchmarks.
-
A Reproducibility Study of LLM-Based Query Reformulation
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.
-
CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation
LLM agents in an opposing-incentive NYC simulation develop limited selective trust and deception through KTO policy updates but stay 70% susceptible to adversarial persuasion.
-
A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction
Heuristic demonstration selection methods outperform embedding-based methods for practical LLM-based next POI prediction on three real-world datasets.
-
AI Evaluation Should Require Standardized Item-Level Data Releases
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
-
Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.
-
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.