hub

emnlp-main.1173/

Association for Computational Linguistics · 2024 · DOI 10.18653/v1/2024.findings-emnlp

24 Pith papers cite this work, alongside 3 external citations. Polarity classification is still indexing.

24 Pith papers citing it

3 external citations · Crossref

open at publisher browse 24 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Presents CULTURE-MT benchmark for UGC translation focusing on cultural transmission and emotion resonance, with tests showing traditional metrics miss cultural effectiveness and larger models perform better on it.

RAG over Thinking Traces Can Improve Reasoning Tasks

cs.IR · 2026-05-05 · unverdicted · novelty 7.0

Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.

On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.

What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.

MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

cs.AI · 2026-02-28 · unverdicted · novelty 7.0

MobiFlow is a new evaluation framework for mobile agents using trajectory fusion on 240 tasks across 20 third-party apps, achieving higher alignment with human judgments than prior benchmarks.

Bayesian Preference Learning for Test-Time Steerable Reward Models

cs.LG · 2026-02-09 · unverdicted · novelty 7.0

ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

cs.AI · 2026-06-07 · unverdicted · novelty 6.0 · 2 refs

ITCR integrates conformal prediction into reasoning graph generation to achieve valid factuality coverage guarantees at inference time for LLMs.

Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.

Self-Improving Language Models with Bidirectional Evolutionary Search

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.

Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.

ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.

GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

cs.CL · 2026-02-19 · unverdicted · novelty 6.0

Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.

Enhancing Table Reasoning with Deterministic Table-State Rewards

cs.AI · 2026-01-30 · unverdicted · novelty 6.0

RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.

Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values

cs.CY · 2025-05-30 · unverdicted · novelty 6.0

LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

cs.MA · 2026-05-28 · unverdicted · novelty 5.0

Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.

Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications

cs.LG · 2026-04-30 · unverdicted · novelty 5.0

Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classification benchmarks.

A Reproducibility Study of LLM-Based Query Reformulation

cs.IR · 2026-04-30 · unverdicted · novelty 5.0

A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

cs.MA · 2026-04-10 · unverdicted · novelty 5.0

LLM agents in an opposing-incentive NYC simulation develop limited selective trust and deception through KTO policy updates but stay 70% susceptible to adversarial persuasion.

A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction

cs.CL · 2026-03-16 · conditional · novelty 5.0

Heuristic demonstration selection methods outperform embedding-based methods for practical LLM-based next POI prediction on three real-world datasets.

AI Evaluation Should Require Standardized Item-Level Data Releases

cs.AI · 2026-02-27 · conditional · novelty 5.0 · 2 refs

AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.

Evaluating the Impact of Verbal Multiword Expressions on Machine Translation

cs.CL · 2025-08-24 · conditional · novelty 3.0

Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.

Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation

cs.CL · 2025-04-02 · unverdicted · novelty 3.0

A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.

citing papers explorer

Showing 24 of 24 citing papers.

Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC cs.CL · 2026-05-25 · unverdicted · none · ref 4
Presents CULTURE-MT benchmark for UGC translation focusing on cultural transmission and emotion resonance, with tests showing traditional metrics miss cultural effectiveness and larger models perform better on it.
RAG over Thinking Traces Can Improve Reasoning Tasks cs.IR · 2026-05-05 · unverdicted · none · ref 1
Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.
On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility cs.CL · 2026-05-02 · unverdicted · none · ref 1
VOLTBench quantifies length volatility in LLM long-form generation; GLoBo, a logits-boosting decoder, increases mean length by 148% and cuts volatility by 69% while preserving quality.
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features cs.CL · 2026-04-06 · unverdicted · none · ref 1
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion cs.AI · 2026-02-28 · unverdicted · none · ref 1
MobiFlow is a new evaluation framework for mobile agents using trajectory fusion on 240 tasks across 20 third-party apps, achieving higher alignment with human judgments than prior benchmarks.
Bayesian Preference Learning for Test-Time Steerable Reward Models cs.LG · 2026-02-09 · unverdicted · none · ref 16
ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models cs.AI · 2026-06-07 · unverdicted · none · ref 9 · 2 links
ITCR integrates conformal prediction into reasoning graph generation to achieve valid factuality coverage guarantees at inference time for LLMs.
Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility cs.CL · 2026-06-02 · unverdicted · none · ref 8
Unlearning in multilingual LLMs suppresses rather than erases knowledge in later layers, with transfer varying by language similarity and reversible via inference-time steering.
Self-Improving Language Models with Bidirectional Evolutionary Search cs.CL · 2026-05-27 · unverdicted · none · ref 18
Bidirectional Evolutionary Search augments autoregressive expansion with evolutionary recombination operators and dense backward subgoal feedback to produce better candidates than standard best-of-N or tree search for language model self-improvement.
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions cs.CL · 2026-05-11 · unverdicted · none · ref 11 · 2 links
LLMs exhibit Pseudo-Deliberation where explicit reasoning fails to align stated values with generated actions, measured via the new VALDI framework across 4,941 scenarios in five domains.
ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding cs.LG · 2026-04-16 · unverdicted · none · ref 6
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking cs.SD · 2026-04-10 · unverdicted · none · ref 13
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning cs.CL · 2026-02-19 · unverdicted · none · ref 6
Adaptive regularization guided by training-time safety risk signals from judges or activations prevents safety degradation in fine-tuned language models while preserving utility.
Enhancing Table Reasoning with Deterministic Table-State Rewards cs.AI · 2026-01-30 · unverdicted · none · ref 19
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
Who Gets the Kidney? Human-AI Alignment, Indecision, and Moral Values cs.CY · 2025-05-30 · unverdicted · none · ref 66
LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization cs.MA · 2026-05-28 · unverdicted · none · ref 2
Proposes temporal and structural credit assignment plus a discrete verbalized block coordinate descent algorithm to optimize prompts in LLM multi-agent systems, claiming reduced query complexity and better performance on reasoning benchmarks.
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning cs.CV · 2026-05-27 · unverdicted · none · ref 18
VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.
Dynamic Scaled Gradient Descent for Stable Fine-Tuning for Classifications cs.LG · 2026-04-30 · unverdicted · none · ref 2
Dynamic scaled gradient descent prevents fine-tuning collapse by dynamically down-weighting gradients of correct examples, yielding lower performance variance and higher accuracy than standard methods on classification benchmarks.
A Reproducibility Study of LLM-Based Query Reformulation cs.IR · 2026-04-30 · unverdicted · none · ref 39
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.
CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation cs.MA · 2026-04-10 · unverdicted · none · ref 1
LLM agents in an opposing-incentive NYC simulation develop limited selective trust and deception through KTO policy updates but stay 70% susceptible to adversarial persuasion.
A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction cs.CL · 2026-03-16 · conditional · none · ref 7
Heuristic demonstration selection methods outperform embedding-based methods for practical LLM-based next POI prediction on three real-world datasets.
AI Evaluation Should Require Standardized Item-Level Data Releases cs.AI · 2026-02-27 · conditional · none · ref 20 · 2 links
AI benchmark evaluations require standardized item-level data releases as core infrastructure to support validity assessment, demonstrated via the OpenEval archive of 10M responses across 155k items.
Evaluating the Impact of Verbal Multiword Expressions on Machine Translation cs.CL · 2025-08-24 · conditional · none · ref 5
Verbal multiword expressions reduce machine translation quality, with the degradation attributable to the expressions themselves rather than general sentence difficulty.
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation cs.CL · 2025-04-02 · unverdicted · none · ref 40
A literature survey that organizes prompting, fine-tuning, preference optimization, and context-aware techniques for LLM-based machine translation with emphasis on low-resource languages.

emnlp-main.1173/

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer