ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
hub
Automatic
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.
iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.
PQR framework generates diverse realistic queries to elicit QA agent failures, uncovering 23-78% more unhelpful responses than prior methods in e-commerce agent tests.
BrainROI achieves leading cross-subject brain-captioning results on NSD by combining multi-atlas soft-ROI fusion with interpretable prompt optimization.
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
GBC treats multi-agent LLM workflows as differentiable graphs to enable token-level attribution and targeted optimization, with reported gains on MultiWOZ and τ-bench.
NOVA introduces a level-aware agent harness with architecture gradient and verification cascade to automate recommender architecture evolution while reducing silent failures and human effort.
Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.
JTPRO co-optimizes prompts and tool descriptions via reflection to raise overall success rate by 5-20% over baselines on multi-tool benchmarks.
An LM-guided counterfactual pipeline recommends minimal ordinal changes to communication features like tone and actionability, yielding a mean +6.41% gain in predicted positive feedback under independent auditor models.
citing papers explorer
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
GrowLoop proposes a human-seeded self-evolving framework that co-evolves rubrics and cases to evaluate conversational human-likeness with differentiated agreement rules.
-
iPOE: Interpretable Prompt Optimization via Explanations
iPOE generates and optimizes annotation guidelines from explanations to produce interpretable prompts, reporting up to 39% gains over baselines on four datasets with LLM explanations substituting for human ones.
-
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
PQR framework generates diverse realistic queries to elicit QA agent failures, uncovering 23-78% more unhelpful responses than prior methods in e-commerce agent tests.
-
A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
A single LLM rewrite of skill descriptions using false positive and negative cases matches manual optimization performance in production, with most other pipeline components adding little value.
-
GBC: Gradient-Based Connections for Optimizing Multi-Agent Systems
GBC treats multi-agent LLM workflows as differentiable graphs to enable token-level attribution and targeted optimization, with reported gains on MultiWOZ and τ-bench.
-
NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems
NOVA introduces a level-aware agent harness with architecture gradient and verification cascade to automate recommender architecture evolution while reducing silent failures and human effort.
-
What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents
Empirical study demonstrates that cost-aware skill rewriting for LLM agents can achieve 7% total cost reduction and 6% agent-token cost reduction with preserved quality on SkillsBench.
-
JTPRO: A Joint Tool-Prompt Reflective Optimization Framework for Language Agents
JTPRO co-optimizes prompts and tool descriptions via reflection to raise overall success rate by 5-20% over baselines on multi-tool benchmarks.
-
Improving Medical Communication using Rubric-Guided Counterfactual Recommendations
An LM-guided counterfactual pipeline recommends minimal ordinal changes to communication features like tone and actionability, yielding a mean +6.41% gain in predicted positive feedback under independent auditor models.