PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.
Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.
For binary LLM judge validation, Pearson's r, Spearman's ρ, Kendall's τ_b, phi, and Matthews correlation all equal a single number on non-degenerate data, Cohen's κ supplies the extra signal on label-rate drift, and a reporting checklist is provided.
MDS selects better multi-turn dialogues for instruction tuning by combining bin-wise global coverage with local entity-topic and format consistency scoring, outperforming prior selectors on benchmarks.
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models
PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.
-
Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?
Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.
-
Data Selection for Multi-turn Dialogue Instruction Tuning
MDS selects better multi-turn dialogues for instruction tuning by combining bin-wise global coverage with local entity-topic and format consistency scoring, outperforming prior selectors on benchmarks.
-
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.