Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800

Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan · 2023 · arXiv 2310.13800

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

cs.CL · 2024-06-06 · accept · novelty 7.0

This systematic survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 others, supplies a shared vocabulary, and offers guidelines for state-of-the-art models.

AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone?

cs.AI · 2025-10-06 · unverdicted · novelty 6.0

LLMs generate valid solutions to over 70% of AI research problems from parametric memory alone but rediscover the exact published approach less than 19% of the time, with performance limited by cross-domain analogical transfer.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 1 of 1 citing paper after filters.

AInstein: Can LLMs Solve Research Problems From Parametric Memory Alone? cs.AI · 2025-10-06 · unverdicted · none · ref 12
LLMs generate valid solutions to over 70% of AI research problems from parametric memory alone but rediscover the exact published approach less than 19% of the time, with performance limited by cross-domain analogical transfer.

Evaluation metrics in the era of gpt-4: Reliably evaluating large language models on sequence to sequence tasks.arXiv preprint arXiv:2310.13800

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer