LLMs are applied in a generative pipeline for extracting, normalizing, and interpreting eligibility criteria from securities prospectuses, achieving up to 91% precision in document-level decisions with a conservative bias.
hub
arXiv preprint arXiv:2210.07197 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
MultiHop-RAG is a new benchmark dataset demonstrating that existing retrieval-augmented generation systems perform poorly on multi-hop queries requiring retrieval and reasoning over multiple evidence pieces.
MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
G-Eval uses GPT-4 with chain-of-thought and form-filling to reach 0.514 Spearman correlation with humans on summarization, beating prior NLG metrics while noting a bias toward LLM outputs.
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.
Fine-tuned multilingual LLMs achieve top shared-task scores on financial causality extraction in English and Spanish.
citing papers explorer
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
-
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
G-Eval uses GPT-4 with chain-of-thought and form-filling to reach 0.514 Spearman correlation with humans on summarization, beating prior NLG metrics while noting a bias toward LLM outputs.