hub

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

browse 11 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

cs.CL · 2026-05-13 · conditional · novelty 7.0

Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

cs.CL · 2026-05-21 · conditional · novelty 6.0

LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.

Evaluating Multi-turn Human-AI Interaction

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

cs.CL · 2026-05-04 · unverdicted · novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23 · accept · novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

cs.CL · 2023-04-13 · accept · novelty 6.0

AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

cs.CL · 2026-05-09 · unverdicted · novelty 5.0

Pragmatic reasoning in LLMs varies substantially by evaluation method and model family, with scalar diversity patterns appearing only in certain conditions rather than reflecting stable competence.

LLM Benchmark Datasets Should Be Contamination-Resistant

cs.LG · 2026-05-19 · unverdicted · novelty 4.0

Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

citing papers explorer

Showing 11 of 11 citing papers.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn cs.CL · 2026-05-13 · conditional · none · ref 12
Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation cs.CL · 2026-05-21 · conditional · none · ref 150
LLMs generate adequate counterspeech for co-occurring hate and misinformation in 40% of cases, with a mixed knowledge strategy from fact-checkers and NGOs proving most effective after expert revision.
Evaluating Multi-turn Human-AI Interaction cs.HC · 2026-05-18 · unverdicted · none · ref 27
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls cs.CL · 2026-05-04 · unverdicted · none · ref 72
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 97
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 39
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 204
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models cs.CL · 2023-04-13 · accept · none · ref 31
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered cs.LG · 2026-05-15 · unverdicted · none · ref 12
Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity cs.CL · 2026-05-09 · unverdicted · none · ref 23
Pragmatic reasoning in LLMs varies substantially by evaluation method and model family, with scalar diversity patterns appearing only in certain conditions rather than reflecting stable competence.
LLM Benchmark Datasets Should Be Contamination-Resistant cs.LG · 2026-05-19 · unverdicted · none · ref 97
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.

Advances in neural information processing systems , volume=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer