Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Minbyul Jeong

arxiv: 2606.27595 · v1 · pith:YRRXX6ZCnew · submitted 2026-06-25 · 💻 cs.CL

Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents

Minbyul Jeong This is my paper

Pith reviewed 2026-06-29 01:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords web agentsbreadth searchbenchmarkexhaustive set enumerationKoreantable fillingF1 metricsdifficulty tiers

0 comments

The pith

Web agents recover sets but fail to fill most rows in exhaustive tables on a new Korean benchmark

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds Ko-WideSearch to measure how web agents handle breadth tasks that require listing every member of a closed set and correctly populating each member's attributes in a table. It uses an automated pipeline to generate 228 gold tables over 190 Korean entities, with difficulty set by independently varying table width and a two-dimensional composite key so that full cross-product membership rises across three tiers. Tests on twenty agents show consistent results: high success naming the items themselves but much lower success filling the rows, with performance dropping as the knobs are tightened and no recovery from extra search budget or spend. The hardest cells turn out to be open-ended free-text values rather than dates or names that follow standard formats.

Core claim

Ko-WideSearch shows that web agents recover set membership but not the rows of the associated attribute table. Across 228 tables the gap appears as Item-F1 of 92.8 against Row-F1 of 53.7; accuracy falls steadily as table width and composite-key complexity increase from 0 percent to 100 percent cross-product membership, and neither additional search nor higher spend narrows the gap. The dominant error is locating the correct value, not formatting it once found.

What carries the argument

The Ko-WideSearch benchmark and its automated synthesize-and-verify pipeline that produces complete gold tables for set-parent entities and grades answers with a shared normalization-aware comparator using Item-, Column-, and Row-F1.

If this is right

Agents achieve high Item-F1 but substantially lower Row-F1 on the same tasks.
Accuracy declines steadily as table width and 2-D composite key complexity increase.
Extra search effort or higher spend produces no measurable improvement on these breadth tasks.
Open-ended free-text cells fail far more often than cells with standard formats such as dates or names.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs may need explicit mechanisms for systematic cross-source verification to close the set-versus-row gap.
The Korean-language setting could expose language-specific enumeration difficulties that English-only benchmarks miss.
The same structural knobs could be applied to other languages or domains to test whether the observed failure pattern generalizes.

Load-bearing premise

The automated synthesize-and-verify pipeline produces gold sets that are both complete and every cell correct.

What would settle it

A manual audit that finds missing members or incorrect cells in a sample of the generated gold tables would show the benchmark cannot be trusted.

Figures

Figures reproduced from arXiv: 2606.27595 by Minbyul Jeong.

**Figure 1.** Figure 1: The benchmark and its running example. Left: tasks span Korean categories (rows) and three difficulty tiers (columns); each cell is one real task, with a strip of its columns beneath—the membership-key column(s) shaded, one box marking a list, two a cross-product grid. Reading left to right, tasks gain columns to fill and their membership turns from a list into a grid. Right: the Government/HARD cell shown… view at source ↗

**Figure 2.** Figure 2: KO-WIDESEARCH extends WideSearch (Wong et al. 2025) along two structural axes. (a) Difficulty: the two knobs I dial—table width and the 2-D composite-key share—span the plane. WideSearch is a single, un-tiered reference (median six columns, 35% 2-D), whereas KO-WIDESEARCH covers a calibrated EASY→MEDIUM→HARD region (shaded), its HARD tier reaching the original’s width and exceeding its cross-product share … view at source ↗

**Figure 3.** Figure 3: Membership is recovered; full rows and whole tables are not. Item-F1 (membership), Column-F1 (matched cells), Row-F1 (full rows), and table success (the whole table exactly correct), as percentages, on all 228 tables for the twenty-system roster (sorted by Row-F1; x-axis labels colored by family). Each model’s bars cascade downward: the set (Item-F1, up to 94) is recovered far more fully than whole rows (R… view at source ↗

**Figure 4.** Figure 4: The metric breakdown by difficulty, faceted by model family. Each of the four WideSearch metrics (columns: Item-F1, Column-F1, Row-F1, and table success), as percentages, across the difficulty tiers (EASY/MEDIUM/HARD), with one row of panels per family—proprietary (solid, top), open-weight (dashed, middle), Korean-specialized (dotted, bottom); y-scales are shared down each column so the families are direct… view at source ↗

**Figure 5.** Figure 5: Where and why agents break. (a) Per-cell-type Column-F1 on an instrumented three-system subset (pooled): the open-ended free-text cells are filled least reliably and the format-constrained ones (dates, names) most, so the difficulty is finding and normalizing the value, not formatting it. (b) Row-F1 by gold set size on the full pool is essentially flat—breadth alone does not drive failure. (c) Average tool… view at source ↗

**Figure 6.** Figure 6: Failure composition by stage, for the ten timelyrouted systems with per-task logging. Every task is assigned to its first failure stage—no parseable table, a substantially wrong row set (membership, item-F1 < 0.9), an essentially-correct set with a wrong attribute cell (cells), or fully solved—so the four shares sum over the 228 tasks. For capable systems the bottleneck is cell-filling (cells 56–66%); t… view at source ↗

**Figure 7.** Figure 7: Neither dollars nor search calls close the gap. (a) Row-F1 against cost per task for the six API-served models with metered cost. Accuracy rises with cost then flattens: the open-weight DeepSeek-V4-Pro reaches most of the frontier’s Row-F1 at a fraction of the price of GPT-5.5 and Claude-Opus-4.8. (b) Share of tasks on which a model spends at least thirty tool calls, over all twenty systems; the models tha… view at source ↗

**Figure 8.** Figure 8: Semantic judging helps strong models more. Strict → judged Row-F1 per model (full 228-table pool). The dumbbell length is the correction the LLM judge applies; it grows with model strength (DeepSeek-V4-Pro +4.9 at top vs. +0.8 for the weakest), so strict cell-matching under-states the gap between strong and weak systems. web searches over the search/open/find namespace. The three acceptance gates are appl… view at source ↗

**Figure 9.** Figure 9: The metric breakdown by sourcing label, faceted by model family (companion to [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Pooled Row-F1 by topical category (mean across [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Errors that survive the semantic judge. After the judge credits transliteration and administrative-granularity variants, the residual judge-confirmed-wrong cells are substantive: a different entity (n=269), a different region/district (328), a wrong value (164), or a category mismatch (35). The Item-F1 ≫Row-F1 gap is factual error, not formatting. from 90% on EASY to about 85% on MEDIUM and HARD: the har… view at source ↗

**Figure 12.** Figure 12: GPT-5.5 — surface variants and membership. Gold vs. prediction; predicted cells are coloured white (correct), teal (strict-wrong but rescued by the semantic judge), red (genuinely wrong), orange (hallucinated). (a) The cast is fully recovered, and the cells the deterministic scorer marks wrong are mostly surface variants—the has-children enum written as a free-text count (“yes” ≡ “N children”) and the re… view at source ↗

**Figure 13.** Figure 13: DeepSeek-V4-Pro — blank attribute cells. All eight metropolitan cities are recovered, but the population column is left blank for every row and the mayor for most, so no row is complete. correct blank/omitted hallucinated GPT-5.4-mini · kws-e023 (경부고속선 KTX 정차역) · Item-F1 33 Row-F1 0 (returned 3 of 9) GOLD (reference) GPT-5.4-mini PREDICTION gold lists 9 stations; the model returned 3 (천안아산역 is out of scop… view at source ↗

**Figure 14.** Figure 14: GPT-5.4-mini — under-recall. Of nine valid stations the model returns only three (one of them out of scope) and leaves every distance blank — a recall collapse [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Solar-Open-2-preview — two failure modes. (a) Parse failure: the model narrates a prose list instead of a table, so the answer is unscorable. (b) Even when it emits a valid table, the numeric cells are fabricated (plausible but wrong populations), so no row is complete [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Qwen3.6-35B — over-search, then no table. The model issues the run’s maximum number of tool calls, yet still emits prose rather than a table — its most common outcome on the benchmark. correct wrong A.X-4.0 · kws-0008 (태양계 행성 발견) · Item-F1 100 Column-F1 25 Row-F1 12 GOLD (reference) A.X-4.0 PREDICTION All 8 planets recovered, but free-text cells are fabricated (‘기원전 2000년’, ‘바빌로니아 천문학자들’) — only 천왕성 is fu… view at source ↗

**Figure 17.** Figure 17: A.X-4.0 — fabricated free-text cells. All eight planets are recovered, but the discovery-year and discoverer cells are fabricated (“2000 BC”, “Babylonian astronomers” where the gold is “ancient”); only one row is fully correct [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

read the original abstract

Web-agent benchmarks overwhelmingly measure depth -- pinning one obscure answer behind a chain of constraints -- while breadth, exhaustively enumerating a closed set and filling each item's attributes, is barely evaluated, especially outside English. Breadth is also hard to build: certifying that a gold set is complete and every cell correct is far costlier than checking a single answer. I introduce \textsc{Ko-WideSearch}, a Korean breadth-search benchmark built by an automated synthesize-and-verify pipeline. Each task names a set-parent entity -- a TV season, a dynasty, a league, an administrative region, an election -- and asks for its full membership plus a per-item attribute table, graded by Item-, Column-, and Row-F1. It spans 228 tables over 190 entities and sixteen categories across three difficulty tiers, set by two structural knobs I dial independently -- table width and a 2-D composite key -- so cross-product membership climbs from 0\% to 100\% across the tiers. A single normalization-aware comparator is shared between gold construction and grading, so stable date and count columns are not over-dropped on formatting alone. Across twenty web agents, the failure is consistent: agents recover the set but not the rows (e.g.\ Item-F1 92.8 against Row-F1 53.7), accuracy falls steadily as the knobs harden, and neither more search nor more spend closes the gap. Broken down by cell, the hard part is finding the right value, not formatting it: open-ended free-text cells fail most, while cells with a standard answer such as a date or a name usually come out right.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ko-WideSearch gives a new Korean benchmark for breadth tasks in web agents, with the main result resting on gold labels whose quality is not yet shown to be solid.

read the letter

The paper builds Ko-WideSearch, a benchmark with 228 tables across 190 Korean entities in sixteen categories. Each task asks an agent to enumerate a full set and fill an attribute table, scored by Item-, Column-, and Row-F1. Two knobs control difficulty: table width and a 2-D composite key that raises cross-product membership from 0% to 100% across tiers. The reported pattern is that twenty agents recover most items (Item-F1 92.8) but complete far fewer rows (Row-F1 53.7), free-text cells are the hardest, and extra search or spend does not close the gap.

This is new. Most agent benchmarks stay in English and test depth on single answers; this one targets exhaustive breadth, adds structural difficulty controls, and works in Korean. The shared normalization-aware comparator between gold building and scoring is a practical choice that avoids over-penalizing format differences on dates or counts.

The soft spot is the gold construction. The work uses an automated synthesize-and-verify pipeline, and the abstract itself notes that certifying every cell is complete and correct costs more than single-answer checks. No human audit rate, no error estimate on the final 228 tables, and no sensitivity check on how gold mistakes would move the F1 numbers appear in the supplied text. The stress-test concern is therefore on target: modest errors concentrated in free-text columns would directly produce the Item-F1 versus Row-F1 gap without any agent shortcoming. Until that is addressed, the central claim stays provisional.

The paper is for people who evaluate or train web agents and want to move past depth-only, English-only tests. A reader working on agent robustness or non-English data collection would find the task design and the consistent failure mode useful to think about. It deserves a serious referee because the gap it targets is real and the benchmark construction is reproducible in principle, but the review should focus on the verification step and any released data.

Referee Report

1 major / 2 minor

Summary. The paper introduces Ko-WideSearch, a Korean-language benchmark for web-agent breadth search consisting of 228 tables over 190 entities in 16 categories. Tasks require exhaustive enumeration of a set-parent entity plus per-item attribute tables, graded by Item-F1, Column-F1, and Row-F1. Gold tables are produced by an automated synthesize-and-verify pipeline that shares a normalization-aware comparator with the evaluation metric. Two independent structural knobs (table width and 2-D composite-key cardinality) create three difficulty tiers. Evaluation of 20 web agents shows high Item-F1 (92.8) but low Row-F1 (53.7), monotonic degradation with knob settings, and no recovery from increased search budget or spend; the dominant failure mode is value retrieval rather than formatting.

Significance. If the gold tables are verifiably complete and cell-correct, the benchmark would usefully document a systematic limitation of current web agents on exhaustive multi-row collection tasks outside English and would supply controllable structural difficulty axes for future work. The shared normalization comparator is a methodological strength that prevents spurious penalties on stable columns.

major comments (1)

[Abstract] Abstract (pipeline description): the central claim that agents recover sets but not rows (Item-F1 92.8 vs Row-F1 53.7) is interpretable only if the 228 gold tables are both complete and every cell correct. The automated synthesize-and-verify pipeline is described only at the level of a “normalization-aware comparator”; no human audit rate, no estimated error rate on the final gold, and no ablation of F1 sensitivity to plausible gold errors (especially free-text columns) are supplied. Because Row-F1 penalizes any cell mismatch, modest gold errors concentrated in open-ended cells would directly reproduce the reported gap without any agent failure.

minor comments (2)

[Abstract] Abstract: the phrase “sixteen categories across three difficulty tiers” is stated without enumeration or cross-reference; a short table or list in the methods section would improve traceability.
[Abstract] Abstract: the two structural knobs are introduced as “table width and a 2-D composite key” but their exact operationalization (e.g., how composite-key cardinality is counted) is not defined at first mention; a brief parenthetical or forward reference would reduce ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comment on the gold-table validation. The concern is valid: without explicit human-audit statistics or sensitivity analysis, readers cannot independently confirm that the reported Item-F1 / Row-F1 gap reflects agent limitations rather than undetected gold errors. We will address this directly in revision.

read point-by-point responses

Referee: [Abstract] Abstract (pipeline description): the central claim that agents recover sets but not rows (Item-F1 92.8 vs Row-F1 53.7) is interpretable only if the 228 gold tables are both complete and every cell correct. The automated synthesize-and-verify pipeline is described only at the level of a “normalization-aware comparator”; no human audit rate, no estimated error rate on the final gold, and no ablation of F1 sensitivity to plausible gold errors (especially free-text columns) are supplied. Because Row-F1 penalizes any cell mismatch, modest gold errors concentrated in open-ended cells would directly reproduce the reported gap without any agent failure.

Authors: We agree that the interpretability of the central claim rests on the reliability of the gold tables. The current manuscript describes the pipeline only at the level of the shared normalization-aware comparator. In the revised version we will (1) expand the Methods section with a detailed account of the synthesize-and-verify steps, (2) report the human audit rate and any inter-annotator agreement figures obtained during verification, (3) provide an estimated upper bound on residual error rates (particularly for free-text columns), and (4) include a sensitivity ablation that injects controlled plausible errors into the gold tables and measures the resulting change in Item-F1, Column-F1, and Row-F1. These additions will allow readers to quantify how much of the observed gap could plausibly be explained by gold-table noise. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with external measurements

full rationale

The paper constructs a Korean web-agent benchmark via an automated synthesize-and-verify pipeline and reports Item-F1/Row-F1 gaps across agents. No derivation chain, equations, fitted parameters, or self-citations are present; the central claims rest on external agent runs and the pipeline's output, which the paper itself flags as costly to certify. This matches the default case of a self-contained benchmark paper with no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that agents fail on row completion rests on the correctness of the gold data produced by the synthesize-and-verify pipeline and on the assumption that the normalization comparator treats equivalent values fairly across agents.

axioms (1)

domain assumption The synthesize-and-verify pipeline produces complete and cell-correct gold tables for all 190 entities
Abstract states that certifying completeness is far costlier than single-answer checks yet relies on the pipeline to do so.

pith-pipeline@v0.9.1-grok · 5829 in / 1338 out tokens · 39165 ms · 2026-06-29T01:19:27.700949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 26 canonical work pages · 15 internal anchors

[1]

J.; Wolfson, T.; Rubin, O.; Yoran, O.; Herzig, J.; and Berant, J

Amouyal, S. J.; Wolfson, T.; Rubin, O.; Yoran, O.; Herzig, J.; and Berant, J. 2023. QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs. arXiv:2205.12665

work page arXiv 2023
[2]

Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR

2024
[3]

CAIS; and Scale AI. 2025. Humanity's Last Exam. arXiv:2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Chen, W.; Zha, H.; Chen, Z.; Xiong, W.; Wang, H.; and Wang, W. Y. 2020. HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. EMNLP Findings

2020
[5]

Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; and Wang, W. Y. 2020. TabFact: A Large-scale Dataset for Table-based Fact Verification. ICLR

2020
[6]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Deng, X.; Gu, Y.; Zheng, B.; Chen, S.; Stevens, S.; Wang, B.; Sun, H.; and Su, Y. 2023. Mind2Web: Towards a Generalist Agent for the Web. NeurIPS

2023
[8]

Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; Liu, Z.; Sun, M.; and Zhou, B. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. EMNLP

2023
[9]

Dong, Y.; Jiang, X.; Liu, H.; Jin, Z.; Gu, B.; Yang, M.; and Li, G. 2024. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for LLMs. ACL Findings

2024
[10]

Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Golchin, S.; and Surdeanu, M. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. ICLR

2024
[12]

He, H.; Yao, W.; Ma, K.; Yu, W.; Dai, Y.; Zhang, H.; Lan, Z.; and Yu, D. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL

2024
[13]

Ho, X.; Duong Nguyen, A.-K.; Sugawara, S.; and Aizawa, A. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING

2020
[14]

HyperCLOVA X Team. 2024. HyperCLOVA X Technical Report. arXiv:2404.01954

work page arXiv 2024
[15]

S.; and Davis, E

Jang, M.; Kim, D.; Kwon, D. S.; and Davis, E. 2022. KoBEST: Korean Balanced Evaluation of Significant Tasks. COLING

2022
[16]

Jin, J.; Kim, J.; Lee, N.; Yoo, H.; Oh, A.; and Lee, H. 2024. KoBBQ: Korean Bias Benchmark for Question Answering. TACL

2024
[17]

Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL

2017
[18]

Kanana LLM Team. 2025. Kanana: Compute-efficient Bilingual Language Models. arXiv:2502.18934

work page arXiv 2025
[19]

H.; Park, S.; Kim, S.; Kim, S.; Seo, D.; Lee, H.; Jeong, M.; Lee, S.; Kim, M.; Ko, S

Kim, B.; Kim, H.; Lee, S.-W.; Lee, G.; Kwak, D.; Jeon, D. H.; Park, S.; Kim, S.; Kim, S.; Seo, D.; Lee, H.; Jeong, M.; Lee, S.; Kim, M.; Ko, S. H.; Kim, S.; Park, T.; Kim, J.; Kang, S.; Ryu, N.-H.; Yoo, K. M.; Chang, M.; Suh, S.; In, S.; Park, J.; Kim, K.; Kim, H.; Jeong, J.; Yeo, Y. G.; Ham, D.; Park, D.; Lee, M. Y.; Kang, J.; Kang, I.; Ha, J.-W.; Park, ...

2021
[20]

Kim, E.; Suk, J.; Oh, P.; Yoo, H.; Thorne, J.; and Oh, A. 2024. CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean. LREC-COLING

2024
[21]

Kim, S.; Kim, D.; Park, C.; Lee, W.; Song, W.; Kim, Y.; Kim, H.; Kim, Y.; Lee, H.; Kim, J.; Ahn, C.; Yang, S.; Lee, S.; Park, H.; Gim, G.; Cha, M.; Lee, H.; and Kim, S. 2024. SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. NAACL

2024
[22]

Krishna, S.; Krishna, K.; Mohananey, A.; Schwarcz, S.; Stambler, A.; Upadhyay, S.; and Faruqui, M. 2024. FRAMES: Factuality, Retrieval, And reasoning MEasurement Set. arXiv:2409.12941

work page arXiv 2024
[23]

M.; Uszkoreit, J.; Le, Q.; and Petrov, S

Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; Toutanova, K.; Jones, L.; Kelcey, M.; Chang, M.-W.; Dai, A. M.; Uszkoreit, J.; Le, Q.; and Petrov, S. 2019. Natural Questions: A Benchmark for Question Answering Research. TACL

2019
[24]

Lee, N.; Yoon, D.; Son, G.; Kim, G.; Ko, D.; Park, J.; Yoo, H.; Cho, J.; Park, J.; Lee, C.; Jang, K.; Kim, J.; Kim, E.; Cho, W.; and Kim, S. 2026. K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts. arXiv:2606.02404

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

u ttler, H.; Lewis, M.; Yih, W.-t.; Rockt\

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; K\" u ttler, H.; Lewis, M.; Yih, W.-t.; Rockt\" a schel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS

2020
[26]

LG AI Research. 2024. EXAONE 3.0 7.8B Instruction Tuned Language Model. arXiv:2408.03541

work page arXiv 2024
[27]

Li, X.; Dong, G.; Jin, J.; Zhang, Y.; Zhou, Y.; Zhu, Y.; Zhang, P.; and Dou, Z. 2025. Search-o1: Agentic Search-Enhanced Large Reasoning Models. arXiv:2501.05366

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Lim, S.; Kim, M.; and Lee, J. 2019. KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension. arXiv:1909.07005

work page arXiv 2019
[29]

Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP

2023
[30]

Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; Zhang, S.; Deng, X.; Zeng, A.; Du, Z.; Zhang, C.; Shen, S.; Zhang, T.; Su, Y.; Sun, H.; Huang, M.; Dong, Y.; and Tang, J. 2024. AgentBench: Evaluating LLMs as Agents. ICLR

2024
[31]

Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Mialon, G.; Fourrier, C.; Swift, C.; Wolf, T.; LeCun, Y.; and Scialom, T. 2023. GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi, H.; and Awadallah, A. 2023. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv:2306.02707

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.; Knight, M.; Chess, B.; and Schulman, J. 2021. WebGPT: Browser-assisted Question-answering with Human Feedback. arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

V.; Verma, N.; Zhang, R.; Kry\' s ci\' n ski, W.; Schoelkopf, H.; Kong, R.; Tang, X.; Mutuma, M.; Rosand, B.; Trindade, I.; Bandaru, R.; Cunningham, J.; Xiong, C.; and Radev, D

Nan, L.; Hsieh, C.; Mao, Z.; Lin, X. V.; Verma, N.; Zhang, R.; Kry\' s ci\' n ski, W.; Schoelkopf, H.; Kong, R.; Tang, X.; Mutuma, M.; Rosand, B.; Trindade, I.; Bandaru, R.; Cunningham, J.; Xiong, C.; and Radev, D. 2022. FeTaQA: Free-form Table Question Answering. TACL

2022
[36]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Oren, Y.; Meister, N.; Chatterji, N.; Ladhak, F.; and Hashimoto, T. B. 2024. Proving Test Set Contamination in Black Box Language Models. ICLR

2024
[38]

Park, S.; Moon, J.; Kim, S.; Cho, W. I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; Lee, J.; Oh, J.; Lyu, S.; Jeong, Y.; Lee, I.; Seo, S.; Lee, D.; Kim, H.; Lee, M.; Jang, S.; Do, S.; Kim, S.; Lim, K.; Lee, J.; Park, K.; Shin, J.; Kim, S.; Park, L.; Oh, A.; Ha, J.-W.; and Cho, K. 2021. KLUE: Korean Language Understanding Evaluation. arXiv:2105.09680

work page arXiv 2021
[39]

Pasupat, P.; and Liang, P. 2015. Compositional Semantic Parsing on Semi-Structured Tables. ACL

2015
[40]

G.; Zhang, T.; Wang, X.; and Gonzalez, J

Patil, S. G.; Zhang, T.; Wang, X.; and Gonzalez, J. E. 2024. Gorilla: Large Language Model Connected with Massive APIs. NeurIPS

2024
[41]

Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; Zhao, S.; Hong, L.; Tian, R.; Xie, R.; Zhou, J.; Gerstein, M.; Li, D.; Liu, Z.; and Sun, M. 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. ICLR

2024
[42]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

H.; Singh, S.; Maheshwary, R.; Altomare, M.; Haggag, M

Romanou, A.; Foroutan, N.; Sotnikova, A.; Chen, Z.; Nelaturu, S. H.; Singh, S.; Maheshwary, R.; Altomare, M.; Haggag, M. A.; A, S.; Amayuelas, A.; Amirudin, A. H.; Aryabumi, V.; Boiko, D.; Chang, M.; Chim, J.; Cohen, G.; Dalmia, A. K.; Diress, A.; Duwal, S.; Dzenhaliou, D.; Erazo Florez, D. F.; Farestam, F.; Imperial, J. M.; Islam, S. B.; Isotalo, P.; Jab...

2025
[44]

Sainz, O.; Campos, J.; Garc\' i a-Ferrero, I.; Etxaniz, J.; Lopez de Lacalle, O.; and Agirre, E. 2023. NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. EMNLP Findings

2023
[45]

Schick, T.; Dwivedi-Yu, J.; Dess\` i , R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS

2023
[46]

C.; Yeom, J

Son, G.; Lee, H.; Kim, S.; Kim, H.; Lee, J. C.; Yeom, J. W.; Jung, J.; Kim, J. W.; and Kim, S. 2024. HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models. LREC-COLING

2024
[47]

M.; and Biderman, S

Son, G.; Lee, H.; Kim, S.; Kim, S.; Muennighoff, N.; Choi, T.; Park, C.; Yoo, K. M.; and Biderman, S. 2025. KMMLU: Measuring Massive Multitask Language Understanding in Korean. NAACL

2025
[48]

arXiv preprint arXiv:2602.12413 , year=

Spiesberger, A.; Vazquez, J. J.; Pochinkov, N.; Gaven c iak, T.; Grietzer, P.; Leech, G.; and Schoots, N. 2026. Soft Contamination Means Benchmarks Test Shallow Generalization. arXiv:2602.12413

work page arXiv 2026
[49]

Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2022. MuSiQue: Multihop Questions via Single-hop Question Composition. TACL

2022
[50]

Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.-H.; Zhou, D.; Le, Q.; and Luong, T. 2024. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. ACL Findings

2024
[51]

A.; Khashabi, D.; and Hajishirzi, H

Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. ACL

2023
[52]

Measuring short-form factuality in large language models

Wei, J.; Karina, N.; Chung, H. W.; Jiao, Y. J.; Papay, S.; Glaese, A.; Schulman, J.; and Fedus, W. 2024. Measuring Short-form Factuality in Large Language Models. arXiv:2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Wei, J.; Sun, Z.; Papay, S.; McKinney, S.; Han, J.; Fulford, I.; Chung, H. W.; Passos, A. T.; Fedus, W.; and Glaese, A. 2025. BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Whitehouse, C.; Ruder, S.; Lin, T.; Kurylo, O.; Takagi, H.; Lam, J.; Busetto, N.; Diaz, D.; and Guzm\' a n, F. 2026. MENLO: Evaluating Native-like Quality Across 47 Languages. arXiv:2509.26601

work page arXiv 2026
[55]

Wong, R.; Wang, J.; Zhao, J.; Chen, L.; Gao, Y.; Zhang, L.; Zhou, X.; Wang, Z.; Xiang, K.; Zhang, G.; Huang, W.; Wang, Y.; and Wang, K. 2025. WideSearch: Benchmarking Agentic Broad Info-Seeking. arXiv:2508.07999

work page arXiv 2025
[56]

Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv:2406.08464

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Lin, Q.; and Jiang, D. 2024. WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions. ICLR

2024
[58]

Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP

2018
[59]

Yao, S.; Chen, H.; Yang, J.; and Narasimhan, K. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. NeurIPS

2022
[60]

Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR

2023
[61]

J.; Malaviya, C.; Bogin, B.; Press, O.; and Berant, J

Yoran, O.; Amouyal, S. J.; Malaviya, C.; Bogin, B.; Press, O.; and Berant, J. 2024. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? EMNLP

2024
[62]

P.; Zhang, H.; Gonzalez, J

Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.; Zhang, H.; Gonzalez, J. E.; and Stoica, I. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS

2023
[63]

F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G

Zhou, S.; Xu, F. F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR

2024
[64]

Zhou, P.; Leon, B.; Ying, X.; Zhang, C.; Shao, Y.; Ye, Q.; Chong, D.; Jin, Z.; Xie, C.; Cao, M.; Gu, Y.; Hong, S.; Ren, J.; Chen, J.; Liu, C.; and Hua, Y. 2025. BrowseComp-ZH: Benchmarking Web Browsing Ability of LLMs in Chinese. arXiv:2504.19314

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Zhu, H.; Xu, Q.; Li, H.; Liu, Y.; Qiu, H.; Chen, J.; and Jin, J. 2026. Aggregation Queries over Unstructured Text: Benchmark and Agentic Method. arXiv:2602.01355

work page arXiv 2026

[1] [1]

J.; Wolfson, T.; Rubin, O.; Yoran, O.; Herzig, J.; and Berant, J

Amouyal, S. J.; Wolfson, T.; Rubin, O.; Yoran, O.; Herzig, J.; and Berant, J. 2023. QAMPARI: An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs. arXiv:2205.12665

work page arXiv 2023

[2] [2]

Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; and Hajishirzi, H. 2024. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. ICLR

2024

[3] [3]

CAIS; and Scale AI. 2025. Humanity's Last Exam. arXiv:2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Chen, W.; Zha, H.; Chen, Z.; Xiong, W.; Wang, H.; and Wang, W. Y. 2020. HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data. EMNLP Findings

2020

[5] [5]

Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; and Wang, W. Y. 2020. TabFact: A Large-scale Dataset for Table-based Fact Verification. ICLR

2020

[6] [6]

DeepSeek-AI. 2024. DeepSeek-V3 Technical Report. arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Deng, X.; Gu, Y.; Zheng, B.; Chen, S.; Stevens, S.; Wang, B.; Sun, H.; and Su, Y. 2023. Mind2Web: Towards a Generalist Agent for the Web. NeurIPS

2023

[8] [8]

Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; Liu, Z.; Sun, M.; and Zhou, B. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. EMNLP

2023

[9] [9]

Dong, Y.; Jiang, X.; Liu, H.; Jin, Z.; Gu, B.; Yang, M.; and Li, G. 2024. Generalization or Memorization: Data Contamination and Trustworthy Evaluation for LLMs. ACL Findings

2024

[10] [10]

Gemini Team. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Golchin, S.; and Surdeanu, M. 2024. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. ICLR

2024

[12] [12]

He, H.; Yao, W.; Ma, K.; Yu, W.; Dai, Y.; Zhang, H.; Lan, Z.; and Yu, D. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. ACL

2024

[13] [13]

Ho, X.; Duong Nguyen, A.-K.; Sugawara, S.; and Aizawa, A. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. COLING

2020

[14] [14]

HyperCLOVA X Team. 2024. HyperCLOVA X Technical Report. arXiv:2404.01954

work page arXiv 2024

[15] [15]

S.; and Davis, E

Jang, M.; Kim, D.; Kwon, D. S.; and Davis, E. 2022. KoBEST: Korean Balanced Evaluation of Significant Tasks. COLING

2022

[16] [16]

Jin, J.; Kim, J.; Lee, N.; Yoo, H.; Oh, A.; and Lee, H. 2024. KoBBQ: Korean Bias Benchmark for Question Answering. TACL

2024

[17] [17]

Joshi, M.; Choi, E.; Weld, D.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ACL

2017

[18] [18]

Kanana LLM Team. 2025. Kanana: Compute-efficient Bilingual Language Models. arXiv:2502.18934

work page arXiv 2025

[19] [19]

H.; Park, S.; Kim, S.; Kim, S.; Seo, D.; Lee, H.; Jeong, M.; Lee, S.; Kim, M.; Ko, S

Kim, B.; Kim, H.; Lee, S.-W.; Lee, G.; Kwak, D.; Jeon, D. H.; Park, S.; Kim, S.; Kim, S.; Seo, D.; Lee, H.; Jeong, M.; Lee, S.; Kim, M.; Ko, S. H.; Kim, S.; Park, T.; Kim, J.; Kang, S.; Ryu, N.-H.; Yoo, K. M.; Chang, M.; Suh, S.; In, S.; Park, J.; Kim, K.; Kim, H.; Jeong, J.; Yeo, Y. G.; Ham, D.; Park, D.; Lee, M. Y.; Kang, J.; Kang, I.; Ha, J.-W.; Park, ...

2021

[20] [20]

Kim, E.; Suk, J.; Oh, P.; Yoo, H.; Thorne, J.; and Oh, A. 2024. CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean. LREC-COLING

2024

[21] [21]

Kim, S.; Kim, D.; Park, C.; Lee, W.; Song, W.; Kim, Y.; Kim, H.; Kim, Y.; Lee, H.; Kim, J.; Ahn, C.; Yang, S.; Lee, S.; Park, H.; Gim, G.; Cha, M.; Lee, H.; and Kim, S. 2024. SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. NAACL

2024

[22] [22]

Krishna, S.; Krishna, K.; Mohananey, A.; Schwarcz, S.; Stambler, A.; Upadhyay, S.; and Faruqui, M. 2024. FRAMES: Factuality, Retrieval, And reasoning MEasurement Set. arXiv:2409.12941

work page arXiv 2024

[23] [23]

M.; Uszkoreit, J.; Le, Q.; and Petrov, S

Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; Toutanova, K.; Jones, L.; Kelcey, M.; Chang, M.-W.; Dai, A. M.; Uszkoreit, J.; Le, Q.; and Petrov, S. 2019. Natural Questions: A Benchmark for Question Answering Research. TACL

2019

[24] [24]

Lee, N.; Yoon, D.; Son, G.; Kim, G.; Ko, D.; Park, J.; Yoo, H.; Cho, J.; Park, J.; Lee, C.; Jang, K.; Kim, J.; Kim, E.; Cho, W.; and Kim, S. 2026. K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts. arXiv:2606.02404

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

u ttler, H.; Lewis, M.; Yih, W.-t.; Rockt\

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; K\" u ttler, H.; Lewis, M.; Yih, W.-t.; Rockt\" a schel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS

2020

[26] [26]

LG AI Research. 2024. EXAONE 3.0 7.8B Instruction Tuned Language Model. arXiv:2408.03541

work page arXiv 2024

[27] [27]

Li, X.; Dong, G.; Jin, J.; Zhang, Y.; Zhou, Y.; Zhu, Y.; Zhang, P.; and Dou, Z. 2025. Search-o1: Agentic Search-Enhanced Large Reasoning Models. arXiv:2501.05366

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Lim, S.; Kim, M.; and Lee, J. 2019. KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension. arXiv:1909.07005

work page arXiv 2019

[29] [29]

Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; and Zhu, C. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. EMNLP

2023

[30] [30]

Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; Zhang, S.; Deng, X.; Zeng, A.; Du, Z.; Zhang, C.; Shen, S.; Zhang, T.; Su, Y.; Sun, H.; Huang, M.; Dong, Y.; and Tang, J. 2024. AgentBench: Evaluating LLMs as Agents. ICLR

2024

[31] [31]

Llama Team. 2024. The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Mialon, G.; Fourrier, C.; Swift, C.; Wolf, T.; LeCun, Y.; and Scialom, T. 2023. GAIA: A Benchmark for General AI Assistants. arXiv:2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi, H.; and Awadallah, A. 2023. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv:2306.02707

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; Jiang, X.; Cobbe, K.; Eloundou, T.; Krueger, G.; Button, K.; Knight, M.; Chess, B.; and Schulman, J. 2021. WebGPT: Browser-assisted Question-answering with Human Feedback. arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

V.; Verma, N.; Zhang, R.; Kry\' s ci\' n ski, W.; Schoelkopf, H.; Kong, R.; Tang, X.; Mutuma, M.; Rosand, B.; Trindade, I.; Bandaru, R.; Cunningham, J.; Xiong, C.; and Radev, D

Nan, L.; Hsieh, C.; Mao, Z.; Lin, X. V.; Verma, N.; Zhang, R.; Kry\' s ci\' n ski, W.; Schoelkopf, H.; Kong, R.; Tang, X.; Mutuma, M.; Rosand, B.; Trindade, I.; Bandaru, R.; Cunningham, J.; Xiong, C.; and Radev, D. 2022. FeTaQA: Free-form Table Question Answering. TACL

2022

[36] [36]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Oren, Y.; Meister, N.; Chatterji, N.; Ladhak, F.; and Hashimoto, T. B. 2024. Proving Test Set Contamination in Black Box Language Models. ICLR

2024

[38] [38]

Park, S.; Moon, J.; Kim, S.; Cho, W. I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; Lee, J.; Oh, J.; Lyu, S.; Jeong, Y.; Lee, I.; Seo, S.; Lee, D.; Kim, H.; Lee, M.; Jang, S.; Do, S.; Kim, S.; Lim, K.; Lee, J.; Park, K.; Shin, J.; Kim, S.; Park, L.; Oh, A.; Ha, J.-W.; and Cho, K. 2021. KLUE: Korean Language Understanding Evaluation. arXiv:2105.09680

work page arXiv 2021

[39] [39]

Pasupat, P.; and Liang, P. 2015. Compositional Semantic Parsing on Semi-Structured Tables. ACL

2015

[40] [40]

G.; Zhang, T.; Wang, X.; and Gonzalez, J

Patil, S. G.; Zhang, T.; Wang, X.; and Gonzalez, J. E. 2024. Gorilla: Large Language Model Connected with Massive APIs. NeurIPS

2024

[41] [41]

Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; Zhao, S.; Hong, L.; Tian, R.; Xie, R.; Zhou, J.; Gerstein, M.; Li, D.; Liu, Z.; and Sun, M. 2024. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. ICLR

2024

[42] [42]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

H.; Singh, S.; Maheshwary, R.; Altomare, M.; Haggag, M

Romanou, A.; Foroutan, N.; Sotnikova, A.; Chen, Z.; Nelaturu, S. H.; Singh, S.; Maheshwary, R.; Altomare, M.; Haggag, M. A.; A, S.; Amayuelas, A.; Amirudin, A. H.; Aryabumi, V.; Boiko, D.; Chang, M.; Chim, J.; Cohen, G.; Dalmia, A. K.; Diress, A.; Duwal, S.; Dzenhaliou, D.; Erazo Florez, D. F.; Farestam, F.; Imperial, J. M.; Islam, S. B.; Isotalo, P.; Jab...

2025

[44] [44]

Sainz, O.; Campos, J.; Garc\' i a-Ferrero, I.; Etxaniz, J.; Lopez de Lacalle, O.; and Agirre, E. 2023. NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark. EMNLP Findings

2023

[45] [45]

Schick, T.; Dwivedi-Yu, J.; Dess\` i , R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS

2023

[46] [46]

C.; Yeom, J

Son, G.; Lee, H.; Kim, S.; Kim, H.; Lee, J. C.; Yeom, J. W.; Jung, J.; Kim, J. W.; and Kim, S. 2024. HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models. LREC-COLING

2024

[47] [47]

M.; and Biderman, S

Son, G.; Lee, H.; Kim, S.; Kim, S.; Muennighoff, N.; Choi, T.; Park, C.; Yoo, K. M.; and Biderman, S. 2025. KMMLU: Measuring Massive Multitask Language Understanding in Korean. NAACL

2025

[48] [48]

arXiv preprint arXiv:2602.12413 , year=

Spiesberger, A.; Vazquez, J. J.; Pochinkov, N.; Gaven c iak, T.; Grietzer, P.; Leech, G.; and Schoots, N. 2026. Soft Contamination Means Benchmarks Test Shallow Generalization. arXiv:2602.12413

work page arXiv 2026

[49] [49]

Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2022. MuSiQue: Multihop Questions via Single-hop Question Composition. TACL

2022

[50] [50]

Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.-H.; Zhou, D.; Le, Q.; and Luong, T. 2024. FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation. ACL Findings

2024

[51] [51]

A.; Khashabi, D.; and Hajishirzi, H

Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. ACL

2023

[52] [52]

Measuring short-form factuality in large language models

Wei, J.; Karina, N.; Chung, H. W.; Jiao, Y. J.; Papay, S.; Glaese, A.; Schulman, J.; and Fedus, W. 2024. Measuring Short-form Factuality in Large Language Models. arXiv:2411.04368

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Wei, J.; Sun, Z.; Papay, S.; McKinney, S.; Han, J.; Fulford, I.; Chung, H. W.; Passos, A. T.; Fedus, W.; and Glaese, A. 2025. BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. arXiv:2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Whitehouse, C.; Ruder, S.; Lin, T.; Kurylo, O.; Takagi, H.; Lam, J.; Busetto, N.; Diaz, D.; and Guzm\' a n, F. 2026. MENLO: Evaluating Native-like Quality Across 47 Languages. arXiv:2509.26601

work page arXiv 2026

[55] [55]

Wong, R.; Wang, J.; Zhao, J.; Chen, L.; Gao, Y.; Zhang, L.; Zhou, X.; Wang, Z.; Xiang, K.; Zhang, G.; Huang, W.; Wang, Y.; and Wang, K. 2025. WideSearch: Benchmarking Agentic Broad Info-Seeking. arXiv:2508.07999

work page arXiv 2025

[56] [56]

Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv:2406.08464

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Lin, Q.; and Jiang, D. 2024. WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions. ICLR

2024

[58] [58]

Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP

2018

[59] [59]

Yao, S.; Chen, H.; Yang, J.; and Narasimhan, K. 2022. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. NeurIPS

2022

[60] [60]

Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; and Cao, Y. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR

2023

[61] [61]

J.; Malaviya, C.; Bogin, B.; Press, O.; and Berant, J

Yoran, O.; Amouyal, S. J.; Malaviya, C.; Bogin, B.; Press, O.; and Berant, J. 2024. AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? EMNLP

2024

[62] [62]

P.; Zhang, H.; Gonzalez, J

Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.; Zhang, H.; Gonzalez, J. E.; and Stoica, I. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS

2023

[63] [63]

F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G

Zhou, S.; Xu, F. F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; Alon, U.; and Neubig, G. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR

2024

[64] [64]

Zhou, P.; Leon, B.; Ying, X.; Zhang, C.; Shao, Y.; Ye, Q.; Chong, D.; Jin, Z.; Xie, C.; Cao, M.; Gu, Y.; Hong, S.; Ren, J.; Chen, J.; Liu, C.; and Hua, Y. 2025. BrowseComp-ZH: Benchmarking Web Browsing Ability of LLMs in Chinese. arXiv:2504.19314

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Zhu, H.; Xu, Q.; Li, H.; Liu, Y.; Qiu, H.; Chen, J.; and Jin, J. 2026. Aggregation Queries over Unstructured Text: Benchmark and Agentic Method. arXiv:2602.01355

work page arXiv 2026