Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
HyperCLOVA X technical report.Preprint at https://arxiv.org/abs/2404.01954(2024)
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 7years
2026 7verdicts
UNVERDICTED 7representative citing papers
LLM gender stereotyping across four languages spans roughly 2.5 times the human cross-country range on HEXACO-100, with translation altering specific stereotyped attributes and effects that can compound.
A framework extracts embeddings from Korean-English bilingual LLMs across thousands of spaces and uses similarity distributions plus logistic classifiers to identify lexical gaps with AUCs of 0.81 and 0.76.
K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.
SCRIPT is a model-agnostic injection module that enhances Korean PLM embeddings with subcharacter compositional knowledge from Jamo, leading to better performance on NLU and NLG tasks and more linguistically coherent embedding spaces.
CHERRY combines selective ground-truth token training, recurrent depth compression from 48 to 6 layers, and mixture-of-efficient-experts to achieve competitive loss with fewer parameters on a 1.8B Korean model.
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.
citing papers explorer
-
Ko-WideSearch: A Korean Breadth-Search Benchmark for Exhaustive Set Enumeration by Web Agents
Ko-WideSearch is a new Korean breadth-search benchmark spanning 16 categories and three difficulty tiers that evaluates web agents on full set membership plus per-item attributes, showing consistent gaps between set recovery and row completion.
-
Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit
LLM gender stereotyping across four languages spans roughly 2.5 times the human cross-country range on HEXACO-100, with translation altering specific stereotyped attributes and effects that can compound.
-
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
A framework extracts embeddings from Korean-English bilingual LLMs across thousands of spaces and uses similarity distributions plus logistic classifiers to identify lexical gaps with AUCs of 0.81 and 0.76.
-
K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology
K-MetBench shows LLMs have large gaps in interpreting meteorology diagrams and Korean-specific context, with smaller local models beating much larger global ones.
-
SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models
SCRIPT is a model-agnostic injection module that enhances Korean PLM embeddings with subcharacter compositional knowledge from Jamo, leading to better performance on NLU and NLG tasks and more linguistically coherent embedding spaces.
-
CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield
CHERRY combines selective ground-truth token training, recurrent depth compression from 48 to 6 layers, and mixture-of-efficient-experts to achieve competitive loss with fewer parameters on a 1.8B Korean model.
-
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
SemEval-2026 Task 7 presents a benchmark and two evaluation tracks for assessing LLMs on everyday knowledge in diverse languages and cultures without allowing training on the test data.