LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

Alice Oh; Eunsu Kim; Jaeho Kim; Ken E. Friedl; Kiwoong Park; Seogyeong Jeong; Seyoung Song

arxiv: 2605.21086 · v1 · pith:T2NIPBOTnew · submitted 2026-05-20 · 💻 cs.CL

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

Seogyeong Jeong , Kiwoong Park , Seyoung Song , Eunsu Kim , Ken E. Friedl , Jaeho Kim , Alice Oh This is my paper

Pith reviewed 2026-05-21 04:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords in-vehicle assistantsKorean honorificsLLM evaluationlocalizationsociolinguistic controlconversational metricsclarificationproactivity

0 comments

The pith

Current LLMs show unstable fine-grained control of Korean honorifics in in-vehicle conversations and underperform on clarification and proactivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces LoCar, an evaluation framework tailored to in-vehicle assistants with emphasis on Korean localization. It shows that existing LLMs cannot maintain precise speech-level honorific distinctions, which are essential for culturally appropriate vehicle interactions. The framework also documents weaker results on clarification and proactivity, which the authors link to the subjective character of those tasks and therefore score conservatively to preserve safety signals. These patterns indicate that general language competence is insufficient and that automotive systems require explicit, localization-specific linguistic and strategic benchmarks.

Core claim

LoCar is a localization-aware evaluation framework for in-vehicle assistants that demonstrates unstable fine-grained Korean honorific control in current LLMs and weaker performance on clarification and proactivity, concluding that precise speech-level realization and reliable safety-oriented interaction management must be explicitly evaluated rather than assumed from general competence.

What carries the argument

LoCar evaluation framework that applies fine-grained sociolinguistic control, centered on Korean honorific levels, to measure model behavior in simulated in-vehicle dialogues.

If this is right

Precise speech-level realization must be explicitly tested rather than assumed in any Korean-language in-vehicle deployment.
Clarification and proactivity require conservative scoring because their subjective nature can mask safety risks if over-optimistically evaluated.
Automotive AI must advance from general linguistic competence to domain-specific tailoring and strategic interaction management.
Localization-aware benchmarks are needed to select and improve models for real-world cultural and safety constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-grained honorific evaluation approach could be adapted to other languages with complex politeness systems to test cross-lingual localization gaps.
Embedding these metrics directly into model training loops might reduce the observed instability in safety-critical conversational domains.
The conservative scoring method for subjective tasks offers a template for evaluations in other high-stakes settings such as medical or legal assistants.
Applying the framework to real driving data with time pressure and distraction would test whether the reported weaknesses translate to measurable safety outcomes.

Load-bearing premise

The subjective complexity of clarification and proactivity tasks justifies a conservative evaluation stance that still produces reliable signals for real-world in-vehicle safety.

What would settle it

A model that produces consistent, context-appropriate Korean honorific shifts across multiple speech levels in a new set of in-vehicle clarification and proactivity scenarios without task-specific prompting would falsify the instability finding.

Figures

Figures reproduced from arXiv: 2605.21086 by Alice Oh, Eunsu Kim, Jaeho Kim, Ken E. Friedl, Kiwoong Park, Seogyeong Jeong, Seyoung Song.

**Figure 2.** Figure 2: summarizes overall performance across all 13 KPIs, with domain-specific breakdowns in Figure 3 and detailed scores reported in Appendix F. Across models, fine-grained honorific control remains inconsistent despite generally high compliance with polite speech. In addition, interactionsensitive KPIs—such as clarification and proac- Conciseness Hae Haeyo Hapsyo Implicit Understanding Context Understanding … view at source ↗

**Figure 3.** Figure 3: Full scatter plot of 11 LLMs evaluated under the LoCar framework, with KPIs separated by use case [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a Korean-focused evaluation framework for in-vehicle LLMs that flags instability in honorific control, but the supporting details on test design remain thin.

read the letter

The core point is that current LLMs show unstable fine-grained control over Korean honorific levels in car-assistant scenarios, and the authors built LoCar to make that kind of localization explicit in evaluation. They also note weaker results on clarification and proactivity, which they tie to the subjective nature of those tasks and handle with a conservative scoring approach. That framing is the main new angle: applying sociolinguistic metrics to an automotive setting rather than general chat or translation benchmarks. The practical angle is useful because Korean honorific choices affect perceived politeness and could matter for driver attention or trust in real vehicles. The paper does a reasonable job of naming the gap and sketching why general-purpose models fall short on these dimensions. The stress-test concern about test construction lands. The abstract states the instability finding but gives no prompt templates, variable list, item count, or annotation rules, so it is hard to tell whether the signal comes from genuine model limits or from how the tests were written. If the full paper still leaves those steps underspecified, the central claim stays hard to reproduce. The conservative stance on subjective tasks is defensible for safety-critical use, yet it risks making the framework too blunt to catch real differences. This work is aimed at applied NLP groups and automotive teams working on Korean or other honorific-heavy languages. A reader who needs concrete localization metrics for deployment would get some value from the framing even if the numbers need more scrutiny. The paper is coherent enough on its own terms to deserve referee time rather than a desk reject, mainly because the application area is narrow but real and the framework could be stress-tested in review.

Referee Report

2 major / 1 minor

Summary. The paper proposes the LoCar evaluation framework for in-vehicle conversational assistants, with a focus on Korean-language localization via fine-grained sociolinguistic control (e.g., honorific levels). Empirical analysis reports that current LLMs exhibit unstable fine-grained Korean honorific control and weaker performance on clarification and proactivity tasks; the latter is attributed to subjective task complexity, leading the framework to adopt a conservative evaluation stance for reliability in safety-critical settings. The work concludes that automotive AI requires explicit evaluation of precise linguistic tailoring beyond general competence.

Significance. If the empirical patterns hold under rigorous verification, this contributes a domain-specific benchmark that could improve localization and safety in real-world in-vehicle systems by highlighting gaps in LLM handling of culturally nuanced speech levels. The introduction of fine-grained sociolinguistic metrics for Korean honorifics is a constructive addition to localization-aware NLP evaluation, particularly if accompanied by reproducible test protocols.

major comments (2)

[Abstract] Abstract: The central claim that 'fine-grained Korean honorific control remains unstable in current LLMs' and the resulting recommendation that 'precise speech-level realization must be explicitly evaluated' rest on an unspecified test construction; no details are provided on prompt templates, the exact sociolinguistic variables manipulated (e.g., context-sensitive shifts among banmal, jondaemal, and finer status/age/relationship gradations), number of test items, or the annotation protocol used to label outputs as unstable. This is load-bearing because the instability signal could be an artifact of prompt engineering rather than a robust localization requirement.
[Abstract] Abstract: The assertion that weaker performance in clarification and proactivity 'stems from the inherent subjective complexity of these tasks' and justifies a 'conservative evaluation stance' lacks supporting evidence such as inter-annotator agreement scores, concrete task examples, or error analysis, undermining the claim that this stance still produces reliable signals for real-world in-vehicle safety.

minor comments (1)

The abstract would be strengthened by briefly stating the number and identities of LLMs evaluated and the overall size of the test suite to contextualize the reported patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the manuscript to improve clarity and provide additional supporting details where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'fine-grained Korean honorific control remains unstable in current LLMs' and the resulting recommendation that 'precise speech-level realization must be explicitly evaluated' rest on an unspecified test construction; no details are provided on prompt templates, the exact sociolinguistic variables manipulated (e.g., context-sensitive shifts among banmal, jondaemal, and finer status/age/relationship gradations), number of test items, or the annotation protocol used to label outputs as unstable. This is load-bearing because the instability signal could be an artifact of prompt engineering rather than a robust localization requirement.

Authors: We acknowledge that the abstract, due to length constraints, does not enumerate every methodological parameter. The full manuscript details the LoCar framework's test construction in Sections 3 and 4, including the prompt templates, the sociolinguistic variables (context-sensitive shifts across banmal, jondaemal, and finer gradations by status/age/relationship), the test item count, and the expert annotation protocol. To directly address the concern and strengthen the abstract's self-containment, we have revised it to briefly summarize these elements and added an explicit pointer to the reproducibility details in the methods. This revision ensures the instability findings are presented as grounded in the described protocol rather than prompt artifacts. revision: yes
Referee: [Abstract] Abstract: The assertion that weaker performance in clarification and proactivity 'stems from the inherent subjective complexity of these tasks' and justifies a 'conservative evaluation stance' lacks supporting evidence such as inter-annotator agreement scores, concrete task examples, or error analysis, undermining the claim that this stance still produces reliable signals for real-world in-vehicle safety.

Authors: We agree that the abstract would benefit from explicit supporting evidence for this claim. The manuscript body already contains the relevant analysis, but we have now incorporated a concise summary of inter-annotator agreement, representative task examples, and error patterns into both the abstract and the results section. These additions substantiate the subjective complexity of clarification and proactivity tasks and justify the conservative stance as a deliberate choice for reliability in safety-critical automotive contexts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations and new metrics stand independently

full rationale

The paper proposes an evaluation framework for in-vehicle assistants with focus on Korean sociolinguistic control and reports empirical patterns from model testing. Central claims rest on direct observation of LLM behavior (unstable fine-grained honorifics, weaker clarification/proactivity) rather than any derivation, equation, or fitted parameter that reduces to prior inputs. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify the framework; the conservative stance on subjective tasks is presented as an explicit methodological choice. The analysis is self-contained against external benchmarks of model outputs and introduces novel metrics without renaming or re-deriving known results from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or entities; no free parameters, axioms, or invented entities are explicitly introduced in the provided text.

pith-pipeline@v0.9.0 · 5699 in / 972 out tokens · 33227 ms · 2026-05-21T04:58:49.642967+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid verification step alongside contextual judgment... sentence-level suffix checking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. https://doi.org/10.18653/v1/2024.acl-long.401 MT -bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues . In Proceedings of the 62nd Annual Meeting of the Association for Com...

work page doi:10.18653/v1/2024.acl-long.401 2024
[2]

Lucien Brown. 2015. https://onlinelibrary.wiley.com/doi/10.1002/9781118371008.ch17 Honorifics and politeness . The handbook of Korean linguistics, pages 303--319

work page doi:10.1002/9781118371008.ch17 2015
[3]

Jordan, Joseph E

Wei - Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. https://openreview.net/forum?id=3MW8GKNyzI Chatbot arena: An open platform for evaluating llms by human preference . In Forty-first International Conference on Machine Lea...

work page 2024
[4]

Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, and Yejin Choi. 2025. https://doi.org/10.18653/v1/2025.acl-long.1247 C ultural B ench: A robust, diverse and challenging benchmark for measuring LM s' cultural knowledge through human- AI red-teaming . ...

work page doi:10.18653/v1/2025.acl-long.1247 2025
[5]

Changwoo Chun, Daniel Rim, and Juhee Park. 2025. https://aclanthology.org/2025.coling-industry.66/ LLM C ontext B ridge: A hybrid approach for intent and dialogue understanding in IVSR . In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 794--806, Abu Dhabi, UAE. Association for Computational Linguistics

work page 2025
[6]

Primack, Summer Yue, and Chen Xing

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. 2025. https://doi.org/10.18653/v1/2025.findings-acl.958 M ulti C hallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLM s . In Findings of the As...

work page doi:10.18653/v1/2025.findings-acl.958 2025
[7]

Huifang Du, Xuejing Feng, Jun Ma, Meng Wang, Shiyu Tao, Yijie Zhong, Yuan - Fang Li, and Haofen Wang. 2024. https://www.ijcai.org/proceedings/2024/869 Towards proactive interactions for in-vehicle conversational assistants utilizing large language models . In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ...

work page 2024
[8]

Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. https://arxiv.org/abs/2404.04475 Length-controlled alpacaeval: A simple way to debias automatic evaluators . ArXiv preprint, abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, and Shuo Shang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1092 C ul F i T : A fine-grained cultural-aware LLM training paradigm via multilingual critique data synthesis . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22413--2...

work page doi:10.18653/v1/2025.acl-long.1092 2025
[10]

Yong-cheol Hong. 2022. https://doi.org/10.15860/sigg.32.1.202202.195 Remarks on addressee honorification in korean . Studies in Generative Grammar, 32(1):195--220. (in Korean)

work page doi:10.15860/sigg.32.1.202202.195 2022
[11]

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Dogruoz, Najoung Kim, and Alice Oh. 2025. https://arxiv.org/abs/2510.19028 Are they lovers or friends? evaluating llms' social reasoning in english and korean dialogues

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, and Alice Oh. 2024. https://aclanthology.org/2024.lrec-main.296/ CLI c K : A benchmark dataset of cultural and linguistic intelligence in K orean . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), page...

work page 2024
[13]

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1124 MT -eval: A multi-turn capabilities evaluation benchmark for large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...

work page doi:10.18653/v1/2024.emnlp-main.1124 2024
[14]

Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang. 2024 a . http://papers.nips.cc/paper\_files/paper/2024/hash/77f089cd16dbc36ddd1caeb18446fbdd-Abstract-Conference.html Culturepark: Boosting cross-cultural understanding in large language models . In Advances in Neural Information Processing Systems 38: Annual Conference on Neura...

work page 2024
[15]

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. 2024 b . https://arxiv.org/abs/2406.11939 From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . ArXiv preprint, abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Ji Ryong Lim. 2015. https://www.kci.go.kr/kciportal/landing/article.kci?arti_id=ART001984576 A new understanding of the hearer-oriented honorific system in school grammar . Hanminjok Emunhak, 69:360--398

work page 2015
[17]

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. 2024. https://arxiv.org/abs/2406.04770 Wildbench: Benchmarking llms with challenging tasks from real users in the wild . ArXiv preprint, abs/2406.04770

work page arXiv 2024
[18]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[19]

Zongwei Liu, Wang Zhang, and Fuquan Zhao. 2022. https://doi.org/10.1007/s42154-022-00179-z Impact, challenges and prospect of software-defined vehicles . Automotive Innovation, 5(2):180--194

work page doi:10.1007/s42154-022-00179-z 2022
[20]

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla P \' e rez - Almendros, Abinew Ali Ayele, V \' ctor Guti \' e rrez - Basulto, Yazm \' n Ib \' a \ n ez - Garc \' a, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki - Woong Park, Anar Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehv...

work page 2024
[21]

Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Ji Yoon Han, Jangwon Park, Chisung Song, Junseong Kim, Youngsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, and 12 others. 2021. https://openreview.net/forum?id=q-8h8-LZiUm KLUE : Korean language understandi...

work page 2021
[22]

Md Rashad Al Hasan Rony, Christian Suess, Sinchana Ramakanth Bhat, Viju Sudhi, Julia Schneider, Maximilian Vogel, Roman Teucher, Ken Friedl, and Soumya Sahoo. 2023. https://doi.org/10.18653/v1/2023.emnlp-industry.56 C ar E xpert: Leveraging large language models for in-car conversational question answering . In Proceedings of the 2023 Conference on Empiri...

work page doi:10.18653/v1/2023.emnlp-industry.56 2023
[23]

Ho-min Sohn. 2005. https://uhpress.hawaii.edu/title/korean-language-in-culture-and-society/ Korean language in culture and society . University of Hawaii press

work page 2005
[24]

Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, and Long Chen. 2023. https://doi.org/10.1109/TIV.2023.3274536 Motion planning for autonomous driving: The state of the art and future perspectives . IEEE Transactions on Intelligent Vehicles, 8(6):3692--3711

work page doi:10.1109/tiv.2023.3274536 2023
[25]

Jingyuan Zhao, Yuyan Wu, Rui Deng, Susu Xu, Jinpeng Gao, and Andrew Burke. 2025. https://doi.org/10.1145/3729420 A survey of autonomous driving from a deep learning perspective . ACM Comput. Surv., 57(10)

work page doi:10.1145/3729420 2025
[26]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei - Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets\_and\_Benchmarks.html Judging llm-as-a-judge with mt-bench and chat...

work page 2023

[1] [1]

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. 2024. https://doi.org/10.18653/v1/2024.acl-long.401 MT -bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues . In Proceedings of the 62nd Annual Meeting of the Association for Com...

work page doi:10.18653/v1/2024.acl-long.401 2024

[2] [2]

Lucien Brown. 2015. https://onlinelibrary.wiley.com/doi/10.1002/9781118371008.ch17 Honorifics and politeness . The handbook of Korean linguistics, pages 303--319

work page doi:10.1002/9781118371008.ch17 2015

[3] [3]

Jordan, Joseph E

Wei - Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. https://openreview.net/forum?id=3MW8GKNyzI Chatbot arena: An open platform for evaluating llms by human preference . In Forty-first International Conference on Machine Lea...

work page 2024

[4] [4]

Yu Ying Chiu, Liwei Jiang, Bill Yuchen Lin, Chan Young Park, Shuyue Stella Li, Sahithya Ravi, Mehar Bhatia, Maria Antoniak, Yulia Tsvetkov, Vered Shwartz, and Yejin Choi. 2025. https://doi.org/10.18653/v1/2025.acl-long.1247 C ultural B ench: A robust, diverse and challenging benchmark for measuring LM s' cultural knowledge through human- AI red-teaming . ...

work page doi:10.18653/v1/2025.acl-long.1247 2025

[5] [5]

Changwoo Chun, Daniel Rim, and Juhee Park. 2025. https://aclanthology.org/2025.coling-industry.66/ LLM C ontext B ridge: A hybrid approach for intent and dialogue understanding in IVSR . In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 794--806, Abu Dhabi, UAE. Association for Computational Linguistics

work page 2025

[6] [6]

Primack, Summer Yue, and Chen Xing

Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. 2025. https://doi.org/10.18653/v1/2025.findings-acl.958 M ulti C hallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLM s . In Findings of the As...

work page doi:10.18653/v1/2025.findings-acl.958 2025

[7] [7]

Huifang Du, Xuejing Feng, Jun Ma, Meng Wang, Shiyu Tao, Yijie Zhong, Yuan - Fang Li, and Haofen Wang. 2024. https://www.ijcai.org/proceedings/2024/869 Towards proactive interactions for in-vehicle conversational assistants utilizing large language models . In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ...

work page 2024

[8] [8]

Yann Dubois, Bal \'a zs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. https://arxiv.org/abs/2404.04475 Length-controlled alpacaeval: A simple way to debias automatic evaluators . ArXiv preprint, abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Ruixiang Feng, Shen Gao, Xiuying Chen, Lisi Chen, and Shuo Shang. 2025. https://doi.org/10.18653/v1/2025.acl-long.1092 C ul F i T : A fine-grained cultural-aware LLM training paradigm via multilingual critique data synthesis . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22413--2...

work page doi:10.18653/v1/2025.acl-long.1092 2025

[10] [10]

Yong-cheol Hong. 2022. https://doi.org/10.15860/sigg.32.1.202202.195 Remarks on addressee honorification in korean . Studies in Generative Grammar, 32(1):195--220. (in Korean)

work page doi:10.15860/sigg.32.1.202202.195 2022

[11] [11]

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Dogruoz, Najoung Kim, and Alice Oh. 2025. https://arxiv.org/abs/2510.19028 Are they lovers or friends? evaluating llms' social reasoning in english and korean dialogues

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Eunsu Kim, Juyoung Suk, Philhoon Oh, Haneul Yoo, James Thorne, and Alice Oh. 2024. https://aclanthology.org/2024.lrec-main.296/ CLI c K : A benchmark dataset of cultural and linguistic intelligence in K orean . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), page...

work page 2024

[13] [13]

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1124 MT -eval: A multi-turn capabilities evaluation benchmark for large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages...

work page doi:10.18653/v1/2024.emnlp-main.1124 2024

[14] [14]

Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang. 2024 a . http://papers.nips.cc/paper\_files/paper/2024/hash/77f089cd16dbc36ddd1caeb18446fbdd-Abstract-Conference.html Culturepark: Boosting cross-cultural understanding in large language models . In Advances in Neural Information Processing Systems 38: Annual Conference on Neura...

work page 2024

[15] [15]

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E Gonzalez, and Ion Stoica. 2024 b . https://arxiv.org/abs/2406.11939 From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline . ArXiv preprint, abs/2406.11939

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Ji Ryong Lim. 2015. https://www.kci.go.kr/kciportal/landing/article.kci?arti_id=ART001984576 A new understanding of the hearer-oriented honorific system in school grammar . Hanminjok Emunhak, 69:360--398

work page 2015

[17] [17]

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. 2024. https://arxiv.org/abs/2406.04770 Wildbench: Benchmarking llms with challenging tasks from real users in the wild . ArXiv preprint, abs/2406.04770

work page arXiv 2024

[18] [18]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G -eval: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.153 2023

[19] [19]

Zongwei Liu, Wang Zhang, and Fuquan Zhao. 2022. https://doi.org/10.1007/s42154-022-00179-z Impact, challenges and prospect of software-defined vehicles . Automotive Innovation, 5(2):180--194

work page doi:10.1007/s42154-022-00179-z 2022

[20] [20]

Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla P \' e rez - Almendros, Abinew Ali Ayele, V \' ctor Guti \' e rrez - Basulto, Yazm \' n Ib \' a \ n ez - Garc \' a, Hwaran Lee, Shamsuddeen Hassan Muhammad, Ki - Woong Park, Anar Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehv...

work page 2024

[21] [21]

Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Ji Yoon Han, Jangwon Park, Chisung Song, Junseong Kim, Youngsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, and 12 others. 2021. https://openreview.net/forum?id=q-8h8-LZiUm KLUE : Korean language understandi...

work page 2021

[22] [22]

Md Rashad Al Hasan Rony, Christian Suess, Sinchana Ramakanth Bhat, Viju Sudhi, Julia Schneider, Maximilian Vogel, Roman Teucher, Ken Friedl, and Soumya Sahoo. 2023. https://doi.org/10.18653/v1/2023.emnlp-industry.56 C ar E xpert: Leveraging large language models for in-car conversational question answering . In Proceedings of the 2023 Conference on Empiri...

work page doi:10.18653/v1/2023.emnlp-industry.56 2023

[23] [23]

Ho-min Sohn. 2005. https://uhpress.hawaii.edu/title/korean-language-in-culture-and-society/ Korean language in culture and society . University of Hawaii press

work page 2005

[24] [24]

Siyu Teng, Xuemin Hu, Peng Deng, Bai Li, Yuchen Li, Yunfeng Ai, Dongsheng Yang, Lingxi Li, Zhe Xuanyuan, Fenghua Zhu, and Long Chen. 2023. https://doi.org/10.1109/TIV.2023.3274536 Motion planning for autonomous driving: The state of the art and future perspectives . IEEE Transactions on Intelligent Vehicles, 8(6):3692--3711

work page doi:10.1109/tiv.2023.3274536 2023

[25] [25]

Jingyuan Zhao, Yuyan Wu, Rui Deng, Susu Xu, Jinpeng Gao, and Andrew Burke. 2025. https://doi.org/10.1145/3729420 A survey of autonomous driving from a deep learning perspective . ACM Comput. Surv., 57(10)

work page doi:10.1145/3729420 2025

[26] [26]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei - Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets\_and\_Benchmarks.html Judging llm-as-a-judge with mt-bench and chat...

work page 2023