Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
Pith reviewed 2026-06-27 13:41 UTC · model grok-4.3
The pith
A multi-agent framework produces evidence-grounded multimodal data stories that match expert work in verifiability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Data2Story coordinates multiple agents to act as a complete virtual newsroom that converts raw data into published multimedia stories; an Inspector ensures every number, angle, and asset links directly to the underlying data, code, or external reference, and the system generates interactive or audio elements when reader needs indicate they would help; on 18 paired articles the outputs compete with expert human versions on human-agent angle coverage, a 53-participant rubric, computer-use judges, and a coding verifier that re-executes statements and checks references.
What carries the argument
The Inspector agent, which enforces evidence-grounding by linking every element back to data or references, together with multimodal tool selection that chooses interactive maps, audio, or other formats based on content.
If this is right
- Every generated story can be audited by re-running the linked code and checking external references.
- Multimodal elements such as interactive maps are added only when the agent determines they aid reader understanding.
- The system serves as a starting draft that journalists can edit while retaining the built-in traceability.
- Verifiability scores rise because the Inspector prevents unsupported claims from reaching the final article.
Where Pith is reading between the lines
- The same evidence-linking structure could apply to other domains that require traceable outputs, such as policy reports or scientific summaries.
- If extended to live data streams, the framework might support rapid updates with automatic re-verification of changed numbers.
- Combining the Inspector with external fact-checking databases could reduce the remaining gap in editorial angle selection.
Load-bearing premise
The 18 chosen articles and the four evaluation axes capture the full range of data journalism tasks and reader requirements without selection bias or missing metrics that would change the competitiveness result.
What would settle it
Re-running the coding verifier on a new set of 20 articles finds that more than 25 percent of Data2Story claims cannot be re-executed against the source data or matched to references.
read the original abstract
Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Data2Story, a multi-agent framework that orchestrates specialized roles to automate end-to-end data journalism. It contributes an Inspector agent that grounds every claim, angle, and asset to data, code, or external references, plus multimodal generation that reasons about reader needs to produce interactive maps, audio, and other assets rather than static text/charts. The system is evaluated on 18 articles (each paired with the originally published expert piece) along four axes: human-agent angle coverage, a rubric scored by 53 participants across five dimensions, computer-use agents as judges for interactive navigation, and a coding verifier that re-executes statements against the original data and checks references. The authors conclude that Data2Story produces competitive, evidence-traceable multimedia stories with particular strength in transparency and auditability, while human articles retain advantages in editorial angle, creative design, and presentation; they position the system as a collaborator for journalists.
Significance. If the evaluation holds, the work would be significant for showing that agentic systems can handle the full pipeline of data journalism with explicit verifiability mechanisms, moving beyond isolated tools for analysis or design. The open release of code and demos at the provided GitHub link supports reproducibility and allows direct inspection of the Inspector and multimodal components. This could influence future agent frameworks that require auditability in high-stakes domains like news.
major comments (4)
- [Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.
- [Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.
- [Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.
- [Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.
minor comments (2)
- The abstract would be strengthened by including at least one quantitative result (e.g., average rubric scores or verifiability pass rate) rather than the qualitative statement 'produces competitive' stories.
- The four evaluation axes are listed but their precise mapping to the five rubric dimensions is not clarified in the provided text.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each of the four major comments below. Where information was missing or underspecified, we agree that revisions are needed and will incorporate the requested details in the next version.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section (abstract and implied §4): the claim that Data2Story produces 'competitive' stories rests on the 18 selected articles, yet the manuscript provides no explicit selection protocol, diversity metrics, or justification that these cases represent the range of data journalism challenges (e.g., time pressure, ethical framing, long-form engagement). This selection bias risk directly affects the generalizability of the competitiveness conclusion.
Authors: We agree that the selection protocol requires explicit description to support generalizability claims. The 18 articles were chosen from publicly available data journalism pieces with accompanying datasets. In the revised manuscript we will add a subsection in §4 specifying the selection criteria (public data availability, topic variety across politics, environment, health, and economics), diversity metrics (e.g., distribution of data types and story lengths), and limitations regarding real-time reporting or ethically complex cases. revision: yes
-
Referee: [Evaluation section] Evaluation section: no information is given on how the 53 participants were recruited, what statistical tests were applied to the rubric scores, inter-rater reliability, or data exclusion rules. These omissions make it impossible to confirm that the reported human evaluation results support the competitiveness claim.
Authors: We acknowledge the omission of these methodological details. The revised manuscript will expand the human evaluation description to include recruitment procedures, the statistical tests applied to rubric scores, inter-rater reliability metrics, and data exclusion rules. These additions will allow readers to evaluate the robustness of the reported results. revision: yes
-
Referee: [Verifiability axis] Verifiability axis (abstract): the coding verifier is presented as a strength for re-executing statements against data and checking references, but the manuscript supplies no description of how it handles edge cases, ambiguous claims, or complex multimodal assets, which is load-bearing for the auditability advantage asserted.
Authors: The current manuscript does not detail the verifier's handling of edge cases. We will add a description in the revised Evaluation section covering tolerance thresholds for numerical matches, flagging of ambiguous claims for review, and verification procedures for multimodal assets via associated data sources and code. This will strengthen the auditability claims. revision: yes
-
Referee: [Method section] Method section (abstract): the Inspector agent is introduced as a core innovation for evidence-grounding, yet the paper does not detail its implementation, failure modes, or how it avoids missing references, leaving the 'evidence-traceable' claim underspecified.
Authors: We agree that the Inspector agent's implementation requires more detail. The revised Method section will include its prompting approach, mechanisms for linking claims to sources, documented failure modes such as missed references, and mitigation steps including multi-agent cross-verification. These additions will better substantiate the evidence-grounding contribution. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper's central claims rest on external evaluation: 18 articles are compared to originally published expert pieces, scored by 53 human participants on a rubric, judged by separate computer-use agents, and verified by a coding verifier that re-executes statements against the original data and references. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided abstract or evaluation description. The derivation chain (multi-agent orchestration with Inspector for grounding) is assessed against independent benchmarks rather than quantities defined by the system itself, making the result self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Specialized agents can be orchestrated into a reliable end-to-end pipeline for complex creative tasks without coordination failures that would break evidence links.
invented entities (1)
-
Inspector agent
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024
-
[2]
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, et al. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024
-
[3]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering.arXiv preprint arXiv:2410.07095, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation.arXiv preprint arXiv:2310.03302, 2023. 18
-
[5]
Matplotagent: Method and evaluation for llm-based agentic scientific data visualization
Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. InFindings of the Association for Computational Linguistics: ACL 2024, pages 11789–11804, 2024
2024
-
[6]
LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models
Victor Dibia. LIDA: A tool for automatic generation of grammar-agnostic visualizations and infographics using large language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 113–126, 2023
2023
-
[7]
Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025
Zichen Chen, Jiefeng Chen, Sercan Ö Arik, Misha Sra, Tomas Pfister, and Jinsung Yoon. Coda: Agentic systems for collaborative data visualization.arXiv preprint arXiv:2510.03194, 2025
-
[8]
Design2code: Bench- marking multimodal code generation for automated front-end engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Bench- marking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 39...
2025
-
[9]
Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County
Holly Rusch. Should AI cover your city council meeting? Prevalence of AI-generated articles summarizing public meetings grows in San Mateo County. San Mateo Daily Journal, 2025. Accessed: 2026-06-08
2025
-
[10]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
2023
-
[11]
MindSearch: Mimicking human minds elicits deep AI searcher
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. MindSearch: Mimicking human minds elicits deep AI searcher. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2407.20183
-
[12]
MMSearch: Benchmarkingthepotential of large models as multi-modal search engines
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, GuangluSong,PengGao,YuLiu,ChunyuanLi,andHongshengLi. MMSearch: Benchmarkingthepotential of large models as multi-modal search engines. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.12959
-
[13]
DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research
Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, and James Zou. DSGym: A holistic framework for evaluating and training data science agents.arXiv preprint arXiv:2601.16344, 2026
-
[15]
Data Interpreter: An LLM agent for data science
Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Chenxing Wei, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chengli...
2025
-
[16]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs
Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, and Mark Dredze. Do LLMs plan like human writers? comparing journalist coverage of press releases with LLMs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
2024
-
[19]
O’Reilly Media, Inc
Jonathan Gray, Lucy Chambers, and Liliana Bounegru.The data journalism handbook: How journalists can use data to improve the news. " O’Reilly Media, Inc.", 2012. 19
2012
-
[20]
Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
2020
-
[21]
Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025
OpenAI. Introducing deep research.https://openai.com/index/introducing-deep-research/, 2025
2025
-
[22]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, AlexTachardPassos,WilliamFedus,andAmeliaGlaese. Browsecomp: Asimpleyetchallengingbenchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, and Pengfei Liu. DeepResearcher: Scaling deep research via reinforcement learning in real-world environments.arXiv preprint arXiv:2504.03160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Zhuofeng Li, Dongfu Jiang, Xueguang Ma, Haoxiang Zhang, Ping Nie, Yuyu Zhang, Kai Zou, Jianwen Xie, Yu Zhang, and Wenhu Chen. Openresearcher: A fully open pipeline for long-horizon deep research trajectory synthesis.arXiv preprint arXiv:2603.20278, 2026
-
[25]
DataNarrative: Automated data-driven storytelling with visualizations and texts
Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Md Rizwan Parvez, Enamul Hoque, and Shafiq Joty. DataNarrative: Automated data-driven storytelling with visualizations and texts. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 19253–19286,
2024
-
[26]
Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, and Xiaoyong Du. DeepAnalyze: Agentic large language models for autonomous data science.arXiv preprint arXiv:2510.16872, 2025
-
[27]
Sina Montazeri, Yunhe Feng, and Kewei Sha. PublicAgent: Multi-agent design principles from an LLM-based open data analysis framework.arXiv preprint arXiv:2511.03023, 2025
-
[28]
Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno, Franziska Roesner, and Niloofar Mireshghallah. Developing story: Case studies of generative ai’s use in journalism.arXiv preprint arXiv:2406.13706, 2024
-
[29]
When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025
Sophia Cheng. When journalism meets ai: Risk or opportunity?Digital Government: Research and Practice, 6(1):1–12, 2025
2025
-
[30]
A novel multi-document retrieval benchmark: Journalist source-selection in newswriting
Alexander Spangher, Tenghao Huang, Yiqin Huang, Lucas Spangher, Sewon Min, and Mark Dredze. A novel multi-document retrieval benchmark: Journalist source-selection in newswriting. InProceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing, pages 180–204, 2025
2025
-
[31]
Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, and Smaranda Muresan. Llms as science journalists: Supporting early-stage researchers in communicating their science to the public.arXiv preprint arXiv:2601.05821, 2026
-
[32]
Leixian Shen, Haotian Li, Yun Wang, and Huamin Qu. From data to story: Towards automatic animated data video creation with LLM-based multi-agent systems. InIEEE VIS Workshop on Generative AI for Data Storytelling (Gen4DS), 2024. arXiv:2408.03876
-
[33]
Amsterdam University Press, 2021
Liliana Bounegru and Jonathan Gray.The Data Journalism Handbook 2: Towards a Critical Data Practice. Amsterdam University Press, 2021
2021
-
[34]
Tufte.The Visual Display of Quantitative Information
Edward R. Tufte.The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 2nd edition, 2001
2001
-
[35]
Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004
Colin Ware.Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco, CA, 2nd edition, 2004
2004
-
[36]
Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010
Edward Segel and Jeffrey Heer. Narrative visualization: Telling stories with data.IEEE Transactions on Visualization and Computer Graphics, 16(6):1139–1148, 2010. 20
2010
-
[37]
John Wiley & Sons, 2025
Cole Nussbaumer Knaflic.Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons, 2025
2025
-
[38]
Computational journalism.Communications of the ACM, 54(10):66–71, 2011
Sarah Cohen, James T Hamilton, and Fred Turner. Computational journalism.Communications of the ACM, 54(10):66–71, 2011
2011
-
[39]
Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015
Nicholas Diakopoulos. Algorithmic accountability: Journalistic investigation of computational power structures.Digital Journalism, 3(3):398–415, 2015
2015
-
[40]
fishing expedition
Andrew Gelman and Eric Loken. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time.Department of Statistics, Columbia University, 2013. Unpublished manuscript
2013
-
[41]
New Riders, 2016
Alberto Cairo.The truthful art: Data, charts, and maps for communication. New Riders, 2016
2016
-
[42]
Paul Grice
H. Paul Grice. Logic and conversation. In Peter Cole and Jerry L. Morgan, editors,Syntax and Semantics, Vol. 3: Speech Acts, pages 41–58. Academic Press, New York, 1975
1975
-
[43]
Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006
Chris North. Toward measuring visualization insight.IEEE Computer Graphics and Applications, 26(3):6–9, 2006
2006
-
[44]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023
2023
-
[45]
Agent-as-a-Judge: Evaluate agents with agents
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents.arXiv preprint arXiv:2410.10934, 2024
-
[46]
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark
DongpingChen, RuoxiChen, ShilinZhang, YaochenWang, YinuoLiu, HuichiZhou, QihuiZhang, YaoWan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InForty-first International Conference on Machine Learning, 2024
2024
-
[47]
public data
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024. Appendix A Model Settings Data Journalist Agent is based o...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.