FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
Measuring massive multitask language understanding
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Quality gains from extra thinking in Gemini models for video understanding plateau after the first few hundred tokens, Flash Lite balances quality and cost best, and tight reasoning budgets lead to compression-step hallucination where final outputs include un-reasoned content.
citing papers explorer
-
FactoryBench: Evaluating Industrial Machine Understanding
FactoryBench reveals that frontier LLMs achieve under 50% on structured causal questions and under 18% on decision-making in industrial robotic telemetry.
-
Do Thought Streams Matter? Evaluating Reasoning in Gemini Vision-Language Models for Video Scene Understanding
Quality gains from extra thinking in Gemini models for video understanding plateau after the first few hundred tokens, Flash Lite balances quality and cost best, and tight reasoning budgets lead to compression-step hallucination where final outputs include un-reasoned content.