LLMs show a consistent performance drop on arithmetic, spatial, and temporal reasoning tasks when framed in multi-turn dialogues versus isolated settings, demonstrated by the new BOULDER benchmark across eight travel-related tasks.
The time interval between 20:00 and midnight is 4 hours, which is equivalent to 240 minutes
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reasoning Gets Harder for LLMs Inside A Dialogue
LLMs show a consistent performance drop on arithmetic, spatial, and temporal reasoning tasks when framed in multi-turn dialogues versus isolated settings, demonstrated by the new BOULDER benchmark across eight travel-related tasks.