PIQA: Reasoning about Physical Commonsense in Natural Language
Pith reviewed 2026-05-17 14:49 UTC · model grok-4.3
The pith
Large pretrained models reach only 77 percent accuracy on physical commonsense questions that humans answer at 95 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI systems cannot yet reliably answer physical commonsense questions without experiencing the physical world, as shown by the 77 percent accuracy of large pretrained models on the new PIQA benchmark compared with 95 percent for humans.
What carries the argument
PIQA, a dataset of multiple-choice questions that test reasoning about how everyday objects can be used for simple physical tasks.
If this is right
- Text-based pretraining is insufficient for physical domains because of inherent reporting bias.
- Models lack specific dimensions of knowledge about object affordances and interactions.
- Targeted new methods will be needed to close the gap between model and human performance.
- The benchmark supplies a measurable target for measuring progress on physical reasoning.
Where Pith is reading between the lines
- The same limitation may appear in any AI system that must act in the real world without direct experience.
- Combining language models with simulation or vision data could serve as one route to better physical reasoning.
- Future benchmarks might separate linguistic shortcuts from genuine commonsense to isolate the remaining gap.
Load-bearing premise
The questions in PIQA genuinely require physical commonsense and cannot be solved mainly by detecting linguistic patterns or reporting bias already present in training text.
What would settle it
A text-only pretrained model that reaches 95 percent accuracy on the PIQA test set without any additional physical simulation or sensory data.
read the original abstract
To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the PIQA benchmark for physical commonsense reasoning, consisting of crowdsourced multiple-choice questions about everyday physical tasks and interactions. It reports that humans achieve 95% accuracy on the dataset while large pretrained language models reach only 77%, and provides an analysis of the specific dimensions of physical knowledge (such as affordances and dynamics) where current models are deficient.
Significance. If the dataset construction successfully isolates physical reasoning requirements from textual artifacts, the work is significant for natural language understanding research. It directly demonstrates the impact of reporting bias in text corpora on learning physical commonsense and supplies both a reusable benchmark and targeted error analysis that can guide future efforts to integrate world knowledge into pretrained models. The release of the dataset and the human-model gap constitute clear contributions.
major comments (2)
- [§3] §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.
- [§5] §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.
minor comments (3)
- [Table 1] Table 1: Model accuracy numbers should include standard deviations across multiple random seeds or runs to establish robustness of the reported 77% ceiling.
- [Figure 2] Figure 2: The visualization of knowledge dimensions would benefit from explicit mapping to example PIQA questions to make the analysis more concrete for readers.
- Related Work section: Consider adding a brief comparison to contemporaneous physical reasoning benchmarks to clarify PIQA's distinct contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of validating the PIQA benchmark's focus on physical commonsense. We address each major comment below and describe the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: §3 (Dataset Construction): The central claim that the 18-point human-model gap reflects missing physical interaction knowledge rather than statistical cues requires explicit validation that incorrect options lack exploitable lexical, syntactic, or co-occurrence signals. The description of crowdsourcing physical tasks and generating alternatives does not include quantitative checks such as n-gram overlap statistics, bag-of-words baseline performance, or adversarial filtering results that would rule out reporting bias exploitation by pretrained models.
Authors: We agree that quantitative checks are needed to strengthen the claim that the gap arises from missing physical knowledge rather than exploitable textual signals. The original manuscript emphasized the crowdsourcing protocol but omitted these metrics. In the revision, we have added n-gram overlap statistics between correct and incorrect options (showing minimal differences), a bag-of-words baseline achieving only ~55% accuracy, and results from a simple adversarial filtering pass. These are now reported in the updated §3 to better rule out statistical artifacts. revision: yes
-
Referee: §5 (Analysis of Model Deficiencies): While the paper discusses dimensions of missing knowledge, the error analysis does not quantify the proportion of model errors attributable to genuine physical reasoning failures versus potential dataset artifacts (e.g., option plausibility detectable from text alone). This weakens the claim that the benchmark offers clear opportunities for future research on specific knowledge gaps.
Authors: We acknowledge that a quantitative breakdown of error sources would make the analysis more robust. Fully automated separation of physical failures from textual artifacts is difficult without additional targeted annotations. We have therefore expanded §5 with a manual review of 200 model errors, categorizing them into physical knowledge gaps (affordances, dynamics, etc.) versus potential artifacts, with approximate proportions reported. This provides a clearer, if partial, quantification while preserving the discussion of targeted future research directions. revision: partial
Circularity Check
No circularity: empirical benchmark paper with direct accuracy measurements and no derivations or self-referential predictions
full rationale
The paper introduces the PIQA dataset for physical commonsense reasoning and reports direct evaluation results (models at 77%, humans at 95%) along with analysis of model shortcomings. No mathematical derivations, equations, fitted parameters, or predictions that reduce to inputs by construction appear in the provided text or abstract. Results stem from straightforward accuracy measurements on a newly collected crowdsourced dataset rather than any self-definitional loop, fitted-input prediction, or load-bearing self-citation chain. The work is self-contained as an empirical benchmark without circular reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world?
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Language Models are Few-Shot Learners
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
-
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
-
Massive Activations in Large Language Models
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
-
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Short window attention enables long-term memorization
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
-
HyperAdapt: Simple High-Rank Adaptation
HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
Textbooks Are All You Need II: phi-1.5 technical report
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Zellers, Rowan and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin , title =. CVPR , year =
-
[2]
SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =
Maarten Sap and Hannah Rashkin and Derek Chen and Ronan. SocialIQA: Commonsense Reasoning about Social Interactions , booktitle =. 2019 , month =
work page 2019
-
[3]
WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , author=. AAAI , year=
-
[4]
Antoine Bosselut and Hannah Rashkin and Maarten Sap and Chaitanya Malaviya and Asli Celikyilmaz and Yejin Choi , title =. ACL , year =
-
[5]
Rosario Scalise and Jesse Thomason and Yonatan Bisk and Siddhartha Srinivasa , title =. IROS , year =
-
[6]
Angel Daruna and Weiyu Liu and Zsolt Kira and Sonia Chernova , title =. ICRA , year =
-
[7]
Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin , title =. EMNLP , year =
-
[8]
Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. ACL , year =
-
[9]
Mor Geva and Yoav Goldberg and Jonathan Berant , title =. EMNLP-IJCNLP , year =
-
[10]
Annotation Artifacts in Natural Language Inference Data
Gururangan, Suchin and Swayamdipta, Swabha and Levy, Omer and Schwartz, Roy and Bowman, Samuel and Smith, Noah A. Annotation Artifacts in Natural Language Inference Data. NAACL-HLT. 2018
work page 2018
-
[11]
Joint Conference on Lexical and Computational Semantics (StarSem) , year =
Poliak, Adam and Naradowsky, Jason and Haldar, Aparajita and Rudinger, Rachel and. Joint Conference on Lexical and Computational Semantics (StarSem) , year =
-
[12]
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova , title =. NAACL-HLT , year =
-
[13]
Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever , title =. 2019 , url =
work page 2019
-
[14]
Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever , title =. 2018 , url =
work page 2018
-
[16]
Roemmele, Melissa and Bejan, Cosmin and Gordon, Andrew , title =. Tenth International Symposium on Logical Formalizations of Commonsense Reasoning (Commonsense-2011) , year =
work page 2011
-
[17]
Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonatha and Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , title =. NAACL-HLT , year =
- [18]
- [19]
-
[20]
Thomason, Jesse and Sinapov, Jivko and Svetlik, Maxwell and Stone, Peter and Mooney, Raymond J , title =. IJCAI , year =
-
[21]
Carissa Schoenick and Peter Clark and Oyvind Tafjord and Peter Turney and Oren Etzioni , title =. 2016 , file =
work page 2016
-
[22]
Miller and Sebastian Riedel , title =
Fabio Petroni and Tim Rocktäschel and Patrick Lewis and Anton Bakhtin and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel , title =. EMNLP , year =
-
[23]
Yonatan Bisk and Jan Buys and Karl Pichotta and Yejin Choi , title =. NAACL-HLT , year =
-
[24]
Learning to See Physics via Visual De-animation , author =. NeurIPS , editor =. 2017 , url =
work page 2017
- [25]
-
[26]
Yanai Elazar and Abhijit Mahabal and Deepak Ramachandran and Tania Bedrax-Weiss and Dan Roth , title =. ACL , year =
-
[27]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , booktitle =
work page 2016
-
[28]
Situation Recognition: Visual Semantic Role Labeling for Image Understanding , author=. CVPR , year=
-
[30]
``What Happens If...'' Learning to Predict the Effect of Forces in Images
Mottaghi, Roozbeh and Rastegari, Mohammad and Gupta, Abhinav and Farhadi, Ali. ``What Happens If...'' Learning to Predict the Effect of Forces in Images. ECCV. 2016
work page 2016
-
[31]
Anton Bakhtin and Laurens van der Maaten and Justin Johnson and Laura Gustafson and Ross Girshick , title =. 2019 , journal =
work page 2019
- [32]
-
[33]
Gunnar A. Sigurdsson and G. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , booktitle=
-
[34]
A Short Note about Kinetics-600
Joao Carreira and Eric Noland and Chloe Hillier and Andrew Zisserman , title =. arXiv:1808.01340 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =
Agrawal, Pulkit and Nair, Ashvin and Abbeel, Pieter and Malik, Jitendra and Levine, Sergey , booktitle =. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , year =
-
[36]
Toussaint, Marc and Allen, Kelsey R and Smith, Kevin A and Tenenbaum, Joshua B , title =. RSS , year =
-
[37]
Byravan, Arunkumar and Leeb, Felix and Meier, Franziska and Fox,Dieter , title =. ICRA , year =
-
[38]
Lakshmi Nair and Jonathan Balloch and Sonia Chernova , title =. ICRA , year =
-
[39]
Gao, Qiaozi and Doering, Malcolm and Yang, Shaohua and Chai, Joyce , title =. ACL , year =
-
[40]
Proceedings of the National Conference on Artificial Intelligence , year =
Tellex, Stefanie and Kollar, Thomas and Dickerson, Steven and Walter, Matthew R and Banerjee, Ashis Gopal and Teller, Seth and Roy, Nicholas , title =. Proceedings of the National Conference on Artificial Intelligence , year =
- [41]
-
[42]
Goldberg, Yoav , journal=
-
[43]
SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. EMNLP , pages=
-
[44]
Tjong Kim Sang, Erik F. and De Meulder, Fien. Introduction to the C o NLL -2003 Shared Task: Language-Independent Named Entity Recognition. NAACL. 2003
work page 2003
-
[45]
Concreteness ratings for 40 thousand generally known English word lemmas , journal =
Marc Brysbaert and Amy Beth Warriner and Victor Kuperman , year =. Concreteness ratings for 40 thousand generally known English word lemmas , journal =
-
[46]
Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets
Hessel, Jack and Mimno, David and Lee, Lillian. Quantifying the Visual Concreteness of Words and Topics in Multimodal Datasets. NAACL-HLT. 2018
work page 2018
-
[47]
Hespos, Susan J. and Spelke, Elizabeth S. , title =. Nature , volume = 430, pages =
-
[48]
Agrawal, P.; Nair, A.; Abbeel, P.; Malik, J.; and Levine, S. 2016. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS
work page 2016
-
[49]
Bisk, Y.; Buys, J.; Pichotta, K.; and Choi, Y. 2019. Benchmarking hierarchical script knowledge. In NAACL-HLT
work page 2019
-
[50]
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense Transformers for Automatic Knowledge Graph Construction . In ACL
work page 2019
-
[51]
Brysbaert, M.; Warriner, A. B.; and Kuperman, V. 2014. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior Research Methods (46):904--911
work page 2014
-
[52]
Byravan, A.; Leeb, F.; Meier, F.; and Fox, D. 2018. Se3-pose-nets: Structured deep dynamics models for visuomotor planning and control. In ICRA
work page 2018
-
[53]
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In NAACL-HLT
work page 2019
-
[54]
Elazar, Y.; Mahabal, A.; Ramachandran, D.; Bedrax-Weiss, T.; and Roth, D. 2019. How large are lions? inducing distributions over quantitative attributes. In ACL
work page 2019
-
[55]
Forbes, M., and Choi, Y. 2017. Verb physics: Relative physical knowledge of actions and objects. In ACL
work page 2017
-
[56]
Gao, Q.; Doering, M.; Yang, S.; and Chai, J. 2016. Physical causality of action verbs in grounded language understanding. In ACL , 1814--1824
work page 2016
-
[57]
Goldberg, Y. 2019. Assessing BERT's Syntactic Abilities . arXiv:1901.05287
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[58]
Gururangan, S.; Swayamdipta, S.; Levy, O.; Schwartz, R.; Bowman, S.; and Smith, N. A. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT , 107--112
work page 2018
-
[59]
Hespos, S. J., and Spelke, E. S. 2004. Conceptual precursors to language. Nature 430:453--456
work page 2004
-
[60]
Hessel, J.; Mimno, D.; and Lee, L. 2018. Quantifying the visual concreteness of words and topics in multimodal datasets. In NAACL-HLT , 2194--2205
work page 2018
-
[61]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; Bernstein, M.; and Fei-Fei, L. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In arXiv:1602.07332
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [62]
-
[63]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv:1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[64]
Matuszek, C. 2018. Grounded Language Learning: Where Robotics and NLP Meet . In IJCAI , 5687 -- 5691
work page 2018
-
[65]
Mottaghi, R.; Rastegari, M.; Gupta, A.; and Farhadi, A. 2016. ``what happens if...'' learning to predict the effect of forces in images. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., ECCV , 269--285
work page 2016
-
[66]
Nair, L.; Balloch, J.; and Chernova, S. 2019. Tool Macgyvering: Tool Construction Using Geometric Reasoning . In ICRA
work page 2019
-
[67]
Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? In EMNLP
work page 2019
-
[68]
Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; and Van Durme , B. 2018. Hypothesis Only Baselines in Natural Language Inference . In Joint Conference on Lexical and Computational Semantics (StarSem)
work page 2018
-
[69]
Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training
work page 2018
-
[70]
Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. Squad: 100,000+ questions for machine comprehension of text. In EMNLP , 2383--2392
work page 2016
-
[71]
Sakaguchi, K.; Le Bras , R.; Bhagavatula, C.; and Choi, Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI
work page 2020
-
[72]
Sap, M.; Rashkin, H.; Chen, D.; Le Bras , R.; and Choi, Y. 2019. Socialiqa: Commonsense reasoning about social interactions. In EMNLP
work page 2019
-
[73]
Schoenick, C.; Clark, P.; Tafjord, O.; Turney, P.; and Etzioni, O. 2016. Moving beyond the turing test with the allen ai science challenge. Communications of the ACM
work page 2016
-
[74]
Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M. R.; Banerjee, A. G.; Teller, S.; and Roy, N. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the National Conference on Artificial Intelligence
work page 2011
-
[75]
Thomason, J.; Sinapov, J.; Svetlik, M.; Stone, P.; and Mooney, R. J. 2016. Learning Multi-Modal Grounded Linguistic Semantics by Playing "I Spy" . In IJCAI , 3477--3483
work page 2016
-
[76]
Tjong Kim Sang, E. F., and De Meulder, F. 2003. Introduction to the C o NLL -2003 shared task: Language-independent named entity recognition. In NAACL , 142--147
work page 2003
-
[77]
Toussaint, M.; Allen, K. R.; Smith, K. A.; and Tenenbaum, J. B. 2018. Differentiable physics and stable modes for tool-use and manipulation planning. In RSS
work page 2018
-
[78]
Wu, J.; Lu, E.; Kohli, P.; Freeman, B.; and Tenenbaum, J. 2017. Learning to see physics via visual de-animation. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., NeurIPS
work page 2017
-
[79]
Yatskar, M.; Zettlemoyer, L.; and Farhadi, A. 2016. Situation recognition: Visual semantic role labeling for image understanding. In CVPR
work page 2016
-
[80]
Zellers, R.; Bisk, Y.; Schwartz, R.; and Choi, Y. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference . In EMNLP
work page 2018
-
[81]
Zellers, R.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019a. From recognition to cognition: Visual commonsense reasoning. In CVPR
-
[82]
Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019b. HellaSwag: Can a Machine Really Finish Your Sentence? In ACL
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.