pith. machine review for the scientific record.
sign in
Pith Number

pith:FMJQSJSF

pith:2023:FMJQSJSFFAOLGAFINRTNTVFPHH
not attested not anchored not stored refs resolved

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Bryan Wilie, Dan Su, Holy Lovenia, Nayeon Lee, Pascale Fung, Quyet V. Do, Samuel Cahyawijaya, Tiezheng Yu, Wenliang Dai, Willy Chung, Yan Xu, Yejin Bang, Ziwei Ji

ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction.

arxiv:2302.04023 v4 · 2023-02-08 · cs.CL · cs.AI

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning.

C2weakest assumption

That the 23 chosen datasets, the newly designed multimodal dataset, and the 10 reasoning categories provide a representative and low-bias measure of ChatGPT capabilities without major sensitivity to prompt wording or subjective hallucination labeling.

C3one line summary

ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.

References

23 extracted · 23 resolved · 2 Pith anchors

[1] News summarization and evaluation in the era of gpt-3 2023
[2] In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900 2021
[3] Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv. Just Accepted. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, 2022
[4] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models 2022 · arXiv:2206.04615
[5] Richmond Thomason 2018

Formal links

2 machine-checked theorem links

Cited by

19 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.237057Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

2b13092645281cb300a86c66d9d4af39df07b806108f271b30a4904d70721687

Aliases

arxiv: 2302.04023 · arxiv_version: 2302.04023v4 · doi: 10.48550/arxiv.2302.04023 · pith_short_12: FMJQSJSFFAOL · pith_short_16: FMJQSJSFFAOLGAFI · pith_short_8: FMJQSJSF
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/FMJQSJSFFAOLGAFINRTNTVFPHH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2b13092645281cb300a86c66d9d4af39df07b806108f271b30a4904d70721687
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "a2a18e59511c993ddda9d3136987bf9209d89feec0b355f8f79871b9e7e58bd3",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-02-08T12:35:34Z",
    "title_canon_sha256": "0af1cd3d0f93626347676b13538faee2652ea3b76a43db3c8090a2df56595df6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2302.04023",
    "kind": "arxiv",
    "version": 4
  }
}