Pith Number

pith:FMJQSJSF

pith:2023:FMJQSJSFFAOLGAFINRTNTVFPHH

not attested not anchored not stored refs resolved

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Bryan Wilie, Dan Su, Holy Lovenia, Nayeon Lee, Pascale Fung, Quyet V. Do, Samuel Cahyawijaya, Tiezheng Yu, Wenliang Dai, Willy Chung, Yan Xu, Yejin Bang, Ziwei Ji

ChatGPT averages 63.41% accuracy across ten reasoning categories and improves only modestly with human interaction.

arxiv:2302.04023 v4 · 2023-02-08 · cs.CL · cs.AI

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning.

C2weakest assumption

That the 23 chosen datasets, the newly designed multimodal dataset, and the 10 reasoning categories provide a representative and low-bias measure of ChatGPT capabilities without major sensitivity to prompt wording or subjective hallucination labeling.

C3one line summary

ChatGPT outperforms zero-shot LLMs on most tasks and improves with interaction but scores only 63.41 percent on reasoning categories and generates extrinsic hallucinations from its training data.

References

23 extracted · 23 resolved · 2 Pith anchors

[1] News summarization and evaluation in the era of gpt-3 2023

[2] In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7890–7900 2021

[3] Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv. Just Accepted. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, 2022

[4] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models 2022 · arXiv:2206.04615

[5] Richmond Thomason 2018

Formal links

2 machine-checked theorem links

Cited by

19 papers in Pith

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

An Embodied Generalist Agent in 3D World

Low-Resource Languages Jailbreak GPT-4

Receipt and verification

First computed	2026-05-17T23:38:13.237057Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

2b13092645281cb300a86c66d9d4af39df07b806108f271b30a4904d70721687

Aliases

arxiv: 2302.04023 · arxiv_version: 2302.04023v4 · doi: 10.48550/arxiv.2302.04023 · pith_short_12: FMJQSJSFFAOL · pith_short_16: FMJQSJSFFAOLGAFI · pith_short_8: FMJQSJSF

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/FMJQSJSFFAOLGAFINRTNTVFPHH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2b13092645281cb300a86c66d9d4af39df07b806108f271b30a4904d70721687

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "a2a18e59511c993ddda9d3136987bf9209d89feec0b355f8f79871b9e7e58bd3",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-sa/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-02-08T12:35:34Z",
    "title_canon_sha256": "0af1cd3d0f93626347676b13538faee2652ea3b76a43db3c8090a2df56595df6"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2302.04023",
    "kind": "arxiv",
    "version": 4
  }
}