pith. sign in
Pith Number

pith:WPWLP2AF

pith:2023:WPWLP2AFNH6A5CCMKTPPTQYGSQ
not attested not anchored not stored refs resolved

FinanceBench: A New Benchmark for Financial Question Answering

Anand Kannappan, Bertie Vidgen, Douwe Kiela, Nino Scherrer, Pranab Islam, Rebecca Qian

Existing LLMs fail to correctly answer or refuse 81 percent of financial questions even with retrieval support.

arxiv:2311.11944 v1 · 2023-11-20 · cs.CL · cs.AI · cs.CE · stat.ML

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WPWLP2AFNH6A5CCMKTPPTQYGSQ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

GPT-4-Turbo used with a retrieval system incorrectly answered or refused to answer 81% of questions.

C2weakest assumption

The 150 sampled cases are representative of the full 10,231 questions and that all questions are ecologically valid and clear-cut as stated.

C3one line summary

FinanceBench shows state-of-the-art LLMs incorrectly answer or refuse 81% of tested financial QA cases even with retrieval augmentation.

References

18 extracted · 18 resolved · 0 Pith anchors

[1] In Findings of the Association for Computational Linguistics: ACL 2023 , pages 1298–1313, Toronto, Canada 2023
[2] Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension. ACM Comput. Surv., 55(10). Julio Cesar Salinas Alvarado, Karin Verspoor, and Tim- othy Baldwin. 20 2015
[3] Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat- Seng Chua 2021
[4] fi- nacebench_id_0000
[5] A value for whether it is in the eval sample of 298 cases (‘1’), in the open source sample (‘2’) or in neither (‘0’)

Formal links

1 machine-checked theorem link

Cited by

30 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:49.037230Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b3ecb7e80569fc0e884c54def9c306940cc8af16666c5227b5b02cc34ae29d57

Aliases

arxiv: 2311.11944 · arxiv_version: 2311.11944v1 · doi: 10.48550/arxiv.2311.11944 · pith_short_12: WPWLP2AFNH6A · pith_short_16: WPWLP2AFNH6A5CCM · pith_short_8: WPWLP2AF
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WPWLP2AFNH6A5CCMKTPPTQYGSQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b3ecb7e80569fc0e884c54def9c306940cc8af16666c5227b5b02cc34ae29d57
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "b93fb6e2e2c745257d9732d756d1bf963b7deaf484def9c20e8cfb5acf1f0834",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CE",
      "stat.ML"
    ],
    "license": "http://creativecommons.org/licenses/by-nc-nd/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-11-20T17:28:02Z",
    "title_canon_sha256": "0065779b111e2415d7e651f7202b8f5738a55baee518adfb61aae403754ff70d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.11944",
    "kind": "arxiv",
    "version": 1
  }
}