pith. sign in
Pith Number

pith:ENWG2XKP

pith:2026:ENWG2XKPMCBSPT5H7Q5XXEX7HU
not attested not anchored not stored refs resolved

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Dieuwke Hupkes, Jesse Dodge, Philipp Mondorf, Samuel J. Bell

Logic-preserving difficulty scaling finds problem variations that cause language models to fail up to five times more often than random tests.

arxiv:2605.15393 v1 · 2026-05-14 · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ENWG2XKPMCBSPT5H7Q5XXEX7HU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We show that LPDS efficiently finds difficult problem variations for a model, resulting in performance drops up to 5 times larger compared to random sampling.

C2weakest assumption

The framework assumes that difficulty of a logic-preserving variation can be quantified in a model-agnostic or at least transferable way that reliably predicts where failures will occur, and that the search procedure finds variations that are truly harder rather than merely different.

C3one line summary

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

References

74 extracted · 74 resolved · 12 Pith anchors

[1] and Yang, Qiang and Xie, Xing , number = 2024 · doi:10.1145/3641289
[2] A survey on large language models for code generation 2025 · doi:10.1145/3747588
[3] Evaluating Open-Domain Question Answering in the Era of Large Language Models 2023 · doi:10.18653/v1/2023.acl-long.307
[4] Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models · arXiv:2503.09567
[5] Nature645(8081), 633–638 (2025) https://doi.org/10.1038/s41586-025-09422-z · doi:10.1038/s41586-025-09422-z

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:00:56.318662Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

236c6d5d4f608327cfa7fc3b7b92ff3d3f3550026e8e6c3cc78bcff3c97c5f66

Aliases

arxiv: 2605.15393 · arxiv_version: 2605.15393v1 · doi: 10.48550/arxiv.2605.15393 · pith_short_12: ENWG2XKPMCBS · pith_short_16: ENWG2XKPMCBSPT5H · pith_short_8: ENWG2XKP
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ENWG2XKPMCBSPT5H7Q5XXEX7HU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 236c6d5d4f608327cfa7fc3b7b92ff3d3f3550026e8e6c3cc78bcff3c97c5f66
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1cef0907d0f498f145adbc8860fb3d834bef77895a396b62279f7af2a33f5c26",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by-sa/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T20:26:59Z",
    "title_canon_sha256": "3084e86520bad1b69c41ebd76253a81e4cba3f8551ce75ec8cb122d6bad18f0c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15393",
    "kind": "arxiv",
    "version": 1
  }
}