CryptoX : Compositional Reasoning Evaluation of Large Language Models

Chaoren Wei; Chenghao Yang; Ge Zhang; Jiajun Shi; Jian Yang; Liqun Yang; Stephen Huang; Tao Peng; Zekun Moore Wang; Zhoufutu Wen

arxiv: 2502.07813 · v2 · pith:U2YGPS47new · submitted 2025-02-08 · 💻 cs.CR · cs.AI

CryptoX : Compositional Reasoning Evaluation of Large Language Models

Jiajun Shi , Chaoren Wei , Liqun Yang , Zekun Moore Wang , Chenghao Yang , Ge Zhang , Stephen Huang , Tao Peng

show 2 more authors

Jian Yang Zhoufutu Wen

This is my paper

classification 💻 cs.CR cs.AI

keywords compositionalllmsreasoningbenchmarkscapacitycryptobenchcryptoxevaluation

0 comments

read the original abstract

The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data
cs.AI 2026-05 unverdicted novelty 4.0

JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.