Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Francesco Massa; Marco Cristofanilli

arxiv: 2606.13241 · v1 · pith:EPJYVJIVnew · submitted 2026-06-11 · 💻 cs.AI

Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

Francesco Massa , Marco Cristofanilli This is my paper

Pith reviewed 2026-06-27 06:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of modelsLLM routingcapability dimensionsquery difficultycost optimizationgeometric dispatchrouter benchmark

0 comments

The pith

Brick routes queries using six capability dimensions plus difficulty to beat the best single model while cutting cost up to 22 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Brick as a router that assigns each model scores along six capability dimensions, pairs them with an estimate of query difficulty, and selects the model through a geometric rule that penalizes higher cost. A single preference setting lets the operator move continuously from highest accuracy to lowest cost without retraining. On a test set of 5,504 queries the max-quality mode exceeds the strongest individual model, while balanced and minimum-cost modes trade modest accuracy for large reductions in expense and latency. The approach targets the observation that success within a domain varies sharply with query properties that surface features such as keywords or token count fail to capture. If the routing rule works as described, production systems can keep frontier-level performance on many requests while paying far less on average.

Core claim

Brick scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

What carries the argument

The cost-penalized geometric dispatch rule that operates on six-dimensional capability scores and per-query difficulty estimates.

If this is right

At the max-quality setting the router exceeds the accuracy of any individual model tested.
At the neutral setting the same router delivers near-frontier accuracy at roughly one-fifth the cost of the strongest model.
At the minimum-cost setting cost falls by more than twenty times, with a documented accuracy penalty of 11.85 points.
Median response latency falls by more than half under the routing policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same six-dimensional scoring could be reused to add or retire models without retraining the router.
If difficulty estimates can be updated online, the router might adapt to shifts in query distribution over time.
The geometric rule offers a template for adding other constraints such as latency targets or regulatory requirements.
Operators could expose the preference knob to end users so that cost versus quality choices become per-session decisions.

Load-bearing premise

The six capability dimensions and per-query difficulty estimate can be computed in a way that reliably predicts which model will succeed on unseen queries.

What would settle it

A held-out query set where the model chosen by the six-dimensional scores and difficulty estimate performs worse on average than the strongest single model or a simple baseline router.

Figures

Figures reproduced from arXiv: 2606.13241 by Francesco Massa, Marco Cristofanilli.

**Figure 1.** Figure 1: Cost vs response accuracy on Dataset A. Single-model baselines are squares, external routers are triangles, Brick (MoM) profiles are circles (max profile is a star). The dashed line marks the three-model oracle ceiling. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Brick architecture. The query x flows through text normalization and keyword prior injection, then branches into two parallel estimators: a ModernBERT capability distribution head producing p(x)∈∆D−1 , and a complexity head producing the difficulty target τq. The routing math is decomposed into five sub-blocks: difficulty lift (raising τq to logit space zq), per-capability requirement rq,c and per-model ca… view at source ↗

**Figure 3.** Figure 3: Two views of the routing decision for the worked-example query q_03563 of §7.2. (A) Heatmap of the post-projection quantities the router actually compares: the top row is the per-capability requirement rq,c=pc zq (Step 3), the next three rows are the per-model capacities vm,c=pc logit(sm,c) (Step 4); the red box in each column marks the model whose capacity is closest to the query requirement on that dimen… view at source ↗

**Figure 4.** Figure 4: Top-10 runs from the ModernBERT W&B sweep by Pearson-macro, in a multi-metric dashboard view (training loss, learning rate, gradient norm, validation loss, validation Pearson macro). 7.5. Complexity classifier and difficulty blending Step (iv) of the pipeline is a separate fine-tuned classifier that scores how hard the query is for the pool. We use a 3-class head (easy/medium/hard) built on Qwen3.5-0.8B w… view at source ↗

**Figure 5.** Figure 5: End-to-end latency CDF on Dataset A. Brick (MoM) at the max profile sits between always-ds4 and always-kimi, while achieving higher response accuracy than either. two auxiliary classifier calls (capability and complexity) over a remote endpoint, while Cascade Routing performs a local hash lookup. In a production deployment of Brick observed over 6,376 live requests, the same decision step has a median of… view at source ↗

read the original abstract

Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Brick's gains hinge on unvalidated capability scores and difficulty estimates that the abstract does not describe.

read the letter

The one thing to know is that Brick's reported gains depend entirely on whether those six capability dimensions and the difficulty estimate actually predict which model will succeed on a query, and the paper gives no evidence that they do.

What the paper does is lay out a router that scores models on six dimensions, adds a difficulty value, and dispatches with a cost-penalized geometric rule. It includes a knob to adjust the quality versus cost preference at deployment time. On their benchmark of 5,504 queries, it shows small accuracy improvements at max quality and substantial cost reductions at other settings, with latency benefits too.

This is useful as a concrete engineering proposal for handling the cost difference between expensive frontier models and cheaper ones. The idea of a tunable profile is a good practical feature.

The soft spots are significant though. There is no account of how the capability scores are obtained or whether they were validated on held-out data separate from the benchmark. The same goes for the difficulty estimate. Without that, the results could easily be due to overfitting or circular fitting rather than a general method. The stress test concern is on point here.

This paper is for people who run large-scale LLM inference and need to manage costs. A reader interested in routing techniques might get some ideas from it, but it won't change how we think about model selection in a deeper way.

I would bring this to a reading group to discuss the methods if they are detailed in the full text. It is worth sending to peer review so reviewers can check the validation of the scoring components.

Referee Report

3 major / 1 minor

Summary. The paper introduces Brick, a multimodal router for the Mixture-of-Models paradigm. It assigns each model scores along six capability dimensions, combines these with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule controlled by a continuous preference knob. On a 5,504-query benchmark, Brick reports 76.98% accuracy at max-quality (exceeding the best single model at 75.02%), 74.11% accuracy at a neutral profile with 4.71× lower cost than the strongest model, and 22.15× cost reduction at min-cost with 11.85-point accuracy loss.

Significance. If the capability scores and difficulty estimates can be shown to generalize beyond the reported benchmark, the method would offer a practical, tunable mechanism for balancing accuracy and inference cost in production LLM deployments, addressing a key engineering constraint where frontier models are 10-100× more expensive than open-weight alternatives.

major comments (3)

[Abstract, §3] Abstract and §3 (methodology): The headline results (76.98% max-quality accuracy, 74.11% at neutral profile) rest entirely on the router correctly ranking models via the six capability dimensions plus per-query difficulty. No description is supplied of how these quantities are computed, whether they are derived from static metadata, surface features, or performance on the 5,504-query set itself, or whether any cross-validation or held-out validation demonstrates predictive validity on unseen queries.
[Abstract] Abstract: The geometric cost-penalized dispatch rule is presented as parameter-free at deployment time, yet the paper supplies no evidence that the underlying capability scores and difficulty estimates were not fitted or calibrated on the same benchmark used for the reported accuracy and cost numbers; this leaves open the possibility that the observed gains are circular.
[Abstract, evaluation] Abstract and evaluation section: No calibration plots, per-model success/failure prediction metrics, or out-of-distribution test are mentioned to confirm that the six-dimensional scores reliably predict which model will succeed on a given query; without this, the 4.71× cost reduction and latency claims cannot be assessed as generalizable.

minor comments (1)

[Abstract] The abstract states benchmark numbers but does not define the six capability dimensions or the difficulty estimator; adding a short methods paragraph would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for highlighting the need for greater methodological transparency. We address each major comment below and commit to revisions that add the requested details without altering the reported results.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (methodology): The headline results (76.98% max-quality accuracy, 74.11% at neutral profile) rest entirely on the router correctly ranking models via the six capability dimensions plus per-query difficulty. No description is supplied of how these quantities are computed, whether they are derived from static metadata, surface features, or performance on the 5,504-query set itself, or whether any cross-validation or held-out validation demonstrates predictive validity on unseen queries.

Authors: The current manuscript does not supply a full description of the computation procedure for the six capability dimensions or the per-query difficulty estimate. We will revise §3 to include the exact derivation method (static metadata combined with a separate validation set), the formulas used, and cross-validation results on held-out queries to demonstrate predictive validity. revision: yes
Referee: [Abstract] Abstract: The geometric cost-penalized dispatch rule is presented as parameter-free at deployment time, yet the paper supplies no evidence that the underlying capability scores and difficulty estimates were not fitted or calibrated on the same benchmark used for the reported accuracy and cost numbers; this leaves open the possibility that the observed gains are circular.

Authors: We agree that the manuscript provides no explicit evidence separating the score computation from the 5,504-query evaluation set. The revision will add a subsection documenting the independent data sources and procedures used to obtain the scores, thereby addressing the circularity concern. revision: yes
Referee: [Abstract, evaluation] Abstract and evaluation section: No calibration plots, per-model success/failure prediction metrics, or out-of-distribution test are mentioned to confirm that the six-dimensional scores reliably predict which model will succeed on a given query; without this, the 4.71× cost reduction and latency claims cannot be assessed as generalizable.

Authors: The present version omits calibration plots, per-model prediction metrics, and out-of-distribution tests. We will incorporate these analyses into the evaluation section of the revised manuscript to support the generalizability of the routing decisions and cost-accuracy trade-offs. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained on presented claims

full rationale

The abstract and available text describe Brick as computing six capability dimensions plus per-query difficulty then applying a cost-penalized geometric dispatch rule, with performance reported as empirical results on the 5,504-query benchmark. No equations, definitions, or steps are quoted that define the dimensions or difficulty from the evaluation outcomes themselves, nor any self-citation chain that imports the core routing logic. The central claims rest on the (unshown) computation of those scores generalizing, but nothing in the given material reduces the reported accuracy or cost gains to a tautology or fitted input renamed as prediction. This is the normal case of an empirical router paper whose internal validity cannot be challenged for circularity from the abstract alone.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, no derivation, and no description of how the six capability dimensions or difficulty estimate are obtained; therefore no free parameters, axioms, or invented entities can be identified with certainty.

pith-pipeline@v0.9.1-grok · 5735 in / 1167 out tokens · 28345 ms · 2026-06-27T06:54:38.580270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 10 canonical work pages · 8 internal anchors

[1]

Measuring Mathematical Problem Solv- ing with the MATH Dataset,

D. Hendrycks et al., “Measuring Mathematical Problem Solv- ing with the MATH Dataset, ” inProc. NeurIPS Datasets and Benchmarks, 2021

2021
[2]

Training Verifiers to Solve Math Word Problems

K. Cobbe et al., “Training Verifiers to Solve Math Word Prob- lems, ” arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

MMLU-Pro: A More Robust and Challeng- ing Multi-Task Language Understanding Benchmark,

Y . Wang et al., “MMLU-Pro: A More Robust and Challeng- ing Multi-Task Language Understanding Benchmark, ” inProc. NeurIPS Datasets and Benchmarks, 2024

2024
[4]

Instruction-Following Evaluation for Large Language Models

J. Zhou et al., “Instruction-Following Evaluation for Large Lan- guage Models, ” arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

IFBench: Granular Instruction-Following Evaluation,

V . Pyatkin et al., “IFBench: Granular Instruction-Following Evaluation, ” arXiv:2503.07879 , 2025

work page arXiv 2025
[6]

Berkeley Function-Calling Leaderboard,

F. Yan et al., “Berkeley Function-Calling Leaderboard, ” Berke- ley AI Research, 2024

2024
[7]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain et al., “LiveCodeBench: Holistic and Contamination- Free Evaluation of Large Language Models for Code, ” arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Measuring short-form factuality in large language models

J. Wei et al., “SimpleQA: Measuring Short-Form Factuality in Large Language Models, ” arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark,

D. Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark, ” inProc. COLM, 2024

2024
[10]

EQ-Bench-Creative-v3: Emotional Intelligence and Creative Writing Evaluation,

S. Paech, “EQ-Bench-Creative-v3: Emotional Intelligence and Creative Writing Evaluation, ”eqbench.com benchmark release, 2024.https://eqbench.com/creative_writing.html

2024
[11]

LitBench: A Literary Synthesis Bench- mark for Long-Form Evaluation,

Stanford NLP , “LitBench: A Literary Synthesis Bench- mark for Long-Form Evaluation, ”Hugging Face dataset release, 2025. https://huggingface.co/datasets/SAA-Lab/ litbench-test

2025
[12]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

S. Yao et al., “tau-bench: A Benchmark for Tool-Augmented Reasoning, ” arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

RouteLLM: Learning to Route LLMs with Preference Data

I. Ong et al., “RouteLLM: Learning to Route LLMs with Prefer- ence Data, ” arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Perfor- mance, ” arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Cascade Routing for Large Language Mod- els,

W . Jitkrittum et al., “Cascade Routing for Large Language Mod- els, ” arXiv:2405.20828, 2024

work page arXiv 2024
[16]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

B. Warner et al., “Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory-Efficient, and Long- Context Fine-Tuning and Inference, ” arXiv:2412.13663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

A Coefficient of Agreement for Nominal Scales,

J. Cohen, “ A Coefficient of Agreement for Nominal Scales, ” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37– 46, 1960. 17

1960

[1] [1]

Measuring Mathematical Problem Solv- ing with the MATH Dataset,

D. Hendrycks et al., “Measuring Mathematical Problem Solv- ing with the MATH Dataset, ” inProc. NeurIPS Datasets and Benchmarks, 2021

2021

[2] [2]

Training Verifiers to Solve Math Word Problems

K. Cobbe et al., “Training Verifiers to Solve Math Word Prob- lems, ” arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

MMLU-Pro: A More Robust and Challeng- ing Multi-Task Language Understanding Benchmark,

Y . Wang et al., “MMLU-Pro: A More Robust and Challeng- ing Multi-Task Language Understanding Benchmark, ” inProc. NeurIPS Datasets and Benchmarks, 2024

2024

[4] [4]

Instruction-Following Evaluation for Large Language Models

J. Zhou et al., “Instruction-Following Evaluation for Large Lan- guage Models, ” arXiv:2311.07911, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

IFBench: Granular Instruction-Following Evaluation,

V . Pyatkin et al., “IFBench: Granular Instruction-Following Evaluation, ” arXiv:2503.07879 , 2025

work page arXiv 2025

[6] [6]

Berkeley Function-Calling Leaderboard,

F. Yan et al., “Berkeley Function-Calling Leaderboard, ” Berke- ley AI Research, 2024

2024

[7] [7]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

N. Jain et al., “LiveCodeBench: Holistic and Contamination- Free Evaluation of Large Language Models for Code, ” arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Measuring short-form factuality in large language models

J. Wei et al., “SimpleQA: Measuring Short-Form Factuality in Large Language Models, ” arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark,

D. Rein et al., “GPQA: A Graduate-Level Google-Proof Q&A Benchmark, ” inProc. COLM, 2024

2024

[10] [10]

EQ-Bench-Creative-v3: Emotional Intelligence and Creative Writing Evaluation,

S. Paech, “EQ-Bench-Creative-v3: Emotional Intelligence and Creative Writing Evaluation, ”eqbench.com benchmark release, 2024.https://eqbench.com/creative_writing.html

2024

[11] [11]

LitBench: A Literary Synthesis Bench- mark for Long-Form Evaluation,

Stanford NLP , “LitBench: A Literary Synthesis Bench- mark for Long-Form Evaluation, ”Hugging Face dataset release, 2025. https://huggingface.co/datasets/SAA-Lab/ litbench-test

2025

[12] [12]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

S. Yao et al., “tau-bench: A Benchmark for Tool-Augmented Reasoning, ” arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

RouteLLM: Learning to Route LLMs with Preference Data

I. Ong et al., “RouteLLM: Learning to Route LLMs with Prefer- ence Data, ” arXiv:2406.18665, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Perfor- mance, ” arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Cascade Routing for Large Language Mod- els,

W . Jitkrittum et al., “Cascade Routing for Large Language Mod- els, ” arXiv:2405.20828, 2024

work page arXiv 2024

[16] [16]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

B. Warner et al., “Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory-Efficient, and Long- Context Fine-Tuning and Inference, ” arXiv:2412.13663, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

A Coefficient of Agreement for Nominal Scales,

J. Cohen, “ A Coefficient of Agreement for Nominal Scales, ” Educational and Psychological Measurement, vol. 20, no. 1, pp. 37– 46, 1960. 17

1960