KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

Richard Sproat; Stefano Peluchetti

arxiv: 2605.13322 · v2 · pith:OPOWPHNAnew · submitted 2026-05-13 · 💻 cs.CV · cs.LG

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

Richard Sproat , Stefano Peluchetti This is my paper

Pith reviewed 2026-05-20 21:35 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords kamoncompositional visual recognitionvision-language modelssynthetic datasetfactor recoverygrammar-based generationjapanese crestsprogram code annotations

0 comments

The pith

KamonBench generates synthetic crests from known container, modifier and motif factors so vision-language models can be scored directly on factor recovery instead of captions alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KamonBench, a collection of 20,000 grammar-generated images of Japanese family crests together with formal descriptions, English translations, and program-code representations. Each crest is built from three explicit factors—container, modifier, and motif—so the ground truth for every element is known by construction. This structure makes it possible to run evaluations that measure exact factor recovery, test performance on recombined factor pairs, check sensitivity to motif changes under fixed contexts, and probe whether factors are linearly accessible in model embeddings. A reader would care because most current vision-language models are evaluated only on loose caption matches, leaving open whether they truly parse compositional structure. The authors supply baseline results from a ViT-Transformer model and two VGG n-gram decoders to illustrate the kinds of measurements the benchmark enables.

Core claim

Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility, thereby supplying a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

What carries the argument

Grammar-based synthetic generation that defines every crest by the explicit factors of container, modifier, and motif and pairs each image with a formal kamon yōgo description plus executable program code.

If this is right

Models can be scored on the exact accuracy with which they recover each factor through the supplied program code.
Performance can be measured on held-out recombinations of factor pairs to test whether models generalize beyond training combinations.
Motif changes can be isolated while container and modifier stay fixed, revealing whether models are sensitive to the intended compositional variable.
Linear probes can determine how readily the individual factors can be read out from the model’s internal representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factor-known synthetic construction could be applied to other domains that combine a small number of symbolic choices, such as logos or technical diagrams.
If models improve on the benchmark’s factor-recovery metrics, the gains might transfer to captioning tasks that require describing scenes with multiple interacting objects.
The benchmark’s emphasis on program-code outputs suggests a route for training models to produce structured representations rather than free-form text.

Load-bearing premise

The synthetic images produced by the grammar capture the same compositional difficulties that appear in natural visual recognition without letting models exploit generation-specific regularities.

What would settle it

A result in which models achieve high scores on the program-code factor metrics and recombination splits yet show no improvement when tested on photographs of real kamon crests would indicate that the benchmark does not measure the intended recognition challenges.

Figures

Figures reproduced from arXiv: 2605.13322 by Richard Sproat, Stefano Peluchetti.

**Figure 2.** Figure 2: Synthetic examples of crests with various modifiers: a) crab in a circle ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Cumulative error counts for the 32 participants, ranked from the best (2 with no errors) to [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: BNF for kamon generation. Valid ⟨CONTAINER⟩s and ⟨MOTIF⟩s are provided with the released benchmark dataset. The ⟨modifier⟩ non-terminal covers both spatial arrangements and modifications. The ⟨empty⟩ alternative denotes the null/unmodified value for a motif placed directly inside a container. Containerless composite examples use a spatial arrangement. Note that the recursion on the ⟨complex-motif⟩ node, wh… view at source ↗

**Figure 5.** Figure 5: Schematic VGG n-gram decoder family. The blue components are shared across output [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Positional masks from the masked VGG baselines. Images are inverted: darker regions [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon y\=ogo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KamonBench supplies a grammar-based synthetic testbed with direct factor metrics and controlled splits, but the rendering process may embed exploitable regularities.

read the letter

The paper introduces KamonBench, a set of 20,000 synthetic kamon crests built from an explicit grammar over container, modifier, and motif factors. Each image comes with a formal kamon yogo description, segmented Japanese text, English translation, and a program-code representation. This setup lets them run program-code factor metrics, recombination splits, counterfactual motif groups under fixed contexts, and linear probes for factor accessibility, rather than stopping at caption accuracy. Baselines appear for a ViT encoder with Transformer decoder and for VGG n-gram decoders with and without learned positional masks. The construction is straightforward and the ground-truth factors are known by design, which is the main practical advantage over caption-only benchmarks. The grammar and auxiliary component examples are new in the literature cited. The evaluation protocols are also new: the recombination and counterfactual splits give a cleaner way to isolate factor recovery than most existing VLM tests. The baselines are reported plainly, which is useful for comparison. The soft spot is the synthetic rendering itself. Uniform stroke widths, exact centering, and consistent textures could let models achieve high factor scores through low-level cues instead of compositional structure. The abstract gives no sign of artifact audits or side-by-side checks against real kamon photographs, so the reported numbers might overstate what the same models would do on natural images. If the full paper contains those controls, the concern shrinks; if not, it is a moderate but real limitation for claims about general compositional recovery. The work is aimed at groups studying factor disentanglement and controlled evaluation in vision-language models. Anyone building or testing synthetic benchmarks for compositionality will find the split design and program-code metrics worth looking at. It is worth sending to peer review because the core idea and the evaluation machinery are concrete and falsifiable, even if the authors need to add validation steps against real data and generation artifacts.

Referee Report

1 major / 1 minor

Summary. The paper introduces KamonBench, a grammar-based image-to-structure benchmark consisting of 20,000 synthetic composite crests generated from known factors (container, modifier, motif). Each crest is paired with a formal kamon yōgo description, segmented Japanese analysis, English translation, and non-linguistic program code. The dataset supports evaluations beyond caption-level accuracy via direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed contexts, and linear probes of factor accessibility. Baseline results are reported for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks.

Significance. If the central assumptions hold, KamonBench would provide a useful controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models. The use of an external grammar for independent synthetic generation with known ground-truth factors enables direct, falsifiable metrics and controlled experimental splits that are difficult to achieve with natural-image datasets. This is a clear strength for reproducibility and precise probing of compositional abilities.

major comments (1)

Abstract: The central claim that KamonBench enables valid evaluation of compositional factor recovery (via program-code metrics, recombination splits, counterfactual groups, and linear probes) depends on the rendered synthetic images presenting the same visual challenges as natural kamon without generation-specific regularities (e.g., consistent stroke widths, exact centering, or texture uniformity) that models could exploit as shortcuts. No validation of grammar fidelity, data quality checks, artifact controls, or comparison to real kamon images is described, which directly undermines the load-bearing assumption that the controlled splits and probes test true compositionality rather than dataset artifacts.

minor comments (1)

Abstract: The baseline model descriptions (ViT encoder/Transformer decoder and VGG n-gram decoders) would benefit from explicit details on training procedures, loss functions, and how the learned positional masks are implemented to support reproducibility of the reported results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of KamonBench's value as a controlled testbed and for identifying a key requirement for validating the synthetic data. We address the major comment below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The central claim that KamonBench enables valid evaluation of compositional factor recovery (via program-code metrics, recombination splits, counterfactual groups, and linear probes) depends on the rendered synthetic images presenting the same visual challenges as natural kamon without generation-specific regularities (e.g., consistent stroke widths, exact centering, or texture uniformity) that models could exploit as shortcuts. No validation of grammar fidelity, data quality checks, artifact controls, or comparison to real kamon images is described, which directly undermines the load-bearing assumption that the controlled splits and probes test true compositionality rather than dataset artifacts.

Authors: We agree that explicit validation is necessary to rule out generation-specific shortcuts and thereby support the central claims. The grammar ensures known ground-truth factors by construction, yet the initial manuscript does not report direct comparisons to real kamon, quantitative checks for rendering regularities, or controls for potential artifacts such as uniform stroke widths or centering. In the revised version we will add a new subsection on data quality and fidelity. This will include side-by-side visual comparisons with authentic kamon examples, statistical summaries of rendering parameters across the dataset, and targeted ablations that test whether models can exploit centering, texture uniformity, or stroke consistency when factor labels are held constant. These additions will directly address the concern that the reported metrics and splits may reflect dataset artifacts rather than compositional factor recovery. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark relies on explicit external grammar and known generative factors

full rationale

The paper constructs KamonBench by generating 20,000 synthetic crests from a defined grammar with explicitly known factors (container, modifier, motif) and pairs each with formal descriptions and program code. All listed evaluation capabilities—program-code factor metrics, recombination splits, counterfactual groups, and linear probes—follow directly from this transparent construction rather than from any fitted parameter, self-referential prediction, or load-bearing self-citation. No equations or derivations reduce a claimed result to its own inputs; the central claim is simply that the synthetic data enables controlled testing, which is self-contained and externally verifiable by inspecting the generation process.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that kamon crests exhibit decomposable factors of container, modifier, and motif that can be faithfully reproduced synthetically for evaluation.

free parameters (1)

Dataset size
Arbitrary scale choice of 20,000 examples to support benchmarking.

axioms (1)

domain assumption Kamon crests are compositional objects decomposable into container, modifier, and motif factors.
Invoked to justify generation of synthetic crests and definition of factor recovery metrics.

pith-pipeline@v0.9.0 · 5728 in / 1424 out tokens · 136243 ms · 2026-05-20T21:35:18.290868+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each composite crest is paired with a formal kamon description language... Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits...
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Understanding intermediate layers using linear classi- fier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi- fier probes. In5th International Conference on Learning Representations, Workshop Track Proceedings, 2017. URLhttps://openreview.net/forum?id=HJ4-rAVtl

work page 2017
[2]

Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[3]

Kadokawa Shoten (角川書店), Tokyo, 1993

Shigeru Chikano.Nihon Kamon S ¯okan (日本家紋総鑑). Kadokawa Shoten (角川書店), Tokyo, 1993

work page 1993
[4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. URLhttps://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

John Weatherhill, New York, 1971

John Dower.The Elements of Japanese Design. John Weatherhill, New York, 1971

work page 1971
[6]

Dodge Publishing, New York, 1909

Arthur Charles Fox-Davies.A Complete Guide to Heraldry. Dodge Publishing, New York, 1909

work page 1909
[7]

Herbert Press, London, 1993

Stephen Friar and John Ferguson.Basic Heraldry. Herbert Press, London, 1993

work page 1993
[8]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138, Minneapolis, Minnesota, 2019. Association for Computational...

work page doi:10.18653/v1/n19-1419 2019
[9]

β-V AE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations, 2017.https://openreview.net/forum?id=Sy2fzU9gl

work page 2017
[10]

Towards a Definition of Disentangled Representations

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations, 2018. URL https://arxiv.org/abs/1812.02230

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Compositionality decomposed: How do neural networks generalise? J

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020. doi: 10.1613/jair.1.11674. URL https://www.jair.org/index.php/jair/ article/view/11674

work page doi:10.1613/jair.1.11674 2020
[12]

Lawrence Zit- nick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zit- nick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910, July 2017. URL https://openaccess.thecvf.com/ content_cvpr_2...

work page 2017
[13]

Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks

Brenden Lake and Marco Baroni. Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks. In Jennifer Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2873–2882. PMLR, 10–15 Jul 2018. URL https:/...

work page 2018
[14]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 ofPro- cee...

work page 2019
[15]

Disentangling factors of variations using few labels

Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, and Olivier Bachem. Disentangling factors of variations using few labels. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=SygagpEKwB

work page 2020
[16]

CNN based efficient image classification system for smartphone device.Electronic Letters on Computer Vision and Image Analysis, pp

Mayank Mishra, Tanupriya Choudhury, and Tanmay Sarkar. CNN based efficient image classification system for smartphone device.Electronic Letters on Computer Vision and Image Analysis, pp. 1–7, 2021

work page 2021
[17]

Morimoto Dyeing, Kyoto, 2006

Keiichi Morimoto.Onnamon (女紋). Morimoto Dyeing, Kyoto, 2006

work page 2006
[18]

Nihon Jitsugy ¯o Publishers (日本実業出版社), Tokyo, 2013

Y¯uya Morimoto.Nihon no Kamon Daijiten ( 日本の家紋大事典). Nihon Jitsugy ¯o Publishers (日本実業出版社), Tokyo, 2013

work page 2013
[19]

Flag Heritage Foundation, Danvers, MA, 2018

David Phillips.Japanese Heraldry and Heraldic Flags. Flag Heritage Foundation, Danvers, MA, 2018

work page 2018
[20]

Toward causal representation learning,

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954. URL https://doi.org/ 10.1109/JPROC.2021.3058954

work page doi:10.1109/jproc.2021.3058954 2021
[21]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.https://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Lorenz Books, London, 2002

Stephen Slater.The Complete Book of Heraldry. Lorenz Books, London, 2002

work page 2002
[23]

Springer Nature, Cham, Switzerland, 2023

Richard Sproat.Symbols: An Evolutionary History from the Stone Age to the Future. Springer Nature, Cham, Switzerland, 2023

work page 2023
[24]

家紋—画像·テキストの新たなチャレンジ (Kamon: Gaz ¯o/tekisuto no aratana charenji)

Richard Sproat. 家紋—画像·テキストの新たなチャレンジ (Kamon: Gaz ¯o/tekisuto no aratana charenji). InANLP, Utsunomiya, March 2026

work page 2026
[25]

Nihon Moncho

Hugo Gerard Ströhl.Japanisches Wappenbuch “Nihon Moncho”. Verlag von Anton Schroll, Wien, 1906

work page 1906
[26]

Tokyodo Publishers, Tokyo, 2008

Hitoshi Takasawa.Kamon no Jiten (家紋の事典). Tokyodo Publishers, Tokyo, 2008

work page 2008
[27]

Tokyodo Publishers, Tokyo, 2021

Hitoshi Takasawa.Kamon Daijiten (家紋大事典). Tokyodo Publishers, Tokyo, 2021. 11 A Appendix A.1 Background and related work KamonBench is designed around three labeled factors of variation per crest: container C, modifier R, and motif M. It provides a suite of factor-aware diagnostics defined in Section 4. This section positions those design choices relative ...

work page 2021
[28]

head" and

(description 1)2. (description 2)3. (description 3) (etc.) A.8 Few-shot multimodal LLM performance Table 14 shows the 20 sampled synthetic examples used for the Japanese LLM prompt, with VGG and ViT outputs where the sampled image is present in the test predictions, and two large language models, Claude Opus 4.7 Max and GPT 5.4 xhigh. The prompt given to ...

work page
[29]

(description 2)3

(description 1)2. (description 2)3. (description 3) (etc.)

work page

[1] [1]

Understanding intermediate layers using linear classi- fier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classi- fier probes. In5th International Conference on Learning Representations, Workshop Track Proceedings, 2017. URLhttps://openreview.net/forum?id=HJ4-rAVtl

work page 2017

[2] [2]

Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology. org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[3] [3]

Kadokawa Shoten (角川書店), Tokyo, 1993

Shigeru Chikano.Nihon Kamon S ¯okan (日本家紋総鑑). Kadokawa Shoten (角川書店), Tokyo, 1993

work page 1993

[4] [4]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021. URLhttps://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

John Weatherhill, New York, 1971

John Dower.The Elements of Japanese Design. John Weatherhill, New York, 1971

work page 1971

[6] [6]

Dodge Publishing, New York, 1909

Arthur Charles Fox-Davies.A Complete Guide to Heraldry. Dodge Publishing, New York, 1909

work page 1909

[7] [7]

Herbert Press, London, 1993

Stephen Friar and John Ferguson.Basic Heraldry. Herbert Press, London, 1993

work page 1993

[8] [8]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138, Minneapolis, Minnesota, 2019. Association for Computational...

work page doi:10.18653/v1/n19-1419 2019

[9] [9]

β-V AE: Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-V AE: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations, 2017.https://openreview.net/forum?id=Sy2fzU9gl

work page 2017

[10] [10]

Towards a Definition of Disentangled Representations

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations, 2018. URL https://arxiv.org/abs/1812.02230

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Compositionality decomposed: How do neural networks generalise? J

Dieuwke Hupkes, Verna Dankers, Mathijs Mul, and Elia Bruni. Compositionality decomposed: How do neural networks generalise?Journal of Artificial Intelligence Research, 67:757– 795, 2020. doi: 10.1613/jair.1.11674. URL https://www.jair.org/index.php/jair/ article/view/11674

work page doi:10.1613/jair.1.11674 2020

[12] [12]

Lawrence Zit- nick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zit- nick, and Ross Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910, July 2017. URL https://openaccess.thecvf.com/ content_cvpr_2...

work page 2017

[13] [13]

Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks

Brenden Lake and Marco Baroni. Generalization without systematicity: On the composi- tional skills of sequence-to-sequence recurrent networks. In Jennifer Dy and Andreas Krause (eds.),Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 2873–2882. PMLR, 10–15 Jul 2018. URL https:/...

work page 2018

[14] [14]

Challenging common assumptions in the unsupervised learning of disentangled representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 ofPro- cee...

work page 2019

[15] [15]

Disentangling factors of variations using few labels

Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar Rätsch, Bernhard Schölkopf, and Olivier Bachem. Disentangling factors of variations using few labels. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=SygagpEKwB

work page 2020

[16] [16]

CNN based efficient image classification system for smartphone device.Electronic Letters on Computer Vision and Image Analysis, pp

Mayank Mishra, Tanupriya Choudhury, and Tanmay Sarkar. CNN based efficient image classification system for smartphone device.Electronic Letters on Computer Vision and Image Analysis, pp. 1–7, 2021

work page 2021

[17] [17]

Morimoto Dyeing, Kyoto, 2006

Keiichi Morimoto.Onnamon (女紋). Morimoto Dyeing, Kyoto, 2006

work page 2006

[18] [18]

Nihon Jitsugy ¯o Publishers (日本実業出版社), Tokyo, 2013

Y¯uya Morimoto.Nihon no Kamon Daijiten ( 日本の家紋大事典). Nihon Jitsugy ¯o Publishers (日本実業出版社), Tokyo, 2013

work page 2013

[19] [19]

Flag Heritage Foundation, Danvers, MA, 2018

David Phillips.Japanese Heraldry and Heraldic Flags. Flag Heritage Foundation, Danvers, MA, 2018

work page 2018

[20] [20]

Toward causal representation learning,

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021. doi: 10.1109/JPROC.2021.3058954. URL https://doi.org/ 10.1109/JPROC.2021.3058954

work page doi:10.1109/jproc.2021.3058954 2021

[21] [21]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2015.https://arxiv.org/abs/1409.1556

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

Lorenz Books, London, 2002

Stephen Slater.The Complete Book of Heraldry. Lorenz Books, London, 2002

work page 2002

[23] [23]

Springer Nature, Cham, Switzerland, 2023

Richard Sproat.Symbols: An Evolutionary History from the Stone Age to the Future. Springer Nature, Cham, Switzerland, 2023

work page 2023

[24] [24]

家紋—画像·テキストの新たなチャレンジ (Kamon: Gaz ¯o/tekisuto no aratana charenji)

Richard Sproat. 家紋—画像·テキストの新たなチャレンジ (Kamon: Gaz ¯o/tekisuto no aratana charenji). InANLP, Utsunomiya, March 2026

work page 2026

[25] [25]

Nihon Moncho

Hugo Gerard Ströhl.Japanisches Wappenbuch “Nihon Moncho”. Verlag von Anton Schroll, Wien, 1906

work page 1906

[26] [26]

Tokyodo Publishers, Tokyo, 2008

Hitoshi Takasawa.Kamon no Jiten (家紋の事典). Tokyodo Publishers, Tokyo, 2008

work page 2008

[27] [27]

Tokyodo Publishers, Tokyo, 2021

Hitoshi Takasawa.Kamon Daijiten (家紋大事典). Tokyodo Publishers, Tokyo, 2021. 11 A Appendix A.1 Background and related work KamonBench is designed around three labeled factors of variation per crest: container C, modifier R, and motif M. It provides a suite of factor-aware diagnostics defined in Section 4. This section positions those design choices relative ...

work page 2021

[28] [28]

head" and

(description 1)2. (description 2)3. (description 3) (etc.) A.8 Few-shot multimodal LLM performance Table 14 shows the 20 sampled synthetic examples used for the Japanese LLM prompt, with VGG and ViT outputs where the sampled image is present in the test predictions, and two large language models, Claude Opus 4.7 Max and GPT 5.4 xhigh. The prompt given to ...

work page

[29] [29]

(description 2)3

(description 1)2. (description 2)3. (description 3) (etc.)

work page