pith. sign in

arxiv: 2606.25231 · v2 · pith:KF2I56CRnew · submitted 2026-06-23 · 💻 cs.CL

Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

Pith reviewed 2026-06-29 05:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords Arabic-English dictionarymachine-readable dictionaryparsing expression grammarsmicrostructure inductionnatural language processinglexical resourcesAl-Mawrid dictionarydictionary structuring
0
0 comments X

The pith

A parsing expression grammar can convert unstructured Arabic dictionary entries into explicit hierarchical machine-readable structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that even though Arabic dictionaries lack standardized microstructure, their entry patterns can be induced and then parsed automatically or semi-automatically into usable form. The method takes the stream of words and punctuation in the Al-Mawrid Arabic-English dictionary and turns each entry into a hierarchy that marks subentries, defining phrases, domain labels, cross-references, and translation equivalences. This matters because printed dictionaries hold lexical data needed for natural language processing, yet that data remains inaccessible to machines until it is explicitly structured. The approach relies on cascaded parsing steps with a PEG parser at the core. If the patterns prove inducible, the same process could reduce the manual work required to make lexical resources machine-ready.

Core claim

After inducing the microstructure of the Al-Mawrid dictionary, a parser built with parsing expression grammars converts each entry from a sequence of words and marks into a hierarchical structure that explicitly represents its subentries along with their defining phrases, domain labels, cross-references, and translation equivalences.

What carries the argument

Parsing expression grammars (PEG) used to implement a parser that processes dictionary entries in cascaded steps to build hierarchical structures.

If this is right

  • Structured entries enable direct use of lexical data in NLP applications without further parsing.
  • The induction-plus-parsing process can be reused on other Arabic dictionaries that also lack standardization.
  • Explicit hierarchies make specific elements such as translations and cross-references immediately extractable by software.
  • Semi-automatic structuring lowers the cost of creating machine-readable versions of printed lexical resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern-induction approach might extend to dictionaries in other languages that also lack fixed formats.
  • Once structured, the resulting data could feed directly into downstream tasks such as machine translation or semantic search involving Arabic.
  • Success here suggests that many non-standard lexical resources could be made machine-usable through targeted grammar induction rather than full redesign.

Load-bearing premise

Dictionary entries contain consistent and inducible patterns that a PEG parser can capture without excessive manual rules or post-processing exceptions.

What would settle it

Applying the parser to a large sample of entries and finding that most require heavy manual correction or produce incorrect hierarchies would falsify the claim of plausible accuracy.

read the original abstract

Dictionaries are rich sources of lexical information about words that is required for many applications of natural language processing and human language technology. However, publishers prepare printed dictionaries for human usage not for machine processing. This paper presented a method to structure partly a machine-readable version of the Arabic-English Al-Mawrid dictionary. The method converted the entries of Al-Mawrid from a stream of words and punctuation marks into hierarchical structures. The hierarchical structure expresses the components of each dictionary entry in explicit format. A dictionary entry is composed of subentries and each subentry consists of defining phrases, domain labels, cross-references, and translation equivalences. We designed the proposed method as cascaded steps where parsing is the main step. We implemented the parser using the parsing expression grammars formalism. In conclusion, although Arabic dictionaries do not have microstructure standardization, this study demonstrated that it is possible to structure them automatically or semi-automatically with plausible accuracy after inducing their microstructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents a cascaded parsing method using Parsing Expression Grammars (PEG) to convert entries from the Arabic-English Al-Mawrid dictionary from unstructured text into explicit hierarchical structures. Each entry is decomposed into subentries containing defining phrases, domain labels, cross-references, and translation equivalences. The authors conclude that, despite the absence of microstructure standardization in Arabic dictionaries, automatic or semi-automatic structuring is feasible with plausible accuracy once the microstructure has been induced.

Significance. If the accuracy claim can be substantiated, the work would contribute a practical technique for retrofitting legacy printed dictionaries into machine-readable lexical resources, which remains a bottleneck for Arabic NLP. The choice of PEG for handling the semi-regular but non-standardized entry formats is a reasonable technical fit, though the absence of any reported metrics prevents assessment of whether the approach scales beyond the specific dictionary examined.

major comments (1)
  1. [Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. The central issue identified is the lack of quantitative support for the accuracy claims in the abstract and conclusion. We address this point directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.

    Authors: We agree that the manuscript does not include quantitative metrics such as test-set size, precision, recall, coverage, or error analysis, and that this leaves the 'plausible accuracy' claim unsubstantiated. The current version demonstrates the method through example parses after microstructure induction but relies on qualitative observation rather than formal evaluation. In the revised manuscript we will add a dedicated evaluation section reporting these metrics on a held-out set of entries, along with an error analysis, to allow assessment of whether the approach scales with limited manual exceptions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method demonstration with no fitted predictions or self-citation chains

full rationale

The paper describes a cascaded PEG parser to induce microstructure from Al-Mawrid dictionary entries and convert them to hierarchical structures. No equations, parameters, or 'predictions' are defined; the central claim is an empirical demonstration that patterns can be captured. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is self-contained as a grammar-engineering exercise whose success is asserted via the method itself rather than reducing to prior fitted inputs or author citations. Absence of quantitative metrics is a separate evaluation concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that dictionary text streams contain recoverable hierarchical patterns inducible from examples; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Dictionary entries follow consistent enough patterns to allow induction of a microstructure grammar.
    Invoked in the conclusion when claiming automatic structuring is possible after induction.

pith-pipeline@v0.9.1-grok · 5709 in / 1134 out tokens · 16195 ms · 2026-06-29T05:07:51.438479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references

  1. [1]

    International Journal of Lexicography,

    Fontenelle, T., Using a Bilingual Dictionary to Create Semantic Networks. International Journal of Lexicography,

  2. [2]

    of Computer Science

    Kazman, R., Structuring the Text of the Oxford English Dictionary through Finite State Transduction , in Dept. of Computer Science. 1986, University of Waterloo

  3. [3]

    Tsujii, Y

    Nagao, M., J. Tsujii, Y. Ueda, and M. Takiyama l. An Attempt to Computerized Dictionary Databases . 1980. Association for Computational Linguistics

  4. [4]

    Computerizing a Machine Readable Dictionary

    Wilms, G.J. Computerizing a Machine Readable Dictionary. in Proceedings of the 28th annual Southeast regional conference. 1990. ACM New York, NY, USA

  5. [5]

    Boguraev, and D

    Alshawi, H., B. Boguraev, and D. Carter, Placing the Dictionary On -line, in Computational Lexicography for Natural Language Processing. 1989, Longman Publishing Group. p. 41-63

  6. [6]

    Mayfield, J. and P. McNamee. Converting On-Line Bilingual Dictionaries from Human -Readable to Machine - Readable Form. 2002. ACM

  7. [7]

    Hastings, A., Loading a Bilingual Dictionary Into LDB. 1994

  8. [8]

    Castro-Sánchez, N. and G. Sidorov, Automatic Acquisition of Synonyms of Verbs from an Explanatory Dictionary using Hyponym and Hyperonym Relations. Pattern Recognition, 2011: p. 322-331

  9. [9]

    Wang, T. and G. Hirst, Extracting Synonyms from Dictionary Definitions, in Computer Science. 2009, University of Toronto

  10. [10]

    Klavans, J. and B. Whitman. Extracting Taxonomic Relationships from On-line Definitional Sources using Lexing

  11. [11]

    Vanderwende, and S.D

    Dolan, W., L. Vanderwende, and S.D. Richardson, Automatically Deriving Structured Knowledge Bases from On- Line Dictionaries , in PACLING 93 . 1993, Pacific Association for Computational Linguistics: Simon Fraser University, Vancouver, BC. p. 5-14

  12. [12]

    Montemagni, S. and L. Vanderwende. Structural Patterns vs. String Patterns for Extracting Semantic Information from Dictionaries. 1992

  13. [13]

    Ahlswede, T. and M. Evens, Generating a Relational Lexicon from a Machine–Readable Dictionary*. International Journal of Lexicography, 1988. 1(3): p. 214

  14. [14]

    Roy, and E.H

    Martin, S.C., J.B. Roy, and E.H. George. Extracting Semantic Hierarchies from a Large On -line Dictionary. in Proceedings of the 23rd annual meeting on Association for Computational Linguistics . 1985. Chicago, Illinois: Association for Computational Linguistics

  15. [15]

    Ide, N. and J. Véronis. Machine Readable Dictionaries: What Have We learned, Where Do We Go . in Proceedings of the International Workshop on the Future of Lexical Research. 1994. Beijing, China

  16. [16]

    1995, Mississippi State University

    Wilms, G.J., Automated Induction of a Lexical Sublanguage Grammar Using a Hybrid System of Corpus -and Knowledge-based Techniques. 1995, Mississippi State University

  17. [17]

    Black, D

    Elkateb, S., W. Black, D. Farwell, P. Vossen, A. Pease, and C. Fellbaum. Arabic WordNet and the Challenges of Arabic. in Proceedings of Arabic NLP/MT Conference. 2006. London, UK

  18. [18]

    Farwell, J

    Rodríguez, H., D. Farwell, J. Farreres, M. Bertran, M. Alkhalifa, M. Martí, W. Black, S. Elkateb, J. Kirk, A. Pease, P. Vossen, and C. Fellbaum. Arabic Wordnet: Current State and Future Extensions. in Proceedings of The Fourth Global WordNet Conference. 2008. Szeged, Hungary

  19. [19]

    Elkateb, and P

    Black, W., S. Elkateb, and P. Vossen. Introducing the Arabic WordNet Project . in In Proceedings of the third International WordNet Conference (GWC-06). 2006

  20. [20]

    Black, P

    Elkateb, S., W. Black, P. Vossen, H. Rodríguez, A. Pease, M. Alkhalifa, and C. Fellbaum. Building a WordNet for Arabic. in Proceedings of The fifth international conference on Language Resources and Evaluation (LREC 2006). 2006

  21. [21]

    Parsing Expression Grammars: A Recognition-Based Syntactic Foundation

    Ford, B. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. in Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 2004. ACM

  22. [22]

    1980, The University of Texas at Austin

    Amsler, R.A., The Structure of the Merriam-Webster Pocket Dictionary. 1980, The University of Texas at Austin

  23. [23]

    Calzolari, M

    Byrd, R., N. Calzolari, M. Chodorow, J. Klavans, M. Neff, O. Rizk, Tools and Methods for Computational Lexicology. Computational Linguistics, 1987. 13(3-4): p. 219-240. 14

  24. [24]

    Nakramura, J. I. and M. Nagao. Extraction of Semantic Information from an Ordinary English Dictionary and its Evaluation. in Proceedings of the 12th conference on Computational linguistics . 1988. Budapest, Hungry: Association for Computational Linguistics

  25. [25]

    2010, Faculty of Engineering, Cairo University: Giza, Egypt

    Eid, S.M., Automatic Generation of Thesaurus from Arabic Lexical Resources, in Electronics and Communication Engineering. 2010, Faculty of Engineering, Cairo University: Giza, Egypt

  26. [26]

    Hartmann, R.R.K. and G. James, Dictionary of Lexicography. 2002, London and New York: Routledge

  27. [27]

    Kamel, M

    Hawarry, A., W. Kamel, M. Rashwan. لغة التعريف في المعجم العربي العام الحديث: إشكالية الصياغة والمحتوى، رؤية لغوية2008, Ph.D., Faculty of Arts, Cairo University, Giza

  28. [28]

    Garshol, L.M., BNF and EBNF: Wha t Are They and How Do They Work , 2003 ,url: http://tur- www1.massey.ac.nz/~dpplayne/159331/BNF+EBNF-Garshol.pdf, access time: Sep 18, 2014

  29. [29]

    Getting Started with Pyparsing

    McGuire, P., ed. Getting Started with Pyparsing. 2007, O'Reilly Media, Inc

  30. [30]

    Data Mining

    McGuire, P., Introduction to Pyparsing: An Object -oriented Easyto-Use Toolkit for Building Recursive Descent Parsers. World, 2006: p. 1-15. Diaa El-Din Mohamed Abo-Fayed received the B.Sc. degree in the Electronics , Faculty of Engineering, Mansoura University, 1995. He received M.Sc. degree in the Automatic Control from the Faculty of Engineering, Manso...