Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars
Pith reviewed 2026-06-29 05:07 UTC · model grok-4.3
The pith
A parsing expression grammar can convert unstructured Arabic dictionary entries into explicit hierarchical machine-readable structures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After inducing the microstructure of the Al-Mawrid dictionary, a parser built with parsing expression grammars converts each entry from a sequence of words and marks into a hierarchical structure that explicitly represents its subentries along with their defining phrases, domain labels, cross-references, and translation equivalences.
What carries the argument
Parsing expression grammars (PEG) used to implement a parser that processes dictionary entries in cascaded steps to build hierarchical structures.
If this is right
- Structured entries enable direct use of lexical data in NLP applications without further parsing.
- The induction-plus-parsing process can be reused on other Arabic dictionaries that also lack standardization.
- Explicit hierarchies make specific elements such as translations and cross-references immediately extractable by software.
- Semi-automatic structuring lowers the cost of creating machine-readable versions of printed lexical resources.
Where Pith is reading between the lines
- The same pattern-induction approach might extend to dictionaries in other languages that also lack fixed formats.
- Once structured, the resulting data could feed directly into downstream tasks such as machine translation or semantic search involving Arabic.
- Success here suggests that many non-standard lexical resources could be made machine-usable through targeted grammar induction rather than full redesign.
Load-bearing premise
Dictionary entries contain consistent and inducible patterns that a PEG parser can capture without excessive manual rules or post-processing exceptions.
What would settle it
Applying the parser to a large sample of entries and finding that most require heavy manual correction or produce incorrect hierarchies would falsify the claim of plausible accuracy.
read the original abstract
Dictionaries are rich sources of lexical information about words that is required for many applications of natural language processing and human language technology. However, publishers prepare printed dictionaries for human usage not for machine processing. This paper presented a method to structure partly a machine-readable version of the Arabic-English Al-Mawrid dictionary. The method converted the entries of Al-Mawrid from a stream of words and punctuation marks into hierarchical structures. The hierarchical structure expresses the components of each dictionary entry in explicit format. A dictionary entry is composed of subentries and each subentry consists of defining phrases, domain labels, cross-references, and translation equivalences. We designed the proposed method as cascaded steps where parsing is the main step. We implemented the parser using the parsing expression grammars formalism. In conclusion, although Arabic dictionaries do not have microstructure standardization, this study demonstrated that it is possible to structure them automatically or semi-automatically with plausible accuracy after inducing their microstructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a cascaded parsing method using Parsing Expression Grammars (PEG) to convert entries from the Arabic-English Al-Mawrid dictionary from unstructured text into explicit hierarchical structures. Each entry is decomposed into subentries containing defining phrases, domain labels, cross-references, and translation equivalences. The authors conclude that, despite the absence of microstructure standardization in Arabic dictionaries, automatic or semi-automatic structuring is feasible with plausible accuracy once the microstructure has been induced.
Significance. If the accuracy claim can be substantiated, the work would contribute a practical technique for retrofitting legacy printed dictionaries into machine-readable lexical resources, which remains a bottleneck for Arabic NLP. The choice of PEG for handling the semi-regular but non-standardized entry formats is a reasonable technical fit, though the absence of any reported metrics prevents assessment of whether the approach scales beyond the specific dictionary examined.
major comments (1)
- [Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the recommendation for major revision. The central issue identified is the lack of quantitative support for the accuracy claims in the abstract and conclusion. We address this point directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.
Authors: We agree that the manuscript does not include quantitative metrics such as test-set size, precision, recall, coverage, or error analysis, and that this leaves the 'plausible accuracy' claim unsubstantiated. The current version demonstrates the method through example parses after microstructure induction but relies on qualitative observation rather than formal evaluation. In the revised manuscript we will add a dedicated evaluation section reporting these metrics on a held-out set of entries, along with an error analysis, to allow assessment of whether the approach scales with limited manual exceptions. revision: yes
Circularity Check
No circularity: empirical method demonstration with no fitted predictions or self-citation chains
full rationale
The paper describes a cascaded PEG parser to induce microstructure from Al-Mawrid dictionary entries and convert them to hierarchical structures. No equations, parameters, or 'predictions' are defined; the central claim is an empirical demonstration that patterns can be captured. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is self-contained as a grammar-engineering exercise whose success is asserted via the method itself rather than reducing to prior fitted inputs or author citations. Absence of quantitative metrics is a separate evaluation concern, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dictionary entries follow consistent enough patterns to allow induction of a microstructure grammar.
Reference graph
Works this paper leans on
-
[1]
International Journal of Lexicography,
Fontenelle, T., Using a Bilingual Dictionary to Create Semantic Networks. International Journal of Lexicography,
-
[2]
of Computer Science
Kazman, R., Structuring the Text of the Oxford English Dictionary through Finite State Transduction , in Dept. of Computer Science. 1986, University of Waterloo
1986
-
[3]
Tsujii, Y
Nagao, M., J. Tsujii, Y. Ueda, and M. Takiyama l. An Attempt to Computerized Dictionary Databases . 1980. Association for Computational Linguistics
1980
-
[4]
Computerizing a Machine Readable Dictionary
Wilms, G.J. Computerizing a Machine Readable Dictionary. in Proceedings of the 28th annual Southeast regional conference. 1990. ACM New York, NY, USA
1990
-
[5]
Boguraev, and D
Alshawi, H., B. Boguraev, and D. Carter, Placing the Dictionary On -line, in Computational Lexicography for Natural Language Processing. 1989, Longman Publishing Group. p. 41-63
1989
-
[6]
Mayfield, J. and P. McNamee. Converting On-Line Bilingual Dictionaries from Human -Readable to Machine - Readable Form. 2002. ACM
2002
-
[7]
Hastings, A., Loading a Bilingual Dictionary Into LDB. 1994
1994
-
[8]
Castro-Sánchez, N. and G. Sidorov, Automatic Acquisition of Synonyms of Verbs from an Explanatory Dictionary using Hyponym and Hyperonym Relations. Pattern Recognition, 2011: p. 322-331
2011
-
[9]
Wang, T. and G. Hirst, Extracting Synonyms from Dictionary Definitions, in Computer Science. 2009, University of Toronto
2009
-
[10]
Klavans, J. and B. Whitman. Extracting Taxonomic Relationships from On-line Definitional Sources using Lexing
-
[11]
Vanderwende, and S.D
Dolan, W., L. Vanderwende, and S.D. Richardson, Automatically Deriving Structured Knowledge Bases from On- Line Dictionaries , in PACLING 93 . 1993, Pacific Association for Computational Linguistics: Simon Fraser University, Vancouver, BC. p. 5-14
1993
-
[12]
Montemagni, S. and L. Vanderwende. Structural Patterns vs. String Patterns for Extracting Semantic Information from Dictionaries. 1992
1992
-
[13]
Ahlswede, T. and M. Evens, Generating a Relational Lexicon from a Machine–Readable Dictionary*. International Journal of Lexicography, 1988. 1(3): p. 214
1988
-
[14]
Roy, and E.H
Martin, S.C., J.B. Roy, and E.H. George. Extracting Semantic Hierarchies from a Large On -line Dictionary. in Proceedings of the 23rd annual meeting on Association for Computational Linguistics . 1985. Chicago, Illinois: Association for Computational Linguistics
1985
-
[15]
Ide, N. and J. Véronis. Machine Readable Dictionaries: What Have We learned, Where Do We Go . in Proceedings of the International Workshop on the Future of Lexical Research. 1994. Beijing, China
1994
-
[16]
1995, Mississippi State University
Wilms, G.J., Automated Induction of a Lexical Sublanguage Grammar Using a Hybrid System of Corpus -and Knowledge-based Techniques. 1995, Mississippi State University
1995
-
[17]
Black, D
Elkateb, S., W. Black, D. Farwell, P. Vossen, A. Pease, and C. Fellbaum. Arabic WordNet and the Challenges of Arabic. in Proceedings of Arabic NLP/MT Conference. 2006. London, UK
2006
-
[18]
Farwell, J
Rodríguez, H., D. Farwell, J. Farreres, M. Bertran, M. Alkhalifa, M. Martí, W. Black, S. Elkateb, J. Kirk, A. Pease, P. Vossen, and C. Fellbaum. Arabic Wordnet: Current State and Future Extensions. in Proceedings of The Fourth Global WordNet Conference. 2008. Szeged, Hungary
2008
-
[19]
Elkateb, and P
Black, W., S. Elkateb, and P. Vossen. Introducing the Arabic WordNet Project . in In Proceedings of the third International WordNet Conference (GWC-06). 2006
2006
-
[20]
Black, P
Elkateb, S., W. Black, P. Vossen, H. Rodríguez, A. Pease, M. Alkhalifa, and C. Fellbaum. Building a WordNet for Arabic. in Proceedings of The fifth international conference on Language Resources and Evaluation (LREC 2006). 2006
2006
-
[21]
Parsing Expression Grammars: A Recognition-Based Syntactic Foundation
Ford, B. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. in Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 2004. ACM
2004
-
[22]
1980, The University of Texas at Austin
Amsler, R.A., The Structure of the Merriam-Webster Pocket Dictionary. 1980, The University of Texas at Austin
1980
-
[23]
Calzolari, M
Byrd, R., N. Calzolari, M. Chodorow, J. Klavans, M. Neff, O. Rizk, Tools and Methods for Computational Lexicology. Computational Linguistics, 1987. 13(3-4): p. 219-240. 14
1987
-
[24]
Nakramura, J. I. and M. Nagao. Extraction of Semantic Information from an Ordinary English Dictionary and its Evaluation. in Proceedings of the 12th conference on Computational linguistics . 1988. Budapest, Hungry: Association for Computational Linguistics
1988
-
[25]
2010, Faculty of Engineering, Cairo University: Giza, Egypt
Eid, S.M., Automatic Generation of Thesaurus from Arabic Lexical Resources, in Electronics and Communication Engineering. 2010, Faculty of Engineering, Cairo University: Giza, Egypt
2010
-
[26]
Hartmann, R.R.K. and G. James, Dictionary of Lexicography. 2002, London and New York: Routledge
2002
-
[27]
Kamel, M
Hawarry, A., W. Kamel, M. Rashwan. لغة التعريف في المعجم العربي العام الحديث: إشكالية الصياغة والمحتوى، رؤية لغوية2008, Ph.D., Faculty of Arts, Cairo University, Giza
-
[28]
Garshol, L.M., BNF and EBNF: Wha t Are They and How Do They Work , 2003 ,url: http://tur- www1.massey.ac.nz/~dpplayne/159331/BNF+EBNF-Garshol.pdf, access time: Sep 18, 2014
2003
-
[29]
Getting Started with Pyparsing
McGuire, P., ed. Getting Started with Pyparsing. 2007, O'Reilly Media, Inc
2007
-
[30]
Data Mining
McGuire, P., Introduction to Pyparsing: An Object -oriented Easyto-Use Toolkit for Building Recursive Descent Parsers. World, 2006: p. 1-15. Diaa El-Din Mohamed Abo-Fayed received the B.Sc. degree in the Electronics , Faculty of Engineering, Mansoura University, 1995. He received M.Sc. degree in the Automatic Control from the Faculty of Engineering, Manso...
2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.