Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

Aly A. Fahmy; Diaa Mohamed Fayed; Mohsen A. Rashwan; Wafaa K. Fayed

arxiv: 2606.25231 · v2 · pith:KF2I56CRnew · submitted 2026-06-23 · 💻 cs.CL

Towards Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

Diaa Mohamed Fayed , Aly A. Fahmy , Mohsen A. Rashwan , Wafaa K. Fayed This is my paper

Pith reviewed 2026-06-29 05:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords Arabic-English dictionarymachine-readable dictionaryparsing expression grammarsmicrostructure inductionnatural language processinglexical resourcesAl-Mawrid dictionarydictionary structuring

0 comments

The pith

A parsing expression grammar can convert unstructured Arabic dictionary entries into explicit hierarchical machine-readable structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that even though Arabic dictionaries lack standardized microstructure, their entry patterns can be induced and then parsed automatically or semi-automatically into usable form. The method takes the stream of words and punctuation in the Al-Mawrid Arabic-English dictionary and turns each entry into a hierarchy that marks subentries, defining phrases, domain labels, cross-references, and translation equivalences. This matters because printed dictionaries hold lexical data needed for natural language processing, yet that data remains inaccessible to machines until it is explicitly structured. The approach relies on cascaded parsing steps with a PEG parser at the core. If the patterns prove inducible, the same process could reduce the manual work required to make lexical resources machine-ready.

Core claim

After inducing the microstructure of the Al-Mawrid dictionary, a parser built with parsing expression grammars converts each entry from a sequence of words and marks into a hierarchical structure that explicitly represents its subentries along with their defining phrases, domain labels, cross-references, and translation equivalences.

What carries the argument

Parsing expression grammars (PEG) used to implement a parser that processes dictionary entries in cascaded steps to build hierarchical structures.

If this is right

Structured entries enable direct use of lexical data in NLP applications without further parsing.
The induction-plus-parsing process can be reused on other Arabic dictionaries that also lack standardization.
Explicit hierarchies make specific elements such as translations and cross-references immediately extractable by software.
Semi-automatic structuring lowers the cost of creating machine-readable versions of printed lexical resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern-induction approach might extend to dictionaries in other languages that also lack fixed formats.
Once structured, the resulting data could feed directly into downstream tasks such as machine translation or semantic search involving Arabic.
Success here suggests that many non-standard lexical resources could be made machine-usable through targeted grammar induction rather than full redesign.

Load-bearing premise

Dictionary entries contain consistent and inducible patterns that a PEG parser can capture without excessive manual rules or post-processing exceptions.

What would settle it

Applying the parser to a large sample of entries and finding that most require heavy manual correction or produce incorrect hierarchies would falsify the claim of plausible accuracy.

read the original abstract

Dictionaries are rich sources of lexical information about words that is required for many applications of natural language processing and human language technology. However, publishers prepare printed dictionaries for human usage not for machine processing. This paper presented a method to structure partly a machine-readable version of the Arabic-English Al-Mawrid dictionary. The method converted the entries of Al-Mawrid from a stream of words and punctuation marks into hierarchical structures. The hierarchical structure expresses the components of each dictionary entry in explicit format. A dictionary entry is composed of subentries and each subentry consists of defining phrases, domain labels, cross-references, and translation equivalences. We designed the proposed method as cascaded steps where parsing is the main step. We implemented the parser using the parsing expression grammars formalism. In conclusion, although Arabic dictionaries do not have microstructure standardization, this study demonstrated that it is possible to structure them automatically or semi-automatically with plausible accuracy after inducing their microstructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PEG parsing is applied to structure Al-Mawrid entries but the paper supplies no accuracy numbers or error analysis to support the claim of plausible success.

read the letter

The main thing to know is that the authors used parsing expression grammars in a cascaded setup to turn the text of the Al-Mawrid Arabic-English dictionary into explicit hierarchical entries, yet they give no test results to show how well it worked.

They first induce the microstructure from the printed dictionary and then parse subentries, definitions, domain labels, cross-references, and translations. The approach treats the input as a stream of words and punctuation and builds nested structures step by step. This is a straightforward extension of grammar-based extraction to Arabic, where dictionaries lack fixed formatting.

The description of entry components and the choice of PEG for this kind of semi-structured text is clear enough. The practical focus on a real, non-standardized resource is the part that could be useful to others facing similar digitization tasks.

The soft spot is the complete absence of evaluation. The abstract concludes that automatic or semi-automatic structuring is possible with plausible accuracy, but there is no mention of how many entries were processed, what precision or recall was achieved, how coverage was measured, or what kinds of exceptions arose. Without those figures it is impossible to judge whether the grammar captured consistent patterns or required extensive manual rules that would weaken the automation claim.

The assumption that dictionary entries contain sufficiently regular patterns for a PEG parser to handle reliably is left untested in the available text.

This paper is aimed at people building lexical resources for Arabic NLP. A reader already working on dictionary parsing might borrow the cascaded PEG idea, but anyone needing validated performance data will find little to take away.

I would not send it to peer review in this form. It needs at least a quantitative evaluation section and error analysis before it merits referee time.

Referee Report

1 major / 0 minor

Summary. The paper presents a cascaded parsing method using Parsing Expression Grammars (PEG) to convert entries from the Arabic-English Al-Mawrid dictionary from unstructured text into explicit hierarchical structures. Each entry is decomposed into subentries containing defining phrases, domain labels, cross-references, and translation equivalences. The authors conclude that, despite the absence of microstructure standardization in Arabic dictionaries, automatic or semi-automatic structuring is feasible with plausible accuracy once the microstructure has been induced.

Significance. If the accuracy claim can be substantiated, the work would contribute a practical technique for retrofitting legacy printed dictionaries into machine-readable lexical resources, which remains a bottleneck for Arabic NLP. The choice of PEG for handling the semi-regular but non-standardized entry formats is a reasonable technical fit, though the absence of any reported metrics prevents assessment of whether the approach scales beyond the specific dictionary examined.

major comments (1)

[Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. The central issue identified is the lack of quantitative support for the accuracy claims in the abstract and conclusion. We address this point directly below.

read point-by-point responses

Referee: [Abstract] Abstract (and conclusion): the central claim that the method achieves 'plausible accuracy' is unsupported by any quantitative evaluation. No test-set size, precision/recall, coverage rate, or error analysis is provided, which directly undermines the assertion that structuring is possible 'automatically or semi-automatically' without excessive manual exceptions.

Authors: We agree that the manuscript does not include quantitative metrics such as test-set size, precision, recall, coverage, or error analysis, and that this leaves the 'plausible accuracy' claim unsubstantiated. The current version demonstrates the method through example parses after microstructure induction but relies on qualitative observation rather than formal evaluation. In the revised manuscript we will add a dedicated evaluation section reporting these metrics on a held-out set of entries, along with an error analysis, to allow assessment of whether the approach scales with limited manual exceptions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method demonstration with no fitted predictions or self-citation chains

full rationale

The paper describes a cascaded PEG parser to induce microstructure from Al-Mawrid dictionary entries and convert them to hierarchical structures. No equations, parameters, or 'predictions' are defined; the central claim is an empirical demonstration that patterns can be captured. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is self-contained as a grammar-engineering exercise whose success is asserted via the method itself rather than reducing to prior fitted inputs or author citations. Absence of quantitative metrics is a separate evaluation concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that dictionary text streams contain recoverable hierarchical patterns inducible from examples; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Dictionary entries follow consistent enough patterns to allow induction of a microstructure grammar.
Invoked in the conclusion when claiming automatic structuring is possible after induction.

pith-pipeline@v0.9.1-grok · 5709 in / 1134 out tokens · 16195 ms · 2026-06-29T05:07:51.438479+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references

[1]

International Journal of Lexicography,

Fontenelle, T., Using a Bilingual Dictionary to Create Semantic Networks. International Journal of Lexicography,
[2]

of Computer Science

Kazman, R., Structuring the Text of the Oxford English Dictionary through Finite State Transduction , in Dept. of Computer Science. 1986, University of Waterloo

1986
[3]

Tsujii, Y

Nagao, M., J. Tsujii, Y. Ueda, and M. Takiyama l. An Attempt to Computerized Dictionary Databases . 1980. Association for Computational Linguistics

1980
[4]

Computerizing a Machine Readable Dictionary

Wilms, G.J. Computerizing a Machine Readable Dictionary. in Proceedings of the 28th annual Southeast regional conference. 1990. ACM New York, NY, USA

1990
[5]

Boguraev, and D

Alshawi, H., B. Boguraev, and D. Carter, Placing the Dictionary On -line, in Computational Lexicography for Natural Language Processing. 1989, Longman Publishing Group. p. 41-63

1989
[6]

Mayfield, J. and P. McNamee. Converting On-Line Bilingual Dictionaries from Human -Readable to Machine - Readable Form. 2002. ACM

2002
[7]

Hastings, A., Loading a Bilingual Dictionary Into LDB. 1994

1994
[8]

Castro-Sánchez, N. and G. Sidorov, Automatic Acquisition of Synonyms of Verbs from an Explanatory Dictionary using Hyponym and Hyperonym Relations. Pattern Recognition, 2011: p. 322-331

2011
[9]

Wang, T. and G. Hirst, Extracting Synonyms from Dictionary Definitions, in Computer Science. 2009, University of Toronto

2009
[10]

Klavans, J. and B. Whitman. Extracting Taxonomic Relationships from On-line Definitional Sources using Lexing
[11]

Vanderwende, and S.D

Dolan, W., L. Vanderwende, and S.D. Richardson, Automatically Deriving Structured Knowledge Bases from On- Line Dictionaries , in PACLING 93 . 1993, Pacific Association for Computational Linguistics: Simon Fraser University, Vancouver, BC. p. 5-14

1993
[12]

Montemagni, S. and L. Vanderwende. Structural Patterns vs. String Patterns for Extracting Semantic Information from Dictionaries. 1992

1992
[13]

Ahlswede, T. and M. Evens, Generating a Relational Lexicon from a Machine–Readable Dictionary*. International Journal of Lexicography, 1988. 1(3): p. 214

1988
[14]

Roy, and E.H

Martin, S.C., J.B. Roy, and E.H. George. Extracting Semantic Hierarchies from a Large On -line Dictionary. in Proceedings of the 23rd annual meeting on Association for Computational Linguistics . 1985. Chicago, Illinois: Association for Computational Linguistics

1985
[15]

Ide, N. and J. Véronis. Machine Readable Dictionaries: What Have We learned, Where Do We Go . in Proceedings of the International Workshop on the Future of Lexical Research. 1994. Beijing, China

1994
[16]

1995, Mississippi State University

Wilms, G.J., Automated Induction of a Lexical Sublanguage Grammar Using a Hybrid System of Corpus -and Knowledge-based Techniques. 1995, Mississippi State University

1995
[17]

Black, D

Elkateb, S., W. Black, D. Farwell, P. Vossen, A. Pease, and C. Fellbaum. Arabic WordNet and the Challenges of Arabic. in Proceedings of Arabic NLP/MT Conference. 2006. London, UK

2006
[18]

Farwell, J

Rodríguez, H., D. Farwell, J. Farreres, M. Bertran, M. Alkhalifa, M. Martí, W. Black, S. Elkateb, J. Kirk, A. Pease, P. Vossen, and C. Fellbaum. Arabic Wordnet: Current State and Future Extensions. in Proceedings of The Fourth Global WordNet Conference. 2008. Szeged, Hungary

2008
[19]

Elkateb, and P

Black, W., S. Elkateb, and P. Vossen. Introducing the Arabic WordNet Project . in In Proceedings of the third International WordNet Conference (GWC-06). 2006

2006
[20]

Black, P

Elkateb, S., W. Black, P. Vossen, H. Rodríguez, A. Pease, M. Alkhalifa, and C. Fellbaum. Building a WordNet for Arabic. in Proceedings of The fifth international conference on Language Resources and Evaluation (LREC 2006). 2006

2006
[21]

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation

Ford, B. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. in Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 2004. ACM

2004
[22]

1980, The University of Texas at Austin

Amsler, R.A., The Structure of the Merriam-Webster Pocket Dictionary. 1980, The University of Texas at Austin

1980
[23]

Calzolari, M

Byrd, R., N. Calzolari, M. Chodorow, J. Klavans, M. Neff, O. Rizk, Tools and Methods for Computational Lexicology. Computational Linguistics, 1987. 13(3-4): p. 219-240. 14

1987
[24]

Nakramura, J. I. and M. Nagao. Extraction of Semantic Information from an Ordinary English Dictionary and its Evaluation. in Proceedings of the 12th conference on Computational linguistics . 1988. Budapest, Hungry: Association for Computational Linguistics

1988
[25]

2010, Faculty of Engineering, Cairo University: Giza, Egypt

Eid, S.M., Automatic Generation of Thesaurus from Arabic Lexical Resources, in Electronics and Communication Engineering. 2010, Faculty of Engineering, Cairo University: Giza, Egypt

2010
[26]

Hartmann, R.R.K. and G. James, Dictionary of Lexicography. 2002, London and New York: Routledge

2002
[27]

Kamel, M

Hawarry, A., W. Kamel, M. Rashwan. لغة التعريف في المعجم العربي العام الحديث: إشكالية الصياغة والمحتوى، رؤية لغوية2008, Ph.D., Faculty of Arts, Cairo University, Giza
[28]

Garshol, L.M., BNF and EBNF: Wha t Are They and How Do They Work , 2003 ,url: http://tur- www1.massey.ac.nz/~dpplayne/159331/BNF+EBNF-Garshol.pdf, access time: Sep 18, 2014

2003
[29]

Getting Started with Pyparsing

McGuire, P., ed. Getting Started with Pyparsing. 2007, O'Reilly Media, Inc

2007
[30]

Data Mining

McGuire, P., Introduction to Pyparsing: An Object -oriented Easyto-Use Toolkit for Building Recursive Descent Parsers. World, 2006: p. 1-15. Diaa El-Din Mohamed Abo-Fayed received the B.Sc. degree in the Electronics , Faculty of Engineering, Mansoura University, 1995. He received M.Sc. degree in the Automatic Control from the Faculty of Engineering, Manso...

2006

[1] [1]

International Journal of Lexicography,

Fontenelle, T., Using a Bilingual Dictionary to Create Semantic Networks. International Journal of Lexicography,

[2] [2]

of Computer Science

Kazman, R., Structuring the Text of the Oxford English Dictionary through Finite State Transduction , in Dept. of Computer Science. 1986, University of Waterloo

1986

[3] [3]

Tsujii, Y

Nagao, M., J. Tsujii, Y. Ueda, and M. Takiyama l. An Attempt to Computerized Dictionary Databases . 1980. Association for Computational Linguistics

1980

[4] [4]

Computerizing a Machine Readable Dictionary

Wilms, G.J. Computerizing a Machine Readable Dictionary. in Proceedings of the 28th annual Southeast regional conference. 1990. ACM New York, NY, USA

1990

[5] [5]

Boguraev, and D

Alshawi, H., B. Boguraev, and D. Carter, Placing the Dictionary On -line, in Computational Lexicography for Natural Language Processing. 1989, Longman Publishing Group. p. 41-63

1989

[6] [6]

Mayfield, J. and P. McNamee. Converting On-Line Bilingual Dictionaries from Human -Readable to Machine - Readable Form. 2002. ACM

2002

[7] [7]

Hastings, A., Loading a Bilingual Dictionary Into LDB. 1994

1994

[8] [8]

Castro-Sánchez, N. and G. Sidorov, Automatic Acquisition of Synonyms of Verbs from an Explanatory Dictionary using Hyponym and Hyperonym Relations. Pattern Recognition, 2011: p. 322-331

2011

[9] [9]

Wang, T. and G. Hirst, Extracting Synonyms from Dictionary Definitions, in Computer Science. 2009, University of Toronto

2009

[10] [10]

Klavans, J. and B. Whitman. Extracting Taxonomic Relationships from On-line Definitional Sources using Lexing

[11] [11]

Vanderwende, and S.D

Dolan, W., L. Vanderwende, and S.D. Richardson, Automatically Deriving Structured Knowledge Bases from On- Line Dictionaries , in PACLING 93 . 1993, Pacific Association for Computational Linguistics: Simon Fraser University, Vancouver, BC. p. 5-14

1993

[12] [12]

Montemagni, S. and L. Vanderwende. Structural Patterns vs. String Patterns for Extracting Semantic Information from Dictionaries. 1992

1992

[13] [13]

Ahlswede, T. and M. Evens, Generating a Relational Lexicon from a Machine–Readable Dictionary*. International Journal of Lexicography, 1988. 1(3): p. 214

1988

[14] [14]

Roy, and E.H

Martin, S.C., J.B. Roy, and E.H. George. Extracting Semantic Hierarchies from a Large On -line Dictionary. in Proceedings of the 23rd annual meeting on Association for Computational Linguistics . 1985. Chicago, Illinois: Association for Computational Linguistics

1985

[15] [15]

Ide, N. and J. Véronis. Machine Readable Dictionaries: What Have We learned, Where Do We Go . in Proceedings of the International Workshop on the Future of Lexical Research. 1994. Beijing, China

1994

[16] [16]

1995, Mississippi State University

Wilms, G.J., Automated Induction of a Lexical Sublanguage Grammar Using a Hybrid System of Corpus -and Knowledge-based Techniques. 1995, Mississippi State University

1995

[17] [17]

Black, D

Elkateb, S., W. Black, D. Farwell, P. Vossen, A. Pease, and C. Fellbaum. Arabic WordNet and the Challenges of Arabic. in Proceedings of Arabic NLP/MT Conference. 2006. London, UK

2006

[18] [18]

Farwell, J

Rodríguez, H., D. Farwell, J. Farreres, M. Bertran, M. Alkhalifa, M. Martí, W. Black, S. Elkateb, J. Kirk, A. Pease, P. Vossen, and C. Fellbaum. Arabic Wordnet: Current State and Future Extensions. in Proceedings of The Fourth Global WordNet Conference. 2008. Szeged, Hungary

2008

[19] [19]

Elkateb, and P

Black, W., S. Elkateb, and P. Vossen. Introducing the Arabic WordNet Project . in In Proceedings of the third International WordNet Conference (GWC-06). 2006

2006

[20] [20]

Black, P

Elkateb, S., W. Black, P. Vossen, H. Rodríguez, A. Pease, M. Alkhalifa, and C. Fellbaum. Building a WordNet for Arabic. in Proceedings of The fifth international conference on Language Resources and Evaluation (LREC 2006). 2006

2006

[21] [21]

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation

Ford, B. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. in Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 2004. ACM

2004

[22] [22]

1980, The University of Texas at Austin

Amsler, R.A., The Structure of the Merriam-Webster Pocket Dictionary. 1980, The University of Texas at Austin

1980

[23] [23]

Calzolari, M

Byrd, R., N. Calzolari, M. Chodorow, J. Klavans, M. Neff, O. Rizk, Tools and Methods for Computational Lexicology. Computational Linguistics, 1987. 13(3-4): p. 219-240. 14

1987

[24] [24]

Nakramura, J. I. and M. Nagao. Extraction of Semantic Information from an Ordinary English Dictionary and its Evaluation. in Proceedings of the 12th conference on Computational linguistics . 1988. Budapest, Hungry: Association for Computational Linguistics

1988

[25] [25]

2010, Faculty of Engineering, Cairo University: Giza, Egypt

Eid, S.M., Automatic Generation of Thesaurus from Arabic Lexical Resources, in Electronics and Communication Engineering. 2010, Faculty of Engineering, Cairo University: Giza, Egypt

2010

[26] [26]

Hartmann, R.R.K. and G. James, Dictionary of Lexicography. 2002, London and New York: Routledge

2002

[27] [27]

Kamel, M

Hawarry, A., W. Kamel, M. Rashwan. لغة التعريف في المعجم العربي العام الحديث: إشكالية الصياغة والمحتوى، رؤية لغوية2008, Ph.D., Faculty of Arts, Cairo University, Giza

[28] [28]

Garshol, L.M., BNF and EBNF: Wha t Are They and How Do They Work , 2003 ,url: http://tur- www1.massey.ac.nz/~dpplayne/159331/BNF+EBNF-Garshol.pdf, access time: Sep 18, 2014

2003

[29] [29]

Getting Started with Pyparsing

McGuire, P., ed. Getting Started with Pyparsing. 2007, O'Reilly Media, Inc

2007

[30] [30]

Data Mining

McGuire, P., Introduction to Pyparsing: An Object -oriented Easyto-Use Toolkit for Building Recursive Descent Parsers. World, 2006: p. 1-15. Diaa El-Din Mohamed Abo-Fayed received the B.Sc. degree in the Electronics , Faculty of Engineering, Mansoura University, 1995. He received M.Sc. degree in the Automatic Control from the Faculty of Engineering, Manso...

2006