Some of the tools below use a Sahidic Coptic lexicon based on data kindly provided by Prof. Tito Orlandi and the CMCL project. When using the part-of-speech tagging models or the tokenization script and its lexicon please make sure to refer back to the CMCL project.
Natural Language Processing API
New: You can now get unified access to the latest NLP tools online using a web interface or a machine actionable REST API:Coptic NLP Service
The NLP service currently covers segmentation, normalization, part of speech tagging, lemmatization and language of origin tagging. For individual command line tools, see below.
- Scripts and models
Additional Annotation Tools
- Normalizer (normalizes orthography, removes diacritics)
- Language of origin tagger (to annotate loan words from Greek, Latin, Hebrew/Greco-Hebrew, Aramaic)
- Lemmatizer (to annotate words with their dictionary head word; embedded in the Part-of-Speech Tagger)
- Coptic encoding converter (converts older text character systems used for fonts such as Coptic and Laser Coptic into standards-compliant Coptic Unicode characters)
- Simple recoding script in Perl (supports CMCL, Laser Coptic and UTF-8 encoding conversion)
- Converter for ASCII encoding / UTF-8 of Dirk Van Damme and Gregor Wurst
- Download both converters
- SaltNPepper - a metamodel based Java framework for multi-format conversion
- Excel-Plugin for importing and exporting EXMARaLDA XML, SGML, PAULA XML and subsets of TEI XML