Some of the tools below use a Sahidic Coptic lexicon based on data kindly provided by Prof. Tito Orlandi and the CMCL project. When using the part-of-speech tagging models or the tokenization script and its lexicon please make sure to refer back to the CMCL project.
New: We've posted some visualizations of our Coptic entity annotations. Try playing with the data, which comes from our freely available corpora:
Natural Language Processing API
You can now get unified access to the latest NLP tools online using a web interface or a machine actionable REST API:
The NLP service currently covers segmentation, normalization, part of speech tagging, lemmatization and language of origin tagging. For individual command line tools, see below.
Scripts and models
Coptic Universal Dependency Treebank
A treebank is a collection of texts in which sentences have been exhaustively annotated with syntactic analyses. Our Coptic Treebank project uses the Universal Dependencies standards, which apply the same annotation scheme to multiple languages.
Models and Examples
Additional Annotation Tools
- Coptic encoding converter (converts older text character systems used for fonts such as Coptic and Laser Coptic into standards-compliant Coptic Unicode characters)
- Simple recoding script in Perl (supports CMCL, Laser Coptic and UTF-8 encoding conversion)
- Converter for ASCII encoding / UTF-8 of Dirk Van Damme and Gregor Wurst
- Download both converters
- SaltNPepper - a metamodel based Java framework for multi-format conversion
- Excel-Plugin for importing and exporting EXMARaLDA XML, SGML, PAULA XML and subsets of TEI XML