Some of the tools below use a Sahidic Coptic lexicon based on data kindly provided by Prof. Tito Orlandi and the CMCL project. When using the part-of-speech tagging models or the tokenization script and its lexicon please make sure to refer back to the CMCL project.
Lacuna Prediction Tool
New: Check out the demo of the neural lacuna prediction tool from our paper:
Entity Visualizations
We've posted some visualizations of our Coptic entity annotations. Try playing with the data, which comes from our freely available corpora:
Natural Language Processing API
You can now get unified access to the latest NLP tools online using a web interface or a machine actionable REST API:
The NLP service currently covers segmentation, normalization, part of speech tagging, lemmatization and language of origin tagging. For individual command line tools, see below.
Part-of-Speech Tagging
Scripts and models
- Tokenization script and lexicon (assumes normalized Coptic, see tokenization guidelines)
- TreeTagger - an open source part-of-speech tagger (additional Windows interface WinTreeTagger)
- Coptic TreeTagger training models - for the fine and coarse grained tagsets (see tagging guidelines)
Documentation
- Diplomatic Transcription Guidelines
- Tokenization Guidelines (see sections 3 & 4 of the Transcription Guidelines)
- Part-of-Speech Tagging Guidelines
- Lemmatization Guidelines
Coptic Universal Dependency Treebank
A treebank is a collection of texts in which sentences have been exhaustively annotated with syntactic analyses. Our Coptic Treebank project uses the Universal Dependencies standards, which apply the same annotation scheme to multiple languages.
Models and Examples
Additional Annotation Tools
- Normalizer (normalizes orthography, removes diacritics)
- Language of origin tagger (to annotate loan words from Greek, Latin, Hebrew/Greco-Hebrew, Aramaic)
- Lemmatizer (to annotate words with their dictionary head word; embedded in the Part-of-Speech Tagger)
Converters
- Coptic encoding converter (converts older text character systems used for fonts such as Coptic and Laser Coptic into standards-compliant Coptic Unicode characters)
- Simple recoding script in Perl (supports CMCL, Laser Coptic and UTF-8 encoding conversion)
- Converter for ASCII encoding / UTF-8 of Dirk Van Damme and Gregor Wurst
- Download both converters
- SaltNPepper - a metamodel based Java framework for multi-format conversion
- Excel-Plugin for importing and exporting EXMARaLDA XML, SGML, PAULA XML and subsets of TEI XML