Some of the tools below use a Sahidic Coptic lexicon based on data kindly provided by Prof. Tito Orlandi and the CMCL project. When using the part-of-speech tagging models or the tokenization script and its lexicon please make sure to refer back to the CMCL project.

Entity Visualizations

New: We've posted some visualizations of our Coptic entity annotations. Try playing with the data, which comes from our freely available corpora:

Entity visualizations

Natural Language Processing API

You can now get unified access to the latest NLP tools online using a web interface or a machine actionable REST API:

Coptic NLP Service

The NLP service currently covers segmentation, normalization, part of speech tagging, lemmatization and language of origin tagging. For individual command line tools, see below.

Part-of-Speech Tagging

Scripts and models

Documentation

Coptic Universal Dependency Treebank

A treebank is a collection of texts in which sentences have been exhaustively annotated with syntactic analyses. Our Coptic Treebank project uses the Universal Dependencies standards, which apply the same annotation scheme to multiple languages.

Additional Annotation Tools

  • Normalizer (normalizes orthography, removes diacritics)
  • Language of origin tagger (to annotate loan words from Greek, Latin, Hebrew/Greco-Hebrew, Aramaic)
  • Lemmatizer (to annotate words with their dictionary head word; embedded in the Part-of-Speech Tagger)

Converters

  • Coptic encoding converter (converts older text character systems used for fonts such as Coptic and Laser Coptic into standards-compliant Coptic Unicode characters)
    • Simple recoding script in Perl (supports CMCL, Laser Coptic and UTF-8 encoding conversion)
    • Converter for ASCII encoding / UTF-8 of Dirk Van Damme and Gregor Wurst
    • Download both converters
  • SaltNPepper - a metamodel based Java framework for multi-format conversion
  • Excel-Plugin for importing and exporting EXMARaLDA XML, SGML, PAULA XML and subsets of TEI XML