The Coptic Universal Dependency Treebank
Welcome to the Coptic Dependency Treebank, a project of Coptic Scriptorium
FAQ
- What's a treebank?
A treebank is a collection of texts in which sentences have been exhaustively annotated with syntactic analyses. The term itself, pioneered by the Penn Treebank for English, draws from the traditional representation of sentences as upside-down trees, whose leaves are the words in the sentence.
- What are dependencies?
Dependency trees connect each word in a sentence to the word on which it "depends". For example, we think of subjects and objects of verbs as depending on those verbs, since the verb is the word determining their appearance. This is illustrated by the English example further below.
- What are universal dependencies?
The Universal Dependencies (UD) project is an initiative to create treebanks for a wide range of languages using the same annotation scheme. This helps us to compare data from different languages, develop tools that work on multiple lanaguages, and leverage data from one or more languages to improve analyses for other languages.
The UD scheme is lexico-centric. This means that function words are dependents of content words. This makes information extraction and entity recognition easier, since we can tell what is happening in a sentence just be looking at verbs and their dependents. For example, prepositions like 'on' are not seen as governing a following noun; rather, the noun is a nominal modifier (nmod), and 'on' is a case-marker designating the kind of modification, much like case endings in Greek or Latin.
- What are treebanks good for?
- Treebanks help us to build parsers that can tell us who did what to whom. For example, if we want to know what Shenoute of Atripe is doing in his Not Because a Fox Barks, we can look for all verbs attached to a first person subject:
pos="V" ->dep[func="nsubj"] lemma="ⲁⲛⲟⲕ"
Having trees is also essential to further processing steps, such as Named Entity Recognition (NER). A mention of an entity typically consists of a nominal word and all of its dependents, but we can't know who those are without a parse tree.
(see Search below for more examples)
Treebank contents
The treebank currently contains the following data:
subcorpus | documents | tokens |
---|---|---|
Not Because a Fox Barks | MONB XH204-216 | 2,547 |
Abraham our Father | MONB XL93-94, YA518-520 | 1,197 |
Acephalous Work 22 | MONB YA421-428 | 1,698 |
I See Your Eagerness | MONB GF31-32 | 439 |
Epistle of Pseudo-Ephrem | psephrem.letter | 1,925 |
Gospel of Mark | Chapters 1 - 9 | 10,812 |
1 Corinthians | Chapters 1 - 6 | 3,570 |
Book of Ruth | Chapters 1 - 4 (complete) | 3,470 |
Letters of Besa | #1,2,13,15,25 | 3,939 |
Life of Cyrus | life.cyrus.01 | 1,962 |
Life of Onnophrius | life.onnophrius.01 | 2,745 |
Apophthegmata Patrum | #1-6,18-19,23-32,114-139 | 4,152 |
Martyrdom of St. Victor | Chapters 1 - 6 | 1,985 |
Dormition of John | dormition.john.mercad | 3,064 |
Pseudo-Athanasius | mercy_judgment | 2,782 |
Proclus Homilies | #13 On Easter | 2,344 |
Total: | 48,631 |
See http://copticscriptorium.org for more details on these sources.
Guidelines
The documentation below contains the complete list of UD labels as they apply to Coptic, as well as detailed guidelines for the correct labels and dependency structures in different constructions. The guidelines also make reference to the underlying Coptic Scriptorium part of speech tags, which are documented separately.
Treebank Annotation Guidelines
Search
The syntactically annotated data can be queried and visualized using ANNIS (http://corpus-tools.org/annis), the same interface used for the other Coptic Scriptorium corpora which do not contain syntactic analyses.
Currently only Shenoute's Not Because a Fox Barks (NBFB) is completely annotated with syntax trees. However, since parts of the Gospel of Mark, Letters of Besa, the Martyrdom of Victor, Acephalous Work 22 and the Apophthegmata Patrum have also been annotated, we are offering a separately searchable Coptic Treebank corpus, which contains all syntactically annotated data. You can also find automatically parsed analyses in the ANNIS interface for most of our data, but these have not been manually checked.
The following ANNIS queries work in all of our syntactically annotated corpora (the treebank, and individual work corpora as they are annotated):
- Search for verbs governing a 1st person subject in NBFB:
pos="V" ->dep[func="nsubj"] lemma="ⲁⲛⲟⲕ" - Search for complement clauses in the Treebank:
pos="V" ->dep[func="ccomp"] norm - Search for appositions in the Treebank:
norm ->dep[func="appos"] norm - Find dislocated arguments preceding their verb:
pos=/V.*/ ->dep[func="dislocated"] norm & #2 .* #1
Download
You can download the latest version of the treebank from the dev branch here, or switch to the master branch for possibly older stable releases.
Coptic Universal Dependency Treebank (CoNLLU format)
Download multilayer data including syntax in various formats for Not Because a Fox Barks
Contribute
If you know Coptic and would like to join the effort to extend the Coptic Treebank, please contact Amir Zeldes.
Special thanks are due to Kim Gerdes for creating Arborator, the annotation interface used to build the treebank and the tree visualizations above.