Using ANNIS for search in Coptic corpora

Introduction

The ANNIS search and visualization platform offers highly complex search capabilities for texts provided by Coptic Scriptorium. To get started using ANNIS, go to:

https://annis.copticscriptorium.org/annis/scriptorium

The interface shows a search box at the top left with Coptic works below, and some example queries on the right. All queries in this tutorial are linked to searches in the small Coptic Treebank, but you can change this in the bottom left list.

ANNIS interface

Cheat sheet

Note searches are run on the Coptic Treebank only by default (you can select other/more corpora in ANNIS)

Words

Where are the words in Coptic?

Words in Coptic can be complex, and the fact that manuscripts use various spellings for words complicates things further. Some terminology can help:

  • Bound groups - these are the Coptic units we are used to seeing written between spaces, for example ⲁϥⲥⲱⲧⲙ 'he has heard' is a bound group
  • Norm units - these are the components of bound groups, some of which can appear by themselves (like ⲥⲱⲧⲙ 'hear' above) and some of which always appear bound (like the past tense marker ). Norm units always have a part of speech, such as being a noun (N) or a verb (V), or even an auxiliary in the case of .
  • Morphs - these units are prefixes or suffixes smaller than norm units, and do not have their own part of speech, for example the complex norm unit ⲙⲛⲧ-ⲁⲧ-ⲥⲱⲧⲙ 'disobedience' has three morphs: the abstract prefix ⲙⲛⲧ (a little like English -ness), ⲁⲧ (like dis-) and ⲥⲱⲧⲙ. Notice that although ⲥⲱⲧⲙ is usually a norm unit (a verb), in this case it is only a smaller morph, since it is part of a bigger noun.

Use the cheat sheet for some commonly used query types, as well as the explanations below. Also see our overview of annotation guidelines for some common annotation practices.

How to search for norms, morphs and groups

You can search for norm units, groups and morphs in ANNIS like this:

Enter the query in the query box and click Search or hit ctrl+Enter (or click the link above). Once you have a search result, you can view all of the available annotations for each result by expanding the [+] next to each annotation layer. Expanding the annotations grid will show you the available annotation layers that can be searched for, such as norm and morph below.

annotation grid

Using orig and orig_group for orthographic variants

Sometimes we want to search not for norm units, but for original spellings found in manuscripts. You can use the following queries to find specific spellings, including supralinear strokes and other diacritics:

Note that diacritics will always be removed in the norm annotations, but will be retained in orig if available in the original transcription. After you have run a search, you can also toggle the visualization between original and normalized spelling by choosing the Base text drop down at the top of your search results and switching between norm_group (the default) and orig_group. It is also possible to switch to norm or orig, to see the text segmented into units, rather than bound groups.

Searching for lemmas

If you want to find all forms of an inflected word, you can search for lemmas instead of norm forms. For example:

This search finds the absolute form ⲕⲱⲧ, but also the reduced form "ⲕⲟⲧ" and even the stative form "ⲕⲏⲧ", which all have the same lemma or dictionary entry. In the annotation grid, lemmas are clickable and link to a search in the Coptic Dictionary Online.

Wild cards and regular expressions

Sometimes it can be useful to search for units or bound group containing some letter or letters. We can do this using wildcard, or 'regular expression' searches. You can run such a search on any annotation layer by using slashes instead of double quotes, and the following operators:

  • . - any single character
  • ? - makes the preceding character optional
  • * - the preceding character any number of times (including zero times)

For example, you can run these searches:

The first example searches for words like ⲥⲱⲧⲡ, ⲥⲟⲧⲡ or ⲥⲉⲧⲡ, with the dot indicating 'any character'. The second example searches for groups beginning with ⲉⲛⲧ or ⲛⲧ (the is made optional by the '?'), and may end with anything - .* means any character, any number of times, so it allows the group to end in any way.

There are more regular expression operators available - for more information on regular expression operators, see http://www.regular-expressions.info/.

Tags

Coptic Scriptorium data is annotated with grammatical parts of speech for every norm/orig unit.

The part of speech tagset

The possible parts of speech are divided into coarse categories, which can be searched for using regular expressions:

Tag Name Examples
A.* Auxiliary [ϥ], ⲙⲉ[ϥ], ⲧⲣⲉ[ϥ], ...
ADV Adverb ⲉⲃⲟⲗ, ⲟⲛ, ⲡⲱⲥ
ART Article (), (), (), ϩⲉⲛ, ⲕⲉ
C.* Converter , ⲉⲧⲉ, ⲛⲉ, ...
CONJ Conjunction ⲁⲩⲱ, ϫⲉ, , ⲙⲏ, ⲉⲓⲧⲉ, ...
COP Copula ⲡⲉ/ⲧⲉ/ⲛⲉ
EXIST Existential/possessive ⲟⲩⲛ/ⲙⲛ
FM Foreign material ⲡⲁⲣⲁ ⲧⲟⲩⲧⲟ
FUT Future ⲛⲁ
IMOD Inflected modifier ⲧⲏⲣ[ϥ], ϩⲱⲱ[], ...
N.* Noun ⲁⲑⲏⲧ, ⲣⲱⲙⲉ, ⲁⲣⲭⲏ, ...
NEG Negation , ⲁⲛ, ⲧⲙ[ⲥⲱⲧⲙ]
NUM Numeral ⲟⲩⲁ, ⲥⲛⲁⲩ, ...
PDEM Pronoun, demonstrative ⲡⲉⲓ/ⲡⲁⲓ, ⲧⲉⲓ/ⲧⲁⲓ, ⲛⲉⲓ/ⲛⲁⲓ
PINT Pronoun, interrogative ⲟⲩ, ⲛⲓⲙ
PPER.* Pronoun, personal ϥ,,,ϯ,,ⲁⲛⲟⲕ,ⲁⲛⲅ̄,...
PPOS Pronoun, possessive ⲡⲉϥ,ⲧⲉⲧⲛ̄,ⲡⲟⲩ,ⲡⲁ,ⲡⲱⲓ,...
PREP Preposition ⲉⲧⲃⲉ, ϩⲛ̄, , ⲙ̄ⲙⲟ[ϥ], ...
PTC Particle ⲇⲉ, ⲛ̄ϭⲓ, ...
PUNCT Punctuation . , · ...
UNKNOWN Unknown, lacuna _ _ _, _ _ⲟⲥ, _ _ _, ...
V.* Verb ⲥⲱⲧⲙ, ⲥⲱⲧⲡ, ⲥⲟⲧⲡ, ⲉⲓⲣⲉ, , ⲁⲣⲓ, ...
VBD Verboid ⲛⲁⲛⲟⲩ[ϥ], ⲡⲉϫⲁ[ϥ], ⲡⲉϫⲉ,...

Each of the tags containing wild cards stands for multiple options, for example V.* encompasses V (a regular verb), VSTAT (stative verb) and VIMP (inflected imperative verbs). For complete documentation of fine-grained POS tags, see the documentation.

Searching for words with tags

Some example searches using wild cards for coarse POS or exact matches for fine POS categories:

We can combine the search for words and tags using the operator _=_, which mean 'in the same place' or 'covering the same span of text'. For example, the following searches for verbs starting with :

Language of origin

For words of foreign origin, Scriptorium tags the earliest language of origin using lang, as follows:

Note that Hebrew origin names, such as ⲁⲃⲓⲙⲉⲗⲉⲭ are tagged as Hebrew, not Greek. It is also possible to combine language of origin with part of speech, for example to find verbs of Greek origin:

Note also that if a verb is complex, such as -ⲭⲣⲉⲓⲁ, the search above will not find it, since not the entire word is Greek. In such cases, only a search on the morph level will recover the Greek language of origin. See more about searching within spans under 'Searching for longer span annotations'.

Sequences

To search for multiple words or bound groups we must specify the order in which they appear, and possibly the distance. Sequences of words and annotations work similarly, and can be mixed freely.

Words

The following queries illustrate searching for two adjacent norm units, three norm units, two adjacent bound groups, etc. The operator . indicates that two search terms are adjacent:

We can also specify a range of possible distances between words, for example the auxiliary followed by ⲇⲉ within 1-10 tokens:

If order does not matter, we can use the operator ^ instead, which can be used with or without token ranges, just like .:

If we don't care how far two terms are, we can also use .* ('any distance forward') and ^* ('any distance in any direction').

Words and annotations

Combining annotation and word search is possible too:

This finds the norm unit , but only if it is also tagged as PREP.

Using value negation

Sometimes it makes sense to ask for all values except for something, usually in combination with some positive search. You can negate values with != instead of =:

This finds all cases of the norm unit which are not prepositions.

Spans

If you look at an annotation grid for any search result, you will notice longer span annotations, such as translation, entity or multiword, which give translations, entity types (person, place etc.) and multiword expressions respectively. Searching for these by themselves works as usual with both exact and wildcard searches, as well as multiple adjacent spans with .:

However sometimes we want to express complex overlap relations between spans, such as a translation or entity span containing a certain word. For example, we can find instances of ⲉⲓⲱⲧ meaning 'barley' rather than 'father' in two different ways, using the operator _i_, which means that one span includes another (i.e. the second span is nested and smaller or equal to the first one:

Metadata

You can see what metadata a corpus or document has by clicking on the i-button for that corpus, or for the search result that comes from a particular document:

metadata

To search using metadata criteria, just use the annotation name and value after the prefix meta:: and add them to your query with & like this (regular expression wildcards work as usual):

Syntax

Functions and dependencies

Norm units in each sentence are connected by dependencies which express their grammatical functions. To search for these functions, you can use the func annotation, for example to search for nominal or clausal subjects (nsubj and csubj), or objects (obj):

You can see how these functions connect to other words by expanding the syntax visualization (the complete list of grammatical functions can be found in the Coptic Universal Dependencies Documentation).

syntactic dependencies

Using the dependency relation operator ->dep we can also constrain functions to attach to certain words or parts of speech, for example to search for objects of the verb ϯ 'give', or for cases of fronted dislocated arguments (e.g. "Me, I haven't seen him"):

In the second example, we want to specify both the dislocated dependency between the two search terms (with ->dep) and the ordering (dislocated argument precedes verb). Since we cannot use two operators simultaneously, we add a condition using &, then specify that the second term we declated (func="dislocated", now referred to as #2) should precede (.*) the first thing we declared (the verb, now #1).

Full list of func labels

label description
acl adjunct adnominal clause predicate ([ⲧⲉⲝⲟⲩⲥⲓⲁ ⲉⲧⲣⲁ]ⲙⲟⲟϣⲉ)
acl:relcl relative clause predicate ([ⲉⲛⲧⲁϥ]ⲥⲱⲧⲡ)
advcl adverbial clause predicate ([ⲉⲃⲟⲗ ϫⲉ ϯ] ⲙⲙⲁⲩ)
advmod adverb (ⲙⲙⲁⲩ, ⲕⲁⲗⲱⲥ)
amod adjectival modifier ([ϣⲏⲣⲉ] ϣⲏⲙ)
appos apposition ([ⲡⲣⲣⲟ, ] ⲇⲓⲟⲕⲗⲏϯⲁⲛⲟⲥ)
aux auxiliary (, ⲙⲡⲉ, ϣⲁⲣⲉ)
case case marker such as a preposition (ϩⲛ, )
cc coordinating conjunction (ⲁⲩⲱ, , ⲙⲛ)
ccomp complement clause predicate ([ⲡⲉϫⲁϥ ϫⲉ]ⲁⲛⲁⲩ)
compound part of compound word, usually a complex number (ⲙⲏⲧ [ⲛϣⲉ])
conj coordinate head ([ϩⲉⲛϣⲏⲧⲉ ⲙⲛ ϩⲉⲛ]ϣⲉⲉⲣⲉ)
cop copula dependent of predicate ([ⲟⲩ ⲣⲱⲙⲉ] ⲡⲉ)
csubj clausal subject ([ϣϣⲉ ]ϣⲗⲏⲗ)
dep unspecified dependency/other
det article or other determiner (ⲟⲩ, ⲡⲉⲓ)
discourse interjections (, ϩⲁⲙⲏⲛ)
dislocated second realization of argument out of place ([]ⲣⲱⲙⲉ [ⲁϥⲥⲱⲧⲙ], [ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡ]ⲣⲱⲙⲉ)
fixed non-initial token in a fixed expression ([ⲉⲃⲟⲗ] ϩⲛ)
flat non-initial part of a name ([ⲁⲡⲁ] ⲡⲟⲓⲙⲏⲛ)
iobj indirect object in possession ([ⲟⲩⲛⲧ]ϥ [ϭⲟⲙ])
mark subordinating or clause-introducing conjunction (ⲉⲣⲉ, ϫⲉ)
nmod adnominal prepositional phrase ([ⲙⲁ ] ϣⲱⲡⲉ, [ⲟⲩ ⲣⲱⲙⲉ ϩⲓ ] ϫⲁⲉⲓⲉ)
nsubj nominal subject (ϥ[ⲥⲱⲧⲙ], []ⲏⲓ [ⲕⲏⲧ])
nummod numeric modifier ([ⲣⲱⲙⲉ] ⲥⲛⲁⲩ, ϣⲟⲙⲛⲧ [ⲛϩⲟⲟⲩ])
obj direct object ([ⲥⲟⲧⲡ]ϥ, [ⲙⲙⲟ]ϥ)
obl oblique/adverbial prepositional phrase ([ⲛⲁⲩ ⲉⲣⲟ]ϥ, [ϩⲙ ]ⲏⲓ)
obl:npmod oblique noun phrase ([]ⲟⲩⲁ [ⲡⲟⲩⲁ])
orphan links arguments whose joint head is elliptical
parataxis additional phrase head without explicit coordination ([ⲁϥⲃⲱⲕ ⲁϥ]ⲛⲁⲩ )
punct punctuation (., )
reparandum marks head of erroneous or dysfluent material
root main predicate (ⲡⲉϫⲉ, ⲥⲱⲧⲙ)
vocative used in appellations ([ ]ⲣⲱⲙⲉ)
xcomp external complement with shared object, usually infinitive/causative ([ⲁϥⲃⲱⲕ ]ⲛⲁⲩ, [ⲉⲧⲣⲉϥ]ⲥⲱⲧⲙ)

Frequencies

Once a query has been formulated, we can select More -> Frequencies to get a frequency breakdown of the values for the annotations we searched for. In such cases, it often makes sense to leave some annotation values unspecified. For example, we can look for a breakdown of all Greek origin words by specifying the language and asking for a lemma with no constraints:

This will match any lemma with Greek language, and the frequency breakdown will provide counts for each type.

frequency query frequency breakdown

Alternatively, we can download all results matching this query using the CSV exporter by selecting More -> Export, then clicking Perform Export and finally using Download when the export is ready. For more information on different export formats, see the ANNIS documentation.

Citing

Please see the citation guidelines here for how to cite search results in academic papers.

More