Using ANNIS for search in Coptic corpora

Introduction

The ANNIS search and visualization platform offers highly complex search capabilities for texts provided by Coptic Scriptorium. To get started using ANNIS, go to:

https://annis.copticscriptorium.org/annis/scriptorium

The interface shows a search box at the top left with Coptic works below, and some example queries on the right. All queries in this tutorial are linked to searches in the small Coptic Treebank, but you can change this in the bottom left list.

ANNIS interface

Cheat sheet

Note searches are run on the Coptic Treebank only by default (you can select other/more corpora in ANNIS)

norm="ⲧⲉⲓ" (standard form)
orig="ⲧⲉⲉⲓ" (form as spelled in manuscript)
lemma="ⲡⲉⲓ" (dictionary entry)
norm=/.*ⲟⲥ/ (form ends in -ⲟⲥ)
lang (any item with a language annotation such as Greek, Latin, Hebrew, etc.)
lang="Greek" (exactly all Greek origin terms)
pos="V" _o_ lang="Greek" (a verb overlapping Greek material, incl. either part or all of a word)
pos="VSTAT" (all stative verbs)
pos=/V.*/ (any kind of verb)
pos=/CCIRC|CFOC/ (a circumstantial or a focalizing converter)
morph="ⲙⲛⲧ" (part of a complex unit)
norm_group="ⲙⲡⲉⲭⲣⲓⲥⲧⲟⲥ" (whole bound group)
orig_group="ⲙ̇ⲡⲉⲭ︦ⲥ︦" (whole group as spelled)

norm="ⲛϭⲓ" . norm="ⲧ" (ⲛϭⲓ followed by feminine article)
lang="Greek" . lang="Greek" (two adjacent Greek words)
norm=/ⲛⲧⲟⲕ?/ .1,3 lemma="ⲡⲉ" (ⲛⲧⲟ or ⲛⲧⲟⲕ followed by lemma ⲡⲉ within 3 tokens)
lemma="ⲁⲩⲱ" ^1,5 lemma="ⲟⲛ" (the words ⲁⲩⲱ and ⲟⲛ within 5 tokens of each other)
pos="N" . pos="PPERO" (noun followed by clitic pronoun)

norm="ⲁ" . norm="ϥ" . norm="ⲥⲱⲧⲙ" (the sequence ⲁ-ϥ-ⲥⲱⲧⲙ)
lemma="ⲟⲩ" . norm="ⲉⲃⲟⲗ" . lemma="ϩⲛ" (ⲟⲩ-ⲉⲃⲟⲗ followed by lemma ϩⲛ)
norm="ⲣ" . lemma="ⲡ" . pos="N" (ⲣ, definite article, Noun)
norm="ⲉ" . pos="V" . pos="PPERO" (to-infinitive followed by clitic pronoun)

func="acl:relcl" (relative clauses)
pos="V" ->dep func="dislocated" (verb governing a dislocated argument)
pos="V" ->dep func="dislocated" & #2 .* #1 (verb governing a dislocated argument which precedes the verb)
norm="ⲉⲃⲟⲗ" . lemma="ϩⲛ" _=_ func="fixed" (the fixed expression ⲉⲃⲟⲗ ϩⲛ meaning 'out-of')
norm="ⲉⲃⲟⲗ" . lemma="ϩⲛ" _=_ func="case" (ⲉⲃⲟⲗ followed by a separate ϩⲛ meaning 'out ... in')
norm=/.*ⲙ/ _=_ lemma=/.*ⲛ/ _=_ pos="PREP" (preposition ending in -ⲙ with lemma in -ⲛ)

lang="Latin" & meta::msName=/MONB.*/ (Latin words in MONB manuscripts from the White Monastery)
norm="ⲛⲟⲩⲧⲉ" & meta::annotation=/.*Krawiec.*/ (the word ⲛⲟⲩⲧⲉ in documents edited by Rebecca Krawiec)
translation=/.*Lord.*/ & meta::translation=/.*Budge.*/ (the word 'Lord' in translations by Budge)
norm="ⲉⲛⲉϩ" & meta::redundant="no" (look for ⲉⲛⲉϩ excluding 'redundant' parallel witnesses)
norm="ⲉⲛⲉϩ" & meta::redundant="yes" (look for ⲉⲛⲉϩ in 'redundant' parallel witnesses)

hi_rend="ekthetic" (ekthetic letters)
translation _l_ hi_rend="ekthetic" (ekthetic and sentence initial)
translation _l_ tok .* hi_rend="ekthetic" & #1 _i_ #3 (ekthetic and non-sentence initial)
note=/.*sic.*/ (note annotation indicating scribal error)
morph="ⲣ" . morph _=_ lang="Greek" (complex verb in ⲣ + Greek stem)
lb_n _i_ norm_group=/.*ⲱ.*/ & lb_n _i_ norm_group=/.*ⲁⲡϫ.*/ & #1 . #3 (two consecutive lines, the first containing ⲱ and the second ⲁⲡϫ)

Words

Where are the words in Coptic?

Words in Coptic can be complex, and the fact that manuscripts use various spellings for words complicates things further. Some terminology can help:

Bound groups - these are the Coptic units we are used to seeing written between spaces, for example ⲁϥⲥⲱⲧⲙ 'he has heard' is a bound group
Norm units - these are the components of bound groups, some of which can appear by themselves (like ⲥⲱⲧⲙ 'hear' above) and some of which always appear bound (like the past tense marker ⲁ). Norm units always have a part of speech, such as being a noun (N) or a verb (V), or even an auxiliary in the case of ⲁ.
Morphs - these units are prefixes or suffixes smaller than norm units, and do not have their own part of speech, for example the complex norm unit ⲙⲛⲧ-ⲁⲧ-ⲥⲱⲧⲙ 'disobedience' has three morphs: the abstract prefix ⲙⲛⲧ (a little like English -ness), ⲁⲧ (like dis-) and ⲥⲱⲧⲙ. Notice that although ⲥⲱⲧⲙ is usually a norm unit (a verb), in this case it is only a smaller morph, since it is part of a bigger noun.

Use the cheat sheet for some commonly used query types, as well as the explanations below. Also see our overview of annotation guidelines for some common annotation practices.

How to search for norms, morphs and groups

You can search for norm units, groups and morphs in ANNIS like this:

Enter the query in the query box and click Search or hit ctrl+Enter (or click the link above). Once you have a search result, you can view all of the available annotations for each result by expanding the [+] next to each annotation layer. Expanding the annotations grid will show you the available annotation layers that can be searched for, such as norm and morph below.

annotation grid

Using orig and orig_group for orthographic variants

Sometimes we want to search not for norm units, but for original spellings found in manuscripts. You can use the following queries to find specific spellings, including supralinear strokes and other diacritics:

Note that diacritics will always be removed in the norm annotations, but will be retained in orig if available in the original transcription. After you have run a search, you can also toggle the visualization between original and normalized spelling by choosing the Base text drop down at the top of your search results and switching between norm_group (the default) and orig_group. It is also possible to switch to norm or orig, to see the text segmented into units, rather than bound groups.

Searching for lemmas

If you want to find all forms of an inflected word, you can search for lemmas instead of norm forms. For example:

lemma="ⲕⲱⲧ"

This search finds the absolute form ⲕⲱⲧ, but also the reduced form "ⲕⲟⲧ" and even the stative form "ⲕⲏⲧ", which all have the same lemma or dictionary entry. In the annotation grid, lemmas are clickable and link to a search in the Coptic Dictionary Online.

Wild cards and regular expressions

Sometimes it can be useful to search for units or bound group containing some letter or letters. We can do this using wildcard, or 'regular expression' searches. You can run such a search on any annotation layer by using slashes instead of double quotes, and the following operators:

. - any single character
? - makes the preceding character optional
* - the preceding character any number of times (including zero times)

For example, you can run these searches:

The first example searches for words like ⲥⲱⲧⲡ, ⲥⲟⲧⲡ or ⲥⲉⲧⲡ, with the dot indicating 'any character'. The second example searches for groups beginning with ⲉⲛⲧ or ⲛⲧ (the ⲉ is made optional by the '?'), and may end with anything - .* means any character, any number of times, so it allows the group to end in any way.

There are more regular expression operators available - for more information on regular expression operators, see http://www.regular-expressions.info/.

Tag	Name	Examples
A.*	Auxiliary	ⲁ[ϥ], ⲙⲉ[ϥ], ⲧⲣⲉ[ϥ], ...
ADV	Adverb	ⲉⲃⲟⲗ, ⲟⲛ, ⲡⲱⲥ
ART	Article	ⲡ(ⲉ), ⲧ(ⲉ), ⲛ(ⲉ), ϩⲉⲛ, ⲕⲉ
C.*	Converter	ⲉ, ⲉⲧⲉ, ⲛⲉ, ...
CONJ	Conjunction	ⲁⲩⲱ, ϫⲉ, ⲏ, ⲙⲏ, ⲉⲓⲧⲉ, ...
COP	Copula	ⲡⲉ/ⲧⲉ/ⲛⲉ
EXIST	Existential/possessive	ⲟⲩⲛ/ⲙⲛ
FM	Foreign material	ⲡⲁⲣⲁ ⲧⲟⲩⲧⲟ
FUT	Future	ⲛⲁ
IMOD	Inflected modifier	ⲧⲏⲣ[ϥ], ϩⲱⲱ[ⲧ], ...
N.*	Noun	ⲁⲑⲏⲧ, ⲣⲱⲙⲉ, ⲁⲣⲭⲏ, ...
NEG	Negation	ⲛ, ⲁⲛ, ⲧⲙ[ⲥⲱⲧⲙ]
NUM	Numeral	ⲟⲩⲁ, ⲥⲛⲁⲩ, ...
PDEM	Pronoun, demonstrative	ⲡⲉⲓ/ⲡⲁⲓ, ⲧⲉⲓ/ⲧⲁⲓ, ⲛⲉⲓ/ⲛⲁⲓ
PINT	Pronoun, interrogative	ⲟⲩ, ⲛⲓⲙ
PPER.*	Pronoun, personal	ϥ,ⲥ,ⲓ,ϯ,ⲛ,ⲁⲛⲟⲕ,ⲁⲛⲅ̄,...
PPOS	Pronoun, possessive	ⲡⲉϥ,ⲧⲉⲧⲛ̄,ⲡⲟⲩ,ⲡⲁ,ⲡⲱⲓ,...
PREP	Preposition	ⲉⲧⲃⲉ, ϩⲛ̄, ⲛ, ⲙ̄ⲙⲟ[ϥ], ...
PTC	Particle	ⲇⲉ, ⲛ̄ϭⲓ, ...
PUNCT	Punctuation	. , · ...
UNKNOWN	Unknown, lacuna	ⲃ_ _ _, _ _ⲟⲥ, _ _ _, ...
V.*	Verb	ⲥⲱⲧⲙ, ⲥⲱⲧⲡ, ⲥⲟⲧⲡ, ⲉⲓⲣⲉ, ⲟ, ⲁⲣⲓ, ...
VBD	Verboid	ⲛⲁⲛⲟⲩ[ϥ], ⲡⲉϫⲁ[ϥ], ⲡⲉϫⲉ,...

Sequences

To search for multiple words or bound groups we must specify the order in which they appear, and possibly the distance. Sequences of words and annotations work similarly, and can be mixed freely.

Words

The following queries illustrate searching for two adjacent norm units, three norm units, two adjacent bound groups, etc. The operator . indicates that two search terms are adjacent:

We can also specify a range of possible distances between words, for example the auxiliary ⲁ followed by ⲇⲉ within 1-10 tokens:

norm="ⲁ" .1,10 norm="ⲇⲉ"

If order does not matter, we can use the operator ^ instead, which can be used with or without token ranges, just like .:

norm="ⲁ" ^ norm="ⲇⲉ" (ⲁ followed by ⲇⲉ or the opposite order)
norm="ⲁ" ^1,10 norm="ⲇⲉ" (ⲁ and ⲇⲉ within 10 tokens of each other, in either order)

If we don't care how far two terms are, we can also use .* ('any distance forward') and ^* ('any distance in any direction').

Words and annotations

Combining annotation and word search is possible too:

norm="ⲛ" _=_ pos="PREP"

This finds the norm unit ⲛ, but only if it is also tagged as PREP.

Using value negation

Sometimes it makes sense to ask for all values except for something, usually in combination with some positive search. You can negate values with != instead of =:

norm="ⲛ" _=_ pos!="PREP"

This finds all cases of the norm unit ⲛ which are not prepositions.

Spans

If you look at an annotation grid for any search result, you will notice longer span annotations, such as translation, entity or multiword, which give translations, entity types (person, place etc.) and multiword expressions respectively. Searching for these by themselves works as usual with both exact and wildcard searches, as well as multiple adjacent spans with .:

entity="place" (find all place annotations)
translation=/.*[Bb]elieve.*/ (find all translations containing 'believe', lower or upper case)
entity="person" . entity!="person" (find a person entity followed by a non-person entity)

However sometimes we want to express complex overlap relations between spans, such as a translation or entity span containing a certain word. For example, we can find instances of ⲉⲓⲱⲧ meaning 'barley' rather than 'father' in two different ways, using the operator _i_, which means that one span includes another (i.e. the second span is nested and smaller or equal to the first one:

Metadata

You can see what metadata a corpus or document has by clicking on the i-button for that corpus, or for the search result that comes from a particular document:

metadata

To search using metadata criteria, just use the annotation name and value after the prefix meta:: and add them to your query with & like this (regular expression wildcards work as usual):

Syntax

Functions and dependencies

Norm units in each sentence are connected by dependencies which express their grammatical functions. To search for these functions, you can use the func annotation, for example to search for nominal or clausal subjects (nsubj and csubj), or objects (obj):

func=/.subj/ (find both nominal and clausal subjects)
func=/obj/ (find direct objects)

You can see how these functions connect to other words by expanding the syntax visualization (the complete list of grammatical functions can be found in the Coptic Universal Dependencies Documentation).

syntactic dependencies

Using the dependency relation operator ->dep we can also constrain functions to attach to certain words or parts of speech, for example to search for objects of the verb ϯ 'give', or for cases of fronted dislocated arguments (e.g. "Me, I haven't seen him"):

lemma="ϯ" ->dep func="obj" (the verb ϯ and the object it governs)
pos="V" ->dep func="dislocated" & #2 .* #1 (a Verb and its dislocated argument, which precedes the verb)

In the second example, we want to specify both the dislocated dependency between the two search terms (with ->dep) and the ordering (dislocated argument precedes verb). Since we cannot use two operators simultaneously, we add a condition using &, then specify that the second term we declated (func="dislocated", now referred to as #2) should precede (.*) the first thing we declared (the verb, now #1).

Full list of func labels

label	description
acl	adjunct adnominal clause predicate ([ⲧⲉⲝⲟⲩⲥⲓⲁ ⲉⲧⲣⲁ]ⲙⲟⲟϣⲉ)
acl:relcl	relative clause predicate ([ⲉⲛⲧⲁϥ]ⲥⲱⲧⲡ)
advcl	adverbial clause predicate ([ⲉⲃⲟⲗ ϫⲉ ϯ] ⲙⲙⲁⲩ)
advmod	adverb (ⲙⲙⲁⲩ, ⲕⲁⲗⲱⲥ)
amod	adjectival modifier ([ϣⲏⲣⲉ] ϣⲏⲙ)
appos	apposition ([ⲡⲣⲣⲟ, ] ⲇⲓⲟⲕⲗⲏϯⲁⲛⲟⲥ)
aux	auxiliary (ⲁ, ⲙⲡⲉ, ϣⲁⲣⲉ)
case	case marker such as a preposition (ϩⲛ, ⲛ)
cc	coordinating conjunction (ⲁⲩⲱ, ⲏ, ⲙⲛ)
ccomp	complement clause predicate ([ⲡⲉϫⲁϥ ϫⲉ]ⲁⲛⲁⲩ)
compound	part of compound word, usually a complex number (ⲙⲏⲧ [ⲛϣⲉ])
conj	coordinate head ([ϩⲉⲛϣⲏⲧⲉ ⲙⲛ ϩⲉⲛ]ϣⲉⲉⲣⲉ)
cop	copula dependent of predicate ([ⲟⲩ ⲣⲱⲙⲉ] ⲡⲉ)
csubj	clausal subject ([ϣϣⲉ ⲉ]ϣⲗⲏⲗ)
dep	unspecified dependency/other
det	article or other determiner (ⲟⲩ, ⲡⲉⲓ)
discourse	interjections (ⲱ, ϩⲁⲙⲏⲛ)
dislocated	second realization of argument out of place ([ⲡ]ⲣⲱⲙⲉ [ⲁϥⲥⲱⲧⲙ], [ⲁϥⲥⲱⲧⲙ ⲛϭⲓⲡ]ⲣⲱⲙⲉ)
fixed	non-initial token in a fixed expression ([ⲉⲃⲟⲗ] ϩⲛ)
flat	non-initial part of a name ([ⲁⲡⲁ] ⲡⲟⲓⲙⲏⲛ)
iobj	indirect object in possession ([ⲟⲩⲛⲧ]ϥ [ϭⲟⲙ])
mark	subordinating or clause-introducing conjunction (ⲉⲣⲉ, ϫⲉ)
nmod	adnominal prepositional phrase ([ⲙⲁ ⲛ] ϣⲱⲡⲉ, [ⲟⲩ ⲣⲱⲙⲉ ϩⲓ ⲡ] ϫⲁⲉⲓⲉ)
nsubj	nominal subject (ϥ[ⲥⲱⲧⲙ], [ⲡ]ⲏⲓ [ⲕⲏⲧ])
nummod	numeric modifier ([ⲣⲱⲙⲉ] ⲥⲛⲁⲩ, ϣⲟⲙⲛⲧ [ⲛϩⲟⲟⲩ])
obj	direct object ([ⲥⲟⲧⲡ]ϥ, [ⲙⲙⲟ]ϥ)
obl	oblique/adverbial prepositional phrase ([ⲛⲁⲩ ⲉⲣⲟ]ϥ, [ϩⲙ ⲡ]ⲏⲓ)
obl:npmod	oblique noun phrase ([ⲡ]ⲟⲩⲁ [ⲡⲟⲩⲁ])
orphan	links arguments whose joint head is elliptical
parataxis	additional phrase head without explicit coordination ([ⲁϥⲃⲱⲕ ⲁϥ]ⲛⲁⲩ )
punct	punctuation (., ⳾)
reparandum	marks head of erroneous or dysfluent material
root	main predicate (ⲡⲉϫⲉ, ⲥⲱⲧⲙ)
vocative	used in appellations ([ⲱ ⲡ]ⲣⲱⲙⲉ)
xcomp	external complement with shared object, usually infinitive/causative ([ⲁϥⲃⲱⲕ ⲉ]ⲛⲁⲩ, [ⲉⲧⲣⲉϥ]ⲥⲱⲧⲙ)

Frequencies

Once a query has been formulated, we can select More -> Frequencies to get a frequency breakdown of the values for the annotations we searched for. In such cases, it often makes sense to leave some annotation values unspecified. For example, we can look for a breakdown of all Greek origin words by specifying the language and asking for a lemma with no constraints:

lang="Greek" _=_ lemma

This will match any lemma with Greek language, and the frequency breakdown will provide counts for each type.

frequency query frequency breakdown

Alternatively, we can download all results matching this query using the CSV exporter by selecting More -> Export, then clicking Perform Export and finally using Download when the export is ready. For more information on different export formats, see the ANNIS documentation.

Citing

Please see the citation guidelines here for how to cite search results in academic papers.

For more walkthrough tutorials see also:
For more information on the Scriptorium tag set, see tagging guidelines
Lemmatization practices are documented here
Syntactic parsing guidelines are part of the Coptic Universal Dependencies Documentation project
Entity tagging guidelines are here
For a complete list of ANNIS operators, see the ANNIS documentation.

Using ANNIS for search in Coptic corpora

Introduction

Cheat sheet

Words

Where are the words in Coptic?

How to search for norms, morphs and groups

Using orig and orig_group for orthographic variants

Searching for lemmas

Wild cards and regular expressions

Tags

The part of speech tagset

Searching for words with tags

Language of origin

Sequences

Words

Words and annotations

Using value negation

Spans

Metadata

Syntax

Functions and dependencies

Full list of func labels

Frequencies

Citing

More