UDAR(enie)

UDAR Does Accented Russian: A finite-state morphological analyzer of Russian that handles stressed wordforms.

A python wrapper for the Russian finite-state transducer described originally in chapter 2 of my dissertation.

If you use this work in your research please cite the following:

Reynolds, Robert J. "Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications" PhD Diss., UiT–The Arctic University of Norway, 2016. https://hdl.handle.net/10037/9685

Feature requests, issues, and pull requests are welcome!

Dependencies

For all features to be available, you should have hfst and vislcg3 installed as command-line utilities. Specifically, hfst is needed for FST-based tokenization, and vislcg3 is needed for grammatical disambiguation. The version used to successfully test the code is included in each commit in this file. The recommended method for installing these dependencies is as follows:

Debian / Ubuntu

$ wget https://apertium.projectjj.com/apt/install-nightly.sh -O - | sudo bash
$ sudo apt-get install cg3 hfst python3-hfst

MacOS (Python 3.6/3.7 only)

On MacOS, one of udar's dependencies, the python package hfst, is not currently available for Python 3.8+. Hopefully, this will be remedied soon.

$ curl https://apertium.projectjj.com/osx/install-nightly.sh | sudo bash
$ python3 -m pip install hfst

Installation

This package can be installed from PyPI using the usual...

$ python3 -m pip install --user udar

...or directly from this repository using...

$ python3 -m pip install --user git+https://github.com/reynoldsnlp/udar

Introduction

NB! Documentation is currently limited to docstrings. I recommend that you use help() frequently to see how to use classes and methods. For example, to see what options are available for building a Document, try help(Document).

The most common use-case is to use the Document constructor to automatically tokenize and analyze a text. If you print() a Document object, the result is an XFST/HFST stream:

import udar
doc1 = udar.Document('Мы удивились простоте системы.')
print(doc1)
# Мы	мы+Pron+Pers+Pl1+Nom	0.000000
#
# удивились	удивиться+V+Perf+IV+Pst+MFN+Pl	5.078125
#
# простоте	простота+N+Fem+Inan+Sg+Dat	4.210938
# простоте	простота+N+Fem+Inan+Sg+Loc	4.210938
#
# системы	система+N+Fem+Inan+Pl+Acc	5.429688
# системы	система+N+Fem+Inan+Pl+Nom	5.429688
# системы	система+N+Fem+Inan+Sg+Gen	5.429688
#
# .	.+CLB	0.000000

Passing the argument disambiguate=True, or running doc1.disambiguate() after the fact will run a Constraint Grammar to remove as many ambiguous readings as possible. This grammar is far from complete, so some ambiguous readings will remain.

Data objects

`Document` object

Property	Type	Description
text	`str`	Original text of this document
sentences	`List[Sentence]`	List of sentences in this document
num_tokens	`int`	Number of tokens in this document
features	`tuple`	`udar.features.FeatureExtractor` stores extracted features here

Document objects have convenient methods for adding stress or converting to phonetic transcription.

Method	Return type	Description
stressed	`str`	The original text of the document with stress marks
phonetic	`str`	The original text converted to phonetic transcription
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate	`None`	Disambiguate readings using the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
from_cg3	`Document`	Create `Document` from VISL-CG3 format stream
hfst_str	`str`	Analysis stream in the XFST/HFST format
from_hfst	`Document`	Create `Document` from XFST/HFST format stream
to_dict	`list`	Convert to a complex list object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

Examples

stressed_doc1 = doc1.stressed()
print(stressed_doc1)
# Мы́ удиви́лись простоте́ систе́мы.

ambig_doc = udar.Document('Твои слова ничего не значат.', disambiguate=True)
print(sorted(ambig_doc[1].stresses()))  # Note that слова is still ambiguous
# ['сло́ва', 'слова́']

print(ambig_doc.stressed(selection='safe'))  # 'safe' skips сло́ва and слова́
# Твои́ слова ничего́ не зна́чат.
print(ambig_doc.stressed(selection='all'))  # 'all' combines сло́ва and слова́
# Твои́ сло́ва́ ничего́ не зна́чат.
print(ambig_doc.stressed(selection='rand') in {'Твои́ сло́ва ничего́ не зна́чат.', 'Твои́ слова́ ничего́ не зна́чат.'})  # 'rand' randomly chooses between сло́ва and слова́
# True


phonetic_doc1 = doc1.phonetic()
print(phonetic_doc1)
# мы́ уд'ив'и́л'ис' пръстʌт'э́ с'ис'т'э́мы.

`Sentence` object

Property	Type	Description
doc	`Document`	"Back pointer" to the parent document of this sentence
text	`str`	Original text of this sentence
tokens	`List[Token]`	The list of tokens in this sentence
id	`str`	(optional) Sentence id, if assigned at creation

Method	Return type	Description
stressed	`str`	The original text of the sentence with stress marks
phonetic	`str`	The original text converted to phonetic transcription
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
disambiguate	`None`	Disambiguate readings using the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
from_cg3	`Sentence`	Create `Sentence` from VISL-CG3 format stream
hfst_str	`str`	Analysis stream in the XFST/HFST format
from_hfst	`Sentence`	Create `Sentence` from XFST/HFST format stream
to_dict	`list`	Convert to a complex list object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

`Token` object

Property	Type	Description
id	`str`	The index of this token in the sentence, 1-based
text	`str`	The original text of this token
misc	`str`	Miscellaneous annotations with regard to this token
lemmas	`Set[str]`	All possible lemmas, based on remaining readings
readings	`List[Reading]`	List of readings not removed by the Constraint Grammar
removed_readings	`List[Reading]`	List of readings removed by the Constraint Grammar
deprel	`str`	The dependency relation between this word and its syntactic head. Example: ‘nmod’.

Method	Return type	Description
stresses	`Set[str]`	All possible stressed wordforms, based on remaining readings
stressed	`str`	The original text of the sentence with stress marks
phonetic	`str`	The original text converted to phonetic transcription
most_likely_reading	`Reading`	"Most likely" reading (may be partially random selection)
most_likely_lemmas	`List[str]`	List of lemma(s) from the "most likely" reading
transliterate	`str`	The original text converted to Romanized Cyrillic (default=Scholarly)
force_disambiguate	`None`	Fully disambiguate readings using methods other than the Constraint Grammar
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
to_dict	`dict`	Convert to a `dict` object
to_html	`str`	Convert to HTML with markup in `data-` attributes
to_json	`str`	Convert to a JSON string

`Reading` object

Property	Type	Description
subreadings	`List[Subreading]`	Usually only one subreading, but multiple subreadings are possible for complex `Token`s.
lemmas	`List[str]`	Lemmas from all subreadings
grouped_tags	`List[Tag]`	The part-of-speech, morphosyntactic, semantic and other tags from all subreadings
weight	`str`	Weight indicating the likelihood of the reading, without respect to context
cg_rule	`str`	Reference to the rule in the constraint grammar that removed/selected/etc. this reading. If no action has been taken on this reading, then `''`.
is_most_likely	`bool`	Indicates whether this reading has been selected as the most likely reading of its `Token`. Note that some selection methods may be at least partially random.

Method	Return type	Description
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
generate	`str`	Generate the wordform from this reading
replace_tag	`None`	Replace a tag in this reading
does_not_conflict	`bool`	Determine whether reading from external tagset (e.g. Universal Dependencies) conflicts with this reading
to_dict	`list`	Convert to a `list` object
to_json	`str`	Convert to a JSON string

`Subreading` object

Property	Type	Description
lemma	`str`	The lemma of the subreading
tags	`List[Tag]`	The part-of-speech, morphosyntactic, semantic and other tags
tagset	`Set[Tag]`	Same as `tags`, but for faster membership testing (`in` Reading)

Method	Return type	Description
cg3_str	`str`	Analysis stream in the VISL-CG3 format
hfst_str	`str`	Analysis stream in the XFST/HFST format
replace_tag	`None`	Replace a tag in this reading
to_dict	`dict`	Convert to a `dict` object
to_json	`str`	Convert to a JSON string

`Tag` object

Property	Type	Description
name	`str`	The name of this tag
ms_feat	`str`	Morphosyntactic feature that this tag is associated with (e.g. `Dat` has ms_feat `CASE`)
detail	`str`	Description of the tag's purpose or meaning
is_L2_error	`bool`	Whether this tag indicates a second-language learner error

Method	Return type	Description
info	`str`	Alias for `Tag.detail`

Convenience functions

A number of functions are included, both for convenience, and to give concrete examples for using the API.

noun_distractors()

This function generates all six cases of a given noun. If the given noun is singular, then the function generates singular forms. If the given noun is plural, then the function generates plural forms. Such a list can be used in a multiple-choice exercise, hence the name distractors.

sg_paradigm = udar.noun_distractors('словом')
print(sg_paradigm == {'сло́ву', 'сло́ве', 'сло́вом', 'сло́ва', 'сло́во'})
# True

pl_paradigm = udar.noun_distractors('словах')
print(pl_paradigm == {'слова́м', 'слова́', 'слова́х', 'слова́ми', 'сло́в'})
# True

If unstressed forms are desired, simply pass the argument stressed=False.

diagnose_L2()

This function will take a text string as the argument, and will return a dictionary of all the types of L2 errors in the text, along with examples of the error.

diag = udar.diagnose_L2('Етот малчик говорит по-русски.')
print(diag == {'Err/L2_e2je': {'Етот'}, 'Err/L2_NoSS': {'малчик'}})
# True

tag_info()

This function will look up the meaning of any tag used by the analyzer.

print(udar.tag_info('Err/L2_ii'))
# L2 error: Failure to change ending ие to ии in +Sg+Loc or +Sg+Dat, e.g. к Марие, о кафетерие, о знание

Using the transducers manually

The transducers come in two varieties: the Analyzer class and the Generator class. For memory efficiency, I recommend using the get_analyzer and get_generator functions, which ensure that each flavor of the transducers remains a singleton in memory.

Analyzer

The Analyzer can be initialized with or without analyses for second-language learner errors using the keyword L2_errors.

analyzer = udar.get_analyzer()  # by default, L2_errors is False
L2_analyzer = udar.get_analyzer(L2_errors=True)

Analyzers are callable. They take a token str and return a sequence of reading/weight tuples.

raw_readings1 = analyzer('сло́ва')
print(raw_readings1)
# (('слово+N+Neu+Inan+Sg+Gen', 5.9755859375),)

raw_readings2 = analyzer('слова')
print(raw_readings2)
# (('слово+N+Neu+Inan+Pl+Acc', 5.9755859375), ('слово+N+Neu+Inan+Pl+Nom', 5.9755859375), ('слово+N+Neu+Inan+Sg+Gen', 5.9755859375))

Generator

The Generator can be initialized in three varieties: unstressed, stressed, and phonetic.

generator = udar.get_generator()  # unstressed by default
stressed_generator = udar.get_generator(stressed=True)
phonetic_generator = udar.get_generator(phonetic=True)

Generators are callable. They take a Reading or raw reading str and return a surface form.

print(stressed_generator('слово+N+Neu+Inan+Pl+Nom'))
# слова́

Working with `Token`s and `Readings`s

You can easily check if a morphosyntactic tag is in a Token, Reading, or Subreading using in:

token2 = udar.Token('слова', analyze=True)
print(token2)
# слова [слово_N_Neu_Inan_Pl_Acc  слово_N_Neu_Inan_Pl_Nom  слово_N_Neu_Inan_Sg_Gen]

print('Gen' in token2)  # do any of the readings include Genitive case?
# True

print('слово' in token2)  # does not work for lemmas; use `in Token.lemmas`
# False

print('слово' in token2.lemmas)
# True

You can make a filtered list of a Token's readings using the following idiom:

pl_readings = [reading for reading in token2 if 'Pl' in reading]
print(pl_readings)
# [Reading(слово+N+Neu+Inan+Pl+Acc, 5.975586, ), Reading(слово+N+Neu+Inan+Pl+Nom, 5.975586, )]

Related projects

Finite-state tools

https://github.com/giellalt/lang-rus (The FSTs underlying this package comes from here)
https://github.com/mikahama/uralicNLP

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
.github/workflows		.github/workflows
dev		dev
docs		docs
scripts		scripts
src/udar		src/udar
test		test
.coveragerc		.coveragerc
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
hfst_vislcg3_versions.txt		hfst_vislcg3_versions.txt
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

reynoldsnlp/udar

Folders and files

Latest commit

History

Repository files navigation

UDAR(enie)

Feature requests, issues, and pull requests are welcome!

Dependencies

Debian / Ubuntu

MacOS (Python 3.6/3.7 only)

Installation

Introduction

Data objects

Document object

Examples

Sentence object

Token object

Reading object

Subreading object

Tag object

Convenience functions

noun_distractors()

diagnose_L2()

tag_info()

Using the transducers manually

Analyzer

Generator

Working with Tokens and Readingss

Related projects

Finite-state tools

Russian morphological analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Languages