Utils¶

Utility functions for TRUNAJOD library.

class TRUNAJOD.utils.SupportedModels¶: Enum for supported Doc models.

TRUNAJOD.utils.flatten(list_of_lists)¶

Flatten a list of list.

This is a utility function that takes a list of lists and flattens it. For example the list [[1, 2, 3], [4, 5]] would be converted into [1, 2, 3, 4, 5].

Parameters: list_of_lists (Python List of Lists) – List to be flattened
Returns: The list flattened
Return type: Python List

TRUNAJOD.utils.get_sentences_lemmas(docs, lemma_dict, stopwords=[])¶

Get lemmas from sentences.

Get different types of lemma measurements, such as noun lemmas, verb lemmas, content lemmas. It calls TRUNAJOD.utils.get_token_lemmas() internally to extract different lemma types for each sentence. This function extract the following lemmas:

Noun lemmas
Verb lemmas
Function lemmas (provided as stopwords)
Content lemmas (anything that is not in stopwords)
Adjective lemmas
Adverb lemmas
Proper pronoun lemmas

Parameters

docs (List of Spacy Doc (Doc.sents)) – List of sentences to be processed.
lemma_dict (dict) – Lemmatizer dictionary
stopwords (list, optional) – List of stopwords (function words), defaults to []

Returns

List of lemmas from text

Return type

List of Lists of str

TRUNAJOD.utils.get_stopwords(filename)¶

Read stopword list from file.

Assumes that the list is defined as a newline separated words. It is a utility in case you’d like to provide your own stopwords list. Assumes encoding utf8.

Parameters: filename (string) – Name of the file containing stopword list.
Returns: List of stopwords
Return type: set

TRUNAJOD.utils.get_token_lemmas(doc, lemma_dict, stopwords=[])¶

Return lemmas from a sentence.

From a sentence, extracts the following lemmas:

Noun lemmas
Verb lemmas
Function lemmas (provided as stopwords)
Content lemmas (anything that is not in stopwords)
Adjective lemmas
Adverb lemmas
Proper pronoun lemmas

Parameters

doc (Spacy Doc) – Doc containing tokens from text
lemma_dict (Dict) – Lemmatizer key-value pairs
stopwords (set/list, optional) – list of stopwords, defaults to []

Returns

All lemmas for noun, verb, etc.

Return type

tuple of lists

TRUNAJOD.utils.is_adjective(token: spacy.tokens.token.Token)¶

Return True if pos_tag is ADJ, False otherwise.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is adjective
Return type: boolean

TRUNAJOD.utils.is_adverb(token: spacy.tokens.token.Token)¶

Return True if pos_tag is ADV, False otherwise.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is adverb
Return type: boolean

TRUNAJOD.utils.is_noun(token: spacy.tokens.token.Token)¶

Return True if pos_tag is NOUN or PROPN, False otherwise.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is noun or proper noun
Return type: boolean

TRUNAJOD.utils.is_pronoun(token: spacy.tokens.token.Token)¶

Return True if pos_tag is PRON, False otherwise.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is pronoun
Return type: boolean

TRUNAJOD.utils.is_stopword(word, stopwords)¶

Return True if word is in stopwords, False otherwise.

Parameters

word (string) – Word to be checked
stopwords (List of strings) – stopword list

Returns

True if word in stopwords

Return type

boolean

TRUNAJOD.utils.is_verb(token: spacy.tokens.token.Token)¶

Return True if pos_tag is VERB, False otherwise.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is verb
Return type: boolean

TRUNAJOD.utils.is_word(token: spacy.tokens.token.Token)¶

Return True if pos_tag is not punctuation, False otherwise.

This method checks that the pos_tag does not belong to the following set: {'PUNCT', 'SYM', 'SPACE'}.

Parameters: pos_tag (string) – Part of Speech tag
Returns: True if POS is a word
Return type: boolean

TRUNAJOD.utils.lemmatize(lemma_dict, word)¶

Lemmatize a word.

Lemmatizes a word using a lemmatizer which is represented as a dict that has (word, lemma) as (key, value) pair. An example of a lemma list can be found in https://github.com/michmech/lemmatization-lists.

If the word is not found in the dictionary, the lemma returned will be the word.

Parameters

lemma_dict (Python dict) – A dict (word, lemma)
word (string) – The word to be lemmatized

Returns

Lemmatized word

Return type

string

TRUNAJOD.utils.process_text(text, sent_tokenize)¶

Process text by tokenizing sentences given a tokenizer.

Parameters

text (string) – Text to be processed
sent_tokenize (Python callable that returns list of strings) – Tokenizer

Returns

Tokenized sentences

Return type

List of strings

TRUNAJOD.utils.read_text(filename)¶

Read a utf-8 encoded text file and returns the text as string.

This is just a utily function, that is not recommended to use if the text file does not fit your available RAM. Mostly used for small text files.

Parameters: filename (string) – File from which is the text to be read
Returns: Text in the file
Return type: string