Utils

Utility functions for TRUNAJOD library.

class TRUNAJOD.utils.SupportedModels

Enum for supported Doc models.

TRUNAJOD.utils.flatten(list_of_lists)

Flatten a list of list.

This is a utility function that takes a list of lists and flattens it. For example the list [[1, 2, 3], [4, 5]] would be converted into [1, 2, 3, 4, 5].

Parameters

list_of_lists (Python List of Lists) – List to be flattened

Returns

The list flattened

Return type

Python List

TRUNAJOD.utils.get_sentences_lemmas(docs, lemma_dict, stopwords=[])

Get lemmas from sentences.

Get different types of lemma measurements, such as noun lemmas, verb lemmas, content lemmas. It calls TRUNAJOD.utils.get_token_lemmas() internally to extract different lemma types for each sentence. This function extract the following lemmas:

  • Noun lemmas

  • Verb lemmas

  • Function lemmas (provided as stopwords)

  • Content lemmas (anything that is not in stopwords)

  • Adjective lemmas

  • Adverb lemmas

  • Proper pronoun lemmas

Parameters
  • docs (List of Spacy Doc (Doc.sents)) – List of sentences to be processed.

  • lemma_dict (dict) – Lemmatizer dictionary

  • stopwords (list, optional) – List of stopwords (function words), defaults to []

Returns

List of lemmas from text

Return type

List of Lists of str

TRUNAJOD.utils.get_stopwords(filename)

Read stopword list from file.

Assumes that the list is defined as a newline separated words. It is a utility in case you’d like to provide your own stopwords list. Assumes encoding utf8.

Parameters

filename (string) – Name of the file containing stopword list.

Returns

List of stopwords

Return type

set

TRUNAJOD.utils.get_token_lemmas(doc, lemma_dict, stopwords=[])

Return lemmas from a sentence.

From a sentence, extracts the following lemmas:

  • Noun lemmas

  • Verb lemmas

  • Function lemmas (provided as stopwords)

  • Content lemmas (anything that is not in stopwords)

  • Adjective lemmas

  • Adverb lemmas

  • Proper pronoun lemmas

Parameters
  • doc (Spacy Doc) – Doc containing tokens from text

  • lemma_dict (Dict) – Lemmatizer key-value pairs

  • stopwords (set/list, optional) – list of stopwords, defaults to []

Returns

All lemmas for noun, verb, etc.

Return type

tuple of lists

TRUNAJOD.utils.is_adjective(token: spacy.tokens.token.Token)

Return True if pos_tag is ADJ, False otherwise.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is adjective

Return type

boolean

TRUNAJOD.utils.is_adverb(token: spacy.tokens.token.Token)

Return True if pos_tag is ADV, False otherwise.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is adverb

Return type

boolean

TRUNAJOD.utils.is_noun(token: spacy.tokens.token.Token)

Return True if pos_tag is NOUN or PROPN, False otherwise.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is noun or proper noun

Return type

boolean

TRUNAJOD.utils.is_pronoun(token: spacy.tokens.token.Token)

Return True if pos_tag is PRON, False otherwise.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is pronoun

Return type

boolean

TRUNAJOD.utils.is_stopword(word, stopwords)

Return True if word is in stopwords, False otherwise.

Parameters
  • word (string) – Word to be checked

  • stopwords (List of strings) – stopword list

Returns

True if word in stopwords

Return type

boolean

TRUNAJOD.utils.is_verb(token: spacy.tokens.token.Token)

Return True if pos_tag is VERB, False otherwise.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is verb

Return type

boolean

TRUNAJOD.utils.is_word(token: spacy.tokens.token.Token)

Return True if pos_tag is not punctuation, False otherwise.

This method checks that the pos_tag does not belong to the following set: {'PUNCT', 'SYM', 'SPACE'}.

Parameters

pos_tag (string) – Part of Speech tag

Returns

True if POS is a word

Return type

boolean

TRUNAJOD.utils.lemmatize(lemma_dict, word)

Lemmatize a word.

Lemmatizes a word using a lemmatizer which is represented as a dict that has (word, lemma) as (key, value) pair. An example of a lemma list can be found in https://github.com/michmech/lemmatization-lists.

If the word is not found in the dictionary, the lemma returned will be the word.

Parameters
  • lemma_dict (Python dict) – A dict (word, lemma)

  • word (string) – The word to be lemmatized

Returns

Lemmatized word

Return type

string

TRUNAJOD.utils.process_text(text, sent_tokenize)

Process text by tokenizing sentences given a tokenizer.

Parameters
  • text (string) – Text to be processed

  • sent_tokenize (Python callable that returns list of strings) – Tokenizer

Returns

Tokenized sentences

Return type

List of strings

TRUNAJOD.utils.read_text(filename)

Read a utf-8 encoded text file and returns the text as string.

This is just a utily function, that is not recommended to use if the text file does not fit your available RAM. Mostly used for small text files.

Parameters

filename (string) – File from which is the text to be read

Returns

Text in the file

Return type

string