Utils¶
Utility functions for TRUNAJOD library.
-
class
TRUNAJOD.utils.SupportedModels¶ Enum for supported Doc models.
-
TRUNAJOD.utils.flatten(list_of_lists)¶ Flatten a list of list.
This is a utility function that takes a list of lists and flattens it. For example the list
[[1, 2, 3], [4, 5]]would be converted into[1, 2, 3, 4, 5].- Parameters
list_of_lists (Python List of Lists) – List to be flattened
- Returns
The list flattened
- Return type
Python List
-
TRUNAJOD.utils.get_sentences_lemmas(docs, lemma_dict, stopwords=[])¶ Get lemmas from sentences.
Get different types of lemma measurements, such as noun lemmas, verb lemmas, content lemmas. It calls
TRUNAJOD.utils.get_token_lemmas()internally to extract different lemma types for each sentence. This function extract the following lemmas:Noun lemmas
Verb lemmas
Function lemmas (provided as
stopwords)Content lemmas (anything that is not in
stopwords)Adjective lemmas
Adverb lemmas
Proper pronoun lemmas
- Parameters
docs (List of Spacy Doc (Doc.sents)) – List of sentences to be processed.
lemma_dict (dict) – Lemmatizer dictionary
stopwords (list, optional) – List of stopwords (function words), defaults to []
- Returns
List of lemmas from text
- Return type
List of Lists of str
-
TRUNAJOD.utils.get_stopwords(filename)¶ Read stopword list from file.
Assumes that the list is defined as a newline separated words. It is a utility in case you’d like to provide your own stopwords list. Assumes encoding
utf8.- Parameters
filename (string) – Name of the file containing stopword list.
- Returns
List of stopwords
- Return type
set
-
TRUNAJOD.utils.get_token_lemmas(doc, lemma_dict, stopwords=[])¶ Return lemmas from a sentence.
From a sentence, extracts the following lemmas:
Noun lemmas
Verb lemmas
Function lemmas (provided as
stopwords)Content lemmas (anything that is not in
stopwords)Adjective lemmas
Adverb lemmas
Proper pronoun lemmas
- Parameters
doc (Spacy Doc) – Doc containing tokens from text
lemma_dict (Dict) – Lemmatizer key-value pairs
stopwords (set/list, optional) – list of stopwords, defaults to []
- Returns
All lemmas for noun, verb, etc.
- Return type
tuple of lists
-
TRUNAJOD.utils.is_adjective(pos_tag)¶ Return
Trueifpos_tagisADJ, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is adjective
- Return type
boolean
-
TRUNAJOD.utils.is_adverb(pos_tag)¶ Return
Trueifpos_tagisADV, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is adverb
- Return type
boolean
-
TRUNAJOD.utils.is_noun(pos_tag)¶ Return
Trueifpos_tagisNOUNorPROPN, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is noun or proper noun
- Return type
boolean
-
TRUNAJOD.utils.is_pronoun(pos_tag)¶ Return
Trueifpos_tagisPRON, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is pronoun
- Return type
boolean
-
TRUNAJOD.utils.is_stopword(word, stopwords)¶ Return
Trueifwordis instopwords, False otherwise.- Parameters
word (string) – Word to be checked
stopwords (List of strings) – stopword list
- Returns
True if word in stopwords
- Return type
boolean
-
TRUNAJOD.utils.is_verb(pos_tag)¶ Return
Trueifpos_tagisVERB, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is verb
- Return type
boolean
-
TRUNAJOD.utils.is_word(pos_tag)¶ Return
Trueifpos_tagis not punctuation, False otherwise.This method checks that the
pos_tagdoes not belong to the following set:{'PUNCT', 'SYM', 'SPACE'}.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is a word
- Return type
boolean
-
TRUNAJOD.utils.lemmatize(lemma_dict, word)¶ Lemmatize a word.
Lemmatizes a word using a lemmatizer which is represented as a dict that has (word, lemma) as (key, value) pair. An example of a lemma list can be found in https://github.com/michmech/lemmatization-lists.
If the word is not found in the dictionary, the lemma returned will be the word.
- Parameters
lemma_dict (Python dict) – A dict (word, lemma)
word (string) – The word to be lemmatized
- Returns
Lemmatized word
- Return type
string
-
TRUNAJOD.utils.process_text(text, sent_tokenize)¶ Process text by tokenizing sentences given a tokenizer.
- Parameters
text (string) – Text to be processed
sent_tokenize (Python callable that returns list of strings) – Tokenizer
- Returns
Tokenized sentences
- Return type
List of strings
-
TRUNAJOD.utils.read_text(filename)¶ Read a
utf-8encoded text file and returns the text asstring.This is just a utily function, that is not recommended to use if the text file does not fit your available RAM. Mostly used for small text files.
- Parameters
filename (string) – File from which is the text to be read
- Returns
Text in the file
- Return type
string