Utils¶
Utility functions for TRUNAJOD library.
-
class
TRUNAJOD.utils.
SupportedModels
¶ Enum for supported Doc models.
-
TRUNAJOD.utils.
flatten
(list_of_lists)¶ Flatten a list of list.
This is a utility function that takes a list of lists and flattens it. For example the list
[[1, 2, 3], [4, 5]]
would be converted into[1, 2, 3, 4, 5]
.- Parameters
list_of_lists (Python List of Lists) – List to be flattened
- Returns
The list flattened
- Return type
Python List
-
TRUNAJOD.utils.
get_sentences_lemmas
(docs, lemma_dict, stopwords=[])¶ Get lemmas from sentences.
Get different types of lemma measurements, such as noun lemmas, verb lemmas, content lemmas. It calls
TRUNAJOD.utils.get_token_lemmas()
internally to extract different lemma types for each sentence. This function extract the following lemmas:Noun lemmas
Verb lemmas
Function lemmas (provided as
stopwords
)Content lemmas (anything that is not in
stopwords
)Adjective lemmas
Adverb lemmas
Proper pronoun lemmas
- Parameters
docs (List of Spacy Doc (Doc.sents)) – List of sentences to be processed.
lemma_dict (dict) – Lemmatizer dictionary
stopwords (list, optional) – List of stopwords (function words), defaults to []
- Returns
List of lemmas from text
- Return type
List of Lists of str
-
TRUNAJOD.utils.
get_stopwords
(filename)¶ Read stopword list from file.
Assumes that the list is defined as a newline separated words. It is a utility in case you’d like to provide your own stopwords list. Assumes encoding
utf8
.- Parameters
filename (string) – Name of the file containing stopword list.
- Returns
List of stopwords
- Return type
set
-
TRUNAJOD.utils.
get_token_lemmas
(doc, lemma_dict, stopwords=[])¶ Return lemmas from a sentence.
From a sentence, extracts the following lemmas:
Noun lemmas
Verb lemmas
Function lemmas (provided as
stopwords
)Content lemmas (anything that is not in
stopwords
)Adjective lemmas
Adverb lemmas
Proper pronoun lemmas
- Parameters
doc (Spacy Doc) – Doc containing tokens from text
lemma_dict (Dict) – Lemmatizer key-value pairs
stopwords (set/list, optional) – list of stopwords, defaults to []
- Returns
All lemmas for noun, verb, etc.
- Return type
tuple of lists
-
TRUNAJOD.utils.
is_adjective
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
isADJ
, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is adjective
- Return type
boolean
-
TRUNAJOD.utils.
is_adverb
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
isADV
, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is adverb
- Return type
boolean
-
TRUNAJOD.utils.
is_noun
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
isNOUN
orPROPN
, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is noun or proper noun
- Return type
boolean
-
TRUNAJOD.utils.
is_pronoun
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
isPRON
, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is pronoun
- Return type
boolean
-
TRUNAJOD.utils.
is_stopword
(word, stopwords)¶ Return
True
ifword
is instopwords
, False otherwise.- Parameters
word (string) – Word to be checked
stopwords (List of strings) – stopword list
- Returns
True if word in stopwords
- Return type
boolean
-
TRUNAJOD.utils.
is_verb
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
isVERB
, False otherwise.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is verb
- Return type
boolean
-
TRUNAJOD.utils.
is_word
(token: spacy.tokens.token.Token)¶ Return
True
ifpos_tag
is not punctuation, False otherwise.This method checks that the
pos_tag
does not belong to the following set:{'PUNCT', 'SYM', 'SPACE'}
.- Parameters
pos_tag (string) – Part of Speech tag
- Returns
True if POS is a word
- Return type
boolean
-
TRUNAJOD.utils.
lemmatize
(lemma_dict, word)¶ Lemmatize a word.
Lemmatizes a word using a lemmatizer which is represented as a dict that has (word, lemma) as (key, value) pair. An example of a lemma list can be found in https://github.com/michmech/lemmatization-lists.
If the word is not found in the dictionary, the lemma returned will be the word.
- Parameters
lemma_dict (Python dict) – A dict (word, lemma)
word (string) – The word to be lemmatized
- Returns
Lemmatized word
- Return type
string
-
TRUNAJOD.utils.
process_text
(text, sent_tokenize)¶ Process text by tokenizing sentences given a tokenizer.
- Parameters
text (string) – Text to be processed
sent_tokenize (Python callable that returns list of strings) – Tokenizer
- Returns
Tokenized sentences
- Return type
List of strings
-
TRUNAJOD.utils.
read_text
(filename)¶ Read a
utf-8
encoded text file and returns the text asstring
.This is just a utily function, that is not recommended to use if the text file does not fit your available RAM. Mostly used for small text files.
- Parameters
filename (string) – File from which is the text to be read
- Returns
Text in the file
- Return type
string