Type Token Ratios¶
Type Token Ratios module.
Type token ratios (TTR) are a measurement of lexical diversity. They are defined as the ratio of unique tokens divided by the total number of tokens. This measurement is bounded between 0 and 1. If there is no repetition in the text this measurement is 1, and if there is infinite repetition, it will tend to 0. This measurement is not recommended if analyzing texts of different lengths, as when the number of tokens increases, the TTR tends flatten.
-
TRUNAJOD.ttr.
d_estimate
(doc: spacy.tokens.doc.Doc, min_range: int = 35, max_range: int = 50, trials: int = 5) → float¶ Compute D measurement for lexical diversity.
The measurement is based in [RM00]. We pick
n
numbers of tokens, varyingN
frommin_range
up tomax_range
. For eachn
we do the following:Sample
n
tokens without replacementCompute
TTR
Repeat steps 1 and 2
trials
timesCompute the average
TTR
At this point, we have a set of points
(n, ttr)
. We then fit these observations to the following model:\[TTR = \displaystyle\frac{D}{N}\left[\sqrt{1 + 2\frac{N}{D}} - 1\right]\]The fit is done to get an estimation for the
D
parameter, and we use a least squares as the criteria for the fit.- Parameters
doc (Doc) – SpaCy doc of the text.
min_range (int, optional) – Lower bound for n, defaults to 35
max_range (int, optional) – Upper bound for n, defaults to 50
trials (int, optional) – Number of trials to estimate TTR, defaults to 5
- Raises
ValueError – If invalid range is provided.
- Returns
D metric
- Return type
float
-
TRUNAJOD.ttr.
lexical_diversity_mtld
(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float¶ Compute MTLD lexical diversity in a bi-directional fashion.
- Parameters
doc (NLP Doc) – Processed text
model_name (str) – Determines which model is used (spacy or stanza)
ttr_segment (float) – Threshold for TTR mean computation
- Returns
Bi-directional lexical diversity MTLD
- Return type
float
-
TRUNAJOD.ttr.
one_side_lexical_diversity_mtld
(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float¶ Lexical diversity per MTLD.
- Parameters
doc (NLP Doc) – Tokenized text
model_name (str) – Determines which model is used (spacy or stanza)
ttr_segment (float) – Threshold for TTR mean computation
- Returns
MLTD lexical diversity
- Return type
float
-
TRUNAJOD.ttr.
type_token_ratio
(word_list: List[str]) → float¶ Return Type Token Ratio of a word list.
- Parameters
word_list (List of strings) – List of words
- Returns
TTR of the word list
- Return type
float
-
TRUNAJOD.ttr.
word_variation_index
(doc: spacy.tokens.doc.Doc) → float¶ Compute Word Variation Index.
Word variation index might be thought as the density of ideas in a text. It is computed as:
\[WVI = \displaystyle\frac{log\left(n(w)\right)} {log\left(2 - \frac{log(n(vw))}{log(n(w))}\right)}\]Where n(w) is the number of words in the text, and n(vw) is the number of unique words in the text.
- Parameters
doc (Doc) – Document to be processed
- Returns
Word variation index
- Return type
float
-
TRUNAJOD.ttr.
yule_k
(doc: spacy.tokens.doc.Doc) → float¶ Compute Yule’s K from a text.
Yule’s K is defined as follows [Yul14]:
\[K=10^{4}\displaystyle\frac{\sum{r^2V_r-N}}{N^2}\]Where Vr is the number of tokens ocurring r times. This is a measurement of lexical diversity.
- Parameters
doc (Doc) – Processed spaCy Doc
- Returns
Texts’ Yule’s K
- Return type
float