Type Token Ratios¶

Type Token Ratios module.

Type token ratios (TTR) are a measurement of lexical diversity. They are defined as the ratio of unique tokens divided by the total number of tokens. This measurement is bounded between 0 and 1. If there is no repetition in the text this measurement is 1, and if there is infinite repetition, it will tend to 0. This measurement is not recommended if analyzing texts of different lengths, as when the number of tokens increases, the TTR tends flatten.

TRUNAJOD.ttr.d_estimate(doc: spacy.tokens.doc.Doc, min_range: int = 35, max_range: int = 50, trials: int = 5) → float¶

Compute D measurement for lexical diversity.

The measurement is based in [RM00]. We pick n numbers of tokens, varying N from min_range up to max_range. For each n we do the following:

Sample n tokens without replacement
Compute TTR
Repeat steps 1 and 2 trials times
Compute the average TTR

At this point, we have a set of points (n, ttr). We then fit these observations to the following model:

\[TTR = \displaystyle\frac{D}{N}\left[\sqrt{1 + 2\frac{N}{D}} - 1\right]\]

The fit is done to get an estimation for the D parameter, and we use a least squares as the criteria for the fit.

Parameters

doc (Doc) – SpaCy doc of the text.
min_range (int, optional) – Lower bound for n, defaults to 35
max_range (int, optional) – Upper bound for n, defaults to 50
trials (int, optional) – Number of trials to estimate TTR, defaults to 5

Raises

ValueError – If invalid range is provided.

Returns

D metric

Return type

float

TRUNAJOD.ttr.lexical_diversity_mtld(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float¶

Compute MTLD lexical diversity in a bi-directional fashion.

Parameters

doc (NLP Doc) – Processed text
model_name (str) – Determines which model is used (spacy or stanza)
ttr_segment (float) – Threshold for TTR mean computation

Returns

Bi-directional lexical diversity MTLD

Return type

float

TRUNAJOD.ttr.one_side_lexical_diversity_mtld(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float¶

Lexical diversity per MTLD.

Parameters

doc (NLP Doc) – Tokenized text
model_name (str) – Determines which model is used (spacy or stanza)
ttr_segment (float) – Threshold for TTR mean computation

Returns

MLTD lexical diversity

Return type

float

TRUNAJOD.ttr.type_token_ratio(word_list: List[str]) → float¶

Return Type Token Ratio of a word list.

Parameters: word_list (List of strings) – List of words
Returns: TTR of the word list
Return type: float

TRUNAJOD.ttr.word_variation_index(doc: spacy.tokens.doc.Doc) → float¶

Compute Word Variation Index.

Word variation index might be thought as the density of ideas in a text. It is computed as:

\[WVI = \displaystyle\frac{log\left(n(w)\right)} {log\left(2 - \frac{log(n(vw))}{log(n(w))}\right)}\]

Where n(w) is the number of words in the text, and n(vw) is the number of unique words in the text.

Parameters: doc (Doc) – Document to be processed
Returns: Word variation index
Return type: float

TRUNAJOD.ttr.yule_k(doc: spacy.tokens.doc.Doc) → float¶

Compute Yule’s K from a text.

Yule’s K is defined as follows [Yul14]:

\[K=10^{4}\displaystyle\frac{\sum{r^2V_r-N}}{N^2}\]

Where Vr is the number of tokens ocurring r times. This is a measurement of lexical diversity.

Parameters: doc (Doc) – Processed spaCy Doc
Returns: Texts’ Yule’s K
Return type: float

RM00: Brian Richards and David Malvern. Measuring vocabulary richness in teenage learners of french. British Educational Research Association Annual Conference, 2000.
Yul14: C Udny Yule. The statistical study of literary vocabulary. Cambridge University Press, 2014.