Type Token Ratios

Type Token Ratios module.

Type token ratios (TTR) are a measurement of lexical diversity. They are defined as the ratio of unique tokens divided by the total number of tokens. This measurement is bounded between 0 and 1. If there is no repetition in the text this measurement is 1, and if there is infinite repetition, it will tend to 0. This measurement is not recommended if analyzing texts of different lengths, as when the number of tokens increases, the TTR tends flatten.

TRUNAJOD.ttr.d_estimate(doc: spacy.tokens.doc.Doc, min_range: int = 35, max_range: int = 50, trials: int = 5) → float

Compute D measurement for lexical diversity.

The measurement is based in [RM00]. We pick n numbers of tokens, varying N from min_range up to max_range. For each n we do the following:

  1. Sample n tokens without replacement

  2. Compute TTR

  3. Repeat steps 1 and 2 trials times

  4. Compute the average TTR

At this point, we have a set of points (n, ttr). We then fit these observations to the following model:

\[TTR = \displaystyle\frac{D}{N}\left[\sqrt{1 + 2\frac{N}{D}} - 1\right]\]

The fit is done to get an estimation for the D parameter, and we use a least squares as the criteria for the fit.

Parameters
  • doc (Doc) – SpaCy doc of the text.

  • min_range (int, optional) – Lower bound for n, defaults to 35

  • max_range (int, optional) – Upper bound for n, defaults to 50

  • trials (int, optional) – Number of trials to estimate TTR, defaults to 5

Raises

ValueError – If invalid range is provided.

Returns

D metric

Return type

float

TRUNAJOD.ttr.lexical_diversity_mtld(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float

Compute MTLD lexical diversity in a bi-directional fashion.

Parameters
  • doc (NLP Doc) – Processed text

  • model_name (str) – Determines which model is used (spacy or stanza)

  • ttr_segment (float) – Threshold for TTR mean computation

Returns

Bi-directional lexical diversity MTLD

Return type

float

TRUNAJOD.ttr.one_side_lexical_diversity_mtld(doc: spacy.tokens.doc.Doc, model_name: str = 'spacy', ttr_segment: float = 0.72) → float

Lexical diversity per MTLD.

Parameters
  • doc (NLP Doc) – Tokenized text

  • model_name (str) – Determines which model is used (spacy or stanza)

  • ttr_segment (float) – Threshold for TTR mean computation

Returns

MLTD lexical diversity

Return type

float

TRUNAJOD.ttr.type_token_ratio(word_list: List[str]) → float

Return Type Token Ratio of a word list.

Parameters

word_list (List of strings) – List of words

Returns

TTR of the word list

Return type

float

TRUNAJOD.ttr.word_variation_index(doc: spacy.tokens.doc.Doc) → float

Compute Word Variation Index.

Word variation index might be thought as the density of ideas in a text. It is computed as:

\[WVI = \displaystyle\frac{log\left(n(w)\right)} {log\left(2 - \frac{log(n(vw))}{log(n(w))}\right)}\]

Where n(w) is the number of words in the text, and n(vw) is the number of unique words in the text.

Parameters

doc (Doc) – Document to be processed

Returns

Word variation index

Return type

float

TRUNAJOD.ttr.yule_k(doc: spacy.tokens.doc.Doc) → float

Compute Yule’s K from a text.

Yule’s K is defined as follows [Yul14]:

\[K=10^{4}\displaystyle\frac{\sum{r^2V_r-N}}{N^2}\]

Where Vr is the number of tokens ocurring r times. This is a measurement of lexical diversity.

Parameters

doc (Doc) – Processed spaCy Doc

Returns

Texts’ Yule’s K

Return type

float

RM00

Brian Richards and David Malvern. Measuring vocabulary richness in teenage learners of french. British Educational Research Association Annual Conference, 2000.

Yul14

C Udny Yule. The statistical study of literary vocabulary. Cambridge University Press, 2014.