Surface Proxies

Surface proxies of TRUNAJOD.

These surface proxies are measurements from text that consists on shallow measures (proxies) that approximate to intrinsic properties of the text such as cohesion, coherence, complexity. Examples of these measurements include but are not limited to: Number of sentences, number of syllables, etc.

TRUNAJOD.surface_proxies.add_periphrasis(doc, periphrasis_type, periphrasis_list)

Add periphrasis to SPACY tags.

One of the drawbacks that spaCy has, is that it does not address properly periphrasis of texts (in our case Spanish text). This function adds periphrasis to the text in order to improve further analysis such as clause segmentation, and clause count. This is used by TRUNAJOD.surface_proxies.fix_parse_tree().

Parameters
  • doc (Spacy Doc) – Tokenized text

  • type (string) – Periphrasis type

  • periphrasis_list (List of strings) – List of periphrasis

Returns

Corrected doc

Return type

Spacy Doc

TRUNAJOD.surface_proxies.average_clause_length(doc, infinitive_map)

Return average clause length (heuristic).

This measurement is computed as the ratio of # of words / # of clauses. To count clauses we do it heuristically, and you can refer to TRUNAJOD.surface_proxies.clause_count() for more details.

Parameters
  • doc (Spacy Doc) – Text to be processed

  • infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Average clause length

Return type

float

TRUNAJOD.surface_proxies.average_sentence_length(doc)

Return average sentence length.

This measurement is computed as the ratio of: # of words / # of sentences.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

average sentence length

Return type

float

TRUNAJOD.surface_proxies.average_word_length(doc)

Return average word length.

Computed as the ratio of: # number of chars / # of words

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

Average word length

Return type

float

TRUNAJOD.surface_proxies.char_count(doc)

Return number of chars in a text.

This count does not consider anything that its Token.pos_ tag is either PUNCT or SPACE.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

Char count

Return type

int

TRUNAJOD.surface_proxies.clause_count(doc, infinitive_map)

Return clause count (heuristic).

This function is decorated by the TRUNAJOD:surface_proxies.fix_parse_tree() function, in order to heuristically count clauses.

Parameters
  • doc (Spacy Doc) – Text to be processed.

  • infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Clause count

Return type

int

TRUNAJOD.surface_proxies.connection_words_ratio(doc)

Get ratio of connecting words over total words of text.

This function computes the ratio of connective words over the total number of words. This implementation is only supported in Spanish and we consider the following lemmas: y, o, no, si.

Parameters

doc (Spacy Doc) – Tokenized text

Returns

Connection word ratio

Return type

float

TRUNAJOD.surface_proxies.first_second_person_count(doc)

Count first|second person tokens.

Parameters

doc (Spacy Doc) – Processed text

Returns

First and second person count

Return type

int

TRUNAJOD.surface_proxies.first_second_person_density(doc)

Compute density of first|second person.

Parameters

doc (Spacy Doc) – Processed text

Returns

Density 1,2 person

Return type

float

TRUNAJOD.surface_proxies.fix_parse_tree(doc, infinitive_map)

Fix SPACY parse tree.

We found that for Spanish texts, spaCy tags do not deal appropiately with periphrasis and other lingüistic cues. This function address this shortcome by modifying the parse tree computed by spaCy adding periphrasis for Gerunds, Infinitive and Past tense verbs.

Parameters
  • doc (Spacy Doc) – Processed text

  • infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Fixed Doc

Return type

Spacy Doc

TRUNAJOD.surface_proxies.frequency_index(doc, frequency_dict)

Return frequency index.

The frequency index is defined as the average frequency of the rarest word over sentences. To compute this, we use a dictionary. In the case of this Spanish implementation we could use RAE dictionary CREA.

Parameters

doc (Spacy Doc) – Tokenized text.

Returns

Frequency index

Return type

float

TRUNAJOD.surface_proxies.get_word_depth(index, doc)

Get word depth in the parse tree given a sentence and token index.

The ROOT of the sentence is considered level 1. This method traverses the parse tree until reaching the ROOT, and counts all levels traversed.

Parameters
  • index (int) – Position of the token in the sentence

  • doc (Spacy Doc) – Tokenized text

Returns

Depth of the of the token

Return type

int

TRUNAJOD.surface_proxies.infinitve(conjugate, infinitive_map)

Get infinitive form of a conjugated verb.

Given a mapping of conjugate to infinitive, this function computes the infinitive form of a conjugate verb. We provide models available for downloading, so you do not have to worry about the infinitive_map. Regretfully we only provide models for Spanish texts.

Parameters
  • conjugate (string) – Verb to be processed

  • infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Infinitive form of the verb, None if not found

Return type

string

TRUNAJOD.surface_proxies.lexical_density(doc)

Compute lexical density.

The lexical density is defined as the Part of Speech ratio of the following tags: VERB, AUX, ADJ, NOUN, PROPN and ADV over the total number of words.

Parameters

doc (Spacy Doc) – Tokenized text

Returns

Lexical density

Return type

Float

TRUNAJOD.surface_proxies.negation_density(doc)

Compute negation density.

This is defined as the ratio between number of occurrences of TRUNAJOD.surface_proxies.NEGATION_WORDS in the text over the total word count.

Parameters

doc (Spacy Doc) – Tokenized text

Returns

Negation density

Return type

float

TRUNAJOD.surface_proxies.node_similarity(node1, node2, is_central_node=False)

Compute node similarity recursively, based on common children POS.

This function is called inside TRUNAJOD.surface_proxies.syntactic_similarity() so is an auxiliary function. In the common use case, is unlikely you will need to call this function directly, but we provide it for debugging purposes.

Parameters
  • node1 (Spacy Token) – Node of the parse tree.

  • node2 (Spacy Token) – Node of the parse tree

  • is_central_node (bool, optional) – Whether is the central node, defaults to False

Returns

Total childs in common between node1 and node2.

Return type

int

TRUNAJOD.surface_proxies.noun_count(doc)

Count nouns in the text.

Count all tokens which Part of Speech tag is either NOUN or PROPN.

Parameters

doc (Spacy Doc) – Text to be processed

Returns

Noun count

Return type

int

TRUNAJOD.surface_proxies.noun_phrase_density(doc)

Compute NP density.

To compute NP density we do it heuristically. We might improve it in the future by using some NP-chunking strategy. For counting noun phrases, we check that for a node in the parse tree, its head is a Noun. Then, we check if either of the following conditions is met:

  • The token is the article del or al

  • The token dependency is not cc, case or cop, and the token is not a punctuation and the token is not the ROOT

Then we compute the ratio between # of NP / Noun count.

Parameters

doc (Spacy Doc) – Tokenized text.

Returns

NP density

Return type

float

TRUNAJOD.surface_proxies.pos_dissimilarity(doc)

Measure Part of Speech disimilarity over sentences.

The dissimilarity of POS between two sentences is the difference between POS distribution over the total population of POS tags. It is computed as follows:

  • For each sentence, PoS tag distribution is computed.

  • For each tag in either of the two sentences, we compute the difference in distributions (absolute value)

  • This difference is divided by the total population of the sentences

This is done for each pair of sentences (N - 1 sentences) and the results are averaged (again, over N - 1)

Parameters

doc (Spacy Doc) – Processed text

Returns

Part of Speech dissimilarity

Return type

float

TRUNAJOD.surface_proxies.pos_distribution(doc)

Get POS distribution from a processed text.

Let us suppose that a given sentence has the following pos tags: [NOUN, VERB, ADJ, VERB, ADJ]. The PoS distribution would be

  • NOUN: 1

  • VERB: 2

  • ADJ: 2

This function returns this distrubution as a dict.

Parameters

doc (Spacy Doc) – Processed text

Returns

POS distribution as a dict key POS, value Count

Return type

dict

TRUNAJOD.surface_proxies.pos_ratio(doc, pos_types)

Compute POS ratio given desired type of ratio.

The pos_types might be a regular expression if a composed ratio is needed. An example of usage would be pos_ratio(doc, "VERB|AUX").

Parameters
  • doc (Spacy Doc) – Spacy processed text

  • pos_types (string) – POS to get the ratio

Returns

Ratio over number of words

Return type

float

TRUNAJOD.surface_proxies.sentence_count(doc)

Return number of sentences in a text.

Parameters

doc (Spacy Doc) – Text to be processed

Returns

Number of sentences in the text

Return type

int

TRUNAJOD.surface_proxies.subordination(doc, infinitive_map)

Return subordination, defined as the clause density.

The subordination is defined as the ratio between # of clauses and the # of sentences. To compute number of clauses, a heuristic is used.

Parameters
  • doc (Spacy Doc) – Text to be processed.

  • infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Subordination index

Return type

float

TRUNAJOD.surface_proxies.syllable_count(doc)

Return number of syllables of a text.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

Number of syllables in the text

Return type

int

TRUNAJOD.surface_proxies.syllable_word_ratio(doc)

Return average syllable word ratio.

It is computed as # Syllables / # of words.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

syllable word ratio

Return type

float

TRUNAJOD.surface_proxies.syntactic_similarity(doc)

Compute average syntactic similarity between sentences.

For each pair of sentences, compute the similarity between each pair of nodes, using TRUNAJOD.surface_proxies.node_similarity() Then, the result is averaged over the N - 1 pair of sentences.

Parameters

doc (Spacy Doc) – Processed text

Returns

Average syntactic similarity over sentences.

Return type

float

TRUNAJOD.surface_proxies.verb_noun_ratio(doc)

Compute Verb/Noun ratio.

Parameters

doc (Spacy Doc) – Processed text

Returns

Verb Noun ratio

Return type

float

TRUNAJOD.surface_proxies.word_count(doc)

Return number of words in a text.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

Word count

Return type

int

TRUNAJOD.surface_proxies.words_before_root(doc, max_depth=4)

Return average word count of words before root.

For each sentence, word count before root is computed in the case that the root is a verb. Otherwise, the root is considered to be the verb in the highest node in the parse tree.

Parameters

doc (Spacy Doc) – Text to be processed.

Returns

Average words before root

Return type

float