Surface Proxies¶

Surface proxies of TRUNAJOD.

These surface proxies are measurements from text that consists on shallow measures (proxies) that approximate to intrinsic properties of the text such as cohesion, coherence, complexity. Examples of these measurements include but are not limited to: Number of sentences, number of syllables, etc.

TRUNAJOD.surface_proxies.add_periphrasis(doc, periphrasis_type, periphrasis_list)¶

Add periphrasis to SPACY tags.

One of the drawbacks that spaCy has, is that it does not address properly periphrasis of texts (in our case Spanish text). This function adds periphrasis to the text in order to improve further analysis such as clause segmentation, and clause count. This is used by TRUNAJOD.surface_proxies.fix_parse_tree().

Parameters

doc (Spacy Doc) – Tokenized text
type (string) – Periphrasis type
periphrasis_list (List of strings) – List of periphrasis

Returns

Corrected doc

Return type

Spacy Doc

TRUNAJOD.surface_proxies.average_clause_length(doc, infinitive_map)¶

Return average clause length (heuristic).

This measurement is computed as the ratio of # of words / # of clauses. To count clauses we do it heuristically, and you can refer to TRUNAJOD.surface_proxies.clause_count() for more details.

Parameters

doc (Spacy Doc) – Text to be processed
infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Average clause length

Return type

float

TRUNAJOD.surface_proxies.average_sentence_length(doc)¶

Return average sentence length.

This measurement is computed as the ratio of: # of words / # of sentences.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: average sentence length
Return type: float

TRUNAJOD.surface_proxies.average_word_length(doc)¶

Return average word length.

Computed as the ratio of: # number of chars / # of words

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: Average word length
Return type: float

TRUNAJOD.surface_proxies.char_count(doc)¶

Return number of chars in a text.

This count does not consider anything that its Token.pos_ tag is either PUNCT or SPACE.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: Char count
Return type: int

TRUNAJOD.surface_proxies.clause_count(doc, infinitive_map)¶

Return clause count (heuristic).

This function is decorated by the TRUNAJOD:surface_proxies.fix_parse_tree() function, in order to heuristically count clauses.

Parameters

doc (Spacy Doc) – Text to be processed.
infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Clause count

Return type

int

TRUNAJOD.surface_proxies.connection_words_ratio(doc)¶

Get ratio of connecting words over total words of text.

This function computes the ratio of connective words over the total number of words. This implementation is only supported in Spanish and we consider the following lemmas: y, o, no, si.

Parameters: doc (Spacy Doc) – Tokenized text
Returns: Connection word ratio
Return type: float

TRUNAJOD.surface_proxies.first_second_person_count(doc)¶

Count first|second person tokens.

Parameters: doc (Spacy Doc) – Processed text
Returns: First and second person count
Return type: int

TRUNAJOD.surface_proxies.first_second_person_density(doc)¶

Compute density of first|second person.

Parameters: doc (Spacy Doc) – Processed text
Returns: Density 1,2 person
Return type: float

TRUNAJOD.surface_proxies.fix_parse_tree(doc, infinitive_map)¶

Fix SPACY parse tree.

We found that for Spanish texts, spaCy tags do not deal appropiately with periphrasis and other lingüistic cues. This function address this shortcome by modifying the parse tree computed by spaCy adding periphrasis for Gerunds, Infinitive and Past tense verbs.

Parameters

doc (Spacy Doc) – Processed text
infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Fixed Doc

Return type

Spacy Doc

TRUNAJOD.surface_proxies.frequency_index(doc, frequency_dict)¶

Return frequency index.

The frequency index is defined as the average frequency of the rarest word over sentences. To compute this, we use a dictionary. In the case of this Spanish implementation we could use RAE dictionary CREA.

Parameters: doc (Spacy Doc) – Tokenized text.
Returns: Frequency index
Return type: float

TRUNAJOD.surface_proxies.get_word_depth(index, doc)¶

Get word depth in the parse tree given a sentence and token index.

The ROOT of the sentence is considered level 1. This method traverses the parse tree until reaching the ROOT, and counts all levels traversed.

Parameters

index (int) – Position of the token in the sentence
doc (Spacy Doc) – Tokenized text

Returns

Depth of the of the token

Return type

int

TRUNAJOD.surface_proxies.infinitve(conjugate, infinitive_map)¶

Get infinitive form of a conjugated verb.

Given a mapping of conjugate to infinitive, this function computes the infinitive form of a conjugate verb. We provide models available for downloading, so you do not have to worry about the infinitive_map. Regretfully we only provide models for Spanish texts.

Parameters

conjugate (string) – Verb to be processed
infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Infinitive form of the verb, None if not found

Return type

string

TRUNAJOD.surface_proxies.lexical_density(doc)¶

Compute lexical density.

The lexical density is defined as the Part of Speech ratio of the following tags: VERB, AUX, ADJ, NOUN, PROPN and ADV over the total number of words.

Parameters: doc (Spacy Doc) – Tokenized text
Returns: Lexical density
Return type: Float

TRUNAJOD.surface_proxies.negation_density(doc)¶

Compute negation density.

This is defined as the ratio between number of occurrences of TRUNAJOD.surface_proxies.NEGATION_WORDS in the text over the total word count.

Parameters: doc (Spacy Doc) – Tokenized text
Returns: Negation density
Return type: float

TRUNAJOD.surface_proxies.node_similarity(node1, node2, is_central_node=False)¶

Compute node similarity recursively, based on common children POS.

This function is called inside TRUNAJOD.surface_proxies.syntactic_similarity() so is an auxiliary function. In the common use case, is unlikely you will need to call this function directly, but we provide it for debugging purposes.

Parameters

node1 (Spacy Token) – Node of the parse tree.
node2 (Spacy Token) – Node of the parse tree
is_central_node (bool, optional) – Whether is the central node, defaults to False

Returns

Total childs in common between node1 and node2.

Return type

int

TRUNAJOD.surface_proxies.noun_count(doc)¶

Count nouns in the text.

Count all tokens which Part of Speech tag is either NOUN or PROPN.

Parameters: doc (Spacy Doc) – Text to be processed
Returns: Noun count
Return type: int

TRUNAJOD.surface_proxies.noun_phrase_density(doc)¶

Compute NP density.

To compute NP density we do it heuristically. We might improve it in the future by using some NP-chunking strategy. For counting noun phrases, we check that for a node in the parse tree, its head is a Noun. Then, we check if either of the following conditions is met:

The token is the article del or al
The token dependency is not cc, case or cop, and the token is not a punctuation and the token is not the ROOT

Then we compute the ratio between # of NP / Noun count.

Parameters: doc (Spacy Doc) – Tokenized text.
Returns: NP density
Return type: float

TRUNAJOD.surface_proxies.pos_dissimilarity(doc)¶

Measure Part of Speech disimilarity over sentences.

The dissimilarity of POS between two sentences is the difference between POS distribution over the total population of POS tags. It is computed as follows:

For each sentence, PoS tag distribution is computed.
For each tag in either of the two sentences, we compute the difference in distributions (absolute value)
This difference is divided by the total population of the sentences

This is done for each pair of sentences (N - 1 sentences) and the results are averaged (again, over N - 1)

Parameters: doc (Spacy Doc) – Processed text
Returns: Part of Speech dissimilarity
Return type: float

TRUNAJOD.surface_proxies.pos_distribution(doc)¶

Get POS distribution from a processed text.

Let us suppose that a given sentence has the following pos tags: [NOUN, VERB, ADJ, VERB, ADJ]. The PoS distribution would be

NOUN: 1
VERB: 2
ADJ: 2

This function returns this distrubution as a dict.

Parameters: doc (Spacy Doc) – Processed text
Returns: POS distribution as a dict key POS, value Count
Return type: dict

TRUNAJOD.surface_proxies.pos_ratio(doc, pos_types)¶

Compute POS ratio given desired type of ratio.

The pos_types might be a regular expression if a composed ratio is needed. An example of usage would be pos_ratio(doc, "VERB|AUX").

Parameters

doc (Spacy Doc) – Spacy processed text
pos_types (string) – POS to get the ratio

Returns

Ratio over number of words

Return type

float

TRUNAJOD.surface_proxies.sentence_count(doc)¶

Return number of sentences in a text.

Parameters: doc (Spacy Doc) – Text to be processed
Returns: Number of sentences in the text
Return type: int

TRUNAJOD.surface_proxies.subordination(doc, infinitive_map)¶

Return subordination, defined as the clause density.

The subordination is defined as the ratio between # of clauses and the # of sentences. To compute number of clauses, a heuristic is used.

Parameters

doc (Spacy Doc) – Text to be processed.
infinitve_map – Lexicon containing maps from conjugate to infinitive.

Returns

Subordination index

Return type

float

TRUNAJOD.surface_proxies.syllable_count(doc)¶

Return number of syllables of a text.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: Number of syllables in the text
Return type: int

TRUNAJOD.surface_proxies.syllable_word_ratio(doc)¶

Return average syllable word ratio.

It is computed as # Syllables / # of words.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: syllable word ratio
Return type: float

TRUNAJOD.surface_proxies.syntactic_similarity(doc)¶

Compute average syntactic similarity between sentences.

For each pair of sentences, compute the similarity between each pair of nodes, using TRUNAJOD.surface_proxies.node_similarity() Then, the result is averaged over the N - 1 pair of sentences.

Parameters: doc (Spacy Doc) – Processed text
Returns: Average syntactic similarity over sentences.
Return type: float

TRUNAJOD.surface_proxies.verb_noun_ratio(doc)¶

Compute Verb/Noun ratio.

Parameters: doc (Spacy Doc) – Processed text
Returns: Verb Noun ratio
Return type: float

TRUNAJOD.surface_proxies.word_count(doc)¶

Return number of words in a text.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: Word count
Return type: int

TRUNAJOD.surface_proxies.words_before_root(doc, max_depth=4)¶

Return average word count of words before root.

For each sentence, word count before root is computed in the case that the root is a verb. Otherwise, the root is considered to be the verb in the highest node in the parse tree.

Parameters: doc (Spacy Doc) – Text to be processed.
Returns: Average words before root
Return type: float