Surface Proxies¶
Surface proxies of TRUNAJOD.
These surface proxies are measurements from text that consists on shallow measures (proxies) that approximate to intrinsic properties of the text such as cohesion, coherence, complexity. Examples of these measurements include but are not limited to: Number of sentences, number of syllables, etc.
-
TRUNAJOD.surface_proxies.
add_periphrasis
(doc, periphrasis_type, periphrasis_list)¶ Add periphrasis to SPACY tags.
One of the drawbacks that spaCy has, is that it does not address properly periphrasis of texts (in our case Spanish text). This function adds periphrasis to the text in order to improve further analysis such as clause segmentation, and clause count. This is used by
TRUNAJOD.surface_proxies.fix_parse_tree()
.- Parameters
doc (Spacy Doc) – Tokenized text
type (string) – Periphrasis type
periphrasis_list (List of strings) – List of periphrasis
- Returns
Corrected doc
- Return type
Spacy Doc
-
TRUNAJOD.surface_proxies.
average_clause_length
(doc, infinitive_map)¶ Return average clause length (heuristic).
This measurement is computed as the ratio of # of words / # of clauses. To count clauses we do it heuristically, and you can refer to
TRUNAJOD.surface_proxies.clause_count()
for more details.- Parameters
doc (Spacy Doc) – Text to be processed
infinitve_map – Lexicon containing maps from conjugate to infinitive.
- Returns
Average clause length
- Return type
float
-
TRUNAJOD.surface_proxies.
average_sentence_length
(doc)¶ Return average sentence length.
This measurement is computed as the ratio of: # of words / # of sentences.
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
average sentence length
- Return type
float
-
TRUNAJOD.surface_proxies.
average_word_length
(doc)¶ Return average word length.
Computed as the ratio of: # number of chars / # of words
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
Average word length
- Return type
float
-
TRUNAJOD.surface_proxies.
char_count
(doc)¶ Return number of chars in a text.
This count does not consider anything that its
Token.pos_
tag is eitherPUNCT
orSPACE
.- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
Char count
- Return type
int
-
TRUNAJOD.surface_proxies.
clause_count
(doc, infinitive_map)¶ Return clause count (heuristic).
This function is decorated by the
TRUNAJOD:surface_proxies.fix_parse_tree()
function, in order to heuristically count clauses.- Parameters
doc (Spacy Doc) – Text to be processed.
infinitve_map – Lexicon containing maps from conjugate to infinitive.
- Returns
Clause count
- Return type
int
-
TRUNAJOD.surface_proxies.
connection_words_ratio
(doc)¶ Get ratio of connecting words over total words of text.
This function computes the ratio of connective words over the total number of words. This implementation is only supported in Spanish and we consider the following lemmas:
y
,o
,no
,si
.- Parameters
doc (Spacy Doc) – Tokenized text
- Returns
Connection word ratio
- Return type
float
-
TRUNAJOD.surface_proxies.
first_second_person_count
(doc)¶ Count first|second person tokens.
- Parameters
doc (Spacy Doc) – Processed text
- Returns
First and second person count
- Return type
int
-
TRUNAJOD.surface_proxies.
first_second_person_density
(doc)¶ Compute density of first|second person.
- Parameters
doc (Spacy Doc) – Processed text
- Returns
Density 1,2 person
- Return type
float
-
TRUNAJOD.surface_proxies.
fix_parse_tree
(doc, infinitive_map)¶ Fix SPACY parse tree.
We found that for Spanish texts, spaCy tags do not deal appropiately with periphrasis and other lingüistic cues. This function address this shortcome by modifying the parse tree computed by spaCy adding periphrasis for Gerunds, Infinitive and Past tense verbs.
- Parameters
doc (Spacy Doc) – Processed text
infinitve_map – Lexicon containing maps from conjugate to infinitive.
- Returns
Fixed Doc
- Return type
Spacy Doc
-
TRUNAJOD.surface_proxies.
frequency_index
(doc, frequency_dict)¶ Return frequency index.
The frequency index is defined as the average frequency of the rarest word over sentences. To compute this, we use a dictionary. In the case of this Spanish implementation we could use RAE dictionary CREA.
- Parameters
doc (Spacy Doc) – Tokenized text.
- Returns
Frequency index
- Return type
float
-
TRUNAJOD.surface_proxies.
get_word_depth
(index, doc)¶ Get word depth in the parse tree given a sentence and token index.
The
ROOT
of the sentence is considered level 1. This method traverses the parse tree until reaching theROOT
, and counts all levels traversed.- Parameters
index (int) – Position of the token in the sentence
doc (Spacy Doc) – Tokenized text
- Returns
Depth of the of the token
- Return type
int
-
TRUNAJOD.surface_proxies.
infinitve
(conjugate, infinitive_map)¶ Get infinitive form of a conjugated verb.
Given a mapping of conjugate to infinitive, this function computes the infinitive form of a conjugate verb. We provide models available for downloading, so you do not have to worry about the
infinitive_map
. Regretfully we only provide models for Spanish texts.- Parameters
conjugate (string) – Verb to be processed
infinitve_map – Lexicon containing maps from conjugate to infinitive.
- Returns
Infinitive form of the verb, None if not found
- Return type
string
-
TRUNAJOD.surface_proxies.
lexical_density
(doc)¶ Compute lexical density.
The lexical density is defined as the Part of Speech ratio of the following tags:
VERB
,AUX
,ADJ
,NOUN
,PROPN
andADV
over the total number of words.- Parameters
doc (Spacy Doc) – Tokenized text
- Returns
Lexical density
- Return type
Float
-
TRUNAJOD.surface_proxies.
negation_density
(doc)¶ Compute negation density.
This is defined as the ratio between number of occurrences of
TRUNAJOD.surface_proxies.NEGATION_WORDS
in the text over the total word count.- Parameters
doc (Spacy Doc) – Tokenized text
- Returns
Negation density
- Return type
float
-
TRUNAJOD.surface_proxies.
node_similarity
(node1, node2, is_central_node=False)¶ Compute node similarity recursively, based on common children POS.
This function is called inside
TRUNAJOD.surface_proxies.syntactic_similarity()
so is an auxiliary function. In the common use case, is unlikely you will need to call this function directly, but we provide it for debugging purposes.- Parameters
node1 (Spacy Token) – Node of the parse tree.
node2 (Spacy Token) – Node of the parse tree
is_central_node (bool, optional) – Whether is the central node, defaults to False
- Returns
Total childs in common between node1 and node2.
- Return type
int
-
TRUNAJOD.surface_proxies.
noun_count
(doc)¶ Count nouns in the text.
Count all tokens which Part of Speech tag is either
NOUN
orPROPN
.- Parameters
doc (Spacy Doc) – Text to be processed
- Returns
Noun count
- Return type
int
-
TRUNAJOD.surface_proxies.
noun_phrase_density
(doc)¶ Compute NP density.
To compute NP density we do it heuristically. We might improve it in the future by using some NP-chunking strategy. For counting noun phrases, we check that for a node in the parse tree, its head is a Noun. Then, we check if either of the following conditions is met:
The token is the article
del
oral
The token dependency is not
cc
,case
orcop
, and the token is not a punctuation and the token is not theROOT
Then we compute the ratio between # of NP / Noun count.
- Parameters
doc (Spacy Doc) – Tokenized text.
- Returns
NP density
- Return type
float
-
TRUNAJOD.surface_proxies.
pos_dissimilarity
(doc)¶ Measure Part of Speech disimilarity over sentences.
The dissimilarity of POS between two sentences is the difference between POS distribution over the total population of POS tags. It is computed as follows:
For each sentence, PoS tag distribution is computed.
For each tag in either of the two sentences, we compute the difference in distributions (absolute value)
This difference is divided by the total population of the sentences
This is done for each pair of sentences (
N - 1
sentences) and the results are averaged (again, overN - 1
)- Parameters
doc (Spacy Doc) – Processed text
- Returns
Part of Speech dissimilarity
- Return type
float
-
TRUNAJOD.surface_proxies.
pos_distribution
(doc)¶ Get POS distribution from a processed text.
Let us suppose that a given sentence has the following pos tags:
[NOUN, VERB, ADJ, VERB, ADJ]
. The PoS distribution would beNOUN: 1
VERB: 2
ADJ: 2
This function returns this distrubution as a dict.
- Parameters
doc (Spacy Doc) – Processed text
- Returns
POS distribution as a dict key POS, value Count
- Return type
dict
-
TRUNAJOD.surface_proxies.
pos_ratio
(doc, pos_types)¶ Compute POS ratio given desired type of ratio.
The
pos_types
might be a regular expression if a composed ratio is needed. An example of usage would bepos_ratio(doc, "VERB|AUX")
.- Parameters
doc (Spacy Doc) – Spacy processed text
pos_types (string) – POS to get the ratio
- Returns
Ratio over number of words
- Return type
float
-
TRUNAJOD.surface_proxies.
sentence_count
(doc)¶ Return number of sentences in a text.
- Parameters
doc (Spacy Doc) – Text to be processed
- Returns
Number of sentences in the text
- Return type
int
-
TRUNAJOD.surface_proxies.
subordination
(doc, infinitive_map)¶ Return subordination, defined as the clause density.
The subordination is defined as the ratio between # of clauses and the # of sentences. To compute number of clauses, a heuristic is used.
- Parameters
doc (Spacy Doc) – Text to be processed.
infinitve_map – Lexicon containing maps from conjugate to infinitive.
- Returns
Subordination index
- Return type
float
-
TRUNAJOD.surface_proxies.
syllable_count
(doc)¶ Return number of syllables of a text.
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
Number of syllables in the text
- Return type
int
-
TRUNAJOD.surface_proxies.
syllable_word_ratio
(doc)¶ Return average syllable word ratio.
It is computed as # Syllables / # of words.
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
syllable word ratio
- Return type
float
-
TRUNAJOD.surface_proxies.
syntactic_similarity
(doc)¶ Compute average syntactic similarity between sentences.
For each pair of sentences, compute the similarity between each pair of nodes, using
TRUNAJOD.surface_proxies.node_similarity()
Then, the result is averaged over theN - 1
pair of sentences.- Parameters
doc (Spacy Doc) – Processed text
- Returns
Average syntactic similarity over sentences.
- Return type
float
-
TRUNAJOD.surface_proxies.
verb_noun_ratio
(doc)¶ Compute Verb/Noun ratio.
- Parameters
doc (Spacy Doc) – Processed text
- Returns
Verb Noun ratio
- Return type
float
-
TRUNAJOD.surface_proxies.
word_count
(doc)¶ Return number of words in a text.
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
Word count
- Return type
int
-
TRUNAJOD.surface_proxies.
words_before_root
(doc, max_depth=4)¶ Return average word count of words before root.
For each sentence, word count before root is computed in the case that the root is a verb. Otherwise, the root is considered to be the verb in the highest node in the parse tree.
- Parameters
doc (Spacy Doc) – Text to be processed.
- Returns
Average words before root
- Return type
float