Semantic Measures

Semantic measures TRUNAJOD methods.

The dimensions defined in this module, require external knowledge, for example synonym overlap measurement requires knowledge from a word onthology, and semantic measurements require word vectors (word embeddings) obtained from CORPUS semantics.

TRUNAJOD.semantic_measures.avg_w2v_semantic_similarity(docs, N)

Compute average semantic similarity between adjacent sentences.

This is using word2vec [MCC+13] model based on SPACY implementation. The semantic similarity is based on [FKL98] approach to compute text coherence.

Parameters
  • docs (Doc Generator) – Docs generator provided by SPACY API

  • N (int) – Number of sentences

Returns

Average sentence similarity (cosine)

Return type

float

TRUNAJOD.semantic_measures.get_synsets(lemma, synset_dict)

Return synonym set given a word lemma.

The function requires that the synset_dict is passed into it. In our case we provide downloadable models from MCR (Multilingual-Central-Repository). [GALR12]. If the lemma is not found in the synset_dict, then this function returns a set with the lemma in it.

Parameters
  • lemma (string) – Lemma to be look-up into the synset

  • synset_dict (Python dict) – key-value pairs, lemma to synset

Returns

The set of synonyms of a given lemma

Return type

Python set of strings

TRUNAJOD.semantic_measures.overlap(lemma_list_group, synset_dict)

Compute average overlap in a text.

Computes semantic synset overlap (synonyms), based on a lemma list group and a dictionary containing synsets. Note that the computations are carried out dividing by number of text segments considered; matches TAACO implementation. For more details about this measurement, refer to [CKM16]

Parameters
  • lemma_list_group (List of List of strings) – List of tokenized and lemmatized sentences

  • synset_dict (Python dict) – key-value pairs for lemma-synonyms

Returns

Average overlap between sentences

Return type

float

CKM16

Scott A Crossley, Kristopher Kyle, and Danielle S McNamara. The tool for the automatic analysis of text cohesion (taaco): automatic assessment of local, global, and text cohesion. Behavior research methods, 48(4):1227–1237, 2016.

FKL98

Peter W Foltz, Walter Kintsch, and Thomas K Landauer. The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2-3):285–307, 1998.

GALR12

Aitor Gonzalez-Agirre, Egoitz Laparra, and German Rigau. Multilingual central repository version 3.0. In LREC, 2525–2529. 2012.

MCC+13

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, L Sutskever, and G Zweig. Word2vec. URL https://code. google. com/p/word2vec, 2013.