Text Mining and Analytics

Notes from the Coursera Text Mining and Analytics course.

Analysing Text

Parts of speech tagging - splitting a piece of text into nouns, verbs etc.
Entity / relation extraction - seeing how entities in text relate to one another
Sentiment analysis - extracting sentiment from text

Ambiguity

Word level ambiguity - words can have more than one meaning. e.g. “design” can be either a noun or a verb
Syntactic ambiguity - sentence structure can cause ambiguity. “A man saw a boy with a telescope”

Shallow vs Deep Analysis

Deep understanding is extracting precise in-depth knowledge from text. This is difficult, imprecise and may need a human in the loop. This is because computers do not have the same common sense that humans do.
Shallow understanding is robust and general. It is based on statistical methods.

Text Representation

There are different levels of text representation which are useful for different purposes. We often use multiple representations at once.

String of Characters - most simple form
Sequence of words + Parts Of Speech tags. Words and word type
Syntactic structures. Noun phrase, verb phrase etc.
Entities and relations - what entities are in the text and how do they relate
Logic predicates
Speech acts - is the text asking somebody to do something?

Word Relations

Types of Word Relations

Paradigmatic - words belong to the same class. e.g. “banana” and “apple”

We can mine paradigmatic relations by finding words with high context similarity.

Syntagmatic - words are related symatically - e.g “fly” and “helicopter”

We can mine syntagmatic relations by finding words with high co-occurrences but relatively low individual occurrences.

Word Contexts

Left context - word to the left of another word
Right context - word to the right of another word
General context - nearby words. e.g. in the same sentence

Paradigmatic Relation Discovery

Word context can be considered as a pseudo document - a “bag of words”.

Vector Space Model (VSM) of a Bag of Words

Split a pseudo document into words and count them:

Terms: (“eats”, “ate”, “is”, “has” ….)
Counts: (3, 2, 10, 3 ….)

Expected Overlap of Words in Context (EOWC)

Compute document similarity as the dot product of the normalised word count vector.

This works but has faults:

Frequent terms are favoured over more distinct terms
All words are treated equally -e.g. “the”

`d1=(x_1, ...x_n), x_i=(c(w_i, d1))/|d1|`

`d2=(y_1, ...y_n), y_i=(c(w_i, d2))/|d2|`

Similarity(d1, d2) = `d1 . d2 = sum_(i=1)^n x_i y_i`

`x_i` is the probability that a randomly picked word is from d1 is w_i

`c(w_i, d1)` is the count of word `w_i` in d1

`|d1|` is the total count of words in d1

Retrieval Heuristics

BM25 Transformation - a term frequency transformation

`y=TF(w,d)`, x=c(w,d)`

`y=((k+1)x) / (x+k)`

`k in [0, +oo)`

IDF term weighting

`IDF(W) = log((M+1)/k)`

`M` is the total number of docs in collection

`k` is the total number of docs containing word W

Using BM25 for Paradigmatic Relation Mining

We can use BM25 and IDF term weighting to improve on EOWC

For two documents, `d1=(x_1, ...x_n)` and `d2=(y_1, ...y_n)`

`BM25(w_1, d1) = ((k+1)c(w_i, d1))/(c(w_i, d1) + (k(1 - b + b^* |d1|)/(av(dl))))`

`b in [0, 1], k in [0, +oo), av(dl) is average document length`

`x_i = (BM25(w_1, d1)) / (sum_(j=1)^n BM25(w_j, d1))`

`y_i = ... as for x_i`

Similarity(d1, d2) = `sum_(i=1)^n IDF(w_i) x_i y_i`

Syntagmatic Relation Discovery

Entropy

Define a binary random variable `X_w in {0,1}`

` X_w = {0 if w not present, 1 if w present`

then

`P(X_w =1) + P(X_w =0) = 1` since a word is either present or not

Entropy measures the randomness of X

`H(X_w) = sum_(v in {0,1})-p(X_w=v)log_2p(X_w=v)`

`= -p(X-w=0)log_2 p(X_w=0) -p(X-w=1)log_2 p(X_w=1)`

Define `0 log_2 0=0` since `log_2 0` is undefined

Conditional Entropy

Conditional entropy is the entropy of w1 given that w2 is present

`H(X_(w1) | X_(w2)) = sum_(u in {0,1})[p(X_(w2) = u)H(X_(w1) | X_(w2) = u)]`

= `sum_(u in {0,1})[p(X_2 = u) sum_(v in {0,1})[-p(X_1 = v | X_2 = u) log_2 p(X_1 = v | X_2 = u)]]`

This has the following properties:

H(X) >= H(X|Y)
H(X|Y) >= 0
H(X|X) = 0

This is great, but it can’t be used to compare different different pairs of words against each other.

Mutual Information

The reduction in entropy of X obtained by knowing Y.

This measure CAN be used to compare different pairs of words against each other.

`I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)`

This has the following properties:

I(X;Y) >= 0
I(X;Y) = I(Y;X)
I(X;Y) = 0 <=> X,Y independent

KL-Divergence Form of Mutual Information

`I(X_w1;X_W2) = sum_(u in {0,1}) sum_(v in {0,1}) P(X_(w1) = u, X_(w2) = v) log_2 ((p(X_(w1) = u, X_(w2) = v)) / (p(X_(w1) = u)p(X_(w2) = v)))`

We use this to mine syntagmatic relations between pairs of words.

Topic Mining

Probabilistic Topical Model

In a Probabilistic Topical Model we treat a topic as a word distribution. We use θ to denote a topic.

We define a vocabulary set V which contains all words:

`V={w1, w2, ...}`

As this is a probability distribution, the sum of all the probabilities must be 1.

`sum_(w in V) p(w|theta_i) = 1`

We denote the probability of document `d_i` covering topic `theta_j` as `π_(ij)`

The coverage of topics for each `d_1` also sums to 1. i.e:

`sum_(j=1)^k π_(ij) = 1`

Topic Mining Game Plan

Input: C - collection of documents, k - number of topics, V - vocabulary set

Output: Set of topics {θ1, … θk}, Coverage of topics for each document {πi1, …. πik}, {πn1, …. πnk}

Unigram Data Model

…

Categorization

Evaluation

	system - Y	system - N
human - Y	TP (true positive)	FN (false negative)
human - N	FP (false positive)	TN (true negative)

Precision

`Precision = (TP) / (TP + FP)`

When the system says yes - how man are correct

Recall

`Recall = (TP) / (TP + FN)`

Does the document have all the categories it should have?

F-Score

`F_beta = ((beta^2 + 1) (P * R)) / (beta^2 P + R)`

`F_1 = 2PR / (P + R)`

Harmonic mean of precision and recall. Better than using arithmetic mean.

beta = 1 -> precision and recall are given equal weight

Aggregation

We can aggregate the P, R and F1 values in different ways

Macro averaging

Can consider aggregation either with:

per-category - compute averages for all documents in a category
per-document - compute averages for all categories in a document

Beneficial to compare both.

Different aggregation methods:

using arithmetic mean emphasises high values
using geometric mean emphasises low values

Mirco averaging

In micro averaging you pool all results into one table then compute P, R, F1 in one step using the totals.

This method gives all documents and categories the same importance.

Macro averaging tends to be better.

Opinion Mining

An opinion is a subjective statement about what a person thinks about something

Areas

Opinion holder
Opinion target
Opinion content
Opinion context
Opinion sentiment

Types of Opinions

Authors opinion
Reported opinion
Indirect/inferred opinion

Features

Unigram language models aren’t great for sentiment analysis. eg. “it’s not good” vs “it’s not as good as”. So, better to use n-gram models, but watch out for overfitting.

Word classes - synctactic - e.g POS tags, Semantic - e.g thesaurus / recognised entities, empirical word clusters - e.g cluster of syntagmatically related words
Frequent patternsd in text
Parse tree based - ??? WTF is this?