# Text Mining and Analytics

Notes from the Coursera Text Mining and Analytics course.

Syntagmatic Relations

• Chris Manning and Hinrich Schutze - Foundations of Statistical Language Processing - Amazon
• Chenxiang Zhai - Exploiting context to identify lexical atoms: A statistical view of linguistic context
• Shan Jiang and ChengXiang Zhai - Random walks on adjacency graphs for mining lexical relations from big text data.

** Categorisation **

## Analysing Text

• Parts of speech tagging - splitting a piece of text into nouns, verbs etc.
• Entity / relation extraction - seeing how entities in text relate to one another
• Sentiment analysis - extracting sentiment from text

### Ambiguity

• Word level ambiguity - words can have more than one meaning. e.g. “design” can be either a noun or a verb
• Syntactic ambiguity - sentence structure can cause ambiguity. “A man saw a boy with a telescope”

### Shallow vs Deep Analysis

• Deep understanding is extracting precise in-depth knowledge from text. This is difficult, imprecise and may need a human in the loop. This is because computers do not have the same common sense that humans do.
• Shallow understanding is robust and general. It is based on statistical methods.

### Text Representation

There are different levels of text representation which are useful for different purposes. We often use multiple representations at once.

1. String of Characters - most simple form
2. Sequence of words + Parts Of Speech tags. Words and word type
3. Syntactic structures. Noun phrase, verb phrase etc.
4. Entities and relations - what entities are in the text and how do they relate
5. Logic predicates
6. Speech acts - is the text asking somebody to do something?

## Word Relations

### Types of Word Relations

Paradigmatic - words belong to the same class. e.g. “banana” and “apple”

We can mine paradigmatic relations by finding words with high context similarity.

Syntagmatic - words are related symatically - e.g “fly” and “helicopter”

We can mine syntagmatic relations by finding words with high co-occurrences but relatively low individual occurrences.

### Word Contexts

• Left context - word to the left of another word
• Right context - word to the right of another word
• General context - nearby words. e.g. in the same sentence

Word context can be considered as a pseudo document - a “bag of words”.

### Vector Space Model (VSM) of a Bag of Words

Split a pseudo document into words and count them:

• Terms: (“eats”, “ate”, “is”, “has” ….)
• Counts: (3, 2, 10, 3 ….)

### Expected Overlap of Words in Context (EOWC)

Compute document similarity as the dot product of the normalised word count vector.

This works but has faults:

• Frequent terms are favoured over more distinct terms
• All words are treated equally -e.g. “the”

d1=(x_1, ...x_n), x_i=(c(w_i, d1))/|d1|

d2=(y_1, ...y_n), y_i=(c(w_i, d2))/|d2|

Similarity(d1, d2) = d1 . d2 = sum_(i=1)^n x_i y_i

x_i is the probability that a randomly picked word is from d1 is w_i

c(w_i, d1) is the count of word w_i in d1

|d1| is the total count of words in d1

## Retrieval Heuristics

BM25 Transformation - a term frequency transformation

y=TF(w,d), x=c(w,d)

y=((k+1)x) / (x+k)

k in [0, +oo)

IDF term weighting

IDF(W) = log((M+1)/k)

M is the total number of docs in collection

k is the total number of docs containing word W

### Using BM25 for Paradigmatic Relation Mining

We can use BM25 and IDF term weighting to improve on EOWC

For two documents, d1=(x_1, ...x_n) and d2=(y_1, ...y_n)

BM25(w_1, d1) = ((k+1)c(w_i, d1))/(c(w_i, d1) + (k(1 - b + b^* |d1|)/(av(dl))))

b in [0, 1], k in [0, +oo), av(dl) is average document length

x_i = (BM25(w_1, d1)) / (sum_(j=1)^n BM25(w_j, d1))

y_i = ... as for x_i

Similarity(d1, d2) = sum_(i=1)^n IDF(w_i) x_i y_i

## Syntagmatic Relation Discovery

### Entropy

Define a binary random variable X_w in {0,1}

X_w = {0 if w not present, 1 if w present

then

P(X_w =1) + P(X_w =0) = 1 since a word is either present or not

Entropy measures the randomness of X

H(X_w) = sum_(v in {0,1})-p(X_w=v)log_2p(X_w=v)

= -p(X-w=0)log_2 p(X_w=0) -p(X-w=1)log_2 p(X_w=1)

Define 0 log_2 0=0 since log_2 0 is undefined

### Conditional Entropy

Conditional entropy is the entropy of w1 given that w2 is present

H(X_(w1) | X_(w2)) = sum_(u in {0,1})[p(X_(w2) = u)H(X_(w1) | X_(w2) = u)]

= sum_(u in {0,1})[p(X_2 = u) sum_(v in {0,1})[-p(X_1 = v | X_2 = u) log_2 p(X_1 = v | X_2 = u)]]

This has the following properties:

• H(X) >= H(X|Y)
• H(X|Y) >= 0
• H(X|X) = 0

This is great, but it can’t be used to compare different different pairs of words against each other.

### Mutual Information

The reduction in entropy of X obtained by knowing Y.

This measure CAN be used to compare different pairs of words against each other.

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

This has the following properties:

• I(X;Y) >= 0
• I(X;Y) = I(Y;X)
• I(X;Y) = 0 <=> X,Y independent

#### KL-Divergence Form of Mutual Information

I(X_w1;X_W2) = sum_(u in {0,1}) sum_(v in {0,1}) P(X_(w1) = u, X_(w2) = v) log_2 ((p(X_(w1) = u, X_(w2) = v)) / (p(X_(w1) = u)p(X_(w2) = v)))

We use this to mine syntagmatic relations between pairs of words.

## Topic Mining

### Probabilistic Topical Model

In a Probabilistic Topical Model we treat a topic as a word distribution. We use θ to denote a topic.

We define a vocabulary set V which contains all words:

V={w1, w2, ...}

As this is a probability distribution, the sum of all the probabilities must be 1.

sum_(w in V) p(w|theta_i) = 1

We denote the probability of document d_i covering topic theta_j as π_(ij)

The coverage of topics for each d_1 also sums to 1. i.e:

sum_(j=1)^k π_(ij) = 1

### Topic Mining Game Plan

Input: C - collection of documents, k - number of topics, V - vocabulary set

Output: Set of topics {θ1, … θk}, Coverage of topics for each document {πi1, …. πik}, {πn1, …. πnk}

## Categorization

### Evaluation

system - Y system - N
human - Y TP (true positive) FN (false negative)
human - N FP (false positive) TN (true negative)

#### Precision

Precision = (TP) / (TP + FP)

When the system says yes - how man are correct

#### Recall

Recall = (TP) / (TP + FN)

Does the document have all the categories it should have?

#### F-Score

F_beta = ((beta^2 + 1) (P * R)) / (beta^2 P + R)

F_1 = 2PR / (P + R)`

Harmonic mean of precision and recall. Better than using arithmetic mean.

beta = 1 -> precision and recall are given equal weight

### Aggregation

We can aggregate the P, R and F1 values in different ways

#### Macro averaging

Can consider aggregation either with:

• per-category - compute averages for all documents in a category
• per-document - compute averages for all categories in a document

Beneficial to compare both.

Different aggregation methods:

• using arithmetic mean emphasises high values
• using geometric mean emphasises low values

#### Mirco averaging

In micro averaging you pool all results into one table then compute P, R, F1 in one step using the totals.

This method gives all documents and categories the same importance.

Macro averaging tends to be better.

## Opinion Mining

An opinion is a subjective statement about what a person thinks about something

### Areas

• Opinion holder
• Opinion target
• Opinion content
• Opinion context
• Opinion sentiment

### Types of Opinions

• Authors opinion
• Reported opinion
• Indirect/inferred opinion

### Features

Unigram language models aren’t great for sentiment analysis. eg. “it’s not good” vs “it’s not as good as”. So, better to use n-gram models, but watch out for overfitting.

• Word classes - synctactic - e.g POS tags, Semantic - e.g thesaurus / recognised entities, empirical word clusters - e.g cluster of syntagmatically related words
• Frequent patternsd in text
• Parse tree based - ??? WTF is this?