Notes from the Coursera Text Mining and Analytics course.

**Syntagmatic Relations**

- Chris Manning and Hinrich Schutze - Foundations of Statistical Language Processing - Amazon
- Chenxiang Zhai - Exploiting context to identify lexical atoms: A statistical view of linguistic context
- Shan Jiang and ChengXiang Zhai - Random walks on adjacency graphs for mining lexical relations from big text data.

** Categorisation **

**Parts of speech tagging**- splitting a piece of text into nouns, verbs etc.**Entity / relation extraction**- seeing how entities in text relate to one another**Sentiment analysis**- extracting sentiment from text

**Word level ambiguity**- words can have more than one meaning. e.g.*“design”*can be either a noun or a verb**Syntactic ambiguity**- sentence structure can cause ambiguity.*“A man saw a boy with a telescope”*

**Deep understanding**is extracting precise in-depth knowledge from text. This is difficult, imprecise and may need a human in the loop. This is because computers do not have the same common sense that humans do.**Shallow understanding**is robust and general. It is based on statistical methods.

There are different levels of text representation which are useful for different purposes. We often use multiple representations at once.

- String of Characters - most simple form
- Sequence of words +
*Parts Of Speech*tags. Words and word type - Syntactic structures. Noun phrase, verb phrase etc.
- Entities and relations - what entities are in the text and how do they relate
- Logic predicates
- Speech acts - is the text asking somebody to do something?

**Paradigmatic** - words belong to the same class. e.g. *“banana” and “apple”*

We can mine paradigmatic relations by finding words with high context similarity.

**Syntagmatic** - words are related symatically - e.g *“fly” and “helicopter”*

We can mine syntagmatic relations by finding words with high co-occurrences but relatively low individual occurrences.

**Left context**- word to the left of another word**Right context**- word to the right of another word**General context**- nearby words. e.g. in the same sentence

Word context can be considered as a pseudo document - a “bag of words”.

Split a pseudo document into words and count them:

- Terms: (“eats”, “ate”, “is”, “has” ….)
- Counts: (3, 2, 10, 3 ….)

Compute document similarity as the dot product of the normalised word count vector.

This works but has faults:

- Frequent terms are favoured over more distinct terms
- All words are treated equally -e.g. “the”

`d1=(x_1, ...x_n), x_i=(c(w_i, d1))/|d1|`

`d2=(y_1, ...y_n), y_i=(c(w_i, d2))/|d2|`

Similarity(d1, d2) = `d1 . d2 = sum_(i=1)^n x_i y_i`

`x_i` is the probability that a randomly picked word is from d1 is w_i

`c(w_i, d1)` is the count of word `w_i` in d1

`|d1|` is the total count of words in d1

**BM25 Transformation** - a term frequency transformation

`y=TF(w,d)`, x=c(w,d)`

`y=((k+1)x) / (x+k)`

`k in [0, +oo)`

**IDF term weighting**

`IDF(W) = log((M+1)/k)`

`M` is the total number of docs in collection

`k` is the total number of docs containing word W

We can use BM25 and IDF term weighting to improve on EOWC

For two documents, `d1=(x_1, ...x_n)` and `d2=(y_1, ...y_n)`

`BM25(w_1, d1) = ((k+1)c(w_i, d1))/(c(w_i, d1) + (k(1 - b + b^* |d1|)/(av(dl))))`

`b in [0, 1], k in [0, +oo), av(dl) is average document length`

`x_i = (BM25(w_1, d1)) / (sum_(j=1)^n BM25(w_j, d1))`

`y_i = ... as for x_i`

Similarity(d1, d2) = `sum_(i=1)^n IDF(w_i) x_i y_i`

Define a binary random variable `X_w in {0,1}`

` X_w = {0 if w not present, 1 if w present`

then

`P(X_w =1) + P(X_w =0) = 1` since a word is either present or not

**Entropy** measures the randomness of X

`H(X_w) = sum_(v in {0,1})-p(X_w=v)log_2p(X_w=v)`

`= -p(X-w=0)log_2 p(X_w=0) -p(X-w=1)log_2 p(X_w=1)`

Define `0 log_2 0=0` since `log_2 0` is undefined

**Conditional entropy** is the entropy of w1 given that w2 is present

`H(X_(w1) | X_(w2)) = sum_(u in {0,1})[p(X_(w2) = u)H(X_(w1) | X_(w2) = u)]`

= `sum_(u in {0,1})[p(X_2 = u) sum_(v in {0,1})[-p(X_1 = v | X_2 = u) log_2 p(X_1 = v | X_2 = u)]]`

This has the following properties:

`H(X) >= H(X|Y)`

`H(X|Y) >= 0`

`H(X|X) = 0`

This is great, but it can’t be used to compare different different pairs of words against each other.

The reduction in entropy of X obtained by knowing Y.

This measure CAN be used to compare different pairs of words against each other.

`I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)`

This has the following properties:

`I(X;Y) >= 0`

`I(X;Y) = I(Y;X)`

`I(X;Y) = 0 <=> X,Y independent`

`I(X_w1;X_W2) = sum_(u in {0,1}) sum_(v in {0,1}) P(X_(w1) = u, X_(w2) = v) log_2 ((p(X_(w1) = u, X_(w2) = v)) / (p(X_(w1) = u)p(X_(w2) = v)))`

We use this to mine syntagmatic relations between pairs of words.

In a *Probabilistic Topical Model* we treat a topic as a word distribution. We use θ to denote a topic.

We define a vocabulary set V which contains all words:

`V={w1, w2, ...}`

As this is a probability distribution, the sum of all the probabilities must be 1.

`sum_(w in V) p(w|theta_i) = 1`

We denote the probability of document `d_i` covering topic `theta_j` as `π_(ij)`

The coverage of topics for each `d_1` also sums to 1. i.e:

`sum_(j=1)^k π_(ij) = 1`

**Input:** C - collection of documents, k - number of topics, V - vocabulary set

**Output:** Set of topics {θ1, … θk}, Coverage of topics for each document {πi1, …. πik}, {πn1, …. πnk}

…

system - Y | system - N | |
---|---|---|

human - Y | TP (true positive) |
FN (false negative) |

human - N | FP (false positive) |
TN (true negative) |

`Precision = (TP) / (TP + FP)`

When the system says yes - how man are correct

`Recall = (TP) / (TP + FN)`

Does the document have all the categories it should have?

`F_beta = ((beta^2 + 1) (P * R)) / (beta^2 P + R)`

`F_1 = 2PR / (P + R)`

Harmonic mean of precision and recall. Better than using arithmetic mean.

beta = 1 -> precision and recall are given equal weight

We can aggregate the P, R and F1 values in different ways

Can consider aggregation either with:

- per-category - compute averages for all documents in a category
- per-document - compute averages for all categories in a document

Beneficial to compare both.

Different aggregation methods:

- using
**arithmetic mean**emphasises**high values** - using
**geometric mean**emphasises**low values**

In micro averaging you pool all results into one table then compute P, R, F1 in one step using the totals.

This method gives all documents and categories the same importance.

Macro averaging tends to be better.

An opinion is a subjective statement about what a person thinks about something

- Opinion holder
- Opinion target
- Opinion content
- Opinion context
- Opinion sentiment

- Authors opinion
- Reported opinion
- Indirect/inferred opinion

Unigram language models aren’t great for sentiment analysis. eg. “it’s not good” vs “it’s not as good as”. So, better to use n-gram models, but watch out for overfitting.

**Word classes**- synctactic - e.g POS tags, Semantic - e.g thesaurus / recognised entities, empirical word clusters - e.g cluster of syntagmatically related words**Frequent patternsd in text****Parse tree based**- ??? WTF is this?

© Will Robertson