Notes from the Coursera Text Mining and Analytics course.
Syntagmatic Relations
** Categorisation **
There are different levels of text representation which are useful for different purposes. We often use multiple representations at once.
Paradigmatic - words belong to the same class. e.g. “banana” and “apple”
We can mine paradigmatic relations by finding words with high context similarity.
Syntagmatic - words are related symatically - e.g “fly” and “helicopter”
We can mine syntagmatic relations by finding words with high co-occurrences but relatively low individual occurrences.
Word context can be considered as a pseudo document - a “bag of words”.
Split a pseudo document into words and count them:
Compute document similarity as the dot product of the normalised word count vector.
This works but has faults:
`d1=(x_1, ...x_n), x_i=(c(w_i, d1))/|d1|`
`d2=(y_1, ...y_n), y_i=(c(w_i, d2))/|d2|`
Similarity(d1, d2) = `d1 . d2 = sum_(i=1)^n x_i y_i`
`x_i` is the probability that a randomly picked word is from d1 is w_i
`c(w_i, d1)` is the count of word `w_i` in d1
`|d1|` is the total count of words in d1
BM25 Transformation - a term frequency transformation
`y=TF(w,d)`, x=c(w,d)`
`y=((k+1)x) / (x+k)`
`k in [0, +oo)`
IDF term weighting
`IDF(W) = log((M+1)/k)`
`M` is the total number of docs in collection
`k` is the total number of docs containing word W
We can use BM25 and IDF term weighting to improve on EOWC
For two documents, `d1=(x_1, ...x_n)` and `d2=(y_1, ...y_n)`
`BM25(w_1, d1) = ((k+1)c(w_i, d1))/(c(w_i, d1) + (k(1 - b + b^* |d1|)/(av(dl))))`
`b in [0, 1], k in [0, +oo), av(dl) is average document length`
`x_i = (BM25(w_1, d1)) / (sum_(j=1)^n BM25(w_j, d1))`
`y_i = ... as for x_i`
Similarity(d1, d2) = `sum_(i=1)^n IDF(w_i) x_i y_i`
Define a binary random variable `X_w in {0,1}`
` X_w = {0 if w not present, 1 if w present`
then
`P(X_w =1) + P(X_w =0) = 1` since a word is either present or not
Entropy measures the randomness of X
`H(X_w) = sum_(v in {0,1})-p(X_w=v)log_2p(X_w=v)`
`= -p(X-w=0)log_2 p(X_w=0) -p(X-w=1)log_2 p(X_w=1)`
Define `0 log_2 0=0` since `log_2 0` is undefined
Conditional entropy is the entropy of w1 given that w2 is present
`H(X_(w1) | X_(w2)) = sum_(u in {0,1})[p(X_(w2) = u)H(X_(w1) | X_(w2) = u)]`
= `sum_(u in {0,1})[p(X_2 = u) sum_(v in {0,1})[-p(X_1 = v | X_2 = u) log_2 p(X_1 = v | X_2 = u)]]`
This has the following properties:
H(X) >= H(X|Y)
H(X|Y) >= 0
H(X|X) = 0
This is great, but it can’t be used to compare different different pairs of words against each other.
The reduction in entropy of X obtained by knowing Y.
This measure CAN be used to compare different pairs of words against each other.
`I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)`
This has the following properties:
I(X;Y) >= 0
I(X;Y) = I(Y;X)
I(X;Y) = 0 <=> X,Y independent
`I(X_w1;X_W2) = sum_(u in {0,1}) sum_(v in {0,1}) P(X_(w1) = u, X_(w2) = v) log_2 ((p(X_(w1) = u, X_(w2) = v)) / (p(X_(w1) = u)p(X_(w2) = v)))`
We use this to mine syntagmatic relations between pairs of words.
In a Probabilistic Topical Model we treat a topic as a word distribution. We use θ to denote a topic.
We define a vocabulary set V which contains all words:
`V={w1, w2, ...}`
As this is a probability distribution, the sum of all the probabilities must be 1.
`sum_(w in V) p(w|theta_i) = 1`
We denote the probability of document `d_i` covering topic `theta_j` as `π_(ij)`
The coverage of topics for each `d_1` also sums to 1. i.e:
`sum_(j=1)^k π_(ij) = 1`
Input: C - collection of documents, k - number of topics, V - vocabulary set
Output: Set of topics {θ1, … θk}, Coverage of topics for each document {πi1, …. πik}, {πn1, …. πnk}
…
system - Y | system - N | |
---|---|---|
human - Y | TP (true positive) | FN (false negative) |
human - N | FP (false positive) | TN (true negative) |
`Precision = (TP) / (TP + FP)`
When the system says yes - how man are correct
`Recall = (TP) / (TP + FN)`
Does the document have all the categories it should have?
`F_beta = ((beta^2 + 1) (P * R)) / (beta^2 P + R)`
`F_1 = 2PR / (P + R)`
Harmonic mean of precision and recall. Better than using arithmetic mean.
beta = 1 -> precision and recall are given equal weight
We can aggregate the P, R and F1 values in different ways
Can consider aggregation either with:
Beneficial to compare both.
Different aggregation methods:
In micro averaging you pool all results into one table then compute P, R, F1 in one step using the totals.
This method gives all documents and categories the same importance.
Macro averaging tends to be better.
An opinion is a subjective statement about what a person thinks about something
Unigram language models aren’t great for sentiment analysis. eg. “it’s not good” vs “it’s not as good as”. So, better to use n-gram models, but watch out for overfitting.