COM3250 - Machine Learning
Leon Derczynski 000174828 aca00lad / u0ld
13 Dec 2005


Task 1.

The task is to extract features from a corpus of 2000 documents. These documents are movie reviews in English; 1000 of them are positive reviews, and the other 1000 are negative reviews. These extracted features should be presented in arff format for use by Weka.

The method used for selecting features was to select a number of n-grams present in the corpus that are good determinants of the nature of the review (i.e. whether it is positive or negative). To do this, a list of n-grams available in the entire corpus was generated, with n in the range of 1 to 6. Stop words (words common in the English language that have little or no meaning or value in information extraction and retriveal, such as "in" and "the") were removed prior to n-gram analysis. It would also have been possible to use a stemmer to group words with common roots although the complexity of implementing a stemming algorithm left this option outside of the scope of the task. Also, any punctuation and non-alphabetic characters were stripped from all documents. This was done before stopword removal to prevent them being misinterpreted by our simplistic algorithm; for example, without punctuation removal, "I'd" would be converted into "'d" after stopword removal. Text was already placed in lowercase and so there were no issues grouping n-grams in differing cases.

This n-gram generation left a significant number of words.

n	number of n-grams
1			 42409
2			390456
3			461636
4			465448

Representing documents using a list of over a million features is not an easy task and likely contains a lot of extraneous information. To reduce the number of features required, first we removed any n-gram that only occurred in the corpus once. This is too low a level to be of any significance and such a small sample that it's impossible to base decision upon it. To further reduce the feature set, the Chi square value was computed for each of the remaining n-grams, and the list of these values sorted. The top k n-grams by Chi square value was then selected for use in representing the documents. Resource consumption was still significant whilst calculating Chi square values, with up to 600M of memory required. There is room for improvement in the algorithm; for example, it currently calculates n-grams for all values of n whilst iterating through documents, so that removal of unique n-grams can only be done when all documents have been processed. This could be circumvented by iterating through all documents with n=1 and removing uniques, then incrementing n and repeating as required, although processing time would likely increase. Out of 1359949 n-grams found, 1294420 were found to be unique, leaving only 65529 n-grams that occurred two or more times in the entire corpus.

(with k=1000, yielding a min Chi square of 6.0180541624875)

n	frequency > 1	% of set of freq>1	mean Chi square		grams selected	mean X2 of selected grams
1		22819				53.80		1.6137833582813			824				10.73968069
2		33163				 8.49		1.2977219219009			172				 7.98082176
3		 6371				 1.38		1.6105925194914			  4				 7.669310996
4		 3176				 0.68		1.8632668248885			  0				  n/a


As n grows, more unique (and therefore unusable) grams are found; with n=4, only a very small handful of the n-grams found occur more than once in the corpus. With n=3 and k=1000, only 4 grams are selected; none are selected at n=4. A greater resolution in these figures could be discovered by increasing k significantly, although given the current trend and our maximum k of 65529, n is unlikely to get much higher. At any rate, k=1000 should be enough to generate a good representation of the corpus, and there are no 4-grams at this value (and only a very small number of trigrams), so further investigation is outside of the scope of the task. The average Chi square value of selected n-grams descends as n increases, which corresponds to the smaller number of selected n-grams. It's interesting to note that the mean Chi square value over all n-grams does not follow this trend, possibly due to the smaller number of non-unique n-grams providing less ambiguous examples. Where there are smaller smaples giving 4 true positives and only one 1 false positive, this could be granted a higher Chi square value. One reason for the lack of selected grams as n increases could also be due to the sample size leaving a longer low-frequency tail than with lower n. At n=3 or 4, there are likely a large number of cases where the n-gram frequency over the corpus is only 2 or 3, and if not all these occurences are of the same polarity, the statistical usefulness of that feature will be low.

At n=5 and k=2000, there are four 4-grams, and no 5-grams. The average Chi square value for n=5 is about 1.9522, following the trend of rising as n increases past n=2. 2558 of 464483 available 5-grams occured more than once, which follows the negative trend observed where the proportion of n-grams with frequency greater than one decreases as n rises.

Finally, now that there is a list of features to represent documents with, the corpus is re-examined and an arff file built, with a header containing descriptions of the ngrams and then a row for each document, containing a document ID and k feature values. Instead of sticking with a simple measure such as a Boolean "present or not present" for each n-gram, some indication of how heavily the document uses the n-gram could be implemented, to give more information to the learning algorithms. This should help at least IB1 as the extra information could allow more evident groupings by proximity in the n-dimensional space than values of simple 1 or 0 would allow. Using the plain term frequency is not a sensible idea as it could be easily skewed by longer documents. This could be combatted by recording the keyword density instead, or the tf.idf products - that is, the frequency of the term (aka n-gram in this case) in the document multiplied by the inverse document frequency (See Salton 91). Although tf.idf was originally implemented, this attempts to make some sense of how useful the term is over the corpus as a whole, something that we have already done with the Chi square feature selection test. Using tf.idf may obscure any effects that result from the choice of feature selection method and so the plain keyword density was used in the end. Keyword density was calculated using the assumption that in a document of w words, there are  w - (n-1)  ngrams (ignoring punctuation, paragraph breaks and so forth).

The majority of the n-grams with highest Chi square values seem to be indicative of negative reviews; the top 10 are "bad, worst, boring, stupid, mess, ridiculous, wasted, plot, life, lame". This may be indicative of the distribution of negative words across the English vocabulary (and the frequency with which each word is used), or perhaps related to the general author style. As a curio, some actor and film names are stronger indicators of a negative review than very negative words; for example, "battlefield earth" and "steven seagal" both have higher Chi square values and are in more negative reviews than "abysmal".


Task 2.

The chosen learning algorithms for comparison were IB1, Naive Bayes and J48 (an implementation of the C4.5 decision tree algorithm). The method used was to perform 3-fold cross validation using the entire 2000-instance training set with each algorithm using the Weka experimenter.

Looking at the proportion of correctly identified examples, Naive Bayes performed best, with an average of 84.85% accuracy. J48 did not fare quite so well, despite the large volume of training data, attaining an average accuracy of 69.73%; IB1 (k-Nearest neighbour) came off worst, with accuracy of only 62.10%. All algorithms managed to beat the baseline performance of 50% (as there are an equal number of positive and negative reviews). The performance of IB1 is disappointing given the care taken to filter out irrelevant features and the use of keyword density to provide richer training data, although it has still beaten the baseline.

The trees constructed by J48 were all complex, which is unsurprising given the large number of available features. They seem to be effective at classifying unseen examples though not as much so as Naive Bayes. The Naive Bayes algorithm may be best equipped to handle the large amount of data presented, and should be able to make statistically based decisions on the data presented. This statistical approach may be so effective because only the most statistiscally significant (according tho Chi square) features have been used for ranking.


Bibliography:
Salton G (1991) - "Developments in Automatic Text Retrieval" in Science 253, 1991, pp 974-980.