The University of Sheffield
Tuning Brown Clustering

 

Intro

Brown clustering takes a corpus and outputs clusters of word types.

It's a hard clustering, so each type only appears in one cluster. You choose beforehand how many clusters you want — this is c.

As well as grouping words into clusters, Brown also generates a hierarchy above the clusters, in the form of an unbalanced binary tree.

This makes it possible to represent each cluster using a bit string, where each bit indicates the choice of branch, starting at the root and going down. We'll see some examples in the next section.

Brown clustering performs representation learning through unsupervised hierarchical clustering. It's been used in thousands of papers, and is becoming more and more popular.

Here, we present our research into how Brown clustering behaves and how to tune it, from our paper "Tune Your Brown Clustering, Please" — details are below in the outro. If you want more details about the internals of Brown clustering, here are links to the original paper and to some slides describing the algorithm.

Examples

Longer common prefices between two paths mean more similarity. Here are some Brown clusters and their paths.

PathTerms
001010110never neva nvr gladly nevr #never neverr nver neverrr nevaa nevah nva neverrrr letchu letcha ne'er -never neveer glady #inever bever nevaaa neever nerver enver neeever nevet neeeever nevva
001010111ever eva evar evr everrr everr everrrr evah everrrrr everrrrrr evaa evaaa everrrrrrr nevar eveer evaaaa eveeer everrrrrrrr everrrrrrrrr evea eveeeer evaaaaa evur
01000010does duz doess -does sayeth doez doesss d0es deos
 
PathTerms
0100Monday
010100Sunday
010101Friday
0101100Thursday
01011010Saturday
(from: Chris Dyer's cluster viewer; brownclusterings.tumblr.com)

The more clusters you choose to have, the more hierarchy there is above them. The theoretical maximum number of clusters, c, is the size of the vocabulary, V.

This means that Brown clustering gives us two kinds of information: clusterings, where similar words are placed together, and the hierarchy over these clusters.

Problems

If we don't have enough clusters, we'll end up putting words into the wrong clusters. This means the quality of the clusters becomes worse. For example, if we have to take the above illustration down from four to three clusters, where's this guy going to end up?

Maybe it would even be better to merge the currencies with the animals, and give "overflow" its own singleton cluster. And even though this is a toy example, you can really see this kind of decrease in cluster quality in large corpora, too. Let's look at the data. In this example below, does "apt-get" fit? Is this a coherent cluster? Not really!

PathTerms
001011111001ii id ion iv ll iii ud wd uma ul idnt provoking hed 1+1 ididnt hast ine 2+2 idw #thingsblackpeopledo iiii #onlywhitepeople dost doan uon apt-get 2010/06 donn 2010/08 sont cyaa 2010/04 2010/10 neer letts 2010/12 #onlyuglypeople i-i-i #urgirlfriendever cani muz _i_ #confusingthingsgirlsdo ild ifi dontchu sge #onlyfatpeople #thingsiaintdoneyet 19t

That's cool, you might think, we can just generate more clusters, so there's enough room to make them coherent. But generating high numbers of clusters is slow, and gets progressively slower the more clusters c you ask for. So how can we pick a good value for c? And what role does our corpus, T, play? Let's see how sensitive Brown clustering is to these factors by evaluating it, using its output as features for recognising named entities (locations, organisation and people) in tweets — a difficult task.

We describe the size of the input with |T|; for example, the dark red trace is from a 64 million token corpus. From this, we can see that big corpora are better for almost any given c, and that big values of c are better than small ones. The default value of 1000 clusters is not best for any of the corpus sizes.

Analysis

Low cluster counts and small input corpora both lead to poor quality Brown clusterings. But this isn't very useful general advice. We need to look deeper. We'll change to recognising named entities in news text, because we have bigger and more reliable data in this area, which improves the stability and resolution of the performance scores.

Just as we did above conceptually, let's separate the contribution of Brown clustering into path information (paths in the tree) and class information (the clusters). Details of how we did this are in our paper (below).

Paths

First, the paths.

Click and drag 3D graph to rotate.

You can see that both cluster count increases and corpus size increases help performance. Note that the value of 1000 clusters, the default used by hundreds of papers each year, doesn't yield optimal results in a single case! Also look how steep the increase in performance is as c increases when the corpus is large, compared to when it's small.


Clusters

Next, the clustered classes of words.

Click and drag 3D graph to rotate.

From this, we see that too few clusters and too small a corpus lead to poor performance. A few other observations can be made. Too many clusters also decreased performance — this is where we're not grouping up enough data to make a useful impact. Also look at the line for extreme values of c — this is where there's one class per word, i.e., where we add no clustering information. These high values aren't as good for many input sizes, and the problem seems to get worse the more input we have. Here, for many fixed corpus sizes, lower values of c do better.

The sweet spot looks to be around 103.4 for corpora over about a million tokens (e.g. c=2500), and this trend continues up to the upper bound of our newswire dataset.

One interesting thing to see is that big corpora suffer more with low numbers of clusters than small corpora do. At any one time, Brown clustering only considers a certain number of words for merging. When we have a bigger corpus with more word types, this group of words represents a smaller proportion of the dataset. We can see from the results that this proportionally smaller decision space damages the quality of the results.

Advice

  • Avoid default values of c
  • Getting a big corpus is more helpful than generating a high number of clusters, though watch out: very small numbers of clusters can be bad with larger corpora
  • If you care more about path information (maybe you're dealing with tweets or NER), make c as big as you can; if you care more about how words are clustered together (maybe you're doing text normalisation), avoid making c too big
  • Try random search for c, weighted away from very low and high values of c
  • To save time, start your parameter search using some of our pre-generated clusterings (download link below)

For full details, see our paper, Tune Your Brown Clustering, Please — or visit us at RANLP in Hisasrya, during 7-9 September 2015.

Outro

If you found this page helpful, please cite the research of ours that it summarises!

Derczynski, Chester and Bøgh, 2015. Tune Your Brown Clustering, Please. Proceedings of the conference on Recent Advances in Natural Language Processing

This research was sponsored by the Pheme and WallViz projects.

Related links

By Leon Derczynski, Sean Chester and Kenneth S. Bøgh.

The University of Sheffield
Western Bank
Sheffield S10 2TN
UK