Pete's Log: Log Entry 1004

Entry #1004, (Coding, Hacking, & CS stuff)
(posted when I was 22 years old.)

So I finally made it to a seminar today, for the first time in rather a while. It was a talk by Kevin Bowyer about pattern matching for very large datasets. It was a great talk, one of the best seminars I think I've been to at ND. Basically, what it came down to was that they had this huge dataset, like 3.7 million data points (something to do with proteins and amino acids, but that doesn't matter too much), and they had a classifier they had trained with the entire dataset. It had a prediction accuracy of about 78%. They tried classifiers trained with only 1/8 of the dataset, and those classifiers got an average of about 74% accuracy. So they created committees. They got 8 classifiers, each trained with 1/8 of the dataset, and had these classifiers vote. This committee performed better than the classifier with the whole dataset. Like it seems almost kinda counterintuitive at first, but it's really cool. So they then made committees of 16 and 24 and 32 classifiers (but each time they gave each classifier 1/8 of the dataset, they just broke it up different ways with data repeated) and got even better accuracy to a point. I think accuracy plateaued after 32 classifiers. But also cool was that in addition to the improved accuracy, the whole committee model made the whole thing more parallel and stuff. It was a hella cool talk.

now it's algorithms time. woo!