2009-09-06

Github contest: Winning the bottle of Pappy

I and Daniel Haran work together and win the Github Contest. Following
is the blog post by Daniel Haran:

The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.

We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.

Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.

I learned 3 more lessons from the Github contest.

Diversity

While our main focus was improving our score, we did stumble on one
interesting avenue for future research.

Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.

Given the time constraint we decided to just increase recommendation diversity.

Ensembles win, but require preparation

Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.

The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.

Overfitting

Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.

没有评论:

发表评论