新博客地址
欢迎大家访问
The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.
We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.
Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.
I learned 3 more lessons from the Github contest.
Diversity
While our main focus was improving our score, we did stumble on one
interesting avenue for future research.
Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.
Given the time constraint we decided to just increase recommendation diversity.
Ensembles win, but require preparation
Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.
The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.
Overfitting
Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.
A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.
Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.