2009-09-24
Netflix Update Leaderboard
2009-09-23
Youtube : Users will give rating to videos they like very much or hate very much
Photos of The Ensemble from Netflix Prize

2009-09-22
NetflixPrize 我们失败了
http://www.netflixprize.com//community/viewtopic.php?id=1537
Netflix公司同时宣布了下一个比赛项目,这个比赛的任务根据我的理解,是利用content信息来解决cold starting的问题,不过Netflix没有说如何评测大家的算法,所以目前这个问题应该说还没有完全定义出来。
我准备把手头的论文写完了,再考虑是否参加Netflix2的问题,如果单纯是content filtering,感觉没有太大的意思,业余搞搞还是可以的,不太适合作为研究的方向。
2009-09-20
2009-09-19
2009-09-17
Google Reader的数据收集
http://www.google.com/reader/public/atom/user/06601636036055060713/state/com.google/broadcast
这里首先要特别感谢一下kuber,他向我提供了这个链接。
这个链接中给出了用户06601636036055060713所share的文章,同时对每篇文章给出了like它的用户id。所以我们只要从这个链接出发,就可以通过广度优先搜索将整个Google Reader的数据抓下来(不过不能太过分,不然会被封的),每天要更新,获得最新的文章share情况。
目前我的爬虫正在奋勇的爬,我主要是研究目的,所以我准备收集10w用户和100w文章的数据就足够了。这个数据集可以说内容非常丰富,包含了时间和内容信息,相信在他的基础上可以做出不少工作。
P.S. 非常希望google reader能提供用户follow的数据,这样对研究社会网络和推荐系统的结合很有意义
2009-09-11
现在越来越喜欢Google Reader
比如Google Reader目前可能还处在收集数据的阶段,还没有开始在数据挖掘上发力。我觉得如果把推荐系统的方法用到Google Reader中,可以对用户的分享重新排序,这样可以更好的提供用户喜欢的信息。
2009-09-10
官方消息:我们赢得了Github Contest 2009
http://github.com/blog/489-github-contest-winners
Github同时也给予第二名一定的奖励,Jeremy和我在比赛中也联系过,其实本来hintly是由我们3个人组成的,只是后来时间太紧迫,没有来得及融合他的结果。不过jeremy的开放态度我也非常钦佩。其实在比赛开始的几天,我也是公开了源代码的,但后来可能担心有人超过,所以后期不够开放。不过比赛结束后,我还是上传了所有的代码的。
其实这次的Github Contest的设计不是非常合理,比如不应该允许用户无限制的上传,应该对数据做出变换以防止参赛者利用别的信息,提供了过多的内容信息等等。不过不管怎么样,得的奖还是很高兴的,现在就等Daniel把那瓶酒送给我了,哈哈。
2009-09-09
2009-09-06
Github contest: Winning the bottle of Pappy
is the blog post by Daniel Haran:
The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.
We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.
Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.
I learned 3 more lessons from the Github contest.
Diversity
While our main focus was improving our score, we did stumble on one
interesting avenue for future research.
Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.
Given the time constraint we decided to just increase recommendation diversity.
Ensembles win, but require preparation
Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.
The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.
Overfitting
Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.
2009-09-04
Why user give low ratings to books they bought
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.
A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.
Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.