2009-09-30

博客搬家

我最近自己架了一个我wordpress博客,并将这个博客的所有文章搬到了新博客

新博客地址

http://xlvector.cn/blog/

欢迎大家访问

2009-09-24

Some statistics results in Google Reader


I downloaded many users' broadcast feed in Google Reader. In a user feed, we can get what articles this user shared and in every article, there is a list of users who like this user. Therefore, we can use a bread-first-search crawl to crawl broadcast feed of all users who have show their preference in more than 1 articles.

I only crawl feeds of users who like Chinese article. After 1 week crawling, I only crawl down 12586 such users and 51690 Chinese articles.

I extract data about what user like what article, and the results show a user like 10 articles on average and a user share 8 articles on average.

Following are some results

users number : 12586
items number : 51690
like records number : 127616
share records number : 99937

Netflix Update Leaderboard

Netflix update leaderboard today and report RMSE in test set for every competitor. Following is the screenshot.

I am in #13 of leaderboard, and The Ensemble is in #2 now.

The all dataset can be download here

The report from Wining team BPC can be download here

2009-09-23

Youtube : Users will give rating to videos they like very much or hate very much

Youtube 在官方blog上公布了一个图,显示1到5分的用户评分分布,这个图显示,绝大多数的用户评分都是5分,第二多的评分是1分,而其他2,3,4分非常少。这说明,用户只有在极端喜欢一个视频或者极端不喜欢一个视频的时候才会使用评分系统。

这个例子让我想起了推荐系统,youtube所说的这个现象似乎在imdb和netflix中都没有发现。我觉得这是因为,在imdb中,用户并不能在网站上看到视频,他们都是在看完某个电影后才登陆imdb,所以一个用户如果登陆imdb,评分可能就是他的一个任务。

但是在youtube上却不是这样,用户上youtube的第一任务不是评分,而是看视频,所以只要视频能够满足他们的要求,他们就会不停的看下去,不会想到要评分,只有当某个视频刺激到他,而他觉得有必要表达一下自己的意愿的时候,才会评分。


Youtube publish an article in its blog and said that most of users in youtube tend to give 5 starts and 1 star to videos they watch. They think, this is because most of users only rate videos they like very much or they hate very much.

This result is different from ratings in IMDB and Netflix. I think, the first task users want to do in imdb is to rate a movie, because they can not watch movies in IMDB. However, in youtube, the most important task is watching videos. Therefore, if all videos are OK, users will not stop watching and rate videos. They will only rate a video if the video is so special and they think they must express their views.

Photos of The Ensemble from Netflix Prize


This is the first time I see my teammates in photo. Photos comes from Chefele, a member of The Ensemble.


I have not attend the conference, so, I am not in the photo

Ten of the 30+ members of "The Ensemble" posed for this team photo while attending the Netflix Prize awards ceremony in New York on 9/21/2009. From left to right: Joe Sill, Bill Bame, David Weiss, Lester Mackey, Greg McAlpin, Jacob Spoelstra, Chris Hefele, Ces Bertino, Craig Carmichael, Bo Yang.

2009-09-22

NetflixPrize 我们失败了

Netflix公司正式宣布了NetflixPrize的结果,BPC获得了冠军,他们在test set上的结果是10.06%,当然我们的也是10.06%。我们之所以输了,是因为我们比BPC晚提交了20多分钟,sigh,天意弄人啊。

http://www.netflixprize.com//community/viewtopic.php?id=1537

Netflix公司同时宣布了下一个比赛项目,这个比赛的任务根据我的理解,是利用content信息来解决cold starting的问题,不过Netflix没有说如何评测大家的算法,所以目前这个问题应该说还没有完全定义出来。

我准备把手头的论文写完了,再考虑是否参加Netflix2的问题,如果单纯是content filtering,感觉没有太大的意思,业余搞搞还是可以的,不太适合作为研究的方向。

2009-09-19

GoogleReader上的热门中文文章

我抓了1.5W个中文用户,然后统计了一下9月份被Like次数最多的文章,我生成了一个feed,地址如下

2009-09-17

Google Reader的数据收集

我的直觉告诉我,Google Reader的共享和Like功能对个性化的文章推荐将产生很大的影响。最近我在爬google reader的数据,主要是通过如下的feed链接:

http://www.google.com/reader/public/atom/user/06601636036055060713/state/com.google/broadcast

这里首先要特别感谢一下kuber,他向我提供了这个链接。

这个链接中给出了用户06601636036055060713所share的文章,同时对每篇文章给出了like它的用户id。所以我们只要从这个链接出发,就可以通过广度优先搜索将整个Google Reader的数据抓下来(不过不能太过分,不然会被封的),每天要更新,获得最新的文章share情况。

目前我的爬虫正在奋勇的爬,我主要是研究目的,所以我准备收集10w用户和100w文章的数据就足够了。这个数据集可以说内容非常丰富,包含了时间和内容信息,相信在他的基础上可以做出不少工作。

P.S. 非常希望google reader能提供用户follow的数据,这样对研究社会网络和推荐系统的结合很有意义

最后推荐一下kuber利用google reader数据做的一个推荐系统 http://www.feedzshare.com/

2009-09-11

现在越来越喜欢Google Reader

Google Reader现在越来越社会化了,我现在越来越觉得网络的社会化是未来的方向。现在互联网上的信息太多,在现实社会中,我们赛选信息往往也是靠朋友过滤,现在在Google Reader中,我基本很少看我订阅的东西了,因为那些东西太多了,根本看不过来,我每天就只看看朋友分享的东西。

比如Google Reader目前可能还处在收集数据的阶段,还没有开始在数据挖掘上发力。我觉得如果把推荐系统的方法用到Google Reader中,可以对用户的分享重新排序,这样可以更好的提供用户喜欢的信息。

2009-09-10

官方消息:我们赢得了Github Contest 2009

今天Github的官方博客上登出消息,宣布我和Daniel Haran赢得了Github Contest 2009。很多人问我怎么分那瓶酒,我和Daniel的约定是他得到那个帐号(当然我也可以用,但是他是所有者),我得到那瓶酒。所以酒鬼们可以找我,哈哈。

http://github.com/blog/489-github-contest-winners

Github同时也给予第二名一定的奖励,Jeremy和我在比赛中也联系过,其实本来hintly是由我们3个人组成的,只是后来时间太紧迫,没有来得及融合他的结果。不过jeremy的开放态度我也非常钦佩。其实在比赛开始的几天,我也是公开了源代码的,但后来可能担心有人超过,所以后期不够开放。不过比赛结束后,我还是上传了所有的代码的。

其实这次的Github Contest的设计不是非常合理,比如不应该允许用户无限制的上传,应该对数据做出变换以防止参赛者利用别的信息,提供了过多的内容信息等等。不过不管怎么样,得的奖还是很高兴的,现在就等Daniel把那瓶酒送给我了,哈哈。

2009-09-09

读书序列的推荐

在一般的图书推荐系统中,我们往往给用户推荐的是一本本书,但是在现实生活中还有一种更为有用的推荐:推荐读书序列。
举个简单的例子,每次我们实验室来师弟的时候,我们就会和他说,你要先看什么书,再看什么书,然后看什么。给他列出一个由易到难的书单。

所以我现在考虑的问题就是,书的难易程度是否可以总大量用户行为中统计出来?如何给用户推荐书单?

2009-09-06

Github contest: Winning the bottle of Pappy

I and Daniel Haran work together and win the Github Contest. Following
is the blog post by Daniel Haran:

The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.

We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.

Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.

I learned 3 more lessons from the Github contest.

Diversity

While our main focus was improving our score, we did stumble on one
interesting avenue for future research.

Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.

Given the time constraint we decided to just increase recommendation diversity.

Ensembles win, but require preparation

Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.

The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.

Overfitting

Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.

2009-09-04

Why user give low ratings to books they bought

In a book recommender system, user sometimes give low ratings to the
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.

A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.

Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.