2009-09-30

博客搬家

我最近自己架了一个我wordpress博客,并将这个博客的所有文章搬到了新博客

新博客地址

http://xlvector.cn/blog/

欢迎大家访问

2009-09-24

Some statistics results in Google Reader


I downloaded many users' broadcast feed in Google Reader. In a user feed, we can get what articles this user shared and in every article, there is a list of users who like this user. Therefore, we can use a bread-first-search crawl to crawl broadcast feed of all users who have show their preference in more than 1 articles.

I only crawl feeds of users who like Chinese article. After 1 week crawling, I only crawl down 12586 such users and 51690 Chinese articles.

I extract data about what user like what article, and the results show a user like 10 articles on average and a user share 8 articles on average.

Following are some results

users number : 12586
items number : 51690
like records number : 127616
share records number : 99937

Netflix Update Leaderboard

Netflix update leaderboard today and report RMSE in test set for every competitor. Following is the screenshot.

I am in #13 of leaderboard, and The Ensemble is in #2 now.

The all dataset can be download here

The report from Wining team BPC can be download here

2009-09-23

Youtube : Users will give rating to videos they like very much or hate very much

Youtube 在官方blog上公布了一个图,显示1到5分的用户评分分布,这个图显示,绝大多数的用户评分都是5分,第二多的评分是1分,而其他2,3,4分非常少。这说明,用户只有在极端喜欢一个视频或者极端不喜欢一个视频的时候才会使用评分系统。

这个例子让我想起了推荐系统,youtube所说的这个现象似乎在imdb和netflix中都没有发现。我觉得这是因为,在imdb中,用户并不能在网站上看到视频,他们都是在看完某个电影后才登陆imdb,所以一个用户如果登陆imdb,评分可能就是他的一个任务。

但是在youtube上却不是这样,用户上youtube的第一任务不是评分,而是看视频,所以只要视频能够满足他们的要求,他们就会不停的看下去,不会想到要评分,只有当某个视频刺激到他,而他觉得有必要表达一下自己的意愿的时候,才会评分。


Youtube publish an article in its blog and said that most of users in youtube tend to give 5 starts and 1 star to videos they watch. They think, this is because most of users only rate videos they like very much or they hate very much.

This result is different from ratings in IMDB and Netflix. I think, the first task users want to do in imdb is to rate a movie, because they can not watch movies in IMDB. However, in youtube, the most important task is watching videos. Therefore, if all videos are OK, users will not stop watching and rate videos. They will only rate a video if the video is so special and they think they must express their views.

Photos of The Ensemble from Netflix Prize


This is the first time I see my teammates in photo. Photos comes from Chefele, a member of The Ensemble.


I have not attend the conference, so, I am not in the photo

Ten of the 30+ members of "The Ensemble" posed for this team photo while attending the Netflix Prize awards ceremony in New York on 9/21/2009. From left to right: Joe Sill, Bill Bame, David Weiss, Lester Mackey, Greg McAlpin, Jacob Spoelstra, Chris Hefele, Ces Bertino, Craig Carmichael, Bo Yang.

2009-09-22

NetflixPrize 我们失败了

Netflix公司正式宣布了NetflixPrize的结果,BPC获得了冠军,他们在test set上的结果是10.06%,当然我们的也是10.06%。我们之所以输了,是因为我们比BPC晚提交了20多分钟,sigh,天意弄人啊。

http://www.netflixprize.com//community/viewtopic.php?id=1537

Netflix公司同时宣布了下一个比赛项目,这个比赛的任务根据我的理解,是利用content信息来解决cold starting的问题,不过Netflix没有说如何评测大家的算法,所以目前这个问题应该说还没有完全定义出来。

我准备把手头的论文写完了,再考虑是否参加Netflix2的问题,如果单纯是content filtering,感觉没有太大的意思,业余搞搞还是可以的,不太适合作为研究的方向。

2009-09-19

GoogleReader上的热门中文文章

我抓了1.5W个中文用户,然后统计了一下9月份被Like次数最多的文章,我生成了一个feed,地址如下

2009-09-17

Google Reader的数据收集

我的直觉告诉我,Google Reader的共享和Like功能对个性化的文章推荐将产生很大的影响。最近我在爬google reader的数据,主要是通过如下的feed链接:

http://www.google.com/reader/public/atom/user/06601636036055060713/state/com.google/broadcast

这里首先要特别感谢一下kuber,他向我提供了这个链接。

这个链接中给出了用户06601636036055060713所share的文章,同时对每篇文章给出了like它的用户id。所以我们只要从这个链接出发,就可以通过广度优先搜索将整个Google Reader的数据抓下来(不过不能太过分,不然会被封的),每天要更新,获得最新的文章share情况。

目前我的爬虫正在奋勇的爬,我主要是研究目的,所以我准备收集10w用户和100w文章的数据就足够了。这个数据集可以说内容非常丰富,包含了时间和内容信息,相信在他的基础上可以做出不少工作。

P.S. 非常希望google reader能提供用户follow的数据,这样对研究社会网络和推荐系统的结合很有意义

最后推荐一下kuber利用google reader数据做的一个推荐系统 http://www.feedzshare.com/

2009-09-11

现在越来越喜欢Google Reader

Google Reader现在越来越社会化了,我现在越来越觉得网络的社会化是未来的方向。现在互联网上的信息太多,在现实社会中,我们赛选信息往往也是靠朋友过滤,现在在Google Reader中,我基本很少看我订阅的东西了,因为那些东西太多了,根本看不过来,我每天就只看看朋友分享的东西。

比如Google Reader目前可能还处在收集数据的阶段,还没有开始在数据挖掘上发力。我觉得如果把推荐系统的方法用到Google Reader中,可以对用户的分享重新排序,这样可以更好的提供用户喜欢的信息。

2009-09-10

官方消息:我们赢得了Github Contest 2009

今天Github的官方博客上登出消息,宣布我和Daniel Haran赢得了Github Contest 2009。很多人问我怎么分那瓶酒,我和Daniel的约定是他得到那个帐号(当然我也可以用,但是他是所有者),我得到那瓶酒。所以酒鬼们可以找我,哈哈。

http://github.com/blog/489-github-contest-winners

Github同时也给予第二名一定的奖励,Jeremy和我在比赛中也联系过,其实本来hintly是由我们3个人组成的,只是后来时间太紧迫,没有来得及融合他的结果。不过jeremy的开放态度我也非常钦佩。其实在比赛开始的几天,我也是公开了源代码的,但后来可能担心有人超过,所以后期不够开放。不过比赛结束后,我还是上传了所有的代码的。

其实这次的Github Contest的设计不是非常合理,比如不应该允许用户无限制的上传,应该对数据做出变换以防止参赛者利用别的信息,提供了过多的内容信息等等。不过不管怎么样,得的奖还是很高兴的,现在就等Daniel把那瓶酒送给我了,哈哈。

2009-09-09

读书序列的推荐

在一般的图书推荐系统中,我们往往给用户推荐的是一本本书,但是在现实生活中还有一种更为有用的推荐:推荐读书序列。
举个简单的例子,每次我们实验室来师弟的时候,我们就会和他说,你要先看什么书,再看什么书,然后看什么。给他列出一个由易到难的书单。

所以我现在考虑的问题就是,书的难易程度是否可以总大量用户行为中统计出来?如何给用户推荐书单?

2009-09-06

Github contest: Winning the bottle of Pappy

I and Daniel Haran work together and win the Github Contest. Following
is the blog post by Daniel Haran:

The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.

We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.

Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.

I learned 3 more lessons from the Github contest.

Diversity

While our main focus was improving our score, we did stumble on one
interesting avenue for future research.

Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.

Given the time constraint we decided to just increase recommendation diversity.

Ensembles win, but require preparation

Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.

The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.

Overfitting

Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.

2009-09-04

Why user give low ratings to books they bought

In a book recommender system, user sometimes give low ratings to the
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.

A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.

Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.

2009-08-23

昨天在奇遇咖啡关于Netflix Prize的报告

昨天在西直门的奇遇咖啡,和很多推荐系统的experts做了一下交流,同时谈了谈我在Netflix Prize的经历。也不知道讲的如何,呵呵。我们实验室已经很久没有组会了,我也很久没有做过这种长篇的报告。

感谢clickstone和wanghongtao的组织,同时感谢王元涛对RBM的阐述。

下面是Guozhu.Wen拍的一些照片



2009-08-21

net@night interview with some Ensemble members (NetflixPrize)

URL is: http://www.twit.tv/natn114

嘿嘿,这个页面可以下载MP3,我听了一遍,基本上听懂了,看来我的听力这几年进步的挺大。嘿嘿。

采访中的3个人,在Ensemble中都是很活跃的,我们合作的很好。

2009-08-18

my solutions of github contest - remove unlike items

推荐系统的目的是为用户推荐他喜欢的东西,不过这个问题有一个反问题,就是如何找到一个用户不喜欢的东西。

这个想法在Github中很有作用,我一般是通过极值的方法来去掉用户不喜欢的东西。

举一个例子,比如一个用户watch了很多工程,但他watch的工程中最popular的工程的popular程度为A,那么我们可以认为,这个用户再watch超过这个popularity的可能性会很低。

再举一个例子,一个用户watch了很多工程,这些工程都是2006年之后创建的,那么我们可以认为,这个用户watch 2006年前的工程的可能性也很低。

或者,一个用户watch了很多工程,但这些工程都是小程序,行数只有几百行,那么也可以认为,这个用户watch大工程的可能性也很低

可以举出很多这样的例子,每用一种极值,对结果的提高都有好处。不过,这些方法只对watch了很多工程的用户才有作用,因为只有他们的极值是比较可靠的(置信概率比较高)

2009-08-17

my solutions of github contest - item based KNN

文栋以前让我写一些具体的算法问题,所以从今天起,我详细介绍一下我在Github中用到的所有算法,我用中文写。

我的Github Contest解决方案 : item-based KNN

item-based KNN是top-K推荐问题中用的最广泛的一个方法,他的相关论文有

Item-based collaborative filtering recommendation algorithms
Item-based top-n recommendation algorithms
Amazon. com recommendations: Item-to-item collaborative filtering

在github contest里面,我首先使用了item-based KNN,不过具体的实现细节和前面几篇论文不太一样,主要有下面几点

1) 如果两个工程被同一个用户watch过,那这个用户肯定给这两个工程贡献一定的相似度。在传统的相似度计算中,不同的用户贡献相似度的能力是相同的,不过我们考虑两个用户,一个看了100个工程,一个只看过两个工程,那么看过2个工程的用户贡献的相似度应该要高于看过100个工程的用户。(这个效应被称为inverse
user frequence,是和信息检索中的idf相对应的)

2) 推荐过程,对于一个用户,我们找出他曾经watch过的所有工程,然后对每个工程找出和他相似的工程,从而找出这个用户没有watch过得,但是和他watch过的工程最相似的工程。比如一个工程j,一个用户u,那么u对j的喜欢程度定义为

p(u,j) = sum_{i in N(u)} w(i,j)

这里的w(i.j)就是i和j两个工程的相似度,N(u)是u
watch过的所有工程。因为w(i,j)是线性相关系数,在我的实现中,我对w(i,j)进行了平方,这样的目的主要是削弱小相似度的影响,因为w(i,j)是大于0小于1的

2009-08-16

Graph Layout Project in Github

最近用Git越来越熟练了,于是把以前开发的图可视化程序放到了Github上,http://github.com/xlvector/graph-layout/tree/master


我用这个工具对github上的工程之间的关系进行了可视化,下面是一些结果


http://twitpic.com/dzr27

2009-08-15

clueless关于NetflixPrize的一些感悟

这是clueless的一篇博客,我转载过来。clueless是Bill Bame的昵称,他也是Ensemble的一员,他的博客地址是 http://information-density.blogspot.com/

There is one question that seem to come up in almost every competition-related conversation I have had, casual or professional: What did you learn? Actually, "Why did you bother?" is a fairly frequent question, too, but I'll save that for another time.

Oddly enough "what did I learn?" is also the question that I most frequently ask myself, and so far I haven't come up with any sort of gestalt answer. I find this comforting. A pat or encapsulated answer tends to trivialize any endeavor. Of course the more enlightened press seems to want to grab hold of the "big concept" ideas: crowdsourcing, contest-motivated problem solving, new paradigms for business analytics, that sort of thing. While those are important, and certainly make more interesting copy than explanations of why better movie rating predictions are good for Netflix, they don't offer any insight into the more personal explorations that contest participants undoubtedly experienced.

So, for better or worse, I'm going to try to cover some of what I, personally, learned. This will obviously take more than a single blog entry, and I have no doubt that some of my "revelations" will seem simple and obvious (possibly stupid), while others may seem more like heresy. There's not much I can do about that. After all, the only public opinion prediction engine I have available is incapable of doing more than making wild guesses – I refer, of course, to the squishy pattern matcher between my ears.



Algorithmists vs. Mathematicians
Netflix竞赛中的人一般就是分为两种背景,程序员(工程人员)和数学家(偏重理论)

This is something I learned very early in the competition, before I had interacted with any other competitors. It's really more of a personal discovery rather than something directly related to the competition, but I think it's a good place to start.

I am not a mathematician, statistician, or actuary. In fact, math was generally my worst subject when I was in grade school. Still, I did take quite a few math courses in college: 3 semesters of calculus, linear algebra, and differential equations. I would have avoided these courses like the plague, but they were required for one or more of the majors I blithely skipped between as an undergraduate. Once again, my grades in these courses were not stellar. But I found that I understood the theories we studied at least as well as most of my classmates. I even helped others figure out how to do homework problems. Why, then, was my own work so poor? My conclusion: I lacked the ability (or perhaps desire) to do dozens of contrived problems based on the single assumption that the method(s) used should be drawn from the chapter we were were studying at the time. After the first few problems I would get bored and start looking for shortcuts – some of which worked, but most of which failed. I once spent half of the allotted time for a calculus test attempting a derivation that would greatly simplify the bulk of the problems presented. I could see that such a formula must exist, and was much more interested in finding it than doing the problems. Unfortunately I didn't finish the derivation, and subsequently failed the test. We learned the formula a couple of months later. At the time I didn't understand the point of hiding that formula. To be honest, I still don't.

clueless的强项不在数学

What does this have to do with the Netflix competition, you ask? Am I just using this as an excuse to rail against poor educational methods or cover up my character flaws? Good question.

One reason I stumbled across this particular life lesson was because I had been reading books by Wirth (particularly Algorithms + Data Structures = Programs) and Hunter (Metalogic: An Introduction to the Metatheory of Standard First Order Logic).

The academic study of algorithms can be a complex business. It often overlaps combinatorics in ways less obvious than combinatorial optimization; but that's a discussion for another day. In the study of algorithms there is much discussion of provability, efficiency (in terms of big O notation), and elegance. What it all comes down to, though, is finding a reliable, relatively efficient, sequence of steps by which a particular family of problems can be solved. An algorithmist (yes, I choose to use that term) rarely assumes that any algorithm is perfect, especially when seen in the harsh light of real world problems. Every act of publishing an algorithm contains an implicit invitation, or even challenge, to improve upon it. Where would we be if nobody had ever challenged the "bubble sort", after all? Computer science students, in my day at least, were encouraged (even expected) to look for better algorithms than those presented in the textbooks. Discovery of novel algorithms, even if they only fit a small subset of problems, was noteworthy. This was all considered good practice for life-after-college. Why, then, did the related field of mathematics seemed to discourage rather than encourage the same sort of thinking?

Metalogic (which also overlaps combinatorics) revealed the intriguing thought that mathematics was just another logical system, and, as such, could be scrutinized from a perspective that was different than the one I had been so painfully taught. Unfortunately this was before the publication of Gödel, Escher, Bach, which would have saved me a lot of mental calisthenics. Still, the Hunter book lead to Gödel. I delighted in Gödel's incompleteness theorem, willfully misinterpreting his conclusions to support my contention that mathematics was deeply and inherently flawed.

So, I went on my merry way, believing that mathematics was a necessary evil: something to be suffered through.

It was clear from my earliest exposure to the Netflix data that I would need to learn or re-learn some fairly basic math. For example: I certainly knew about gradient methods for problem solving, but finding the proper gradient for some of the formulas I was starting to come up with would require more knowledge of differential equations than I could muster. In the past I had relied on co-workers to handle that. It was equally clear that the bits and pieces of linear algebra that I remembered were not going to be adequate – I could barely remember how to construct and solve simple linear regression problems. That's what SAS and SPSS are for, right? Except I wasn't allowed to use commercial software.

掌握一些基础的数学知识还是很重要的。

What happened next was surprising. I did sit down and grind through some old math texts; just enough to pick up what I needed. (Coincidentally I also read How to Solve It: Modern Heuristics at about the same time – if you've read it you'll know why that's relevant). Then some competitors started to publish Netflix Prize-related papers. These papers were often riddled with formulas (ugh, I thought, more math to slog through). My own notes rarely contained formulas, but had copious quantities of pseudo-code interspersed with the text. So, have you figured out my "brilliant" revelation yet? Here it is: these mathematician's were doing exactly the same thing that I was doing – only the syntax was different. All I had to do was translate the formulas into pseudo-code (or actual code) and what they were trying to say suddenly became very clear. I had a sudden urge to smack myself in the forehead with my palm. How stupid had I been all these years!? But this was applied math, right? Not one of the "pure" pursuits. So, out of morbid curiosity I started reading all sorts of math-related papers (gotta love Google, arxiv.org, and the web in general). Guess what I found? Mathematicians really do promote the same sort of thinking, and have the same sort of expectations, that algorithmists do. It's true that mathematicians have been around longer and explored more of their "world", and in some cases this has lead to stagnation in how it is taught and performed, but, sans baggage, it's still all about finding solutions to problems. Who would have guessed.

Github Contest关于diversity应用的实例

在Github比赛中,考虑diversity是一个重要的因素。因为这个比赛,对每个用户只要猜出一个可能watch的工程,那么如果不考虑
diversity,对于活跃用户的预测就会很差。

如果我们把watch4个以下的用户称为非活跃,反之称为活跃。那么现在的实验表示,对于非活跃用户有60%的预测准确率,而活跃用户只有40%。

这个原因是,活跃用户实际有很多潜在可能watch的工程,而比赛只要我们猜出1个,这个就像大海捞针,是比较困难的。这个时候diversity就

作用了。

我现在只考虑了简单的方法,对于同一类的,给活跃用户推荐时,只选择前几个,后面的降低权重,这个方法目前是很有用的。

这个也说明,diversity在预测时也是很重要的。

2009-08-12

推荐系统中两种不同的老用户

推荐系统上往往存在两种不同的老用户,举一个图书推荐的例子。

第一种用户,买了很多书,但大都是畅销书,这种用户似乎是大多数人

还有一种用户,也买了很多书,但大都是很偏的书,这种用户似乎是某方面的学者,比如我以前在豆瓣查很多数学方面的书都非常偏,很多连封面都没有

这两种都是老用户,但是他们肯定是有不同的习惯的。

如果我们发现某个用户在某个领域尽看那些别人不看的东西,这个人是不是很可能是这个领域的专家?
比如C++,初学者肯定都看一些畅销的书,比如C++ Primer Thinking in C++。而越是看不畅销的书的大都是专家。 不过这里面还有一个问题,用看的人多少来描述一个书是否畅销,也不准确,可能是因为书确实很烂,也许可以用一些作者,出版社的消息, 比如这个作者写的书很畅销,但就这一本不畅销,那可能这本是比较专业的书,嘿嘿。(比如霍金的论文很少人看,但他的科普读物很 畅销,嘿嘿)

2009-08-07

Netflix又要办比赛了? Next Netflix Prize 2.0

http://www.netflixprize.com//community/viewtopic.php?pid=9371#p9371

This is Neil Hunt, Chief Product Officer at Netflix.

To everyone who participated in the Netflix Prize: You've made this a truly remarkable contest and you've brought great innovation to the field. We applaud you for your contributions and we hope you've enjoyed the journey. We look forward to announcing a winner of the $1M Grand Prize in late September.

And, like so many great movies, there will be a sequel.

The advances spurred by the Netflix Prize have so impressed us that we’re planning Netflix Prize 2, a new big money contest with some new twists.

Here’s one: three years was a long time to compete in Prize 1, so the next contest will be a shorter time limited race, with grand prizes for the best results at 6 and 18 months.

While the first contest has been remarkable, we think Netflix Prize 2 will be more challenging, more fun, and even more useful to the field.

Stay tuned for more details when we announce the winners of Prize 1 in September.

很期待Netflix的下一场比赛,希望能公布更多的数据,以及更有挑战力的问题。

2009-08-05

用户兴趣的周期效应,一个关于二战电影的实验

在推荐系统中,周期效应是很重要的。比如,每年到了某个日子,总有一些东西成为热门。我在Netflix数据集上做了一个实验,分析了一下二战电影在每年的哪些日子会成为热门。

我定义一个电影在某一天的热门程度为这个电影在这一天被rate的次数除以这一天所有的rating数目。我选择了像辛德勒名单,珍珠港,拯救大兵瑞恩这样的二战电影,所有的电影列表见

http://en.wikipedia.org/wiki/List_of_World_War_II_films

我计算了这些二战电影在每一天的热门程度,然后我发现,最热门的两天是:9月2号和6月5号。
这实在是太巧了,9月2号是日本投降的日子,也是二战胜利纪念日,6月5号是盟军诺曼底登陆的日子。真是太巧了,说明用户的兴趣确实是有周期性的,每年到了这两天,用户就比较喜欢看二战电影。

同时,这个结果还是具有统计上的显著性的,特别是9月2号,比其他日子显著的热门,下面是前几个日期二战电影的热门程度:

9-02 0.0096554
6-05 0.0070638
6-18 0.0068133
6-12 0.0067752
6-03 0.0067715
6-19 0.006768
8-20 0.0067643
6-04 0.0067132
6-11 0.0067011
6-09 0.0066593
7-30 0.0066514
6-02 0.0065948
6-17 0.0065529
8-14 0.0065498

2009-08-04

对TopK和预测RMSE的看法

很多人都困惑与TopK和RMSE的评测的区别,我感觉其实这两种评测方法解决的是不同的两个问题。

在设计实际的推荐系统时,我们不可能计算一个用户对所以电影的评分,然后排序,找出topK。在BellKor的论文中,他用TopK评测预测问题时, 是随机选出1000个电影,然后评分排序,得出TopK。

实际的系统中,我们需要用binary data首先找出一个候选集,这个过程其实是TopK的过程(这个过程其实不需要评分,只需要关系0-1矩 阵),然后我们计算用户对候选集中电影的评分,然后对候 选集用评分排序。所以说,topk和netflix其实不是一个问题,而是推荐系统中两个不同的问题,所以用不同的评测方法也是应该的。

在Netflix中,我不需要做TopK,因为候选集已经给定了,就是quiz。在实际系统中,我们需要先做TopK,然后用评分对TopK中的K个候选物品评分排序。

举个简单的例子。 topk是找到用户最可能看的电影,他的排名是根据用户看电影的可能性排名的,而rating是在用户可能看的电影中找出用户喜欢的电影,因为有的时候 用户也会对不喜欢的电影评分。 所以这两者结合的结果就是,找出用户最可能看,且看了之后会喜欢的电影。

推荐系统在线讨论组 Forum for resys in Google Group in China

在Google Group上建了一个推荐系统的讨论组,欢迎大家加入

I build a forum for recommender system in Google Groups, welcome to join

https://groups.google.com/group/resys

2009-08-03

Cloud Wisdom in Resys 推荐系统中的群体智慧

I think, the most important thing in design resys is not to find a single algorithm which produce the most accurate predictions and recommendations. There is no such algorithms. Users preference is very different and different types of users have different patterns. Therefore, there is no single model which can meet everyone's habit.

我认为,推荐系统中最重要的问题不是找到一个能够产生极高精度的模型,因为这样的模型是不存在的。不同的用户有不同的兴趣模式,在这个世界上,我们不可能用一个模型来规范所以人的行为。如果这种模型存在,那么政府就很好管理人民了。所以,推荐系统的主要任务是设计不同的推荐算法,然后将这些推荐算法通过一定的方式组合起来。对用户进行分类,然后对某一类用户找到比较好的算法组合,只有这种方式才能设计出高精度的推荐系统。

Netflix PrizeGitHub Contest解决的是不同的问题,前面解决的是预测问题,后面解决的是推荐问题。基本上来说,Netflix中的算法几乎是不能用到Github Contest中的(除了KNN),但是模型组合的思想是放诸四海而皆准的。在Netflix中,我们用回归来组合模型,而在Github中,我们可以通过Bagging加上一些随机优化算法来组合模型(SAGA都是著名的随机优化算法)。

2009-08-02

My Story in NetflixPrize

I am a student from China and I graduated from University of Science and Technology of China (USTC). Now, I study in Institute of Automation, Chinese Academy of Sciences (CASIA).

The reason why I choose recommender system as my research field is very accidental. I do research on search engine before and last year, I wrote a paper of how to find similar web pages in Internet and submit it to WWW09. However, it is rejected by WWW09 and the reviewer tell me this have been studied by many other researchers in resys. At that time, I know the existing of collaborative filtering and resys.

After my paper was rejected, I told with another member in our lab and we found a good conference about resys, that is "ACM Conference on Recommender Systems". After I read some papers about resys, I though resys is a good research area and then I downloaded all papers in ACM resys conference. This happened in Feb. 2009 and it was Spring Festival in China. I took all these papers to home and read them in my holiday. When I read papers, I found two data set is often used resys and CF, that is movielens and Netflix.

I tried movielens firstly because it is small. After doing some researches in movielens, I want to try Netflix data and then I attend Netflix Prize (2009-2-21).

I tried SVD firstly in NetflixPrize because I thought my computer can not store item-item matrix in memory. However, I found I was wrong and I used KNN in Netflix after SVD. I read all papers from BPC, GPT and other members in Leaderboard. The last algorithm I implemented was RBM because the paper of RBM is hard to understand. My friend Wang Yuantao from Tsing-Hua University help me realized RBM. He study math in Tsing-Hua, so understand formulas in RBM paper is easy for him. After I implement RBM, I break 0.87 in NetflixPrize. Meanwhile, I wrote a paper about my time-dependent model and this paper is accepted by WI09.

In June 26, when I woke up, I received a message from Wang and he tell me someone have break 10% line. I was very surprised because at the time, PT was in the firstly place (0.858x) and I thought they need at least half a year to break 10% line. I opened the computer and I found PT is merged with BellKor and Bigchaos. At the same time, I received two emails from Greg and Larry(Dace). They want to combine their results with me and I agreed. Then, I joined VI and Larry also joined VI in the same time. The following story everyone in VI knows and I will not write it.

2009-07-31

GitHub Contest


I am working on github contest with Daniel Haran. Unlike NetflixPrize, the github contest is a Top-K resys task. I think, it is another important task in recommender system.

Let's take movie recommender system for example. When we design a movie resys, we meet two problems:
1) Given a user, we should find which movies he/she will watch. That is finding a candidate movies set.
2) In the candidate set, we should find which movies this user will like after watching.

I think, the first task is top-k recommendation task (GitHub) and the second task is prediction task (NetflixPrize).

Solving two above tasks is the fundanmental of design good recommender system.

2009-07-28

Netflix Competitors Learn the Power of Teamwork

Published: July 27, 2009

A contest set up by Netflix, which offered a $1 million prize to anyone who could significantly improve its movie recommendation system, ended on Sunday with two teams in a virtual dead heat, and no winner to be declared until September.

But the contest, which began in October 2006, has already produced an impressive legacy. It has shaped careers, spawned at least one start-up company and inspired research papers. It has also changed conventional wisdom about the best way to build the automated systems that increasingly help people make online choices about movies, books, clothing, restaurants, news and other goods and services.

These so-called recommendation engines are computing models that predict what a person might enjoy based on statistical scoring of that person’s stated preferences, past consumption patterns and similar choices made by many others — all made possible by the ease of data collection and tracking on the Web.

“The Netflix prize contest will be looked at for years by people studying how to do predictive modeling,” said Chris Volinsky, a scientist at AT&T Research and a leader of one of the two highest-ranked teams in the competition.

The biggest lesson learned, according to members of the two top teams, was the power of collaboration. It was not a single insight, algorithm or concept that allowed both teams to surpass the goal Netflix, the movie rental company, set nearly three years ago: to improve the movie recommendations made by its internal software by at least 10 percent, as measured by predicted versus actual one-through-five-star ratings by customers.

Instead, they say, the formula for success was to bring together people with complementary skills and combine different methods of problem-solving. This became increasingly apparent as the contest evolved. Mr. Volinsky’s team, BellKor’s Pragmatic Chaos, was the longtime front-runner and the first to surpass the 10 percent hurdle. It is actually a seven-person collection of other teams, and its members are statisticians, machine learning experts and computer engineers from the United States, Austria, Canada and Israel.

When BellKor’s announced last month that it had passed the 10 percent threshold, it set off a 30-day race, under contest rules, for other teams to try to best it. That led to another round of team-merging by BellKor’s leading rivals, who assembled a global consortium of about 30 members, appropriately called the Ensemble.

Submissions came fast and furious in the last few weeks from BellKor’s and the Ensemble. Just minutes before the contest deadline on Sunday, the Ensemble’s latest entry edged ahead of BellKor’s on the public Web leader board — by one-hundredth of a percentage point.

“The contest was almost a race to agglomerate as many teams as possible,” said David Weiss, a Ph.D. candidate in computer science at the University of Pennsylvania and a member of the Ensemble. “The surprise was that the collaborative approach works so well, that trying all the algorithms, coding them up and putting them together far exceeded our expectations.”

The contestants evolved, it seems, along with the contest. When the Netflix competition began, Mr. Weiss was one of three seniors at Princeton University, including David Lin and Lester Mackey, who made up a team called Dinosaur Planet. Mr. Lin, a math major, went on to become a derivatives trader on Wall Street.

But Mr. Mackey is a Ph.D. candidate at the Statistical Artificial Intelligence Lab at the University of California, Berkeley. “My interests now have been influenced by working on the Netflix prize contest,” he said.

Software recommendation systems, Mr. Mackey said, will increasingly become common tools to help people find useful information and products amid the explosion of information and offerings competing for their attention on the Web. “A lot of these techniques will propagate across the Internet,” he predicted.

That is certainly the hope of Domonkos Tikk, a Hungarian computer scientist and a member of the Ensemble. Mr. Tikk, 39, and three younger colleagues started working on the contest shortly after it began, and in 2007 they teamed up with the Princeton group. “When we entered the Netflix competition, we had no experience in collaborative filtering,” Mr. Tikk said.

Yet based on what they learned, Mr. Tikk and his colleagues founded a start-up, Gravity, which is developing recommendation systems for commercial clients, including e-commerce Web sites and a European cellphone company.

Though the Ensemble team nudged ahead of BellKor’s on the public leader board, it is not necessarily the winner. BellKor’s, according to Mr. Volinsky, remains in first place, and Netflix contacted it on Sunday to say so.

And in an online forum, another member of the BellKor’s team, Yehuda Koren, a researcher for Yahoo in Israel, said his team had “a better test score than the Ensemble,” despite what the rival team submitted for the leader board.

So is BellKor’s the winner? Certainly not yet, according to a Netflix spokesman, Steve Swasey. “There is no winner,” he said.

A winner, Mr. Swasey said, will probably not be announced until sometime in September at an event hosted by Reed Hastings, Netflix’s chief executive. The movie rental company is not holding off for maximum public relations effect, Mr. Swasey said, but because the winner has not yet been determined.

The Web leader board, he explained, is based on what the teams submit. Next, Netflix’s in-house researchers and outside experts have to validate the teams’ submissions, poring over the submitted code, design documents and other materials. “This is really complex stuff,” Mr. Swasey said.

In Hungary, Mr. Tikk did not sound optimistic. “We didn’t get any notification from Netflix,” he said in a phone interview. “So I think the chances that we won are very slight. It was a nice try.”

2009-07-27

关于下一代推荐系统的一些看法

首先祝贺我们的队伍"The Ensemble"在leaderboard上取得第一名,最终谁赢得比赛还没有正式宣布,不过我们能在最后30天做出这样的成绩还是不错的,感谢队友们,不是他们的帮助,我永远也无法知道0.855x的结果是如何做出来的。

比赛结束了,我对现有的推荐系统做了一番思考。在中国,我一向最佩服的就是douban的推荐系统,因为他们的推荐系统设计是专业的,而不是只用了简单的方法。据说他们使用了业界的一些先进算法。我用豆瓣很久了,但是他的推荐系统还有问题,当然这个问题不是豆瓣的问题,而是推荐系统中很难解决的一些问题。

1.我们知道,单纯的collaborative filtering在实际系统中是不够的,我们需要利用内容信息,但是我们在使用内容时往往是简单的用来计算相似度。比如我们有书的作者,出版社,书名,标签信息。我们往往用这些信息来比较书的相似度,然后推荐相似的书给用户。但是我在研究中发现,用户对书的不同属性的依赖是不同的,有些用户比较信赖出版社,比如我买计算机书,只买几个著名出版社的,其他出版社的书我对他的质量不信任。也有些时候看作者,比如C++,一般只买大牛的书。但是,豆瓣的推荐系统并没有学习出我的这些喜好(应该说没有完全学习出来),他们只是学习出我喜欢C++的书,但没有学习出我对作者和出版社的要求。这一方面因为我没有提供太多的喜好数据,另一方面也是因为可能并没有进行对这些特征的学习。

2.用户评论和自然语言,我们在淘宝买东西的时候,经常喜欢看以前卖家的评论来决定我们的行为。所以下一代推荐系统设计中对论坛评论需要加以利用。当然这涉及自然语言理解中的情感分析,做到完全准确是困难的,但现有的技术足以利用评论来提高推荐的精度,不管提高多少,肯定是能提高的。

3.海量数据。我们考虑网页推荐,其实也就是个性化搜索。这个问题和电影推荐不同,在电影推荐系统中,电影数总是少于用户数的,但在个性化搜索中,用户数是远远小于网页数的。在这种情况下,我觉得聚类是最有效的,我们很难学习出用户对特定网页的喜好,计算量太大。但是我们还是可以先对网页聚类,然后学习出用户的不同类型网页的态度(这个类型可以是基于内容的,也可以是基于界面的,或者基于域名,总之聚类的方法很多)。而且对于用户也很多的系统,比如google,我们也可以对用户聚类,学习出特定类的用户对特定类网页的喜好。这在设计大型系统中,可以作为一个baseline。

所以,我认为,未来推荐系统需要解决的3个问题就是
1)如何结合内容特征
2)如何理解用户的自然语言
3)如何处理海量数据

2009-07-17

Ruby

最近在和一个团队设计一个推荐系统。他们向我推荐了Ruby。这个语言我以前从QQP那儿听到过,我感觉我已经学了太多的语言(C++,Java,Python,PHP)。所以当时就没有用Ruby。不过既然这个工程需要用,昨天就买了一本书看了看。

语法很快就学会了,同时用它实现了一下SVD++模型。速度看当然是不如C++,不过和python差不多,库很多,做很多其他事情比较容易。他们向我推荐了JRuby,说这个的速度已经可以和C++媲美了,但愿如此。

我觉得现在的程序员,应该精通一门语言,熟悉2-3种语言,了解5-6种语言的语法。最近新语言太多了。

P.S. Netflix Prize还有10天就结束了,得抓紧啊,希望还是有的,嘿嘿!最近我在研究用户聚类,感觉不错!

博客受难记

这个博客一会儿能访问,一会儿不能访问,中途我也试过国产的博客,但总觉得他们的界面不适合我,太像90后小孩子的节面了,我还是喜欢简洁的,能自己控制的。所以即使这个博客在中国不能访问,我也懒的换了,我相信总有一天是能够访问的。

而且一般看我的博客的人都是会翻墙的,所以也没有换的必要,嘿嘿。昨天我也学会了怎么翻,所以又可以更新了,很爽!

我觉得有些网站封也就算了,blogspot不应该都封掉,至少像我这种谈谈技术的,一点都不黄色也不反动,sigh...

2009-07-16

写个博客真不容易

现在能上的网是越来越少了,sigh!!

2009-04-30

今天看了拉贝日记

GF上午体检,中午有人请客去家乐福吃了新开张的寿司,很便宜,不管什么都是3元一盘。下午无聊,又去看了拉贝日记,本来看了南京不想在看拉贝了,可惜去的时候只有拉贝在放,其他电影要等1个多小时,没办法,看了拉贝。前几天看陆川说,南京有意的淡化了外国人的影响。其实我觉得,这种民族感情不要也罢。确实外国人在南京给予我们很大的帮助,我们应该极力赞扬。我们自己没本事,就不要抹杀别人的作用了。

看了拉贝的另一个感慨是,万恶的纳粹旗子在这部电影里显然成为的救世主。当然拉贝不是典型的纳粹。

2009-04-27

看南京!南京!

今天和gf去中关村美嘉看了南京!南京!我不屑去评价日本人以及谴责日本人。我只是对中国人有着和鲁迅一样的看法:哀其不幸,怒其不争。废话不多说,我们需要去奋斗,不让历史重演。

2009-04-22

两本李敖的书

最近对李敖很感兴趣,他的“李敖语妙天下”每集都看过了。昨天从卓越上买了他的两本书,一本是李敖回忆录,一本是蒋介石评传。李敖回忆录大致看了一下,感觉文风果然犀利,不遮遮掩掩的,想说什么就说什么。

2009-04-15

ICML 2008的论文集合

转载自 http://www.informatik.uni-trier.de/~ley/db/conf/icml/icml2008.html