2009-08-23

昨天在奇遇咖啡关于Netflix Prize的报告

昨天在西直门的奇遇咖啡,和很多推荐系统的experts做了一下交流,同时谈了谈我在Netflix Prize的经历。也不知道讲的如何,呵呵。我们实验室已经很久没有组会了,我也很久没有做过这种长篇的报告。

感谢clickstone和wanghongtao的组织,同时感谢王元涛对RBM的阐述。

下面是Guozhu.Wen拍的一些照片



2009-08-21

net@night interview with some Ensemble members (NetflixPrize)

URL is: http://www.twit.tv/natn114

嘿嘿,这个页面可以下载MP3,我听了一遍,基本上听懂了,看来我的听力这几年进步的挺大。嘿嘿。

采访中的3个人,在Ensemble中都是很活跃的,我们合作的很好。

2009-08-18

my solutions of github contest - remove unlike items

推荐系统的目的是为用户推荐他喜欢的东西,不过这个问题有一个反问题,就是如何找到一个用户不喜欢的东西。

这个想法在Github中很有作用,我一般是通过极值的方法来去掉用户不喜欢的东西。

举一个例子,比如一个用户watch了很多工程,但他watch的工程中最popular的工程的popular程度为A,那么我们可以认为,这个用户再watch超过这个popularity的可能性会很低。

再举一个例子,一个用户watch了很多工程,这些工程都是2006年之后创建的,那么我们可以认为,这个用户watch 2006年前的工程的可能性也很低。

或者,一个用户watch了很多工程,但这些工程都是小程序,行数只有几百行,那么也可以认为,这个用户watch大工程的可能性也很低

可以举出很多这样的例子,每用一种极值,对结果的提高都有好处。不过,这些方法只对watch了很多工程的用户才有作用,因为只有他们的极值是比较可靠的(置信概率比较高)

2009-08-17

my solutions of github contest - item based KNN

文栋以前让我写一些具体的算法问题,所以从今天起,我详细介绍一下我在Github中用到的所有算法,我用中文写。

我的Github Contest解决方案 : item-based KNN

item-based KNN是top-K推荐问题中用的最广泛的一个方法,他的相关论文有

Item-based collaborative filtering recommendation algorithms
Item-based top-n recommendation algorithms
Amazon. com recommendations: Item-to-item collaborative filtering

在github contest里面,我首先使用了item-based KNN,不过具体的实现细节和前面几篇论文不太一样,主要有下面几点

1) 如果两个工程被同一个用户watch过,那这个用户肯定给这两个工程贡献一定的相似度。在传统的相似度计算中,不同的用户贡献相似度的能力是相同的,不过我们考虑两个用户,一个看了100个工程,一个只看过两个工程,那么看过2个工程的用户贡献的相似度应该要高于看过100个工程的用户。(这个效应被称为inverse
user frequence,是和信息检索中的idf相对应的)

2) 推荐过程,对于一个用户,我们找出他曾经watch过的所有工程,然后对每个工程找出和他相似的工程,从而找出这个用户没有watch过得,但是和他watch过的工程最相似的工程。比如一个工程j,一个用户u,那么u对j的喜欢程度定义为

p(u,j) = sum_{i in N(u)} w(i,j)

这里的w(i.j)就是i和j两个工程的相似度,N(u)是u
watch过的所有工程。因为w(i,j)是线性相关系数,在我的实现中,我对w(i,j)进行了平方,这样的目的主要是削弱小相似度的影响,因为w(i,j)是大于0小于1的

2009-08-16

Graph Layout Project in Github

最近用Git越来越熟练了,于是把以前开发的图可视化程序放到了Github上,http://github.com/xlvector/graph-layout/tree/master


我用这个工具对github上的工程之间的关系进行了可视化,下面是一些结果


http://twitpic.com/dzr27

2009-08-15

clueless关于NetflixPrize的一些感悟

这是clueless的一篇博客,我转载过来。clueless是Bill Bame的昵称,他也是Ensemble的一员,他的博客地址是 http://information-density.blogspot.com/

There is one question that seem to come up in almost every competition-related conversation I have had, casual or professional: What did you learn? Actually, "Why did you bother?" is a fairly frequent question, too, but I'll save that for another time.

Oddly enough "what did I learn?" is also the question that I most frequently ask myself, and so far I haven't come up with any sort of gestalt answer. I find this comforting. A pat or encapsulated answer tends to trivialize any endeavor. Of course the more enlightened press seems to want to grab hold of the "big concept" ideas: crowdsourcing, contest-motivated problem solving, new paradigms for business analytics, that sort of thing. While those are important, and certainly make more interesting copy than explanations of why better movie rating predictions are good for Netflix, they don't offer any insight into the more personal explorations that contest participants undoubtedly experienced.

So, for better or worse, I'm going to try to cover some of what I, personally, learned. This will obviously take more than a single blog entry, and I have no doubt that some of my "revelations" will seem simple and obvious (possibly stupid), while others may seem more like heresy. There's not much I can do about that. After all, the only public opinion prediction engine I have available is incapable of doing more than making wild guesses – I refer, of course, to the squishy pattern matcher between my ears.



Algorithmists vs. Mathematicians
Netflix竞赛中的人一般就是分为两种背景,程序员(工程人员)和数学家(偏重理论)

This is something I learned very early in the competition, before I had interacted with any other competitors. It's really more of a personal discovery rather than something directly related to the competition, but I think it's a good place to start.

I am not a mathematician, statistician, or actuary. In fact, math was generally my worst subject when I was in grade school. Still, I did take quite a few math courses in college: 3 semesters of calculus, linear algebra, and differential equations. I would have avoided these courses like the plague, but they were required for one or more of the majors I blithely skipped between as an undergraduate. Once again, my grades in these courses were not stellar. But I found that I understood the theories we studied at least as well as most of my classmates. I even helped others figure out how to do homework problems. Why, then, was my own work so poor? My conclusion: I lacked the ability (or perhaps desire) to do dozens of contrived problems based on the single assumption that the method(s) used should be drawn from the chapter we were were studying at the time. After the first few problems I would get bored and start looking for shortcuts – some of which worked, but most of which failed. I once spent half of the allotted time for a calculus test attempting a derivation that would greatly simplify the bulk of the problems presented. I could see that such a formula must exist, and was much more interested in finding it than doing the problems. Unfortunately I didn't finish the derivation, and subsequently failed the test. We learned the formula a couple of months later. At the time I didn't understand the point of hiding that formula. To be honest, I still don't.

clueless的强项不在数学

What does this have to do with the Netflix competition, you ask? Am I just using this as an excuse to rail against poor educational methods or cover up my character flaws? Good question.

One reason I stumbled across this particular life lesson was because I had been reading books by Wirth (particularly Algorithms + Data Structures = Programs) and Hunter (Metalogic: An Introduction to the Metatheory of Standard First Order Logic).

The academic study of algorithms can be a complex business. It often overlaps combinatorics in ways less obvious than combinatorial optimization; but that's a discussion for another day. In the study of algorithms there is much discussion of provability, efficiency (in terms of big O notation), and elegance. What it all comes down to, though, is finding a reliable, relatively efficient, sequence of steps by which a particular family of problems can be solved. An algorithmist (yes, I choose to use that term) rarely assumes that any algorithm is perfect, especially when seen in the harsh light of real world problems. Every act of publishing an algorithm contains an implicit invitation, or even challenge, to improve upon it. Where would we be if nobody had ever challenged the "bubble sort", after all? Computer science students, in my day at least, were encouraged (even expected) to look for better algorithms than those presented in the textbooks. Discovery of novel algorithms, even if they only fit a small subset of problems, was noteworthy. This was all considered good practice for life-after-college. Why, then, did the related field of mathematics seemed to discourage rather than encourage the same sort of thinking?

Metalogic (which also overlaps combinatorics) revealed the intriguing thought that mathematics was just another logical system, and, as such, could be scrutinized from a perspective that was different than the one I had been so painfully taught. Unfortunately this was before the publication of Gödel, Escher, Bach, which would have saved me a lot of mental calisthenics. Still, the Hunter book lead to Gödel. I delighted in Gödel's incompleteness theorem, willfully misinterpreting his conclusions to support my contention that mathematics was deeply and inherently flawed.

So, I went on my merry way, believing that mathematics was a necessary evil: something to be suffered through.

It was clear from my earliest exposure to the Netflix data that I would need to learn or re-learn some fairly basic math. For example: I certainly knew about gradient methods for problem solving, but finding the proper gradient for some of the formulas I was starting to come up with would require more knowledge of differential equations than I could muster. In the past I had relied on co-workers to handle that. It was equally clear that the bits and pieces of linear algebra that I remembered were not going to be adequate – I could barely remember how to construct and solve simple linear regression problems. That's what SAS and SPSS are for, right? Except I wasn't allowed to use commercial software.

掌握一些基础的数学知识还是很重要的。

What happened next was surprising. I did sit down and grind through some old math texts; just enough to pick up what I needed. (Coincidentally I also read How to Solve It: Modern Heuristics at about the same time – if you've read it you'll know why that's relevant). Then some competitors started to publish Netflix Prize-related papers. These papers were often riddled with formulas (ugh, I thought, more math to slog through). My own notes rarely contained formulas, but had copious quantities of pseudo-code interspersed with the text. So, have you figured out my "brilliant" revelation yet? Here it is: these mathematician's were doing exactly the same thing that I was doing – only the syntax was different. All I had to do was translate the formulas into pseudo-code (or actual code) and what they were trying to say suddenly became very clear. I had a sudden urge to smack myself in the forehead with my palm. How stupid had I been all these years!? But this was applied math, right? Not one of the "pure" pursuits. So, out of morbid curiosity I started reading all sorts of math-related papers (gotta love Google, arxiv.org, and the web in general). Guess what I found? Mathematicians really do promote the same sort of thinking, and have the same sort of expectations, that algorithmists do. It's true that mathematicians have been around longer and explored more of their "world", and in some cases this has lead to stagnation in how it is taught and performed, but, sans baggage, it's still all about finding solutions to problems. Who would have guessed.

2009-08-13

BellKor's article on matrix factorization methods in in the new edition of IEEE Computer

http://www.research.att.com/%7Evolinsky/papers/ieeecomputer.pdf

Hadoop MapReduce for Netflix Prize

Interesting...

Github Contest关于diversity应用的实例

在Github比赛中,考虑diversity是一个重要的因素。因为这个比赛,对每个用户只要猜出一个可能watch的工程,那么如果不考虑
diversity,对于活跃用户的预测就会很差。

如果我们把watch4个以下的用户称为非活跃,反之称为活跃。那么现在的实验表示,对于非活跃用户有60%的预测准确率,而活跃用户只有40%。

这个原因是,活跃用户实际有很多潜在可能watch的工程,而比赛只要我们猜出1个,这个就像大海捞针,是比较困难的。这个时候diversity就

作用了。

我现在只考虑了简单的方法,对于同一类的,给活跃用户推荐时,只选择前几个,后面的降低权重,这个方法目前是很有用的。

这个也说明,diversity在预测时也是很重要的。

2009-08-12

推荐系统中两种不同的老用户

推荐系统上往往存在两种不同的老用户,举一个图书推荐的例子。

第一种用户,买了很多书,但大都是畅销书,这种用户似乎是大多数人

还有一种用户,也买了很多书,但大都是很偏的书,这种用户似乎是某方面的学者,比如我以前在豆瓣查很多数学方面的书都非常偏,很多连封面都没有

这两种都是老用户,但是他们肯定是有不同的习惯的。

如果我们发现某个用户在某个领域尽看那些别人不看的东西,这个人是不是很可能是这个领域的专家?
比如C++,初学者肯定都看一些畅销的书,比如C++ Primer Thinking in C++。而越是看不畅销的书的大都是专家。 不过这里面还有一个问题,用看的人多少来描述一个书是否畅销,也不准确,可能是因为书确实很烂,也许可以用一些作者,出版社的消息, 比如这个作者写的书很畅销,但就这一本不畅销,那可能这本是比较专业的书,嘿嘿。(比如霍金的论文很少人看,但他的科普读物很 畅销,嘿嘿)

2009-08-07

Netflix又要办比赛了? Next Netflix Prize 2.0

http://www.netflixprize.com//community/viewtopic.php?pid=9371#p9371

This is Neil Hunt, Chief Product Officer at Netflix.

To everyone who participated in the Netflix Prize: You've made this a truly remarkable contest and you've brought great innovation to the field. We applaud you for your contributions and we hope you've enjoyed the journey. We look forward to announcing a winner of the $1M Grand Prize in late September.

And, like so many great movies, there will be a sequel.

The advances spurred by the Netflix Prize have so impressed us that we’re planning Netflix Prize 2, a new big money contest with some new twists.

Here’s one: three years was a long time to compete in Prize 1, so the next contest will be a shorter time limited race, with grand prizes for the best results at 6 and 18 months.

While the first contest has been remarkable, we think Netflix Prize 2 will be more challenging, more fun, and even more useful to the field.

Stay tuned for more details when we announce the winners of Prize 1 in September.

很期待Netflix的下一场比赛,希望能公布更多的数据,以及更有挑战力的问题。

2009-08-05

用户兴趣的周期效应,一个关于二战电影的实验

在推荐系统中,周期效应是很重要的。比如,每年到了某个日子,总有一些东西成为热门。我在Netflix数据集上做了一个实验,分析了一下二战电影在每年的哪些日子会成为热门。

我定义一个电影在某一天的热门程度为这个电影在这一天被rate的次数除以这一天所有的rating数目。我选择了像辛德勒名单,珍珠港,拯救大兵瑞恩这样的二战电影,所有的电影列表见

http://en.wikipedia.org/wiki/List_of_World_War_II_films

我计算了这些二战电影在每一天的热门程度,然后我发现,最热门的两天是:9月2号和6月5号。
这实在是太巧了,9月2号是日本投降的日子,也是二战胜利纪念日,6月5号是盟军诺曼底登陆的日子。真是太巧了,说明用户的兴趣确实是有周期性的,每年到了这两天,用户就比较喜欢看二战电影。

同时,这个结果还是具有统计上的显著性的,特别是9月2号,比其他日子显著的热门,下面是前几个日期二战电影的热门程度:

9-02 0.0096554
6-05 0.0070638
6-18 0.0068133
6-12 0.0067752
6-03 0.0067715
6-19 0.006768
8-20 0.0067643
6-04 0.0067132
6-11 0.0067011
6-09 0.0066593
7-30 0.0066514
6-02 0.0065948
6-17 0.0065529
8-14 0.0065498

2009-08-04

对TopK和预测RMSE的看法

很多人都困惑与TopK和RMSE的评测的区别,我感觉其实这两种评测方法解决的是不同的两个问题。

在设计实际的推荐系统时,我们不可能计算一个用户对所以电影的评分,然后排序,找出topK。在BellKor的论文中,他用TopK评测预测问题时, 是随机选出1000个电影,然后评分排序,得出TopK。

实际的系统中,我们需要用binary data首先找出一个候选集,这个过程其实是TopK的过程(这个过程其实不需要评分,只需要关系0-1矩 阵),然后我们计算用户对候选集中电影的评分,然后对候 选集用评分排序。所以说,topk和netflix其实不是一个问题,而是推荐系统中两个不同的问题,所以用不同的评测方法也是应该的。

在Netflix中,我不需要做TopK,因为候选集已经给定了,就是quiz。在实际系统中,我们需要先做TopK,然后用评分对TopK中的K个候选物品评分排序。

举个简单的例子。 topk是找到用户最可能看的电影,他的排名是根据用户看电影的可能性排名的,而rating是在用户可能看的电影中找出用户喜欢的电影,因为有的时候 用户也会对不喜欢的电影评分。 所以这两者结合的结果就是,找出用户最可能看,且看了之后会喜欢的电影。

推荐系统在线讨论组 Forum for resys in Google Group in China

在Google Group上建了一个推荐系统的讨论组,欢迎大家加入

I build a forum for recommender system in Google Groups, welcome to join

https://groups.google.com/group/resys

2009-08-03

Cloud Wisdom in Resys 推荐系统中的群体智慧

I think, the most important thing in design resys is not to find a single algorithm which produce the most accurate predictions and recommendations. There is no such algorithms. Users preference is very different and different types of users have different patterns. Therefore, there is no single model which can meet everyone's habit.

我认为,推荐系统中最重要的问题不是找到一个能够产生极高精度的模型,因为这样的模型是不存在的。不同的用户有不同的兴趣模式,在这个世界上,我们不可能用一个模型来规范所以人的行为。如果这种模型存在,那么政府就很好管理人民了。所以,推荐系统的主要任务是设计不同的推荐算法,然后将这些推荐算法通过一定的方式组合起来。对用户进行分类,然后对某一类用户找到比较好的算法组合,只有这种方式才能设计出高精度的推荐系统。

Netflix PrizeGitHub Contest解决的是不同的问题,前面解决的是预测问题,后面解决的是推荐问题。基本上来说,Netflix中的算法几乎是不能用到Github Contest中的(除了KNN),但是模型组合的思想是放诸四海而皆准的。在Netflix中,我们用回归来组合模型,而在Github中,我们可以通过Bagging加上一些随机优化算法来组合模型(SAGA都是著名的随机优化算法)。

2009-08-02

My Story in NetflixPrize

I am a student from China and I graduated from University of Science and Technology of China (USTC). Now, I study in Institute of Automation, Chinese Academy of Sciences (CASIA).

The reason why I choose recommender system as my research field is very accidental. I do research on search engine before and last year, I wrote a paper of how to find similar web pages in Internet and submit it to WWW09. However, it is rejected by WWW09 and the reviewer tell me this have been studied by many other researchers in resys. At that time, I know the existing of collaborative filtering and resys.

After my paper was rejected, I told with another member in our lab and we found a good conference about resys, that is "ACM Conference on Recommender Systems". After I read some papers about resys, I though resys is a good research area and then I downloaded all papers in ACM resys conference. This happened in Feb. 2009 and it was Spring Festival in China. I took all these papers to home and read them in my holiday. When I read papers, I found two data set is often used resys and CF, that is movielens and Netflix.

I tried movielens firstly because it is small. After doing some researches in movielens, I want to try Netflix data and then I attend Netflix Prize (2009-2-21).

I tried SVD firstly in NetflixPrize because I thought my computer can not store item-item matrix in memory. However, I found I was wrong and I used KNN in Netflix after SVD. I read all papers from BPC, GPT and other members in Leaderboard. The last algorithm I implemented was RBM because the paper of RBM is hard to understand. My friend Wang Yuantao from Tsing-Hua University help me realized RBM. He study math in Tsing-Hua, so understand formulas in RBM paper is easy for him. After I implement RBM, I break 0.87 in NetflixPrize. Meanwhile, I wrote a paper about my time-dependent model and this paper is accepted by WI09.

In June 26, when I woke up, I received a message from Wang and he tell me someone have break 10% line. I was very surprised because at the time, PT was in the firstly place (0.858x) and I thought they need at least half a year to break 10% line. I opened the computer and I found PT is merged with BellKor and Bigchaos. At the same time, I received two emails from Greg and Larry(Dace). They want to combine their results with me and I agreed. Then, I joined VI and Larry also joined VI in the same time. The following story everyone in VI knows and I will not write it.