






Some statistics results in Google Reader

I downloaded many users' broadcast feed in Google Reader. In a user feed, we can get what articles this user shared and in every article, there is a list of users who like this user. Therefore, we can use a bread-first-search crawl to crawl broadcast feed of all users who have show their preference in more than 1 articles.

I only crawl feeds of users who like Chinese article. After 1 week crawling, I only crawl down 12586 such users and 51690 Chinese articles.

I extract data about what user like what article, and the results show a user like 10 articles on average and a user share 8 articles on average.

Following are some results

users number : 12586
items number : 51690
like records number : 127616
share records number : 99937

Netflix Update Leaderboard

Netflix update leaderboard today and report RMSE in test set for every competitor. Following is the screenshot.

I am in #13 of leaderboard, and The Ensemble is in #2 now.

The all dataset can be download here

The report from Wining team BPC can be download here


Youtube : Users will give rating to videos they like very much or hate very much

Youtube 在官方blog上公布了一个图,显示1到5分的用户评分分布,这个图显示,绝大多数的用户评分都是5分,第二多的评分是1分,而其他2,3,4分非常少。这说明,用户只有在极端喜欢一个视频或者极端不喜欢一个视频的时候才会使用评分系统。



Youtube publish an article in its blog and said that most of users in youtube tend to give 5 starts and 1 star to videos they watch. They think, this is because most of users only rate videos they like very much or they hate very much.

This result is different from ratings in IMDB and Netflix. I think, the first task users want to do in imdb is to rate a movie, because they can not watch movies in IMDB. However, in youtube, the most important task is watching videos. Therefore, if all videos are OK, users will not stop watching and rate videos. They will only rate a video if the video is so special and they think they must express their views.

Photos of The Ensemble from Netflix Prize

This is the first time I see my teammates in photo. Photos comes from Chefele, a member of The Ensemble.

I have not attend the conference, so, I am not in the photo

Ten of the 30+ members of "The Ensemble" posed for this team photo while attending the Netflix Prize awards ceremony in New York on 9/21/2009. From left to right: Joe Sill, Bill Bame, David Weiss, Lester Mackey, Greg McAlpin, Jacob Spoelstra, Chris Hefele, Ces Bertino, Craig Carmichael, Bo Yang.


NetflixPrize 我们失败了

Netflix公司正式宣布了NetflixPrize的结果,BPC获得了冠军,他们在test set上的结果是10.06%,当然我们的也是10.06%。我们之所以输了,是因为我们比BPC晚提交了20多分钟,sigh,天意弄人啊。


Netflix公司同时宣布了下一个比赛项目,这个比赛的任务根据我的理解,是利用content信息来解决cold starting的问题,不过Netflix没有说如何评测大家的算法,所以目前这个问题应该说还没有完全定义出来。

我准备把手头的论文写完了,再考虑是否参加Netflix2的问题,如果单纯是content filtering,感觉没有太大的意思,业余搞搞还是可以的,不太适合作为研究的方向。





Google Reader的数据收集

我的直觉告诉我,Google Reader的共享和Like功能对个性化的文章推荐将产生很大的影响。最近我在爬google reader的数据,主要是通过如下的feed链接:



这个链接中给出了用户06601636036055060713所share的文章,同时对每篇文章给出了like它的用户id。所以我们只要从这个链接出发,就可以通过广度优先搜索将整个Google Reader的数据抓下来(不过不能太过分,不然会被封的),每天要更新,获得最新的文章share情况。


P.S. 非常希望google reader能提供用户follow的数据,这样对研究社会网络和推荐系统的结合很有意义

最后推荐一下kuber利用google reader数据做的一个推荐系统 http://www.feedzshare.com/


现在越来越喜欢Google Reader

Google Reader现在越来越社会化了,我现在越来越觉得网络的社会化是未来的方向。现在互联网上的信息太多,在现实社会中,我们赛选信息往往也是靠朋友过滤,现在在Google Reader中,我基本很少看我订阅的东西了,因为那些东西太多了,根本看不过来,我每天就只看看朋友分享的东西。

比如Google Reader目前可能还处在收集数据的阶段,还没有开始在数据挖掘上发力。我觉得如果把推荐系统的方法用到Google Reader中,可以对用户的分享重新排序,这样可以更好的提供用户喜欢的信息。


官方消息:我们赢得了Github Contest 2009

今天Github的官方博客上登出消息,宣布我和Daniel Haran赢得了Github Contest 2009。很多人问我怎么分那瓶酒,我和Daniel的约定是他得到那个帐号(当然我也可以用,但是他是所有者),我得到那瓶酒。所以酒鬼们可以找我,哈哈。



其实这次的Github Contest的设计不是非常合理,比如不应该允许用户无限制的上传,应该对数据做出变换以防止参赛者利用别的信息,提供了过多的内容信息等等。不过不管怎么样,得的奖还是很高兴的,现在就等Daniel把那瓶酒送给我了,哈哈。






Github contest: Winning the bottle of Pappy

I and Daniel Haran work together and win the Github Contest. Following
is the blog post by Daniel Haran:

The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.

We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.

Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.

I learned 3 more lessons from the Github contest.


While our main focus was improving our score, we did stumble on one
interesting avenue for future research.

Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.

Given the time constraint we decided to just increase recommendation diversity.

Ensembles win, but require preparation

Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with

The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.


Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.


Why user give low ratings to books they bought

In a book recommender system, user sometimes give low ratings to the
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.

A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.

Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.


昨天在奇遇咖啡关于Netflix Prize的报告

昨天在西直门的奇遇咖啡,和很多推荐系统的experts做了一下交流,同时谈了谈我在Netflix Prize的经历。也不知道讲的如何,呵呵。我们实验室已经很久没有组会了,我也很久没有做过这种长篇的报告。




net@night interview with some Ensemble members (NetflixPrize)

URL is: http://www.twit.tv/natn114




my solutions of github contest - remove unlike items




再举一个例子,一个用户watch了很多工程,这些工程都是2006年之后创建的,那么我们可以认为,这个用户watch 2006年前的工程的可能性也很低。




my solutions of github contest - item based KNN


我的Github Contest解决方案 : item-based KNN

item-based KNN是top-K推荐问题中用的最广泛的一个方法,他的相关论文有

Item-based collaborative filtering recommendation algorithms
Item-based top-n recommendation algorithms
Amazon. com recommendations: Item-to-item collaborative filtering

在github contest里面,我首先使用了item-based KNN,不过具体的实现细节和前面几篇论文不太一样,主要有下面几点

1) 如果两个工程被同一个用户watch过,那这个用户肯定给这两个工程贡献一定的相似度。在传统的相似度计算中,不同的用户贡献相似度的能力是相同的,不过我们考虑两个用户,一个看了100个工程,一个只看过两个工程,那么看过2个工程的用户贡献的相似度应该要高于看过100个工程的用户。(这个效应被称为inverse
user frequence,是和信息检索中的idf相对应的)

2) 推荐过程,对于一个用户,我们找出他曾经watch过的所有工程,然后对每个工程找出和他相似的工程,从而找出这个用户没有watch过得,但是和他watch过的工程最相似的工程。比如一个工程j,一个用户u,那么u对j的喜欢程度定义为

p(u,j) = sum_{i in N(u)} w(i,j)



Graph Layout Project in Github






这是clueless的一篇博客,我转载过来。clueless是Bill Bame的昵称,他也是Ensemble的一员,他的博客地址是 http://information-density.blogspot.com/

There is one question that seem to come up in almost every competition-related conversation I have had, casual or professional: What did you learn? Actually, "Why did you bother?" is a fairly frequent question, too, but I'll save that for another time.

Oddly enough "what did I learn?" is also the question that I most frequently ask myself, and so far I haven't come up with any sort of gestalt answer. I find this comforting. A pat or encapsulated answer tends to trivialize any endeavor. Of course the more enlightened press seems to want to grab hold of the "big concept" ideas: crowdsourcing, contest-motivated problem solving, new paradigms for business analytics, that sort of thing. While those are important, and certainly make more interesting copy than explanations of why better movie rating predictions are good for Netflix, they don't offer any insight into the more personal explorations that contest participants undoubtedly experienced.

So, for better or worse, I'm going to try to cover some of what I, personally, learned. This will obviously take more than a single blog entry, and I have no doubt that some of my "revelations" will seem simple and obvious (possibly stupid), while others may seem more like heresy. There's not much I can do about that. After all, the only public opinion prediction engine I have available is incapable of doing more than making wild guesses – I refer, of course, to the squishy pattern matcher between my ears.

Algorithmists vs. Mathematicians

This is something I learned very early in the competition, before I had interacted with any other competitors. It's really more of a personal discovery rather than something directly related to the competition, but I think it's a good place to start.

I am not a mathematician, statistician, or actuary. In fact, math was generally my worst subject when I was in grade school. Still, I did take quite a few math courses in college: 3 semesters of calculus, linear algebra, and differential equations. I would have avoided these courses like the plague, but they were required for one or more of the majors I blithely skipped between as an undergraduate. Once again, my grades in these courses were not stellar. But I found that I understood the theories we studied at least as well as most of my classmates. I even helped others figure out how to do homework problems. Why, then, was my own work so poor? My conclusion: I lacked the ability (or perhaps desire) to do dozens of contrived problems based on the single assumption that the method(s) used should be drawn from the chapter we were were studying at the time. After the first few problems I would get bored and start looking for shortcuts – some of which worked, but most of which failed. I once spent half of the allotted time for a calculus test attempting a derivation that would greatly simplify the bulk of the problems presented. I could see that such a formula must exist, and was much more interested in finding it than doing the problems. Unfortunately I didn't finish the derivation, and subsequently failed the test. We learned the formula a couple of months later. At the time I didn't understand the point of hiding that formula. To be honest, I still don't.


What does this have to do with the Netflix competition, you ask? Am I just using this as an excuse to rail against poor educational methods or cover up my character flaws? Good question.

One reason I stumbled across this particular life lesson was because I had been reading books by Wirth (particularly Algorithms + Data Structures = Programs) and Hunter (Metalogic: An Introduction to the Metatheory of Standard First Order Logic).

The academic study of algorithms can be a complex business. It often overlaps combinatorics in ways less obvious than combinatorial optimization; but that's a discussion for another day. In the study of algorithms there is much discussion of provability, efficiency (in terms of big O notation), and elegance. What it all comes down to, though, is finding a reliable, relatively efficient, sequence of steps by which a particular family of problems can be solved. An algorithmist (yes, I choose to use that term) rarely assumes that any algorithm is perfect, especially when seen in the harsh light of real world problems. Every act of publishing an algorithm contains an implicit invitation, or even challenge, to improve upon it. Where would we be if nobody had ever challenged the "bubble sort", after all? Computer science students, in my day at least, were encouraged (even expected) to look for better algorithms than those presented in the textbooks. Discovery of novel algorithms, even if they only fit a small subset of problems, was noteworthy. This was all considered good practice for life-after-college. Why, then, did the related field of mathematics seemed to discourage rather than encourage the same sort of thinking?

Metalogic (which also overlaps combinatorics) revealed the intriguing thought that mathematics was just another logical system, and, as such, could be scrutinized from a perspective that was different than the one I had been so painfully taught. Unfortunately this was before the publication of Gödel, Escher, Bach, which would have saved me a lot of mental calisthenics. Still, the Hunter book lead to Gödel. I delighted in Gödel's incompleteness theorem, willfully misinterpreting his conclusions to support my contention that mathematics was deeply and inherently flawed.

So, I went on my merry way, believing that mathematics was a necessary evil: something to be suffered through.

It was clear from my earliest exposure to the Netflix data that I would need to learn or re-learn some fairly basic math. For example: I certainly knew about gradient methods for problem solving, but finding the proper gradient for some of the formulas I was starting to come up with would require more knowledge of differential equations than I could muster. In the past I had relied on co-workers to handle that. It was equally clear that the bits and pieces of linear algebra that I remembered were not going to be adequate – I could barely remember how to construct and solve simple linear regression problems. That's what SAS and SPSS are for, right? Except I wasn't allowed to use commercial software.


What happened next was surprising. I did sit down and grind through some old math texts; just enough to pick up what I needed. (Coincidentally I also read How to Solve It: Modern Heuristics at about the same time – if you've read it you'll know why that's relevant). Then some competitors started to publish Netflix Prize-related papers. These papers were often riddled with formulas (ugh, I thought, more math to slog through). My own notes rarely contained formulas, but had copious quantities of pseudo-code interspersed with the text. So, have you figured out my "brilliant" revelation yet? Here it is: these mathematician's were doing exactly the same thing that I was doing – only the syntax was different. All I had to do was translate the formulas into pseudo-code (or actual code) and what they were trying to say suddenly became very clear. I had a sudden urge to smack myself in the forehead with my palm. How stupid had I been all these years!? But this was applied math, right? Not one of the "pure" pursuits. So, out of morbid curiosity I started reading all sorts of math-related papers (gotta love Google, arxiv.org, and the web in general). Guess what I found? Mathematicians really do promote the same sort of thinking, and have the same sort of expectations, that algorithmists do. It's true that mathematicians have been around longer and explored more of their "world", and in some cases this has lead to stagnation in how it is taught and performed, but, sans baggage, it's still all about finding solutions to problems. Who would have guessed.


BellKor's article on matrix factorization methods in in the new edition of IEEE Computer


Hadoop MapReduce for Netflix Prize


Github Contest关于diversity应用的实例













比如C++,初学者肯定都看一些畅销的书,比如C++ Primer Thinking in C++。而越是看不畅销的书的大都是专家。 不过这里面还有一个问题,用看的人多少来描述一个书是否畅销,也不准确,可能是因为书确实很烂,也许可以用一些作者,出版社的消息, 比如这个作者写的书很畅销,但就这一本不畅销,那可能这本是比较专业的书,嘿嘿。(比如霍金的论文很少人看,但他的科普读物很 畅销,嘿嘿)


Netflix又要办比赛了? Next Netflix Prize 2.0


This is Neil Hunt, Chief Product Officer at Netflix.

To everyone who participated in the Netflix Prize: You've made this a truly remarkable contest and you've brought great innovation to the field. We applaud you for your contributions and we hope you've enjoyed the journey. We look forward to announcing a winner of the $1M Grand Prize in late September.

And, like so many great movies, there will be a sequel.

The advances spurred by the Netflix Prize have so impressed us that we’re planning Netflix Prize 2, a new big money contest with some new twists.

Here’s one: three years was a long time to compete in Prize 1, so the next contest will be a shorter time limited race, with grand prizes for the best results at 6 and 18 months.

While the first contest has been remarkable, we think Netflix Prize 2 will be more challenging, more fun, and even more useful to the field.

Stay tuned for more details when we announce the winners of Prize 1 in September.









在设计实际的推荐系统时,我们不可能计算一个用户对所以电影的评分,然后排序,找出topK。在BellKor的论文中,他用TopK评测预测问题时, 是随机选出1000个电影,然后评分排序,得出TopK。

实际的系统中,我们需要用binary data首先找出一个候选集,这个过程其实是TopK的过程(这个过程其实不需要评分,只需要关系0-1矩 阵),然后我们计算用户对候选集中电影的评分,然后对候 选集用评分排序。所以说,topk和netflix其实不是一个问题,而是推荐系统中两个不同的问题,所以用不同的评测方法也是应该的。


举个简单的例子。 topk是找到用户最可能看的电影,他的排名是根据用户看电影的可能性排名的,而rating是在用户可能看的电影中找出用户喜欢的电影,因为有的时候 用户也会对不喜欢的电影评分。 所以这两者结合的结果就是,找出用户最可能看,且看了之后会喜欢的电影。

推荐系统在线讨论组 Forum for resys in Google Group in China

在Google Group上建了一个推荐系统的讨论组,欢迎大家加入

I build a forum for recommender system in Google Groups, welcome to join



Cloud Wisdom in Resys 推荐系统中的群体智慧

I think, the most important thing in design resys is not to find a single algorithm which produce the most accurate predictions and recommendations. There is no such algorithms. Users preference is very different and different types of users have different patterns. Therefore, there is no single model which can meet everyone's habit.


Netflix PrizeGitHub Contest解决的是不同的问题,前面解决的是预测问题,后面解决的是推荐问题。基本上来说,Netflix中的算法几乎是不能用到Github Contest中的(除了KNN),但是模型组合的思想是放诸四海而皆准的。在Netflix中,我们用回归来组合模型,而在Github中,我们可以通过Bagging加上一些随机优化算法来组合模型(SAGA都是著名的随机优化算法)。


My Story in NetflixPrize

I am a student from China and I graduated from University of Science and Technology of China (USTC). Now, I study in Institute of Automation, Chinese Academy of Sciences (CASIA).

The reason why I choose recommender system as my research field is very accidental. I do research on search engine before and last year, I wrote a paper of how to find similar web pages in Internet and submit it to WWW09. However, it is rejected by WWW09 and the reviewer tell me this have been studied by many other researchers in resys. At that time, I know the existing of collaborative filtering and resys.

After my paper was rejected, I told with another member in our lab and we found a good conference about resys, that is "ACM Conference on Recommender Systems". After I read some papers about resys, I though resys is a good research area and then I downloaded all papers in ACM resys conference. This happened in Feb. 2009 and it was Spring Festival in China. I took all these papers to home and read them in my holiday. When I read papers, I found two data set is often used resys and CF, that is movielens and Netflix.

I tried movielens firstly because it is small. After doing some researches in movielens, I want to try Netflix data and then I attend Netflix Prize (2009-2-21).

I tried SVD firstly in NetflixPrize because I thought my computer can not store item-item matrix in memory. However, I found I was wrong and I used KNN in Netflix after SVD. I read all papers from BPC, GPT and other members in Leaderboard. The last algorithm I implemented was RBM because the paper of RBM is hard to understand. My friend Wang Yuantao from Tsing-Hua University help me realized RBM. He study math in Tsing-Hua, so understand formulas in RBM paper is easy for him. After I implement RBM, I break 0.87 in NetflixPrize. Meanwhile, I wrote a paper about my time-dependent model and this paper is accepted by WI09.

In June 26, when I woke up, I received a message from Wang and he tell me someone have break 10% line. I was very surprised because at the time, PT was in the firstly place (0.858x) and I thought they need at least half a year to break 10% line. I opened the computer and I found PT is merged with BellKor and Bigchaos. At the same time, I received two emails from Greg and Larry(Dace). They want to combine their results with me and I agreed. Then, I joined VI and Larry also joined VI in the same time. The following story everyone in VI knows and I will not write it.


GitHub Contest

I am working on github contest with Daniel Haran. Unlike NetflixPrize, the github contest is a Top-K resys task. I think, it is another important task in recommender system.

Let's take movie recommender system for example. When we design a movie resys, we meet two problems:
1) Given a user, we should find which movies he/she will watch. That is finding a candidate movies set.
2) In the candidate set, we should find which movies this user will like after watching.

I think, the first task is top-k recommendation task (GitHub) and the second task is prediction task (NetflixPrize).

Solving two above tasks is the fundanmental of design good recommender system.


Netflix Competitors Learn the Power of Teamwork

Published: July 27, 2009

A contest set up by Netflix, which offered a $1 million prize to anyone who could significantly improve its movie recommendation system, ended on Sunday with two teams in a virtual dead heat, and no winner to be declared until September.

But the contest, which began in October 2006, has already produced an impressive legacy. It has shaped careers, spawned at least one start-up company and inspired research papers. It has also changed conventional wisdom about the best way to build the automated systems that increasingly help people make online choices about movies, books, clothing, restaurants, news and other goods and services.

These so-called recommendation engines are computing models that predict what a person might enjoy based on statistical scoring of that person’s stated preferences, past consumption patterns and similar choices made by many others — all made possible by the ease of data collection and tracking on the Web.

“The Netflix prize contest will be looked at for years by people studying how to do predictive modeling,” said Chris Volinsky, a scientist at AT&T Research and a leader of one of the two highest-ranked teams in the competition.

The biggest lesson learned, according to members of the two top teams, was the power of collaboration. It was not a single insight, algorithm or concept that allowed both teams to surpass the goal Netflix, the movie rental company, set nearly three years ago: to improve the movie recommendations made by its internal software by at least 10 percent, as measured by predicted versus actual one-through-five-star ratings by customers.

Instead, they say, the formula for success was to bring together people with complementary skills and combine different methods of problem-solving. This became increasingly apparent as the contest evolved. Mr. Volinsky’s team, BellKor’s Pragmatic Chaos, was the longtime front-runner and the first to surpass the 10 percent hurdle. It is actually a seven-person collection of other teams, and its members are statisticians, machine learning experts and computer engineers from the United States, Austria, Canada and Israel.

When BellKor’s announced last month that it had passed the 10 percent threshold, it set off a 30-day race, under contest rules, for other teams to try to best it. That led to another round of team-merging by BellKor’s leading rivals, who assembled a global consortium of about 30 members, appropriately called the Ensemble.

Submissions came fast and furious in the last few weeks from BellKor’s and the Ensemble. Just minutes before the contest deadline on Sunday, the Ensemble’s latest entry edged ahead of BellKor’s on the public Web leader board — by one-hundredth of a percentage point.

“The contest was almost a race to agglomerate as many teams as possible,” said David Weiss, a Ph.D. candidate in computer science at the University of Pennsylvania and a member of the Ensemble. “The surprise was that the collaborative approach works so well, that trying all the algorithms, coding them up and putting them together far exceeded our expectations.”

The contestants evolved, it seems, along with the contest. When the Netflix competition began, Mr. Weiss was one of three seniors at Princeton University, including David Lin and Lester Mackey, who made up a team called Dinosaur Planet. Mr. Lin, a math major, went on to become a derivatives trader on Wall Street.

But Mr. Mackey is a Ph.D. candidate at the Statistical Artificial Intelligence Lab at the University of California, Berkeley. “My interests now have been influenced by working on the Netflix prize contest,” he said.

Software recommendation systems, Mr. Mackey said, will increasingly become common tools to help people find useful information and products amid the explosion of information and offerings competing for their attention on the Web. “A lot of these techniques will propagate across the Internet,” he predicted.

That is certainly the hope of Domonkos Tikk, a Hungarian computer scientist and a member of the Ensemble. Mr. Tikk, 39, and three younger colleagues started working on the contest shortly after it began, and in 2007 they teamed up with the Princeton group. “When we entered the Netflix competition, we had no experience in collaborative filtering,” Mr. Tikk said.

Yet based on what they learned, Mr. Tikk and his colleagues founded a start-up, Gravity, which is developing recommendation systems for commercial clients, including e-commerce Web sites and a European cellphone company.

Though the Ensemble team nudged ahead of BellKor’s on the public leader board, it is not necessarily the winner. BellKor’s, according to Mr. Volinsky, remains in first place, and Netflix contacted it on Sunday to say so.

And in an online forum, another member of the BellKor’s team, Yehuda Koren, a researcher for Yahoo in Israel, said his team had “a better test score than the Ensemble,” despite what the rival team submitted for the leader board.

So is BellKor’s the winner? Certainly not yet, according to a Netflix spokesman, Steve Swasey. “There is no winner,” he said.

A winner, Mr. Swasey said, will probably not be announced until sometime in September at an event hosted by Reed Hastings, Netflix’s chief executive. The movie rental company is not holding off for maximum public relations effect, Mr. Swasey said, but because the winner has not yet been determined.

The Web leader board, he explained, is based on what the teams submit. Next, Netflix’s in-house researchers and outside experts have to validate the teams’ submissions, poring over the submitted code, design documents and other materials. “This is really complex stuff,” Mr. Swasey said.

In Hungary, Mr. Tikk did not sound optimistic. “We didn’t get any notification from Netflix,” he said in a phone interview. “So I think the chances that we won are very slight. It was a nice try.”



首先祝贺我们的队伍"The Ensemble"在leaderboard上取得第一名,最终谁赢得比赛还没有正式宣布,不过我们能在最后30天做出这样的成绩还是不错的,感谢队友们,不是他们的帮助,我永远也无法知道0.855x的结果是如何做出来的。


1.我们知道,单纯的collaborative filtering在实际系统中是不够的,我们需要利用内容信息,但是我们在使用内容时往往是简单的用来计算相似度。比如我们有书的作者,出版社,书名,标签信息。我们往往用这些信息来比较书的相似度,然后推荐相似的书给用户。但是我在研究中发现,用户对书的不同属性的依赖是不同的,有些用户比较信赖出版社,比如我买计算机书,只买几个著名出版社的,其他出版社的书我对他的质量不信任。也有些时候看作者,比如C++,一般只买大牛的书。但是,豆瓣的推荐系统并没有学习出我的这些喜好(应该说没有完全学习出来),他们只是学习出我喜欢C++的书,但没有学习出我对作者和出版社的要求。这一方面因为我没有提供太多的喜好数据,另一方面也是因为可能并没有进行对这些特征的学习。









P.S. Netflix Prize还有10天就结束了,得抓紧啊,希望还是有的,嘿嘿!最近我在研究用户聚类,感觉不错!



















ICML 2008的论文集合

