2009-09-04

Why user give low ratings to books they bought

In a book recommender system, user sometimes give low ratings to the
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.

A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.

Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.

2009-08-23

昨天在奇遇咖啡关于Netflix Prize的报告

昨天在西直门的奇遇咖啡,和很多推荐系统的experts做了一下交流,同时谈了谈我在Netflix Prize的经历。也不知道讲的如何,呵呵。我们实验室已经很久没有组会了,我也很久没有做过这种长篇的报告。

感谢clickstone和wanghongtao的组织,同时感谢王元涛对RBM的阐述。

下面是Guozhu.Wen拍的一些照片



2009-08-21

net@night interview with some Ensemble members (NetflixPrize)

URL is: http://www.twit.tv/natn114

嘿嘿,这个页面可以下载MP3,我听了一遍,基本上听懂了,看来我的听力这几年进步的挺大。嘿嘿。

采访中的3个人,在Ensemble中都是很活跃的,我们合作的很好。

2009-08-18

my solutions of github contest - remove unlike items

推荐系统的目的是为用户推荐他喜欢的东西,不过这个问题有一个反问题,就是如何找到一个用户不喜欢的东西。

这个想法在Github中很有作用,我一般是通过极值的方法来去掉用户不喜欢的东西。

举一个例子,比如一个用户watch了很多工程,但他watch的工程中最popular的工程的popular程度为A,那么我们可以认为,这个用户再watch超过这个popularity的可能性会很低。

再举一个例子,一个用户watch了很多工程,这些工程都是2006年之后创建的,那么我们可以认为,这个用户watch 2006年前的工程的可能性也很低。

或者,一个用户watch了很多工程,但这些工程都是小程序,行数只有几百行,那么也可以认为,这个用户watch大工程的可能性也很低

可以举出很多这样的例子,每用一种极值,对结果的提高都有好处。不过,这些方法只对watch了很多工程的用户才有作用,因为只有他们的极值是比较可靠的(置信概率比较高)

2009-08-17

my solutions of github contest - item based KNN

文栋以前让我写一些具体的算法问题,所以从今天起,我详细介绍一下我在Github中用到的所有算法,我用中文写。

我的Github Contest解决方案 : item-based KNN

item-based KNN是top-K推荐问题中用的最广泛的一个方法,他的相关论文有

Item-based collaborative filtering recommendation algorithms
Item-based top-n recommendation algorithms
Amazon. com recommendations: Item-to-item collaborative filtering

在github contest里面,我首先使用了item-based KNN,不过具体的实现细节和前面几篇论文不太一样,主要有下面几点

1) 如果两个工程被同一个用户watch过,那这个用户肯定给这两个工程贡献一定的相似度。在传统的相似度计算中,不同的用户贡献相似度的能力是相同的,不过我们考虑两个用户,一个看了100个工程,一个只看过两个工程,那么看过2个工程的用户贡献的相似度应该要高于看过100个工程的用户。(这个效应被称为inverse
user frequence,是和信息检索中的idf相对应的)

2) 推荐过程,对于一个用户,我们找出他曾经watch过的所有工程,然后对每个工程找出和他相似的工程,从而找出这个用户没有watch过得,但是和他watch过的工程最相似的工程。比如一个工程j,一个用户u,那么u对j的喜欢程度定义为

p(u,j) = sum_{i in N(u)} w(i,j)

这里的w(i.j)就是i和j两个工程的相似度,N(u)是u
watch过的所有工程。因为w(i,j)是线性相关系数,在我的实现中,我对w(i,j)进行了平方,这样的目的主要是削弱小相似度的影响,因为w(i,j)是大于0小于1的

2009-08-16

Graph Layout Project in Github

最近用Git越来越熟练了,于是把以前开发的图可视化程序放到了Github上,http://github.com/xlvector/graph-layout/tree/master


我用这个工具对github上的工程之间的关系进行了可视化,下面是一些结果


http://twitpic.com/dzr27

2009-08-15

clueless关于NetflixPrize的一些感悟

这是clueless的一篇博客,我转载过来。clueless是Bill Bame的昵称,他也是Ensemble的一员,他的博客地址是 http://information-density.blogspot.com/

There is one question that seem to come up in almost every competition-related conversation I have had, casual or professional: What did you learn? Actually, "Why did you bother?" is a fairly frequent question, too, but I'll save that for another time.

Oddly enough "what did I learn?" is also the question that I most frequently ask myself, and so far I haven't come up with any sort of gestalt answer. I find this comforting. A pat or encapsulated answer tends to trivialize any endeavor. Of course the more enlightened press seems to want to grab hold of the "big concept" ideas: crowdsourcing, contest-motivated problem solving, new paradigms for business analytics, that sort of thing. While those are important, and certainly make more interesting copy than explanations of why better movie rating predictions are good for Netflix, they don't offer any insight into the more personal explorations that contest participants undoubtedly experienced.

So, for better or worse, I'm going to try to cover some of what I, personally, learned. This will obviously take more than a single blog entry, and I have no doubt that some of my "revelations" will seem simple and obvious (possibly stupid), while others may seem more like heresy. There's not much I can do about that. After all, the only public opinion prediction engine I have available is incapable of doing more than making wild guesses – I refer, of course, to the squishy pattern matcher between my ears.



Algorithmists vs. Mathematicians
Netflix竞赛中的人一般就是分为两种背景,程序员(工程人员)和数学家(偏重理论)

This is something I learned very early in the competition, before I had interacted with any other competitors. It's really more of a personal discovery rather than something directly related to the competition, but I think it's a good place to start.

I am not a mathematician, statistician, or actuary. In fact, math was generally my worst subject when I was in grade school. Still, I did take quite a few math courses in college: 3 semesters of calculus, linear algebra, and differential equations. I would have avoided these courses like the plague, but they were required for one or more of the majors I blithely skipped between as an undergraduate. Once again, my grades in these courses were not stellar. But I found that I understood the theories we studied at least as well as most of my classmates. I even helped others figure out how to do homework problems. Why, then, was my own work so poor? My conclusion: I lacked the ability (or perhaps desire) to do dozens of contrived problems based on the single assumption that the method(s) used should be drawn from the chapter we were were studying at the time. After the first few problems I would get bored and start looking for shortcuts – some of which worked, but most of which failed. I once spent half of the allotted time for a calculus test attempting a derivation that would greatly simplify the bulk of the problems presented. I could see that such a formula must exist, and was much more interested in finding it than doing the problems. Unfortunately I didn't finish the derivation, and subsequently failed the test. We learned the formula a couple of months later. At the time I didn't understand the point of hiding that formula. To be honest, I still don't.

clueless的强项不在数学

What does this have to do with the Netflix competition, you ask? Am I just using this as an excuse to rail against poor educational methods or cover up my character flaws? Good question.

One reason I stumbled across this particular life lesson was because I had been reading books by Wirth (particularly Algorithms + Data Structures = Programs) and Hunter (Metalogic: An Introduction to the Metatheory of Standard First Order Logic).

The academic study of algorithms can be a complex business. It often overlaps combinatorics in ways less obvious than combinatorial optimization; but that's a discussion for another day. In the study of algorithms there is much discussion of provability, efficiency (in terms of big O notation), and elegance. What it all comes down to, though, is finding a reliable, relatively efficient, sequence of steps by which a particular family of problems can be solved. An algorithmist (yes, I choose to use that term) rarely assumes that any algorithm is perfect, especially when seen in the harsh light of real world problems. Every act of publishing an algorithm contains an implicit invitation, or even challenge, to improve upon it. Where would we be if nobody had ever challenged the "bubble sort", after all? Computer science students, in my day at least, were encouraged (even expected) to look for better algorithms than those presented in the textbooks. Discovery of novel algorithms, even if they only fit a small subset of problems, was noteworthy. This was all considered good practice for life-after-college. Why, then, did the related field of mathematics seemed to discourage rather than encourage the same sort of thinking?

Metalogic (which also overlaps combinatorics) revealed the intriguing thought that mathematics was just another logical system, and, as such, could be scrutinized from a perspective that was different than the one I had been so painfully taught. Unfortunately this was before the publication of Gödel, Escher, Bach, which would have saved me a lot of mental calisthenics. Still, the Hunter book lead to Gödel. I delighted in Gödel's incompleteness theorem, willfully misinterpreting his conclusions to support my contention that mathematics was deeply and inherently flawed.

So, I went on my merry way, believing that mathematics was a necessary evil: something to be suffered through.

It was clear from my earliest exposure to the Netflix data that I would need to learn or re-learn some fairly basic math. For example: I certainly knew about gradient methods for problem solving, but finding the proper gradient for some of the formulas I was starting to come up with would require more knowledge of differential equations than I could muster. In the past I had relied on co-workers to handle that. It was equally clear that the bits and pieces of linear algebra that I remembered were not going to be adequate – I could barely remember how to construct and solve simple linear regression problems. That's what SAS and SPSS are for, right? Except I wasn't allowed to use commercial software.

掌握一些基础的数学知识还是很重要的。

What happened next was surprising. I did sit down and grind through some old math texts; just enough to pick up what I needed. (Coincidentally I also read How to Solve It: Modern Heuristics at about the same time – if you've read it you'll know why that's relevant). Then some competitors started to publish Netflix Prize-related papers. These papers were often riddled with formulas (ugh, I thought, more math to slog through). My own notes rarely contained formulas, but had copious quantities of pseudo-code interspersed with the text. So, have you figured out my "brilliant" revelation yet? Here it is: these mathematician's were doing exactly the same thing that I was doing – only the syntax was different. All I had to do was translate the formulas into pseudo-code (or actual code) and what they were trying to say suddenly became very clear. I had a sudden urge to smack myself in the forehead with my palm. How stupid had I been all these years!? But this was applied math, right? Not one of the "pure" pursuits. So, out of morbid curiosity I started reading all sorts of math-related papers (gotta love Google, arxiv.org, and the web in general). Guess what I found? Mathematicians really do promote the same sort of thinking, and have the same sort of expectations, that algorithmists do. It's true that mathematicians have been around longer and explored more of their "world", and in some cases this has lead to stagnation in how it is taught and performed, but, sans baggage, it's still all about finding solutions to problems. Who would have guessed.

2009-08-13

BellKor's article on matrix factorization methods in in the new edition of IEEE Computer

http://www.research.att.com/%7Evolinsky/papers/ieeecomputer.pdf

Hadoop MapReduce for Netflix Prize

Interesting...

Github Contest关于diversity应用的实例

在Github比赛中,考虑diversity是一个重要的因素。因为这个比赛,对每个用户只要猜出一个可能watch的工程,那么如果不考虑
diversity,对于活跃用户的预测就会很差。

如果我们把watch4个以下的用户称为非活跃,反之称为活跃。那么现在的实验表示,对于非活跃用户有60%的预测准确率,而活跃用户只有40%。

这个原因是,活跃用户实际有很多潜在可能watch的工程,而比赛只要我们猜出1个,这个就像大海捞针,是比较困难的。这个时候diversity就

作用了。

我现在只考虑了简单的方法,对于同一类的,给活跃用户推荐时,只选择前几个,后面的降低权重,这个方法目前是很有用的。

这个也说明,diversity在预测时也是很重要的。

2009-08-12

推荐系统中两种不同的老用户

推荐系统上往往存在两种不同的老用户,举一个图书推荐的例子。

第一种用户,买了很多书,但大都是畅销书,这种用户似乎是大多数人

还有一种用户,也买了很多书,但大都是很偏的书,这种用户似乎是某方面的学者,比如我以前在豆瓣查很多数学方面的书都非常偏,很多连封面都没有

这两种都是老用户,但是他们肯定是有不同的习惯的。

如果我们发现某个用户在某个领域尽看那些别人不看的东西,这个人是不是很可能是这个领域的专家?
比如C++,初学者肯定都看一些畅销的书,比如C++ Primer Thinking in C++。而越是看不畅销的书的大都是专家。 不过这里面还有一个问题,用看的人多少来描述一个书是否畅销,也不准确,可能是因为书确实很烂,也许可以用一些作者,出版社的消息, 比如这个作者写的书很畅销,但就这一本不畅销,那可能这本是比较专业的书,嘿嘿。(比如霍金的论文很少人看,但他的科普读物很 畅销,嘿嘿)

2009-08-07

Netflix又要办比赛了? Next Netflix Prize 2.0

http://www.netflixprize.com//community/viewtopic.php?pid=9371#p9371

This is Neil Hunt, Chief Product Officer at Netflix.

To everyone who participated in the Netflix Prize: You've made this a truly remarkable contest and you've brought great innovation to the field. We applaud you for your contributions and we hope you've enjoyed the journey. We look forward to announcing a winner of the $1M Grand Prize in late September.

And, like so many great movies, there will be a sequel.

The advances spurred by the Netflix Prize have so impressed us that we’re planning Netflix Prize 2, a new big money contest with some new twists.

Here’s one: three years was a long time to compete in Prize 1, so the next contest will be a shorter time limited race, with grand prizes for the best results at 6 and 18 months.

While the first contest has been remarkable, we think Netflix Prize 2 will be more challenging, more fun, and even more useful to the field.

Stay tuned for more details when we announce the winners of Prize 1 in September.

很期待Netflix的下一场比赛,希望能公布更多的数据,以及更有挑战力的问题。