There is one question that seem to come up in almost every competition-related conversation I have had, casual or professional: What did you learn? Actually, "Why did you bother?" is a fairly frequent question, too, but I'll save that for another time.
Oddly enough "what did I learn?" is also the question that I most frequently ask myself, and so far I haven't come up with any sort of gestalt answer. I find this comforting. A pat or encapsulated answer tends to trivialize any endeavor. Of course the more enlightened press seems to want to grab hold of the "big concept" ideas: crowdsourcing, contest-motivated problem solving, new paradigms for business analytics, that sort of thing. While those are important, and certainly make more interesting copy than explanations of why better movie rating predictions are good for Netflix, they don't offer any insight into the more personal explorations that contest participants undoubtedly experienced.
So, for better or worse, I'm going to try to cover some of what I, personally, learned. This will obviously take more than a single blog entry, and I have no doubt that some of my "revelations" will seem simple and obvious (possibly stupid), while others may seem more like heresy. There's not much I can do about that. After all, the only public opinion prediction engine I have available is incapable of doing more than making wild guesses – I refer, of course, to the squishy pattern matcher between my ears.
Algorithmists vs. Mathematicians
Netflix竞赛中的人一般就是分为两种背景,程序员(工程人员)和数学家(偏重理论)
This is something I learned very early in the competition, before I had interacted with any other competitors. It's really more of a personal discovery rather than something directly related to the competition, but I think it's a good place to start.
I am not a mathematician, statistician, or actuary. In fact, math was generally my worst subject when I was in grade school. Still, I did take quite a few math courses in college: 3 semesters of calculus, linear algebra, and differential equations. I would have avoided these courses like the plague, but they were required for one or more of the majors I blithely skipped between as an undergraduate. Once again, my grades in these courses were not stellar. But I found that I understood the theories we studied at least as well as most of my classmates. I even helped others figure out how to do homework problems. Why, then, was my own work so poor? My conclusion: I lacked the ability (or perhaps desire) to do dozens of contrived problems based on the single assumption that the method(s) used should be drawn from the chapter we were were studying at the time. After the first few problems I would get bored and start looking for shortcuts – some of which worked, but most of which failed. I once spent half of the allotted time for a calculus test attempting a derivation that would greatly simplify the bulk of the problems presented. I could see that such a formula must exist, and was much more interested in finding it than doing the problems. Unfortunately I didn't finish the derivation, and subsequently failed the test. We learned the formula a couple of months later. At the time I didn't understand the point of hiding that formula. To be honest, I still don't.
clueless的强项不在数学
What does this have to do with the Netflix competition, you ask? Am I just using this as an excuse to rail against poor educational methods or cover up my character flaws? Good question.
One reason I stumbled across this particular life lesson was because I had been reading books by Wirth (particularly Algorithms + Data Structures = Programs) and Hunter (Metalogic: An Introduction to the Metatheory of Standard First Order Logic).
The academic study of algorithms can be a complex business. It often overlaps combinatorics in ways less obvious than combinatorial optimization; but that's a discussion for another day. In the study of algorithms there is much discussion of provability, efficiency (in terms of big O notation), and elegance. What it all comes down to, though, is finding a reliable, relatively efficient, sequence of steps by which a particular family of problems can be solved. An algorithmist (yes, I choose to use that term) rarely assumes that any algorithm is perfect, especially when seen in the harsh light of real world problems. Every act of publishing an algorithm contains an implicit invitation, or even challenge, to improve upon it. Where would we be if nobody had ever challenged the "bubble sort", after all? Computer science students, in my day at least, were encouraged (even expected) to look for better algorithms than those presented in the textbooks. Discovery of novel algorithms, even if they only fit a small subset of problems, was noteworthy. This was all considered good practice for life-after-college. Why, then, did the related field of mathematics seemed to discourage rather than encourage the same sort of thinking?
Metalogic (which also overlaps combinatorics) revealed the intriguing thought that mathematics was just another logical system, and, as such, could be scrutinized from a perspective that was different than the one I had been so painfully taught. Unfortunately this was before the publication of Gödel, Escher, Bach, which would have saved me a lot of mental calisthenics. Still, the Hunter book lead to Gödel. I delighted in Gödel's incompleteness theorem, willfully misinterpreting his conclusions to support my contention that mathematics was deeply and inherently flawed.
So, I went on my merry way, believing that mathematics was a necessary evil: something to be suffered through.
It was clear from my earliest exposure to the Netflix data that I would need to learn or re-learn some fairly basic math. For example: I certainly knew about gradient methods for problem solving, but finding the proper gradient for some of the formulas I was starting to come up with would require more knowledge of differential equations than I could muster. In the past I had relied on co-workers to handle that. It was equally clear that the bits and pieces of linear algebra that I remembered were not going to be adequate – I could barely remember how to construct and solve simple linear regression problems. That's what SAS and SPSS are for, right? Except I wasn't allowed to use commercial software.
掌握一些基础的数学知识还是很重要的。
What happened next was surprising. I did sit down and grind through some old math texts; just enough to pick up what I needed. (Coincidentally I also read How to Solve It: Modern Heuristics at about the same time – if you've read it you'll know why that's relevant). Then some competitors started to publish Netflix Prize-related papers. These papers were often riddled with formulas (ugh, I thought, more math to slog through). My own notes rarely contained formulas, but had copious quantities of pseudo-code interspersed with the text. So, have you figured out my "brilliant" revelation yet? Here it is: these mathematician's were doing exactly the same thing that I was doing – only the syntax was different. All I had to do was translate the formulas into pseudo-code (or actual code) and what they were trying to say suddenly became very clear. I had a sudden urge to smack myself in the forehead with my palm. How stupid had I been all these years!? But this was applied math, right? Not one of the "pure" pursuits. So, out of morbid curiosity I started reading all sorts of math-related papers (gotta love Google, arxiv.org, and the web in general). Guess what I found? Mathematicians really do promote the same sort of thinking, and have the same sort of expectations, that algorithmists do. It's true that mathematicians have been around longer and explored more of their "world", and in some cases this has lead to stagnation in how it is taught and performed, but, sans baggage, it's still all about finding solutions to problems. Who would have guessed.
多谢分享。
回复删除呵呵,你是数学家还是程序员?