2009-09-20
2009-09-19
2009-09-17
Google Reader的数据收集
http://www.google.com/reader/public/atom/user/06601636036055060713/state/com.google/broadcast
这里首先要特别感谢一下kuber,他向我提供了这个链接。
这个链接中给出了用户06601636036055060713所share的文章,同时对每篇文章给出了like它的用户id。所以我们只要从这个链接出发,就可以通过广度优先搜索将整个Google Reader的数据抓下来(不过不能太过分,不然会被封的),每天要更新,获得最新的文章share情况。
目前我的爬虫正在奋勇的爬,我主要是研究目的,所以我准备收集10w用户和100w文章的数据就足够了。这个数据集可以说内容非常丰富,包含了时间和内容信息,相信在他的基础上可以做出不少工作。
P.S. 非常希望google reader能提供用户follow的数据,这样对研究社会网络和推荐系统的结合很有意义
2009-09-11
现在越来越喜欢Google Reader
比如Google Reader目前可能还处在收集数据的阶段,还没有开始在数据挖掘上发力。我觉得如果把推荐系统的方法用到Google Reader中,可以对用户的分享重新排序,这样可以更好的提供用户喜欢的信息。
2009-09-10
官方消息:我们赢得了Github Contest 2009
http://github.com/blog/489-github-contest-winners
Github同时也给予第二名一定的奖励,Jeremy和我在比赛中也联系过,其实本来hintly是由我们3个人组成的,只是后来时间太紧迫,没有来得及融合他的结果。不过jeremy的开放态度我也非常钦佩。其实在比赛开始的几天,我也是公开了源代码的,但后来可能担心有人超过,所以后期不够开放。不过比赛结束后,我还是上传了所有的代码的。
其实这次的Github Contest的设计不是非常合理,比如不应该允许用户无限制的上传,应该对数据做出变换以防止参赛者利用别的信息,提供了过多的内容信息等等。不过不管怎么样,得的奖还是很高兴的,现在就等Daniel把那瓶酒送给我了,哈哈。
2009-09-09
2009-09-06
Github contest: Winning the bottle of Pappy
is the blog post by Daniel Haran:
The Github contest ended last week-end. Liang Xiang (xlvector) and I
cooperated on an entry that took first place.
We win a bottle of aged Pappy Van Winkle, a large github account and
bragging rights among our fellow geeks.
Scott Chacon asked me to write up about the algorithms we used; that's
now in the project's README.
I learned 3 more lessons from the Github contest.
Diversity
While our main focus was improving our score, we did stumble on one
interesting avenue for future research.
Imagine a user who watches Rails projects for work, and Haskell
related ones for fun. Treating that user as 2 distinct users should
outperform the simpler approach. Using a kNN algorithm with too large
a value of k and 2+ very different interest clusters would guarantee
poor performance.
Given the time constraint we decided to just increase recommendation diversity.
Ensembles win, but require preparation
Netflix taught us that ensembles win. Ilya Grigorik submitted an entry
exploiting that fact and wrote about it Collaborative Filtering with
Ensembles.
The ideal co-operation scenario would have involved participants using
the same training data and result file formats. Had I realized that
sooner, the winning entry would have included Jeremy's results.
Overfitting
Avoiding over-fitting is usually very easy, and early on we decided to
test locally on our own training data subset and test file. Ideally,
we would have had time to do the same thing with each of the
heuristics and data sources we were using. Not doing so ironically
resulted in over-emphasizing those weights in our blending, thereby
underfitting the test set.
Thanks to Scott and Github for giving us a challenge between Netflix contests.
2009-09-04
Why user give low ratings to books they bought
book they bought. Here, I supose there is no cheating in the system
and users are not forced to rate items.
A user buy a book because he/she like it. So, why he give low rating
to the book after he read it? This may because this book is not
satisfy this users interest. So, this user like most of the
properities of the book (he may like the topic, the author, or this
book is recommended by one of his best friend), if not, he will not
buy this book, but one of two disadvantage of this book may make he
give low rating to this book.
Above words seem like common sense. But we can draw one conclusion
from this example, if a user give high rating to a book, he may like
all of the properities the book, but if a user give low rating to a
book, he may only hate part of the book because if he hate all parts
of the book, he will never buy it.
2009-08-23
昨天在奇遇咖啡关于Netflix Prize的报告
2009-08-21
net@night interview with some Ensemble members (NetflixPrize)
嘿嘿,这个页面可以下载MP3,我听了一遍,基本上听懂了,看来我的听力这几年进步的挺大。嘿嘿。
采访中的3个人,在Ensemble中都是很活跃的,我们合作的很好。
2009-08-18
my solutions of github contest - remove unlike items
这个想法在Github中很有作用,我一般是通过极值的方法来去掉用户不喜欢的东西。
举一个例子,比如一个用户watch了很多工程,但他watch的工程中最popular的工程的popular程度为A,那么我们可以认为,这个用户再watch超过这个popularity的可能性会很低。
再举一个例子,一个用户watch了很多工程,这些工程都是2006年之后创建的,那么我们可以认为,这个用户watch 2006年前的工程的可能性也很低。
或者,一个用户watch了很多工程,但这些工程都是小程序,行数只有几百行,那么也可以认为,这个用户watch大工程的可能性也很低
可以举出很多这样的例子,每用一种极值,对结果的提高都有好处。不过,这些方法只对watch了很多工程的用户才有作用,因为只有他们的极值是比较可靠的(置信概率比较高)
2009-08-17
my solutions of github contest - item based KNN
我的Github Contest解决方案 : item-based KNN
item-based KNN是top-K推荐问题中用的最广泛的一个方法,他的相关论文有
Item-based collaborative filtering recommendation algorithms
Item-based top-n recommendation algorithms
Amazon. com recommendations: Item-to-item collaborative filtering
在github contest里面,我首先使用了item-based KNN,不过具体的实现细节和前面几篇论文不太一样,主要有下面几点
1) 如果两个工程被同一个用户watch过,那这个用户肯定给这两个工程贡献一定的相似度。在传统的相似度计算中,不同的用户贡献相似度的能力是相同的,不过我们考虑两个用户,一个看了100个工程,一个只看过两个工程,那么看过2个工程的用户贡献的相似度应该要高于看过100个工程的用户。(这个效应被称为inverse
user frequence,是和信息检索中的idf相对应的)
2) 推荐过程,对于一个用户,我们找出他曾经watch过的所有工程,然后对每个工程找出和他相似的工程,从而找出这个用户没有watch过得,但是和他watch过的工程最相似的工程。比如一个工程j,一个用户u,那么u对j的喜欢程度定义为
p(u,j) = sum_{i in N(u)} w(i,j)
这里的w(i.j)就是i和j两个工程的相似度,N(u)是u
watch过的所有工程。因为w(i,j)是线性相关系数,在我的实现中,我对w(i,j)进行了平方,这样的目的主要是削弱小相似度的影响,因为w(i,j)是大于0小于1的