1 Comment

  1. Cascades Project in the Press | CSDiary January 29, 2008 @ 12:30 pm

    […] has just done a nice story on the Cascades Project. You may remember that I wrote about this back in November. Nice to see the project getting some wider press attention now. A direct link to the project site […]

Picking the Most Important Blogs

Research, News

If you are like me and relatively new to the blogosphere, you’ve probably found it hard to figure out which blogs to read. Are some blogs more important than others? What does that mean? My first impulse, and probably yours, too, is to use Google. But simply searching for interesting topics doesn’t work that well for me because what I want to know is which blogs have the articles with the most interesting writing and the most timely topics.

Well, judging “interest” might still be an open problem, but Carlos Guestrin (Assistant Professor in both the Machine Learning and Computer Science Departments), along with Ph.D. students Andreas Krause and Jure Leskovec and faculty collaborators Christos Faloutsos and Jeanne van Briesen, has an answer for the timeliness issue: think of blog articles as disease agents, and then use an algorithm designed to track the spread of disease outbreaks to figure out which articles are the “most important”. Once that is accomplished, it becomes possible to compute various kinds of rankings, including, for example, the top 100 blogs with the most up-to-date information. (And to answer the obvious question: No, this blog is not on the list. Shocking, I know.)

This is the concept behind the Cascades Project, and as you might imagine, any authoritative ranking of blogs or blog articles is bound to generate a huge amount of interest, especially in the blogosphere itself. (A small sample of the blogging about Cascades is available here.)

Now, a cool aspect of Cascades is that the algorithm wasn’t designed to rank blogs. The original motivation was actually to compute the optimal placement of sensors in a water distribution network, for the purpose of identifying the source(s) of water-borne diseases. (This explains the collaboration with Jeanne VanBriesen, from the Civil and Environmental Engineering Department.) The algorithm is sophisticated, making use of a type of problem structure called submodularity, which in this context intuitively means that adding a sensor to a small deployment helps more than adding it to a large deployment. This allowed the team to develop a provably correct heuristic enabling huge problems to be solved. In an analysis of a major competition in sensor placement published by the American Society of Civil Engineers, this algorithm was determined to be the best, and by a wide margin.

Applying the algorithm to a 30GB database of 45,000 blogs containing 10.5M articles and 16.2M links, Carlos and his team were able to determine the most important sources of information in the blogosphere in 2006. See the project web site at http://www.blogcascades.org/ for the results.

And yes, I think the team has heard just about every joke comparing blogs to sewers…

Peter Lee @ November 17, 2007

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>