Quantcast
Channel: lucasjosh.com » Data
Viewing all articles
Browse latest Browse all 9

Hadoop’ing at My Desk

$
0
0
photo.jpg

Last week, I started scrounging around the office for some unused PC’s. Unfortunately, they were more than just a few because of all the things going on at the Times. I grabbed three, put them on my desk and spent the rest of a day installing Ubuntu on them. Everything went really smoothly and I was very pleasantly surprised that our IT department didn’t give me a hard time for wanting a switch in the office.

I used this post to help setup a Hadoop cluster. It went really smoothly and before I knew it, the future was sitting on my desk.

Why the future? The amount of data used by companies is increasing way beyond what it used to be and systems like Hadoop allow for that data to be dealt with in more humane ways than stuffing it into some sort of database and hoping your SQL-fu can slice and dice.

Of course a three node Hadoop cluster isn’t that impressive when you compare it to the 4000 node one used by Yahoo!. But that’s ok since this is just the beginning.

What am I doing with all this power you ask? Well, let me give you an example. I have 10 years worth of archives loaded into the cluster. As part of the Articles project, I turned each text file into an Atom representation which has allowed us to do various things with the metadata. At first, I put each individual file into the HDFS (Hadoop Distributed FileSystem) but then I would have needed to write some additional code for Hadoop to look at the files individually as opposed to the default of looking at the selection of lines in each file. Eventually I’ll do that but it would have been yak shaving at the beginning.

Instead, I collapsed files from each month into one, having each line but a story. This allowed Hadoop’s default splitter to go crazy. One of the first Map/Reduce jobs I wrote was to go through each story, find all of the A1 (front page) stories and see who wrote it. That would be the Map part of it while the Reduce piece added all of the instances together so you could easily see the leaderboard. I mentioned this to one of my colleagues and warned me that having that data fall into the wrong hands could destroy the newsroom. I think he was kidding but I’m not that sure.

Other tests have been seeing what the breakdown of sections (News, Sports, Business, etc) have been on the front page, what keywords have been used the most across all 10 years as well as on the front page and more recently, using the keywords to try and train a Naive Bayes classifier using Mahout. That one didn’t really work well but the idea still intrigues me.

In all the talk of the demise of the newspaper, one thing still bothers me. Newspapers are one of the few organizations that has real information about the past, information beyond just the facts. Doing things with this information can only help find the proper place for newspapers and the data they’ve created.

Hadoop isn’t some sort of cure all for the woes we face but I think it gives a glimpse of how a future news organization could use data to do incredible things and give users a much different relationship with the news, one they would renew every day.


Viewing all articles
Browse latest Browse all 9

Trending Articles