Sunday, January 3, 2010

Beautiful Data










I finished reading "Beautiful Data" a couple of weeks ago. It's an O'Reilly book edited by Toby Segaran (of "Programming Collective Intelligence" fame) and Jeff Hammerbacher.

I love the topic. We're awash in data, thanks to the Internet, but there will be a greater premium placed on gleaning insight from it all in the future. This book doesn't provide lessons on how to do that, but it does list some wonderful examples. The fields are diverse: biology, social sciences, criminology, space exploration, and others.

I thought the book was a little uneven, as compendiums of essays often are. It's hard to establish and meet a high quality standard consistently when many authors are involved.

Three essays stood out for me.

Peter Norvig is a new technical hero of mine. If a B.S. from Brown in applied math, a Ph.D. in computer science from Cal Berkeley, and Director of Research at Google isn't enough, maybe "Teach Yourself Programming In Ten Years" and "How To Write A Spelling Corrector" will put you over the top. His essay entitled "Natural Language Corpus Data" discusses the trillion word data set published by Thorsten Brants and Alex Franz of Google and its applications. This includes a predecessor of the spelling corrector I cited earlier.

Biology is an up-and-coming field that I'm largely ignorant of. The little I know is being spoon fed to me by my brilliant youngest daughter, but I'm a messy, inattentive eater who gets as much on the floor as I do into my gob. "Life in Data: The Story of DNA" by Matt Wood and Ben Blackburne gives a nice overview with a data slant that I enjoyed.

"Superficial Data Analysis: Exploring Millions of Social Stereotypes" by Brendan O'Connor and Lukas Biewald took data from FaceStat.com, which is down 'temporarily', and mined it for relationships between gender and attractiveness, gender bias in word usage, etc. I thought it was a brilliant use of a public data source.

I'd give an honorable mention to "Building Radiohead's House Of Cards" by Aaron Koblin with Valdean Klump. I love the band. I ran right over to YouTube to see the video after reading the piece. It's brilliant stuff.

There are lots of large data sets that are publicly available. The Stackoverflow.com data is available under a Creative Commons license. It would be a great source of information about the millions of programmers who frequent it.

It doesn't require expensive tools, either. The R statistics package is free to download. The learning curve is steep, but there are resources available here, here, and there to be your sherpa on your way to the summit.

My list of goals for 2010 is growing. If a goal is a dream with deadline, I need to become much better at setting deadlines for myself. Too many dreams remain unfulfilled.



No comments: