Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Wednesday, December 18, 2013

Best Coursera Yet




I finished my third Coursera course tonight: Data Analysis, eight weeks of learning R and its application to statistics problems. I've enjoyed all three, but this one topped them all. It was a difficult eight weeks. I spent a lot of hours at night after work and weekend time poring over assignments. It's been tiring but worth it. I've remembered the statistics I'd forgotten, learned a lot of new things like generalized linear models, and deepened my knowledge of R. It's exactly what I set out to accomplish when I started taking on-line classes a year ago at this time. I wanted to see if I could adapt to a new style of learning and self-education. I wanted to prove that I could still absorb challenging material. I'm at an age when it's easy to sit back and tell yourself that you already know it all, that you're too old a dog to learn new tricks. Did I still have it in me? I think I succeeded on all counts.

The professor was Jeff Leek, who's on the faculty of the Johns Hopkins biostatistics department. They claim to be pre-eminent in the world, and after sitting through this course I believe them. His graduate students are fortunate.

I've earned certificates with distinction for the two other classes I've taken. I calculated my point totals for quizzes and assignments for 'Data Analysis'. I think I squeezed out another distinguished certificate. I'll have to wait a week or two to see if my numbers match those from Coursera, but I'm confident that it'll turn out well.

I've got a lot of other tasks lined up. I want to take more of these classes next year, but I'm uncertain about what to take. I want to go further with R and statistics, but there's no clear choice listed in the catalog. I might start haunting Kaggle.com and applying these new skills to problems. I've got some development tasks to get back to.

I found out that I'm two speeches away from achieving the Toastmasters Advanced Leadership Bronze designation. I'll fulfill those easily in the next few weeks. That would be three awards in one fiscal year. I'll only have ten speeches to achieve Advanced Communicator Gold and the Advanced Leadership Silver to become a Distinguished Toastmaster. Who would have thought it'd culminate in this when I started in 2008? Maybe I can do it by mid 2015.

I plan to relax a bit over the holidays. It's been a nice way to end the year.


profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers

Monday, September 23, 2013

Computing For Data Analysis Using R












I start the next step with Coursera tonight: It's opening night for "Computing For Data Analysis" from Roger Peng at Johns Hopkins University. It's a four week introduction to using R that should be good. He blogs at Simply Statistics, which looks like it'll be a good resource for stretching my brain.

I'm running on my Windows 7 desktop. I've downloaded the latest version of R 3.0.1. There's an IDE called Tinn-R that might be okay. I'm sure it won't replace IntelliJ from JetBrains as the world's greatest IDE. Until those brilliant Russians come up with an R environment for me I'll make do.

My friend Steve Roach pointed out a port of R that runs on the JVM called Renjin. I think this statement is surprising:

We built Renjin, a new interpreter for the JVM because we wanted the beauty, the flexibility, and power of R with the performance of the Java Virtual Machine.

My first thought was that it'll be hard to beat LAPACK in C or Fortran. But perhaps a version that leverages parallelism could tip the balance.

So this will be sucking up some of my time and energy for the next four weeks. I hope the lessons diffuse into my brain quickly.



profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers


Wednesday, March 13, 2013

First MOOC












I've always prided myself on being a lifelong learner. I've been watching the rise of massive open-source on-line courses with great interest and curiosity. All of my education was done the traditional way: sit in a classroom with a lecturer and other students on a fixed schedule, do homework, take tests, let material diffuse into your brain over a week or a semester and hope it sticks.

I've never taken a class on line. How would it feel? Could it be as effective as the traditional approach?

I was excited when I first heard about AI class being offered by Peter Norvig and Sebastian Thrun. Peter Norvig is the Director of Research at Google and the author of Artificial Intelligence: A Modern Approach. He's written some terrific stuff, including Teach Yourself Programming In Ten Years and How To Write A Spelling Corrector. The latter is astounding. When I get on an airplane I get myself a drink and decide which movie I'm going to watch; Peter Norvig writes a statistics-based spelling corrector in 21 lines of Python code that's 70-80% accurate. It was a revelation to me when I first saw it.

I signed up and started with the best of intentions, but then Hurricane Irene knocked our power out for ten days and put me behind the eight ball. No Internet; no computer; no lectures.

My interest in statistics has been growing over the last few years. I've tried to better understand the Bayes approach - what it means and how it differs from the frequentist view that I've been exposed to. I've read Doing Bayesian Analysis Using R and BUGS by John K. Kruschke. Don't let the adorable puppies on the jacket fool you: this is a terrific, well-written book. I've got blog posts describing other books about Bayes that have caught my attention.

But I've still never taken a basic statistics course. I saw that Sebastian Thrun, one of the AI class instructors, was offering intro statistics at Udacity. I liked the lectures I saw him give for the AI class, so I thought I'd give it a go. I started just after Thanksgiving, with the goal of finishing before the end of the year.

The key for me is to make regular, concentrated effort, track my progress, and make sure that I avoid long gaps between sessions. I set up an Excel spreadsheet to record the date and units I covered. It was the same approach that got me through my first half marathon: plan the work, work the plan. It made it easy to see when I had a few days without getting another dose of learning.

I didn't meet my time goal of finishing before the end of 2012, but I didn't miss it by much. More importantly, I got through the entire course - every lecture, every assignment. The programming assignments were in Python, which I loved. I have the latest version of PyCharm - the Python IDE from JetBrains, makers of the best programming tools on the planet. I have NumPy and SciPy, two terrific libraries for scientific computing and numerical methods. It made programming a pleasure.

Most importantly, I proved to myself that I can take good advantage of all the courses on-line: MIT, Stanford, Coursera, Udacity, Apple U and others.

I would still like to revisit AI class. There's a course from Stanford called Probabilistic Graphical Models that presents Markov models in depth. Linear algebra from Gil Strang at MIT would be a treat and a privilege.

But my next choice is Computing for Data Analysis by Roger Peng. Coursera isn't offering it now, but it's available on YouTube from Simply Statistics.

All kinds of knowledge is available to anyone with a computer, an Internet connection, and the drive to take it in. What a time to be alive.



profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers

Sunday, January 3, 2010

Beautiful Data










I finished reading "Beautiful Data" a couple of weeks ago. It's an O'Reilly book edited by Toby Segaran (of "Programming Collective Intelligence" fame) and Jeff Hammerbacher.

I love the topic. We're awash in data, thanks to the Internet, but there will be a greater premium placed on gleaning insight from it all in the future. This book doesn't provide lessons on how to do that, but it does list some wonderful examples. The fields are diverse: biology, social sciences, criminology, space exploration, and others.

I thought the book was a little uneven, as compendiums of essays often are. It's hard to establish and meet a high quality standard consistently when many authors are involved.

Three essays stood out for me.

Peter Norvig is a new technical hero of mine. If a B.S. from Brown in applied math, a Ph.D. in computer science from Cal Berkeley, and Director of Research at Google isn't enough, maybe "Teach Yourself Programming In Ten Years" and "How To Write A Spelling Corrector" will put you over the top. His essay entitled "Natural Language Corpus Data" discusses the trillion word data set published by Thorsten Brants and Alex Franz of Google and its applications. This includes a predecessor of the spelling corrector I cited earlier.

Biology is an up-and-coming field that I'm largely ignorant of. The little I know is being spoon fed to me by my brilliant youngest daughter, but I'm a messy, inattentive eater who gets as much on the floor as I do into my gob. "Life in Data: The Story of DNA" by Matt Wood and Ben Blackburne gives a nice overview with a data slant that I enjoyed.

"Superficial Data Analysis: Exploring Millions of Social Stereotypes" by Brendan O'Connor and Lukas Biewald took data from FaceStat.com, which is down 'temporarily', and mined it for relationships between gender and attractiveness, gender bias in word usage, etc. I thought it was a brilliant use of a public data source.

I'd give an honorable mention to "Building Radiohead's House Of Cards" by Aaron Koblin with Valdean Klump. I love the band. I ran right over to YouTube to see the video after reading the piece. It's brilliant stuff.

There are lots of large data sets that are publicly available. The Stackoverflow.com data is available under a Creative Commons license. It would be a great source of information about the millions of programmers who frequent it.

It doesn't require expensive tools, either. The R statistics package is free to download. The learning curve is steep, but there are resources available here, here, and there to be your sherpa on your way to the summit.

My list of goals for 2010 is growing. If a goal is a dream with deadline, I need to become much better at setting deadlines for myself. Too many dreams remain unfulfilled.