Wednesday, March 13, 2013

First MOOC












I've always prided myself on being a lifelong learner. I've been watching the rise of massive open-source on-line courses with great interest and curiosity. All of my education was done the traditional way: sit in a classroom with a lecturer and other students on a fixed schedule, do homework, take tests, let material diffuse into your brain over a week or a semester and hope it sticks.

I've never taken a class on line. How would it feel? Could it be as effective as the traditional approach?

I was excited when I first heard about AI class being offered by Peter Norvig and Sebastian Thrun. Peter Norvig is the Director of Research at Google and the author of Artificial Intelligence: A Modern Approach. He's written some terrific stuff, including Teach Yourself Programming In Ten Years and How To Write A Spelling Corrector. The latter is astounding. When I get on an airplane I get myself a drink and decide which movie I'm going to watch; Peter Norvig writes a statistics-based spelling corrector in 21 lines of Python code that's 70-80% accurate. It was a revelation to me when I first saw it.

I signed up and started with the best of intentions, but then Hurricane Irene knocked our power out for ten days and put me behind the eight ball. No Internet; no computer; no lectures.

My interest in statistics has been growing over the last few years. I've tried to better understand the Bayes approach - what it means and how it differs from the frequentist view that I've been exposed to. I've read Doing Bayesian Analysis Using R and BUGS by John K. Kruschke. Don't let the adorable puppies on the jacket fool you: this is a terrific, well-written book. I've got blog posts describing other books about Bayes that have caught my attention.

But I've still never taken a basic statistics course. I saw that Sebastian Thrun, one of the AI class instructors, was offering intro statistics at Udacity. I liked the lectures I saw him give for the AI class, so I thought I'd give it a go. I started just after Thanksgiving, with the goal of finishing before the end of the year.

The key for me is to make regular, concentrated effort, track my progress, and make sure that I avoid long gaps between sessions. I set up an Excel spreadsheet to record the date and units I covered. It was the same approach that got me through my first half marathon: plan the work, work the plan. It made it easy to see when I had a few days without getting another dose of learning.

I didn't meet my time goal of finishing before the end of 2012, but I didn't miss it by much. More importantly, I got through the entire course - every lecture, every assignment. The programming assignments were in Python, which I loved. I have the latest version of PyCharm - the Python IDE from JetBrains, makers of the best programming tools on the planet. I have NumPy and SciPy, two terrific libraries for scientific computing and numerical methods. It made programming a pleasure.

Most importantly, I proved to myself that I can take good advantage of all the courses on-line: MIT, Stanford, Coursera, Udacity, Apple U and others.

I would still like to revisit AI class. There's a course from Stanford called Probabilistic Graphical Models that presents Markov models in depth. Linear algebra from Gil Strang at MIT would be a treat and a privilege.

But my next choice is Computing for Data Analysis by Roger Peng. Coursera isn't offering it now, but it's available on YouTube from Simply Statistics.

All kinds of knowledge is available to anyone with a computer, an Internet connection, and the drive to take it in. What a time to be alive.



profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers

Wednesday, March 6, 2013

Indexing My Diary












We had a winter storm for the ages here last month. January had been relatively mild, with little snow and below-average degree day totals. I ran outside on the road every weekend. My fitness was good; I felt strong.

This storm dropped three feet of snow in my yard. I went out on Friday night and blew 4" of snow off the driveway at 9 pm. When I went out the next morning the snow spilled over the top of my Honda snow thrower. It measures about 2' from the ground to the gas cap. When I went out the third time it spilled over the top again! The news said the storm cut a swath up the Connecticut River and left 30-36" of snow in its wake. I hit the snow jackpot.

I know there's no link between being cold and falling ill, but I was wet and chilled to the bone after each pass with the snow thrower. I felt fine that Friday when the snow started. By Sunday night my sinuses were full. On Monday it descended into my lungs. The coughing wouldn't stop. Rather than feel miserable and infect my co-workers, I decided to stay close to home. I had my work laptop with me, so I could have said I was "working from home." But I didn't want to feel guilty if the need to lie down and take a nap came over me, so I called in sick for a few days to beat it once and for all.

The funny thing is that I didn't take that nap. I've got a backlog of projects that I'm interested in finishing. I'm a little embarrassed about how long some of them have remained on the list, without any progress being made. One of them involved the electronic journal that I've kept for the last 19 years and counting. There's a folder for every year, a Word doc for every month, and an entry of one or more pages for every day that I decided to blather on about myself. It predates the coming of the World Wide Web; I started doing it on the first PC that I ever bought.

So I've got lots of stuff locked up inside. I found myself wondering "When did such and such happen? When did I last mention so and so?", but I didn't have any way to search. Then came Lucene, the Java based search engine from Apache. I downloaded the latest version and set about creating an index for my journal. Reading and parsing the Word documents was difficult. I used the Apache POI library, because I started with a Word 97 template; docx didn't come along until much later. I didn't like the API or documentation much, but Google found a terrific link that got me off the dime.

I fell into a nice rhythm: code, test, check in, rinse, repeat. I use Git as my local repository and a GitHub account as my master. There were problems and obstacles to overcome, but I persisted and found my way through all of them. It was a satisfying feeling when I created an index and searched it for a few terms that I knew the answer to. When I typed in "Celine", my youngest sister's name, the first entry that came back was a one-sentence entry that must have been rushed. It puzzled me at first, but I think the frequency of her name was high because the entry was so short. Fortunately her wedding day was high on the list, too. She's mentioned often on that day, but it's a longer entry so the word frequency of her name is smaller. I'll have to dig into the internals to see if I can better understand and optimize my searches.

I checked the code into GitHub; there's read-only access granted to the curious at git://github.com/duffymo/diary-index.git.

I plan to put Apache Solr on top of my index so I can have a lovely web interface. I'd also like to leverage either a timed service or a Java 7 file watcher to update my index on a schedule or whenever I make a new entry. I'm also considering abandoning Word and keeping my diary in TeX. Keeping all my thoughts in plain text will insulate me from the whims of format changes in Word...and I love TeX. (I typeset my dissertation myself using LaTeX.) PDFs can be beautiful.

I felt healthy again by the time I went back to work. It was also a reminder of how much fun it is to fall into a long, sustained coding trance and produce something that's useful and beautiful at the end.

I'm onto the next project on my To-Do list. More to come soon.



profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers