By matloff
An old friend recently called my attention to a thoughtful essay by Stanford statistics professor David Donaho, titled “50 Years of Data Science.” Given the keen interest these days in data science, the essay is quite timely. The work clearly shows that Donaho is not only a grandmaster theoretician, but also a statistical philosopher. The paper should be required reading in all Stat and CS Departments. But as a CS person with deep roots in statistics, I believe there are a few points Donaho should have developed more, which I will discuss here, as well as other points on which his essay really shines.
Though no one seems to claim to know what data science is — not even on an “I know it when I see it” basis — everyone seems to agree that it is roughly a combination of statistics and computer science. Fine, but what does that mean? Let’s take the computer science aspect first.
By CS here, I mean facility with computers, and by that in turn I mean more than programming. By happenstance, I was in a conversation today with some CS colleagues as to whether material on computer networks should be required for CS majors. One of my colleagues said there were equally deserving topics, such as Hadoop. My reply was that Hadoop is SLOW (so much so that many are predicting its imminent demise), and maximizing its performance involves, inter alia, an understanding of…computer networks. Donaho doesn’t cover this point about computation (nor, it seems, do most data science curricula), limiting himself to programming languages and libraries.
But he does a fine job on the latter. I was pleased that his essay contains quite a bit of material on R, such as the work of Yihui Xie and Hadley Wickham. That a top theoretician devotes so …read more
Source:: r-bloggers.com