Why is Python a language of choice for data scientists?

Why is Python a language of choice for data scientists? by Jeff Hammerbacher

Answer by Jeff Hammerbacher:

Python is an interpreted, dynamically-typed language with a precise and efficient syntax. Python has a good REPL and new modules can be explored from the REPL with dir() and docstrings. That's one reason to prefer Python over C, C++, or Java.

The Python community invested in the mid-1990s in Numeric, an "extension to Python to support numeric analysis as naturally as [M]atlab does" [1]. Numeric later evolved into NumPy [2]. Several years later, the plotting functionality from Matlab was ported to Python with matplotlib [3]. Libraries for scientific computing were built around NumPy and matplotlib and bundled into the SciPy package [4], which was commercially supported by Enthought [5]. Python's support for Matlab-like array manipulation and plotting is a major reason to prefer it over Perl and Ruby.

Today, the most popular alternatives to Python for data scientists are R, Matlab/Octave, and Mathematica/Sage. In addition to the work mentioned above to port features from Matlab into Python, recent work has ported several popular features from R and Mathematica into Python.

From R, the data frame and associated manipulations (from the plyr and reshape packages) have been implemented by the pandas library [6]. The scikit-learn project [7] presents a common interface to many machine learning algorithms, similar to the caret package in R.

From Mathematica/Sage, the concept of a "notebook" has been implemented with IPython notebooks [8].

From my personal perspective, Python is still lacking in a few important areas.

  1. The first is the more cumbersome syntax for array manipulations and formula specification in Python. The Matlab/Octave syntax for array manipulation is still preferred (that's why it's used in the Stanford ML class, for example), and the R syntax for formula specification is quite nice.
  2. The second is a Python equivalent to ggplot2 for static graphics and D3 for interactive graphics. The matplotlib library is hard to install, hard to use, and does not facilitate building interactive graphics for the web.
  3. The third is the scalability of NumPy and pandas when working with large data sets. The company Continuum [9] is working to address this problem, but they're a long way from producing something coherent and usable.
  4. The fourth is the lack of an embedded, declarative language for data manipulation, similar to the LINQ project. Pandas is useful as a low-level data manipulation toolkit, but tracking down the custom Pandas syntax for complex operations can be frustrating.
  5. The fifth is an IDE for data scientists of similar quality to R Studio.

[1] http://hugunin.net/story_of_jython.html
[2] http://numpy.scipy.org/
[3] http://matplotlib.sourceforge.net/
[4] http://www.scipy.org/
[5] http://www.enthought.com/
[6] http://pandas.pydata.org
[7] http://scikit-learn.org
[8] http://blog.fperez.org/2012/01/ipython-notebook-historical.html
[9] http://continuum.io/

Why is Python a language of choice for data scientists?