Why is Python a language of choice for data scientists?

Why is Python a language of choice for data scientists? by Jeff Hammerbacher

Answer by Jeff Hammerbacher:

Python is an interpreted, dynamically-typed language with a precise and efficient syntax. Python has a good REPL and new modules can be explored from the REPL with dir() and docstrings. That's one reason to prefer Python over C, C++, or Java.

The Python community invested in the mid-1990s in Numeric, an "extension to Python to support numeric analysis as naturally as [M]atlab does" [1]. Numeric later evolved into NumPy [2]. Several years later, the plotting functionality from Matlab was ported to Python with matplotlib [3]. Libraries for scientific computing were built around NumPy and matplotlib and bundled into the SciPy package [4], which was commercially supported by Enthought [5]. Python's support for Matlab-like array manipulation and plotting is a major reason to prefer it over Perl and Ruby.

Today, the most popular alternatives to Python for data scientists are R, Matlab/Octave, and Mathematica/Sage. In addition to the work mentioned above to port features from Matlab into Python, recent work has ported several popular features from R and Mathematica into Python.

From R, the data frame and associated manipulations (from the plyr and reshape packages) have been implemented by the pandas library [6]. The scikit-learn project [7] presents a common interface to many machine learning algorithms, similar to the caret package in R.

From Mathematica/Sage, the concept of a "notebook" has been implemented with IPython notebooks [8].

From my personal perspective, Python is still lacking in a few important areas.

  1. The first is the more cumbersome syntax for array manipulations and formula specification in Python. The Matlab/Octave syntax for array manipulation is still preferred (that's why it's used in the Stanford ML class, for example), and the R syntax for formula specification is quite nice.
  2. The second is a Python equivalent to ggplot2 for static graphics and D3 for interactive graphics. The matplotlib library is hard to install, hard to use, and does not facilitate building interactive graphics for the web.
  3. The third is the scalability of NumPy and pandas when working with large data sets. The company Continuum [9] is working to address this problem, but they're a long way from producing something coherent and usable.
  4. The fourth is the lack of an embedded, declarative language for data manipulation, similar to the LINQ project. Pandas is useful as a low-level data manipulation toolkit, but tracking down the custom Pandas syntax for complex operations can be frustrating.
  5. The fifth is an IDE for data scientists of similar quality to R Studio.

[1] http://hugunin.net/story_of_jython.html
[2] http://numpy.scipy.org/
[3] http://matplotlib.sourceforge.net/
[4] http://www.scipy.org/
[5] http://www.enthought.com/
[6] http://pandas.pydata.org
[7] http://scikit-learn.org
[8] http://blog.fperez.org/2012/01/ipython-notebook-historical.html
[9] http://continuum.io/

Why is Python a language of choice for data scientists?

Emma of Normandy, Queen of England

One of most strong women of History …

The Freelance History Writer

While on tour in England in 2008, our British tour guide mentioned “We don’t do much medieval here”. My husband and I were standing in Salisbury Cathedral later in the day and I was thinking to myself, why not? That lovely spire that was the highest in Europe for many years was built in 1358 and it’s still standing! This got my imagination going and when I returned to the States, I began studying medieval British history with a vengeance.

The period of British history from the exodus of the Romans until the Norman Conquest has always been shadowy and mist filled for me. My first thoughts were of Alfred, the only English king to be called “The Great” (871-899). In reading about the successors of Alfred, I came across a Queen, Emma, who really intrigued me. It was because of her, the course of English history was sent into…

Ver o post original 679 mais palavras