Wes McKinney, Software Engineer, Cloudera
Hadley Wickham, Chief Scientist, RStudio
This past January, we (Hadley and Wes) met and discussed some of the systems challenges facing the Python and R open source communities. In particular, we wanted to see if there were some opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems.
One thing that struck us was that while R’s data frames and Python’s pandas data frames utilize very different internal memory representations, they share a very similar semantic model. In both R and Panda’s, data frames are lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Every column can have missing values.
Around this time, the open source community had just started the new Apache Arrow project, designed to improve data interoperability for systems dealing with columnar tabular data.
In discussing Apache Arrow in the context of Python and R, we wanted to see if we could use the insights from feather to design a very fast file format for storing data frames that could be used by both languages. Thus, the Feather format was born.
What is Feather?
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames. It has a few specific design goals:
- Lightweight, minimal API: make pushing data frames in and out of memory as simple as possible
- Language agnostic: Feather files are the same whether written by Python or R code. Other languages can read and write Feather files, too.
- High read and write performance. When possible, Feather operations should be bound by local disk performance.
Code examples
The Feather API is designed to make reading and writing data frames as easy as possible. In R, the code might look like:
library(feather) path <- "my_data.feather" write_feather(df, path) df <- read_feather(path)
Analogously, in Python, we have:
import ...read moreSource:: http://blog.rstudio.org