R can handle fairly big data working on a single machine, 2B (2E9) rows and couple of columns require about 100 GB of memory.
This is already well enough to care about performance.
With this post I’m going discuss scalability of filter queries.
The index has been introduced to data.table in 1.9.4. It is also known as secondary keys. Unlike with key, a single data.table can have multiple indexes.
It basically store additional vector of rows order as data.table attribute.
Sounds really simple, it is even better because user does not have use them in any special way – use of index is automatically handled in data.table.
And the performance gains are big enough to write a post on that.
What you should know about data.table index (as of 2015-11-23):
- index will be used when subsetting dataset with
==
or%in%
on a single variable - by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
Above features are likely to be improved in future.
- also important to mention, there is an open FR to automatically utilize index when doing unkeyed join (new feature in 1.9.6) – using new on argument. So in future version user will be able to leverage mighty performance of indexes for joining datasets.
Brief look at the structure:
library(data.table)
op = options(datatable.verbose=TRUE,
datatable.auto.index=TRUE)
dt = data.table(a=letters[c(3L,1L,2L)])
set2keyv(dt, "a")
## forder took 0 sec
attr(dt, "index")
## integer(0)
## attr(,"__a")
## [1] 2 3 1
dt[a=="b"]
## Using existing index 'a'
## Starting bmerge ...done in 0 secs
## a
## 1: b
dt[a %in% c("b","c")]
## Using existing index 'a'
## Starting bmerge ...read more
Source:: r-bloggers.com