By Randy Zwitch
In my previous post about the Adobe Analytics Clickstream Data Feed, I showed how it was possible to take a single day worth of data and build a dataframe in R. However, most likely your analysis will require using multiple days/weeks/months of data, and given the size and complexity of the feed, loading the files into a relational database makes a lot of sense. Although there may be database-specific “fast-load” tools more appropriate for this application, this blog post will show how to handle this process using only R and PostgresSQL.
File Organization
Before getting into the loading of the data into PostgreSQL, I like to sort my files by type into separate directories (remember from the previous post, you’ll receive three files per day). R makes OS-level operations simple enough:
Were there more file types, I could’ve abstracted this into a function instead of copying the code three times, but the idea is the same: Check to see if the directory exists, if it doesn’t then create it and move the files into the directory.
Connecting and Loading Data to PostgreSQL from R
Once we have our files organized, we can begin the process of loading the files into PostgreSQL using the RPostgreSQL R package. RPostgreSQL is DBI-compliant, so the connection string is the same for any other type of database engine; the biggest caveat of loading your servercall data into a database is the first load is almost guaranteed to require loading as text (using colClasses = “character” argument in R). The reason that you’ll need to load the data as text is that Adobe Analytics implementations necessarily change over time; text is the only column format that allows for no loss of data (we can fix the schema later within Postgres either by using ALTER TABLE or by writing a view).
With …read more
Source:: r-bloggers.com