Learning from Learning Curves

by Bob Horton, Senior Data Scientist, Microsoft

This is a follow-up to my earlier post on learning curves. A learning curve is a plot of predictive error for training and validation sets over a range of training set sizes. Here we’re using simulated data to explore some fundamental relationships between training set size, model complexity, and prediction error.

Start by simulating a dataset:

sim_data <- function(N, num_inputs=8, input_cardinality=10){
  inputs <- rep(input_cardinality, num_inputs)
  names(inputs) <- paste0("X", seq_along(inputs))
  
  as.data.frame(lapply (inputs, function(cardinality)
                sample(LETTERS[1:cardinality], N, replace=TRUE)))
}

The input columns are named X1, X2, etc.; these are all categorical variables with single capital letters representing the different categories. Cardinality is the number of possible values in the column; our default cardinality of 10 means we sample from the capital letters A through J.

Next we’ll add an outcome variable (y); it has a base level of 100, but if the values in the first two X variables are equal, this is increased by 10. On top of this we add some normally distributed noise.

set.seed(123)
data <- sim_data(3e4, input_cardinality=10)
noise <- 2
data <- transform(data, y = ifelse(X1 == X2, 110, 100) + 
                      rnorm(nrow(data), sd=noise))

With linear models, we handle an interaction between two categorical variables by adding an interaction term; the number of possibilities in this interaction term is basically the product of the cardinalities. In this simulated data set, only the first two columns affect the outcome, and the other input columns don’t contain any useful information. We’ll use it to demonstrate how adding non-informative variables affects overfitting and training set size requirements.

As in the earlier post, I’ll use the …read more

Source:: http://revolutionanalytics.com

Learning from Learning Curves

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112