Sunday, August 24, 2008

The Basics of Predictive Modeling
(and most data mining exercises)

The general idea of predictive modeling is simple: given all of the information available, predict with as much accuracy as possible a future outcome. The following questions are of particular interest.

  • What information can be used in a predictive model and how?
  • What variables can be predicted?
  • What structures are available for a model?
  • How is accuracy determined?
  • How is a predictive model used in the real world?

Predictive modeling is all about making a practical use of real data to solve real problems. In most cases the data is not clean, the model is not quite perfect and the data often violate some statistical assumptions. On the other hand, the problem which the predictive models solve is usually very real, often currently solved by a very non-data-driven method, and will benefit greatly even from a mediocre model.

Predictive models can be applied to a wide variety of circumstances, however, in my experience they work well when there are many decisions to be made.

Academic Statistics vs. The Real World

I received my PhD many years ago and have learned a lot since then about practical application of statistics to solve business problems. My experience comes from several positions at notable companies and as the president of my own consulting firm (www.equitydecisions.com). I have interacted with leaders from many fortune 500 companies in my various roles and find the following 2 stories illustrative of my realizations about how statistics interacts with business.

Story 1: Early on in my career, having just earned my PhD, I had a meeting with some very good predictive modelers. During the meeting, they told me their strategy for attacking a very well known problem which involved using multiple observations on the same individual. Their solution did not address the correlation between the subsequent observations and I was quick to point out the flaw (this problem may not be obvious to the reader, but to a die hard statistician, it will be). They could not understand my reasoning as their background was in physics and computer science and probably did not understand the statistical implications of this approach. However, I was unable to articulate the implications of this oversight and they were not convinced that it had a material impact on the predictive model. After some thought and research, I discovered that their approach was not as bad as it originally seemed. Their approach was not perfect, but it didhave some reasonable statistical properties. What I grappled with was figuring out if the flaw in their approach actual made a difference in the ROI of the product.

Story 2: During an interview I had with a very well known credit card processor, the interviewer asked me how I would estimate probabilities from a liner regression model fit on binary outcome data (this is data with a "yes" or "no" outcome such as defaulted on a credit card "yes or no"). I was quick to point out that the model he proposed was inappropriate for binary outcome data, but that if I had fit the logistic regression model (the more appropriate model), the answer was simple and I proceeded to write down the answer for him. I also indicated what the answer would be if the linear model had been fit.

These two stories point out a very common practice in business settings. Theory is not always followed exactly. However, many times, an inappropriate model is used to solve a problem and the time and resources it takes to implement the correct solution is either not readily available or is not invented or programmed. So I was forced to answer the question, what is the trade off between the wrong model and the time it takes to research and implement the correct model. If the wrong model works 90% as well as the correct model and saves millions of dollars a day, the cost for each day of research could be hundreds of thousands of dollars a day. During which time the modelers may be able to solve another problem that could also save millions of dollars a day. So as an academic statistician, I was forced to accept the inevitable. There are many imprecise solutions that can save (or make) companies money.

Building a predictive model involves several clear steps.

First, the modeler needs to have a good idea of where the data came from, the problems it may have and the history of the data storage and collection procedures. Once this basic knowledge of the data is obtained, exploratory analysis is conducted on the response and predictor variables of interest to look for any general patterns or to investigate the quality of the data. Particular attention should be paid to missing or invalid records. After the basics of data exploration are exhausted, each variable should be investigated relative to the outcome of interest. This process will uncover each variables importance with regard to the outcome along with give key indicators as to which transformations of the data will be appropriate.

Next comes the actual model building which will involve creating cross validation samples and determining the outcomes on which the model should be optimized. The modeling process will involving selecting variables, investigating alternate transformations, investigating interactions, and investigating alternate model specifications. Along the way, there will be performance statistics which you will try to optimize.

Once the model is completed, usually there is a process of explaining the model and justifying its value to the business. This will involve a cost justification of the model in terms of implementation costs and maintenance costs along with its value to the business or ROI (return on Investment).

After everyone is satisfied that the model will be appropriate for the problem, then it must be implemented in a way it can be used regularly. This implementation process will involve quality control to make sure all of the calculations and transformations are appropriately handled when coded into software.



After the model is being used in a business process, its performance and quality must be monitored over time. Additionally, the model may need to be updated occasionally as new data becomes available or the business usage changes.

Pitfalls with Automated Approaches to Predictive Modeling

Most major software vendors that offer solutions in predictive modeling have automated many of the tasks. While this may work if your data is in perfect order, I have found that your data is almost never in perfect order. Automated approaches will almost certainly grab onto problems in the data. Here are some things to look out for:

  • The response variable is correlated with time and so is the missing category of a predictor variable.
  • The predictor variables contain information that is only known at the point in time after you need to make a prediction.
  • The predictor variable contains information about your outcome (I call this feedback).
  • One of you predictor variables has too many levels to be fit by traditional methods due to infrequent values for some levels.

Of course this list is in no way exhaustive, but it may shed some light on the pitfalls of automated systems.

6 comments:

Anonymous said...

Dr. Speights, how can you best work around the problem when a predictor variable has too many levels to be fit by traditional methods?

David Speights said...

One way that I have dealt with this problem is to use a Bayesian approach. Essentially you use a prior distribution for the every parameter for the variable's levels (b1, b2, ..., bk). Assume b1, ..., bk are distributed with some distribution and then use the actual observations to estiamte the values. This allows for sparse cells to be near the mean of the prior distribution and those cells with data to approach what their estimates would have been in a traditional approach.

Simon Woodward said...

Hi, I liked your blog. I have a linear regression model which fits well to 4 seperate data sets (which are from different geographical locations). The model predicts pasture growth rate from environmental and soil variables. But when I fit the model to the combined data, predictions of the 4 data sets are now biased. This makes me doubt that the model would give good predictions at a new site (with a presumably unknown bias). My interpreation is that there is another (unknown) site related variable that I should be including. Have you come across this kind of problem? Thanks. Simon

David Speights said...

You could try to build a model which uses 3 of the 4 areas to fit the model and evaluate on the 4th area. Then you could cycle through all 4 holding each one out for evaluation. Then you use the model which performs universally best.

Finding variables related to the geographical locations may be tough since you only have 4 geographical locations. This will restrict the values for the additional variables you find. If you use the methods above and try to find a variable, you may end up over-fitting the data.

Will you only be using the model on these 4 geographical areas, or could it be used on another completely new area?

Simon Woodward said...

The model is supposed to be applied across the whole of New Zealand. One of the data sets does include sites across the whole country, the other 3 are from two different sites. I would have thought the model fitted to the first (nation wide) data set would work for the other 3, but this is not the case. There is an also issue with two of the data sets being collected in summer and two in spring, so as well as site there is a time issue. The data includes weather, soil fertility and moisture properties, but these are evidently not the only difference between sites. I just find it an interesting philosophical problem, and was looking on the web for some thoughts, why a model fitted to one set of data so rarely seems to predict well for other data sets, even when you think you've captured the key variables. There could be an unmeasured variable that is causing the differences (e.g. presence of insect pests) or maybe it's just the nature of predictive modelling! All the best.

Unknown said...

Dr. Speights

I'm a senior in electrical engineering at the university of Illinois urbana-champaign. I have chosen my senior design project to build a predictive model of power outage restoration. The basic idea of this model is to be able to
1. predict the number of power outages for a given approaching storm
2. predict the duration of the power outages for that storm
3. predict the damages to the power system for that storm

I think I have the basic steps ready but I'm not sure where to go next and was wondering if you could help me with this.

I have gathered historical outage data (from a local power company) and historical weather data (from the university atmospheric department) for storms that have affected illinois since 2006.

the outage historical data are as follows:

1. number of outages per hour of the storm
2. outage type per hour of the storm
3. damage type per hour of the storm

the weather historical data is as follows:

1. temperature per hour of the storm
2. percipitation per hour of the storm
3. wind speed per hour of the storm
4. distance between storm front and the center of a major city per hour


What should I do next?
How can I utilize this information to build the predictions and model them for future storms?
How would SAS/Statistics help me atchieve my goals?


Thank you so much for your help

Tamer