Sunday, August 24, 2008

The Basics of Predictive Modeling
(and most data mining exercises)

The general idea of predictive modeling is simple: given all of the information available, predict with as much accuracy as possible a future outcome. The following questions are of particular interest.

  • What information can be used in a predictive model and how?
  • What variables can be predicted?
  • What structures are available for a model?
  • How is accuracy determined?
  • How is a predictive model used in the real world?

Predictive modeling is all about making a practical use of real data to solve real problems. In most cases the data is not clean, the model is not quite perfect and the data often violate some statistical assumptions. On the other hand, the problem which the predictive models solve is usually very real, often currently solved by a very non-data-driven method, and will benefit greatly even from a mediocre model.

Predictive models can be applied to a wide variety of circumstances, however, in my experience they work well when there are many decisions to be made.

Academic Statistics vs. The Real World

I received my PhD many years ago and have learned a lot since then about practical application of statistics to solve business problems. My experience comes from several positions at notable companies and as the president of my own consulting firm (www.equitydecisions.com). I have interacted with leaders from many fortune 500 companies in my various roles and find the following 2 stories illustrative of my realizations about how statistics interacts with business.

Story 1: Early on in my career, having just earned my PhD, I had a meeting with some very good predictive modelers. During the meeting, they told me their strategy for attacking a very well known problem which involved using multiple observations on the same individual. Their solution did not address the correlation between the subsequent observations and I was quick to point out the flaw (this problem may not be obvious to the reader, but to a die hard statistician, it will be). They could not understand my reasoning as their background was in physics and computer science and probably did not understand the statistical implications of this approach. However, I was unable to articulate the implications of this oversight and they were not convinced that it had a material impact on the predictive model. After some thought and research, I discovered that their approach was not as bad as it originally seemed. Their approach was not perfect, but it didhave some reasonable statistical properties. What I grappled with was figuring out if the flaw in their approach actual made a difference in the ROI of the product.

Story 2: During an interview I had with a very well known credit card processor, the interviewer asked me how I would estimate probabilities from a liner regression model fit on binary outcome data (this is data with a "yes" or "no" outcome such as defaulted on a credit card "yes or no"). I was quick to point out that the model he proposed was inappropriate for binary outcome data, but that if I had fit the logistic regression model (the more appropriate model), the answer was simple and I proceeded to write down the answer for him. I also indicated what the answer would be if the linear model had been fit.

These two stories point out a very common practice in business settings. Theory is not always followed exactly. However, many times, an inappropriate model is used to solve a problem and the time and resources it takes to implement the correct solution is either not readily available or is not invented or programmed. So I was forced to answer the question, what is the trade off between the wrong model and the time it takes to research and implement the correct model. If the wrong model works 90% as well as the correct model and saves millions of dollars a day, the cost for each day of research could be hundreds of thousands of dollars a day. During which time the modelers may be able to solve another problem that could also save millions of dollars a day. So as an academic statistician, I was forced to accept the inevitable. There are many imprecise solutions that can save (or make) companies money.

Building a predictive model involves several clear steps.

First, the modeler needs to have a good idea of where the data came from, the problems it may have and the history of the data storage and collection procedures. Once this basic knowledge of the data is obtained, exploratory analysis is conducted on the response and predictor variables of interest to look for any general patterns or to investigate the quality of the data. Particular attention should be paid to missing or invalid records. After the basics of data exploration are exhausted, each variable should be investigated relative to the outcome of interest. This process will uncover each variables importance with regard to the outcome along with give key indicators as to which transformations of the data will be appropriate.

Next comes the actual model building which will involve creating cross validation samples and determining the outcomes on which the model should be optimized. The modeling process will involving selecting variables, investigating alternate transformations, investigating interactions, and investigating alternate model specifications. Along the way, there will be performance statistics which you will try to optimize.

Once the model is completed, usually there is a process of explaining the model and justifying its value to the business. This will involve a cost justification of the model in terms of implementation costs and maintenance costs along with its value to the business or ROI (return on Investment).

After everyone is satisfied that the model will be appropriate for the problem, then it must be implemented in a way it can be used regularly. This implementation process will involve quality control to make sure all of the calculations and transformations are appropriately handled when coded into software.



After the model is being used in a business process, its performance and quality must be monitored over time. Additionally, the model may need to be updated occasionally as new data becomes available or the business usage changes.

Pitfalls with Automated Approaches to Predictive Modeling

Most major software vendors that offer solutions in predictive modeling have automated many of the tasks. While this may work if your data is in perfect order, I have found that your data is almost never in perfect order. Automated approaches will almost certainly grab onto problems in the data. Here are some things to look out for:

  • The response variable is correlated with time and so is the missing category of a predictor variable.
  • The predictor variables contain information that is only known at the point in time after you need to make a prediction.
  • The predictor variable contains information about your outcome (I call this feedback).
  • One of you predictor variables has too many levels to be fit by traditional methods due to infrequent values for some levels.

Of course this list is in no way exhaustive, but it may shed some light on the pitfalls of automated systems.