| Predictive Analytics and Data Modeling |
|
Creascience provides support for all phases involved in the building of a reliable prediction model. There are four steps that we believe require a specific attention: To Prepare a Clean and Informative DatasetTo start with, a lot of attention should be put on the data preparation phase. It is illusory to believe that any statistical model will be able to perfectly deal with issues like missing values, redundant information, and variables with too many categories. Many can address some of these issues to some degree, but will perform much better if the data have first been prepared adequately. In the same way, it is also illusory to believe that one does not need any information on the data and context of application to build a good model. When the time comes to remove or keep specific variables, it is crucial to know what each one measures or how reliable they are. Therefore, we consider that an adequate preparation of the data to be analyzed should never be overlooked. To Select an Adapted Prediction CriterionMost ways of measuring the predictive ability of a model are based on the idea that different data should be used to build the model and to assess its performance. There are however several ways of implementing that and sometimes, practical constraints also have to be accounted for in the definition of the objective. To Compare the Performance of Several MethodsThere is no such thing as the overall best modeling technique. Of course, it first depends on the type of response one tries to predict, but it also depends on the context and ultimately on the data themselves. Therefore, we never limit our investigations to a single type of model. We believe that the key to successful predictive analytics lies in testing the performance of several types of models in order to select the one that performs best in a given context. To Provide Actionable ResultsBuilding a predictive model is rarely limited to providing a set of predicted values. First, the model itself might need to be delivered in a usable manner. Second, the model performance can rarely be summarized with a single performance measure. It typically works well for some data and not so well for others. Last but not least, many recent modeling techniques appear as "black boxes" when one tries to understand what the most important drivers or predictors are and how they affect the response. We make sure to address these issues, notably by providing uncertainty on the predictions and understandable plots and tables illustrating the relationship between the response and the key predictors. |