Don’t let data issues stop you from switching to machine learning
Everyone talks about the wonders of big data — everyone, that is, but data scientists. It’s their job to organize, cleanse, and prepare jillions of pieces of information to produce new data models, especially machine learning models, which can consume up to 100 times more data variables than traditional logistic regression models.
And maybe all this data prep work used to inhibit organizations from deploying machine learning in their modeling, but not any longer. The fact is it’s never been easier to put powerful and precise machine learning models into production. New technology tools (and the nature of ML itself) are eliminating many roadblocks to ML implementation.
Switching to machine learning can even make life as a modeler easier, because a single machine learning model can replace multiple logistic regression models, streamlining your modeling operations to adapt more quickly to changes in the market. One bank we worked with was looking to update its modeling portfolio, and staring at the prospect of revising 110 models at a rate of 12 months each, or roughly 110 man-years for those keeping score at home. Working together, we were able to cut that development time down to 18 months. That kind of time saved allows data science and analytics teams to focus on higher value work (like building awesome new models) instead of rote tasks like re-coding variables. Here are three reasons it’s easier than ever to deploy ML models in your underwriting:
You don’t have to worry about clean or complete data
It used to be that gathering data for modeling was a bit like assembling a jigsaw puzzle. You needed all the pieces for an accurate picture.
Thanks to machine learning, that’s no longer the case. Because ML models use orders of magnitude more data than traditional logistic regression, they’re able to provide valid predictions despite missing variables. One or two or even twenty missing puzzle pieces won’t prevent the model from gathering a complete picture or introduce an unmanageable amount of instability. The goal is to collect as many “good enough” variables as possible — and then feed them into your model.
Being less pristine about your data sources means you’re free to use more of your own data as well as data from more sources than ever before, both of which offer a big performance boost. For example, a lender may be ingesting 1,100 data points from a credit bureau, but end up using only 25 summary variables in a logistic regression model. A few of our customers have derived up to two-thirds of the incremental performance gain from ML (in the form of higher approval rates or lower charge-offs) from using more of the data they already had on hand.
Getting ML models ready for production is easier and cheaper than it used to be
Building models used to go something like this: A modeler would compose a model from 30 or so variables in a statistics package like SAS, R, or Excel, and then hand the model to a software engineer for re-coding into a run-time language such as Java so it could operate in harmony with the organization’s data storage and retrieval systems. This could take a year or more. It was common for data scientists to spend months engaged just in verifying the variables were coded correctly during the re-coding process — the process of transforming raw data into an input for a model is called feature engineering, and these steps must be translated correctly for a model to behave the same way it did during development as in production. Model operationalization (translating SAS models into Java) was time-consuming and ran the risk of producing a model that might not run the way it was designed.
Unless you’re really, really good at this kind of thing, such hand re-coding was all but impossible with ML models and their thousands of variables and exponentially more intricate math. Fortunately, modern ML platforms from Microsoft (Azure ML), Google (Cloud AI), Amazon (Sagemaker), open source contributors, and others have greatly reduced the time and effort of coding new ML models by automating the programming and data scutwork and letting you run it on their servers through APIs. You can experiment with different kinds of models to your heart’s content with no dependency on a massive software engineering project.
Explainable machine learning makes the data prep more efficient
A major block to adopting ML models in production has been the inability to explain the models sufficiently to get them past compliance and risk approvals. New advances in ML model explainability and monitoring are solving that, allowing more companies to get ML into production. But Zest’s explainable AI has benefits in model development, as well, allowing you to quantify, in dollars and cents, each feature’s impact on a model. That gives you an economic rationale for choosing the most effective variables — saving you time and money when there are literally thousands of potential options. Say an input is causing your model’s response rate to hang. Zest’s explainability tools can help you weigh the cost of that lag against the investment required to ensure a fast response rate (assuming you have that luxury). Considering licensing a new data source? Zest’s explainability tools will help you evaluate the ROI.
Zest AI’s technology also provides continuous model monitoring capability, which allows you to assess how consistent your model inputs are over time — giving you additional insight into how much missing data or other mistakes affects the robustness of your underwriting scores. Our tools also keep track of how data is combined into features, the provenance of that data, and transformations performed on the way through the model-development pipeline. You get plain English descriptions of the steps along the way. This results in a faster production time and a more accurate implementation.
The triple play of data resiliency, better platform tools, and fully explainable AI makes now as good a time as any to deploy ML models for your underwriting.