If so, the journey to realizing AI and machine learning for your credit and lending decisions essentially begins with data. Handling and preparing a massive amount of data for AI and machine learning can be uncharted territory for many.
If you’re looking to partner with a company like Trust Science to use AI and machine learning for your credit decisioning there are some fundamentals to adopt and ‘data’ mistakes to avoid to help you adopt new tech faster for your. In this blog we cover how to overcome 4 common data mistakes keeping you from adopting AI and machine learning;
- Poor categorization of different data types
- Not having a data dictionary or data ontology
- Lack of time-stamps in the data
- Improper use of or within the LOS and/or LMS
This is the last blog of our 4-part blog series on structured and unstructured data in data-driven underwriting. Our previous 3 blogs covered structured and unstructured data, data ontology, and time-series data in lending.
1. Poor Categorization of Different Data Types
The first data mistake is simply a lack of categorization of the types of data in your datasets. Ie. a particular dataset comprised of structured, semi-structured or unstructured data. (Read blog part 1 on data categorization.) We recommend that you:
- Review historical records and/or corporate data sets
- Recognize their different data types and
- Appropriately tag the data as structured, semi-structured or unstructured.
Outcomes of those recommendations determine what AI/ML algorithms can be adapted for lifting business insights from the data. For underwriting purposes, we considered semistructured data as unstructured data. In a layman’s view, examples of structured data are data that fit into MS Excel sheets and examples of unstructured data are email conversations, audio or speech data, text messages, and images.
In credit underwriting, sometimes bureau data is mistaken as structured data. For example, PDF files, picture scans, bank statement scans are all types of unstructured data.This unstructured data needs to be converted into structured data.
2. No Data Dictionary
The next data mistake is not having proper documentation for all of your data and what the data means. If you’re working with a partner to use your data to build a custom score with machine learning and AI, then it’s crucial to have an ontology (read blog part 2 on Ontology), or shared agreement on the meaning of what the data represents.
Other crucial benefits of proper data documentation are: reduction in time to understanding the business problems being solved; effective feature selection for modeling; and adequate alignment of AI/ML output and the business problems solved. At Trust Science, we work with clients and prospects to reach a mutual agreement on what the data means.
For example, a lender provides the numerical amount of 500 within the field labelled “income”. However income comes in many forms:
- Aspect of pay, e.g. net vs gross
- Time period, e.g. monthly vs annual
- Currency, e.g. US dollars vs Japanese yen
Having a data dictionary would help both parties understand that income is defined as, for example, the annual net income expressed in US dollars.
3. Lack of Time-stamps in Data
The third common data mistake is lack of time-stamps in your data. It is important from a compliance standpoint to ensure that the correct data is being used for the purpose of credit scoring. For example, time-stamping of data also makes entity resolution easier as it helps ensure that only customer data at loan application time is used for credit scoring, a core compliance requirement. But also, time-stamp data (read blog part 3 on time stamps) helps us provide you with accurate and predictable custom scores through machine learning and AI because it provides greater clarity and accuracy when using or interpreting the data.
For example, a lender may have an LOS or LMS that includes borrower credit history information within the customer record. Information like bureau files are stored in the customer record versus in the loan record. Fast-forward six months since the customer received a loan, you’ll usually find either the file is replaced in the LOS/LMS customer record or the file exists in two places, in the customer record and the loan record.
There are no time-stamps to clearly distinguish between the two files and this presents a problem if you’re looking to build a custom score. You wouldn’t know which file was used at the time of origination to best interpret the data in building the custom score through machine learning and AI.
Therefore it’s crucial to consider time-stamps in your data and how to capture time-stamps in your LOS or LMS in order to achieve optimized accuracy in your custom credit score built with machine learning.
4. Improper use of an LOS/LMS
The last data mistake in your road to adopting AI and machine learning is the improper use of an LOS or LMS. We’ve already mentioned one improper use regarding a LOS/LMS above with the replacement of data, but there could be many more ways you’re introducing issues into your data unknowingly by how you or your loan officers use the system.
Another example that demonstrates the improper use within a LOS/LMS, is in updating the interest rate to 0% in the system after a loan has been fully paid off. Updating any fields without being able to track the original value of the field with a timestamp is at risk of being unusable in a model. Naturally a column for interest rates is fairly straightforward, expressed as an annual interest rate, APR. However, this small detail in the data ends up interpreting data inaccurately, categorizing someone as a good loan if it had an APR of zero, and someone as a bad loan if it wasn’t zero.
To reduce some of the inaccuracies, having a separate LOS and LMS could be an option. It separates information used at the time of origination to the data available after the loan started. The former would be easily identified and is categorized as information for a loan origination for the purpose of building a machine learning model. However, if there’s a separate system and different unique identifiers exist for a borrower within both the LOS and LMS, it will complicate the data. A computer would have trouble identifying the same John Smith in both systems.
We hope you’ve gained a greater understanding of some of the 4 common data mistakes made by lenders looking to introduce AI and machine learning into their underwriting. We recommend that you revisit your business processes and data with these considerations in mind and ensure you’re setting your business up for success if AI/ML is part of your business strategy.
Considering alternative credit decisioning? Click below to learn more about automated credit underwriting.