This is the second blog of our four part blog series with topics pertaining to structured and unstructured data in big data loan underwriting. In our first blog of the series, we provided an overview of structured and unstructured data, its impact in data driven underwriting, and how to integrate unstructured data into machine learning algorithms. The last two blogs of the series will cover the role of time series data sets in lending and common data mistakes hindering your ability to adopt AI and machine learning for credit underwriting.
The first blog covered the first step in integrating unstructured data: define the objective of implementing data driven underwriting. This blog is a continuation of the series. We will cover data-preprocessing or data cleaning, which requires an understanding and implementation of ontologies. This step is a precursor to using big data in loan underwriting and building predictive machine learning credit underwriting models.
Ontology in Big Data and Machine Learning
What role does ontology play in big data loan underwriting and machine learning models? Ontologies provide machines with the ability to make sense of big data, a key foundation to how machines learn and understand. Without an understanding of data, accurate and predictive machine learning models would be impossible.
What is an Ontology?
Ontology is a set of concepts and categories in a subject area or domain, using a shared vocabulary that depicts the properties and the relations between them. Ontologies focus on the meaning and shared understanding of data. The definitions of your data need to be the same for the entity sending the data (e.g. lender) and the entity receiving the data (e.g. machine learning credit scoring model).
For example, a lender provides the numerical amount of 500 within the field labelled “income”. However income comes in many forms:
Aspect of pay, e.g. net vs gross
Time period, e.g. monthly vs annual
Currency, e.g. US dollars vs Japanese yen
The lender and the machine learning credit scoring model need to have the same definition for the “income” field along with all of the big data used in your loan underwriting.
Ontology also considers the influence of the use cases in determining the appropriate relations or required actions. For example, you could easily create a tweet csv by inputting the type of tweet, time-stamp, content, and user name. If someone is just interested in what an individual has tweeted, the message itself is structured data. But if someone is interested in interpreting tweets to infer something from it, like how often a negative statement was said about a company on an average month, then that’s a different use case for a tweet. Ontologies factor in influence to interpret the meaning in the unstructured data.
Start improving your credit decisioning processes today with this handy checklist: 5 Steps to Start Automated Loan Underwriting.
The Importance of Ontology in Big Data Loan Underwriting and Machine Learning
In order to develop a predictive machine learning credit underwriting model, we stress the importance of having the right big data. But another crucial step is data-preprocessing/cleansing for machine learning models.
Yes, we must train models to understand concepts through supervised learning. However, cleaning all of your big loan data is a prerequisite to training your credit scoring model for accurate and predictive results. If you throw bad data into a machine learning credit underwriting model you’ll inherently get a bad model with poor results and predictions.
Therefore, for a lender, an important step to data driven underwriting is to clean their big data, the historical records of past loans. Data cleaning or pre-processing requires adherence to the data dictionary or the ontology, such that the big data sent by the lender and the big data received by the credit scoring machine learning model means the same thing to both parties. The benefits of ontologies is that they remove assumptions in the data because of the shared understanding of the domain knowledge. This ultimately helps enhance data quality and produce better machine learning credit underwriting models as machines make better sense of their data.
Stay tuned for the next blog in our series where we cover time-series data in lending.