Gartner estimates that 80% of enterprise data is comprised of unstructured data. Yet until tech advances afforded by machine learning, structured data has been the go-to for data analytics and models. The advances in computation power have afforded organizations the ability to analyze and incorporate unstructured data into their business decisions. One could even say, unstructured data is the gateway to data driven underwriting.
Nav Kesher, head of data sciences for the Facebook Marketplace Experience said“...businesses have, in the past, ignored or forgotten about such [unstructured] data, that is slowly starting to change,” at the AI Summit in San Francisco.
This post is the first of a four part blog series that covers topics in structured and unstructured data for data driven underwriting. This first blog provides an overview of the main differences between structured and unstructured data, its impact within loan underwriting, and how unstructured data is incorporated into machine learning credit underwriting models.
The rest of the blog series will educate you on the following topics of unstructured data in underwriting:
The importance of ontology in data
The role of time series data in lending
Common data mistakes hindering your ability to adopt AI and machine learning
Structured vs. Unstructured Data
Two of the key attributes of unstructured data is that it doesn’t adhere to a standard form and definition. Generally, unlike structured data, unstructured data does not fit into traditional tabular form, it often requires transformation into structured data before it can be used. In addition, unstructured data is readily interpretable by humans and structured data is readily usable by computers.
Examples of structured data are data that fit into MS Excel sheets and examples of unstructured data are email conversations, audio or speech data, text messages, and images. Unstructured data provides a large volume of insights for. With unstructured data, you need to decide what comprises the data, how to interpret it and implement the data.
We say a dataset is structured when each data element, that is each attribute represented in the dataset, has an associated fixed structure. For underwriting purposes we consider any data in tabular form as structured data. Relational tables are the most popular type of structured data. Examples are data that are stored in MS Excel sheet or comma-delimited data stored in CSV files. This type of data can be stored in classic relational database and thus can be queried by SQL and its variants.
For underwriting purposes we treat semistructured data as unstructured data. Semistructured data has a flexible structure while unstructured data is expressed (in writing, image or voice) in natural human languages with no specific structure defined. XML and JSON are the markup languages commonly used to represent semistructured data and images, audio, videos, emails and PDFs are examples of unstructured data. We consider both semistructured and unstructured data as equivalent, for our purposes. Combined they constitute about 80% of all generated data; both types of data require parsing before they can be used to train artificial intelligence/machine learning algorithms.
For example, let’s consider a resume in PDF format. It is essentially a big blob of text and requires parsing to make sense of what the actual data means and its hierarchy of information, headers vs body text, etc. You also need to interpret what the information contained within the resume means. For example, a job title such as “Loans Manager” could be interpreted in various ways based on the set of roles and responsibilities that could differ based on the individual’s actual role.
The Impact of Unstructured Data in Data Driven Underwriting
If you only use structured data in your underwriting, then you’re making lending decisions based on less than 20% of the data available to you. Using unstructured data allows you to tap into a large volume of previously unavailable insights.
Traditional methods of credit decisioning make use of structured data like bureau data. In today’s digital age, only an estimated 20% of enterprise data is structured data. Furthermore, according to IDC, unstructured data is outpacing structured data, growing at 29.8% YoY, against 19.6% for structured data. That’s almost a 50% difference in growth rate.
So how do you access all of the data available to you for credit underwriting? By incorporating unstructured data into your underwriting decisioning, you can make far better predictions of what’s a good loan vs what’s a bad loan, giving you a 360 degree view of an applicant’s creditworthiness.
Incorporating Unstructured Data into Machine Learning/AI Based Credit Underwriting
Unfortunately, unstructured data is not immediately usable for machine learning/AI based credit underwriting unless it is parsed into structured data. Machine learning models developed by data scientists are computer algorithms that are monolingual, and speak in terms of binary digits, i.e. bits. Unstructured data will need to be broken down into the form of structured data for the model to understand and be trained on.
The first step to incorporating unstructured data into your business and underwriting strategy is align stakeholders around the desired benefits of data driven underwriting enabled through the use of unstructured data. In other words, what does success look like for you as a lender? Is it to make better loan predictions and understand whether someone will default or not? Or is it to better identify your key customer profile? Or is it to acquire a new customer set like thin files and credit invisibles? Once the objective is set, it can aid and guide how the unstructured data should be converted parsed to structured data for machine learning.
In the next blog of our four part blog series we cover a crucial component regarding the ontology of data, or the agreement between the meaning of data.
In the meantime, learn how Trust Science can help you automate your credit underwriting.