I'm learning Data Science and I'm currently working on analyzing loan data across different regions. My goal is to build a model that can assess the quality of a given region based on the ratio $$\frac{\sum GCL}{\sum TAP},$$ where:
$GCL$ (Gross Credit Loss) = Total Amount Payable - (Issue Value + Bad Debt)
$TAP$ (Total Amount Payable) is the total amount due.
I have two approaches in mind:
Linear Regression Model: Given a set of variables $X_1, ..., X_k$, I can construct a linear regression model predicting the dependent variable $Y = \frac{\sum GCL}{\sum TAP}$. During the model-building phase, I'd select a subset of these variables, resulting in a prediction $Y \sim \sum a_i X_i$. Subsequently, I can segment the regions into quality classes (e.g., A, B, C, D, E) based on the value of $Y$.
Segmentation followed by Modeling: First, I'd arbitrarily create quality classes (e.g., A, B, C, D, E) based on the value of $\frac{\sum GCL}{\sum TAP}$. Given the set of variables $X_1, ..., X_k$, I'd estimate the distribution of a random variable $Z$, indicating the quality class a given region belongs to. Then, I'd define a function $f$ based on the distribution parameters (e.g., mean, median, deviation, skewness, etc.). The choice of $f$ would be determined by e.g. business requirements (it is not crucial now), and the value $f(Z)$ would be the final model output indicating the quality of the region.
Question: If the two models yield different results (e.g., regarding the quality of a given region), is there any analytic way to compare these models and say which one is better?
I'd appreciate any insights or feedback!