proposal – Home Loan Advisor

1. Problem Definition

Buying a home is the largest financial decision most people will ever make, yet consumers rarely have access to the same analytic tools that banks and lenders use when evaluating mortgage applications. Most prospective home buyers do not know how their personal financial profile — income, credit score, debt load, down payment — affects their real risk of being unable to repay a mortgage, or how changes in interest rates and economic conditions might impact that risk over time.

This project proposes the development of the Home Loan Advisor, a consumer-facing machine learning application that helps individuals understand whether taking on a home loan is financially advisable given their personal characteristics. Unlike traditional bank-side default prediction systems, this app flips the perspective: it empowers the borrower to make smarter, data-driven decisions before committing to a mortgage. The app will use a model trained on real mortgage performance data from Freddie Mac to provide personalized risk assessments and scenario-based projections.

2. Target Users & App Features

The primary target users are consumers — specifically individuals who are considering taking on a home mortgage and want an independent, data-driven second opinion on their financial readiness.

Target Users

First-time homebuyers who are unfamiliar with how mortgage risk is assessed and want guidance before approaching a lender.
Existing homeowners considering refinancing who want to understand how their current profile compares to historical default patterns.
Financial advisers who want a quick, visual tool to walk clients through mortgage readiness scenarios.

Core App Features

Personal profile input form: fields for annual income, credit score, loan amount, property value, down payment percentage, employment status, and debt-to-income ratio.
Risk assessment output: a clear affordability score and risk tier (Low / Moderate / High) with a plain-English explanation of what is driving the score.
Scenario simulator: an interactive panel allowing users to adjust variables (e.g., ‘What if my income drops by 15%?’ or ‘What if interest rates rise by 2%?’) and instantly see how their risk score changes.
Feature influence panel: a chart showing which personal factors most affect the prediction, helping users understand what to improve before applying.
Model transparency tab: displays model performance metrics (AUC-ROC, confusion matrix) so users can trust the underlying system.

3. Application Interface

The application will be built using R Shiny. Shiny is the recommended interface for this course and is well-suited for this project because the underlying model is trained in R using the tidymodels framework, the data is tabular, and Shiny’s reactive programming model naturally supports real-time updates as users adjust sliders and inputs in the scenario simulator.

The app layout will use the bslib package for a modern, clean UI and will consist of three tabs:

My Profile — the input form and risk assessment output
Scenario Simulator — interactive sliders for exploring what-if scenarios
About the Model — performance metrics and data transparency information. The app will be designed to be accessible to non-technical users, with plain-English labels and color-coded risk indicators.

4. Data Sources

The primary data source will be the Freddie Mac Single Family Loan-Level Dataset, publicly available at https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset. This dataset requires a free registration but is fully accessible and contains loan-level origination and performance data on millions of mortgages from 1999 to the present. It is one of the most comprehensive and credible public mortgage datasets available and was specifically recommended by the course professor.

To keep the project manageable in scope, we will use loan origination vintages from 2018 to 2022, which provides a rich and recent snapshot of mortgage performance while remaining computationally feasible. Key features available in this dataset include:

Feature	Description	Type
original_interest_rate	Interest rate at origination	numeric
original_upb	Original unpaid principal balance (loan amount)	numeric
original_ltv	Loan-to-value ratio at origination	numeric
original_dti	Debt-to-income ratio at origination	numeric
borrower_credit_score	FICO credit score at origination	numeric
number_of_units	Number of property units	categorical
occupancy_status	Primary residence, second home, or investment	categorical
loan_purpose	Purchase, refinance, or cash-out refinance	categorical
ever_delinquent	Target: whether loan became 90+ days delinquent	binary

Pre-processing steps will include subsetting to the relevant vintage years, handling missing values, encoding categorical variables, constructing the binary target variable (ever 90+ days delinquent), and addressing class imbalance using SMOTE or weighted sampling.

5. Machine Learning Problem & Initial Models

This is a binary classification problem. The target variable is whether a mortgage borrower ever became seriously delinquent (90+ days past due) on their loan — a strong proxy for default risk from the consumer’s perspective. The following model types will be explored in order of increasing complexity:

Logistic Regression — interpretable baseline model. Coefficients provide direct insight into which features increase or decrease default probability, which supports the app’s explainability goals.
Random Forest — captures non-linear relationships and feature interactions. Provides built-in variable importance scores that can be visualized in the app’s feature influence panel.
Gradient Boosting (XGBoost) — typically the best-performing model on tabular financial data. Will be hyperparameter-tuned using cross-validation as the candidate production model.

All models will be implemented using the tidymodels framework in R. The final selected model will be serialized as an .rds file and loaded into the Shiny app for real-time inference.

6. Model & App Performance Evaluation

Model Evaluation

AUC-ROC: primary metric, measuring the model’s ability to rank borrowers by delinquency risk across all classification thresholds.
Precision, Recall, and F1-Score: evaluated for the delinquent class to capture practical usefulness given class imbalance.
Confusion Matrix: to visualize the trade-off between false positives (overcautious advice) and false negatives (missed high-risk borrowers).
Calibration plot: to verify that the model’s predicted probabilities are well-calibrated and meaningful to end users.

App Evaluation

Prediction latency: the app should return a risk score within 2 seconds of user input submission.
Usability testing: the app will be evaluated by 3–5 peers playing the role of first-time homebuyers, assessed on clarity and ease of use.
Scenario accuracy: the scenario simulator outputs will be verified to respond logically and consistently to input changes.

7. Model Training Mode

Model training will be batch-based. The model will be trained once on the Freddie Mac 2018–2022 vintage data, validated on a held-out test set, and the final model object will be serialized to an .rds file that is loaded by the Shiny app at startup for inference. This approach is appropriate because historical mortgage performance data is stable, and real-time retraining is not necessary for a consumer advisory tool.

If time allows, a periodic retraining schedule (e.g., triggered annually as new Freddie Mac vintage data becomes available) could be explored using GitHub Actions as a stretch goal, but this is outside the core project scope.

8. Computational Needs & Hosting

This project does not require large neural networks, LLMs, GPUs, or TPUs. Logistic Regression, Random Forest, and XGBoost are all computationally lightweight models that train efficiently on a standard laptop. The Freddie Mac dataset for 2018–2022 vintages is large but manageable — a random stratified sample of 500,000 to 1,000,000 loans will be used for training to balance representativeness with training speed.

Hosting plan:

Primary: shinyapps.io free tier — sufficient for a course project with low concurrent traffic and a pre-trained model loaded at startup.
Backup: Render.com free tier or AWS EC2 t2.micro running open-source Shiny Server, if shinyapps.io resource limits are exceeded.

No significant cloud computing costs are anticipated. All model training will be performed locally, and only the serialized model file and Shiny app code will be deployed to the hosting platform.

9. Minimally Viable Product (MVP)

The MVP will be a functional end-to-end prototype demonstrating feasibility, to be delivered at the midterm checkpoint. It will include:

A cleaned and preprocessed subset of the Freddie Mac dataset (2018–2020 vintages) with the binary delinquency target variable defined.
A trained Logistic Regression baseline model with AUC-ROC reported on a held-out test set.
A deployed Shiny app with:
- a basic input form for borrower characteristics
- a real-time default probability output
- a simple risk tier label (Low/Moderate/High).
The app accessible via a public shinyapps.io URL for review.

Post-MVP iterations will add:

Random Forest and XGBoost model options, the scenario simulator with interactive sliders, the feature importance visualization, UI polish using bslib, and the model transparency tab with evaluation metrics.

Bibliography:

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2023). shiny: Web application framework for R. https://shiny.posit.co

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.

Freddie Mac. (n.d.). Single family loan-level dataset. https://www.freddiemac.com/research/datasets/sf-loanlevel-dataset

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

McKee, Amberle. Mastering Shiny for Python: A Beginner’s Guide to Building Interactive Web Applications, DataCamp, 15 Dec. 2023, https://www.datacamp.com/tutorial/mastering-shiny-for-python-a-beginners-guide-to-building-interactive-web-applications.