multiple linear regression

Multiple Linear Regression — Meaning, Definition & Full Explanation

Multiple linear regression is a statistical technique that predicts the value of a single dependent variable based on two or more independent variables by fitting a linear equation. It extends simple linear regression (which uses one predictor) to scenarios where multiple factors influence an outcome. In banking and finance, it is used to model relationships such as loan default risk based on income, credit score, and debt-to-income ratio simultaneously.

What is Multiple Linear Regression?

Multiple linear regression is a predictive modeling method that assumes a straight-line relationship between one outcome variable and several input variables. The dependent variable (also called the response or target variable) is typically continuous—such as loan amount, house price, or return on investment. The independent variables (also called predictors or features) are the factors that drive the outcome.

The core formula is: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

Here, Y is the dependent variable, X₁, X₂, and Xₙ are independent variables, β₀ is the intercept, β₁ through βₙ are coefficients (weights) showing how much each independent variable influences Y, and ε is the error term. Multiple linear regression estimates these coefficients using historical data, then applies the fitted equation to new data for prediction. It differs from simple regression because it captures the combined effect of multiple factors rather than just one. This makes it more realistic for complex banking scenarios like credit scoring or portfolio return forecasting.

How Multiple Linear Regression Works

Multiple linear regression follows these steps:

Data Collection: Gather historical data with one dependent variable and multiple independent variables. For example, collect 500 loan records with loan default status (Y) and borrower income, age, credit score, employment tenure, and existing debt (X variables).
Model Fitting: Use statistical software to calculate the best-fit coefficients (β values) that minimize prediction errors. The algorithm (usually ordinary least squares or OLS) finds the equation that makes the overall residual error as small as possible.
Coefficient Interpretation: Each coefficient tells you the marginal effect of that variable. A coefficient of 0.05 for income means a ₹1 lakh increase in income increases predicted loan default probability by 0.05 units, holding other variables constant.
Residual Analysis: Check that prediction errors (residuals) follow a normal distribution with zero mean and constant variance. Violations suggest the model may be biased or unreliable.
Prediction: Apply the fitted equation to new data. Input a borrower's income, credit score, and other X values to predict their loan default risk.
Model Validation: Test the model on unseen data to ensure it generalizes well and does not overfit. Common metrics include R² (proportion of variance explained), adjusted R², and root mean squared error (RMSE).

Key Assumptions: Multiple linear regression assumes independent variables are not highly correlated with each other (multicollinearity check), relationships are linear, residuals are normally distributed, and prediction errors have constant variance across all levels of X (homoscedasticity).

Multiple Linear Regression in Indian Banking

Multiple linear regression is widely used by Indian banks and non-banking financial companies (NBFCs) for credit risk modeling and pricing. The Reserve Bank of India (RBI) requires banks to maintain robust credit assessment frameworks; multiple regression models help banks quantify how borrower characteristics—such as income, collateral value, industry sector, and loan-to-value ratio—collectively predict default risk. These models feed into internal rating-based (IRB) approaches used under Basel III capital adequacy norms, which RBI enforces for scheduled commercial banks.

In practice, Indian banks like SBI, HDFC Bank, and ICICI Bank employ multiple linear regression in their loan approval algorithms, credit pricing models, and portfolio risk management. The National Credit Framework (NCF) and various RBI guidelines on retail credit portfolios reference statistical models of this type, though RBI does not mandate a specific methodology.

For insurance companies regulated by the Insurance Regulatory and Development Authority (IRDAI), multiple regression is used to model claims frequency and severity based on policyholder characteristics (age, location, health status) and underwriting factors. In investment banking, stock market researchers and mutual fund houses use multiple regression to forecast equity returns based on macroeconomic variables (inflation, interest rates, GDP growth) and firm-specific metrics.

CAIIB (Certified Associate, Indian Institute of Bankers) curriculum covers regression analysis and statistical modeling as part of risk management and advanced quantitative modules. Candidates studying for JAIIB and CAIIB exams should understand regression concepts because they underpin modern credit policy and pricing frameworks.

Practical Example

Scenario: Deepak Kumar, the head of retail lending at a Bangalore-based private bank, needs to build a default prediction model for personal loans.

Deepak's team collects data on 2,000 personal loans approved over three years, recording: (Y) whether the loan defaulted within 24 months, (X₁) borrower's annual income, (X₂) CIBIL credit score, (X₃) existing monthly debt obligations, and (X₄) years employed at current job.

Using regression software, the team estimates the model:

Default Probability = 0.25 − 0.00001×Income + 0.003×ExistingDebt − 0.002×CIBILScore − 0.015×JobTenure

This tells Deepak: higher income lowers default risk; high debt increases it; a better credit score reduces it; stable employment reduces it. The bank can now apply this equation to a new applicant (e.g., ₹8 lakh income, ₹25,000 monthly debt, 720 credit score, 4 years tenure) to predict their default probability and decide on loan approval and pricing. If predicted default risk exceeds 8%, the bank might deny the loan or charge a higher interest rate.

Multiple Linear Regression vs. Logistic Regression

Aspect	Multiple Linear Regression	Logistic Regression
Dependent Variable	Continuous (any real number)	Binary or categorical (0/1, yes/no, default/no default)
Output	Predicted numeric value (e.g., loan amount: ₹5,00,000)	Predicted probability (e.g., default risk: 0.15 = 15%)
Use Case	Forecasting loan amounts, interest rates, house prices	Predicting loan default, credit card fraud, loan approval (yes/no)
Output Range	Unbounded (−∞ to +∞)	Bounded (0 to 1)

Multiple linear regression is appropriate when you are predicting a quantity (e.g., "how much credit should be approved?"). Logistic regression is correct when predicting a probability or classification (e.g., "will this loan default?"). Indian banks use logistic regression more often for credit risk, because risk is inherently a probability, and linear regression would sometimes produce nonsensical predictions outside [0, 1].

Key Takeaways

Multiple linear regression predicts one continuous dependent variable using two or more independent variables assuming a linear relationship.
The model equation is Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε, where coefficients (β) show each variable's marginal impact on the outcome.
RBI-regulated banks use multiple regression in credit risk models, pricing frameworks, and portfolio stress testing in compliance with Basel III IRB approaches.
Four key assumptions must hold: no multicollinearity among predictors, linear relationships, normally distributed residuals with zero mean, and homoscedasticity (constant variance).
R² measures how much of the dependent variable's variance the model explains; adjusted R² penalizes over-fitting by accounting for the number of variables.
In Indian banking exams (JAIIB, CAIIB), multiple regression appears in risk management, quantitative methods, and credit policy modules.
Logistic regression, not multiple linear regression, is the correct choice for binary outcomes like loan default prediction.
The model is fitted using ordinary least squares (OLS) or maximum likelihood estimation (MLE) to minimize overall prediction error on historical data.

Frequently Asked Questions

Q: Can I use multiple linear regression to predict whether a loan will default?

A: Not ideally. Multiple linear regression is designed for continuous outcomes (e.g., loan amount or profit). Loan default is a binary outcome (yes/no), so logistic regression is the better choice because it produces probabilities between 0 and 1. Linear regression may produce predictions outside this range, which are nonsensical for probability.

**Q: What happens if my independent variables are highly correlated with each

Definition