Data Analysis in R: Loan Interest Rates

If you like school and are not already familiar with Coursera, I highly recommend you check it out. The internet has come a long way in evolving beyond its primordial functions (wasting time and checking facts), but I was surprised to discover that it can now be your free college education. Through Coursera, you sign up for courses being taught at universities around the world, and each week, you go through the lectures, quizzes, and assignments at your own pace. Of course, you don’t get actual college credit for your efforts (though that may soon change), just a certificate of completion.  But that doesn’t really matter to school-lovers like myself, who often question the cognitive payoff of the time, money, and stress vested in pursuit of post-secondary degrees (ie could I have made off like Will Hunting if I just spent enough time in the library? This is a foolish thought, as I spend most of my time in the library reading cookbooks, as well as Margaret Atwood and Zadie Smith novels, but I can’t help but wonder anyway).

I’m currently taking a Data Analysis course, taught by Jeff Leek through the John Hopkins Bloomberg School of Public Health. I took a couple of stats courses in college and am comfortable with most of the concepts, but I wanted to be better. Specifically, I wanted to learn how to use R, the open-source software for statistical analysis and computing.

The first assignment was due this morning: an exploration of the relationships between peer-to-peer loan interest rates and other characteristics of the loans and their applicants. If you’re interested (don’t worry, I’m not offended if you’re not) An excerpt of my own analysis is below:

Loan Amount and Length Affect Interest Rates for Applicants with Same Credit Scores

Introduction

Peer-to-peer lending involves money lent between individuals, without involvement of traditional financial institutions, such as banks [1]. Like all lending, peer-to-peer loans involve different risk levels to the investor, depending on the characteristics of the loan and applicant. Higher risk is reflected in higher interest rates, which are undesirable for applicants as they increase the overall cost of the loan. An important characteristic reflecting risk to lenders is the applicant’s FICO credit score: lower scores reflect poor credit history and may therefore warrant the lender to charge higher interest rates. Given this relationship between FICO score and interest rates, the purpose of this data analysis is to identify associations between peer-to-peer loans’ interest rates with other characteristics of the loans and their applicants (controlling for the applicants’ FICO scores). These associations are important, as they give insight as to how peer-to-peer lenders consider characteristics of an applicant and prospective loan in determining a loan’s interest rate.

Methods

Data on 2,500 peer-to-peer loans was obtained from the Lending Club [3], downloaded as a csv file on 2/10/2013 [4], and analyzed using R Statistical Computing Software [5].

To evaluate the relationship between the loans’ interest rate and characteristics of the borrowers, I used multivariate linear regression, pairing average credit score with each other independent variable, with interest rate as the outcome variable.

Multivariate linear regression demonstrates the unique contribution of each categorical or continuous independent variable to variation in the continuous dependent variable: loan interest rate (IntRate) [6].

I began with a uni-variate linear regression model, estimating interest rates (IntRate) as a function of average FICO score (FICOavg). I interpreted the relative contribution of each variable to the model through its adjusted coefficient of determination, or Adjusted-R2 value, which only increases in multiple regression if a new variable added to the model increases the model’s explanatory powermore than would be expected by chance [7].

Results

The data set includes records of loans granted in 46 states across the U.S. Besides state, other categorical variables in the data set included: applicants’ employment length, home ownership status, and FICO score range, as well as the loan purpose, and loan length. Quantitative variables included: applicants’ monthly income, debt-to-income ratio, revolving credit balance, the number of open credit lines, the number of inquiries within the last 6 months, the loan amount requested by applicants, and the loan amount funded by investors.

Loans funded by investors range in value from $0 to $35,000, with the mean value being $12,000. Loans are either 36 or 60 months in duration, and are categorized into 14 purposes, the most common being debt consolidation (n=1307) and credit card (n=444).

To determine the baseline explanatory power of average FICO score (FICOavg) for interest rates (IntRate, measured as a decimal from 0 to 1 rather than a percentage), I tested the following univariate linear regression model:

IntRate = b1 FICOavg + error

where b1 = -8.46e-4, indicating that an increase of 1 point in the applicant’s average FICO score corresponds to a reduction of 0.0846% in the loan’s interest rate. This relationship was highly significant at the P<0.0001 level, and the adjusted R2 value of the model was 0.5026.

I then tested each other variable’s effect (independent of FICOavg) on IntRate by adding them to the regression model, one at a time:

IntRate = b1 FICOavg + b2 Var2 + error

Where Var2 was the added or “new” variable. If Var2 was a categorical variable, then it was added to the regression model as a factor. A variable’s independent effect on interest rates was considered “strong” according to the following two criteria: (1) the new variable coefficient’s P value was less than 0.05, and (2) that variable contributed to the model’s predictive power of interest rates (interpreted in this analysis as an increase of at least 0.10 to the baseline adjusted R2 value of 0.5026. Thus, the adjusted R2 value must be at least 0.6026).

Finding 1: Longer loans have higher interest rates

The length of the loan (Loan.Length) was the only categorical variable that met these criteria. Loan.Length had two possible values: 36 months and 60 months. Including Loan.Length as a factor in the regression model increased the adjusted R2 value by to 0.6896 (an increase of 0.187 from the uni-variate model’s adjusted R2 value). The b2 coefficient for the Loan.Length factor of 60 months was 0.0427, indicating that 60-month loans have the effect of increasing interest rates by 4.37% (P<0.0001) as opposed to 36-month loans among applicants with the same FICO score. The Figure below represents the differential effects of loan length on interest rates, while controlling for applicants’ FICO scores.

Loan Length

Finding 2: The more money that is requested/funded, the higher the interest rates

Among quantitative variables, only two contributed to the model’s adjusted R2 value by at least 0.10: the value of the loan requested by the applicant (Amount.Requested), and the value of the loan funded by the investors (Amount.Funded). It is intuitive that these variables are highly correlated (larger requests are more likely to result in larger amounts funded), and indeed these variables have a Pearson’s correlation coefficient of 0.9698 (P<0.0001).

The b2 coefficient was exactly the same for when Amount.Requested and Amount.Funded were added to the multivariate regression model: 2.11e-6 (P<0.0001 for each variable). This means that for each $1 increase in the loan amount (whether the amount is being requested by the applicant or funded by the investor), the interest rate will increase by .000211%. The adjusted R2 value for the model when Amount.Requested was Var2 was 0.6564, and was 0.6551 when Amount.Funded was Var2. Thus, each of these variables increased the adjusted R2 value by at least 0.15 compared to the baseline model considering only FICOavg. The figure below demonstrates visually how the value of the loan requested (Amount.Requested, broken into quartiles) stratifies a scatterplot of interest rates versus credit scores.

Amount Requested

Other (Weaker) Relationships

When added to the multivariate regression model, other variables in the data set did have statistically significant (P<0.05) b2 coefficients, though none contributed to the baseline adjusted R2 of 0.5026 by more than 0.001, and thus did not significantly increase the model’s predictive power of interest rates, independent of average FICO scores. Variables that had positive and significant b2 coefficients (and thus were associated with increased interest rates for applicants with the same average FICO score) included monthly income, revolving credit balance, inquiries in the past 6 months, employment length of 10+ years, and loan purposes that were “house” or “small business”. Only one factor had a significant negative relationship with interest rates: applicants whose home ownership status was “rent”.

Conclusions

Loan length and the value of the loan (amount requested and amount funded) were the variables with the strongest effects on interest rates, independent of credit scores. Loans with a length of 60 months were shown to increase interest rates by over 4% compared to loans with a length of 36 months, while each dollar added to the loan amount requested/funded was associated with a .000211% increase in interest rates. There are several weaknesses in this analysis. The associations should be treated with caution because of the wide variation in units: interest rates varying by fractions of a percent, while FICO scores, incomes, and other variables are measured on different and larger scales. In addition, my multiple regression equations did not include interaction terms between FICO scores and other explanatory variables included in the analysis, which may affect how associations are interpreted [8].

References

[1] Wikipedia “Peer-to-peer lending” Page. URL: http://en.wikipedia.org/wiki/Peer-to-peer_lending. Accessed 2/15/2013.

[2] MyFICO “Credit Basics” Page. URL: http://www.myfico.com/CreditEducation/Articles// Accessed 2/16/2013.

[3] The Lending Club. URL: http://www.lendingclub.com/. Accessed 2/16/2013.

[4] Project Data Set. https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv. Accessed 2/12/2013.

[5] The R Project for Statistical Computing. URL: http://www.r-project.org/. Accessed 2/16/2013.

[6] Virginia Tech Department of Statistics “Interpreting Multiple Regression: A Short Overview” Page. URL: http://www.lisa.stat.vt.edu/sites/default/files/Multiple_Regression_Analysis.pdf. Accessed 2/15/2013.

[7] Wikipedia “Coefficient of Determination” Page. URL: http://en.wikipedia.org/wiki/Coefficient_of_determination. Accessed 2/16/2013.

[8] Macquarie University PSYSTAT “Testing and Interpreting Interactions in Regression – In a Nutshell” Page. URL: http://www.psy.mq.edu.au/psystat/documents/interaction.pdf/. Accessed 2/16/2012

Advertisements

About katosulliv

Transportation, mapping, and cities enthusiast.
This entry was posted in Uncategorized and tagged , , , , , . Bookmark the permalink.

One Response to Data Analysis in R: Loan Interest Rates

  1. Ajmal Shahbaz says:

    Hey, Good work. I am just a beginner in R and your post has helped me alot. basically I am electrical Engineer. I love to do software work. I have a little request. Can you give me your R code?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s