Great Reset

Richard Florida‘s theories about economic development and the “creative class” don’t always resonate with me, but his latest book The Great Reset most certainly did. He makes some important points about the role of the service sector in our economy, as well as high-speed rail and urban development. My head was nodding vigorously at almost every page.

The thesis of the book is that we are on the brink of a “Great Reset”, or a massive reorganization of the old economic and social order that lead to the 2008 recession.  Just like the urban, industrial era following the Long Depression of 1873, and the suburban, mass-production boom after the Great Depression of the 1930’s, this Reset will be a sweeping transformation of where we live and work, what our jobs are, and how we spend our money.

Megaregions as the new economic geography

Each Reset resolves in a “spatial fix,” or geographic resettlement: the industrial city of the late 1800’s and the suburbs from the 1940’s onwards are examples. Florida projects that the spatial fix to the current Reset is the Megaregion: clusters of major metro areas, secondary cities, and their suburbs. The largest in North America is the “Bos-Wash” corridor (encompasing Boston, New York, Philadelphia, Baltimore, and Washington D.C, with over 50 million people and more than $2 trillion in economic activity). Megaregions, more than nations, run the global economy: the world’s 40 largest megaregions comprise 67% of all economic activity, 85% of all technological innovation, yet only 18% of the population.

People have been concentrating into cities for a while (since 2010, the world’s urban population now comprises the majority), and they will continue to. Megaregions serve as magnets for Florida’s “creative class” and have weathered the recession better than other areas, sustaining high economic and population growth while smaller cities have shrunk. A new theory of urban economics helps to explain why larger cities sustain higher “metabolisms” (a metaphor for the pace of innovation, economic growth and social life) without collapsing into congested and inefficient messes. The study’s authors describe this mechanism as “accelerated innovation cycles,” but I prefer Florida’s wording: “As globalization has increased the financial return on innovation (by widening the consumer market), the pull of innovative places, which are already dense with highly talented workers, has only grown stronger” (p. 152). Megaregions have a bright future.

The new economy: fulfilling jobs in the service sector and beyond.

While higher-paying knowledge/professional/creative jobs are growing and generate substantial wealth, catering solely to the educated “creative class” employed in this sector is an elitist, tunnel-visioned approach to economic development. Florida points out that the service economy, comprised of routine, low-paying, and generally disdained jobs such as in food service, hospitality, cleaning, and health aids, is bigger than any other sector. It comprises over 45% of U.S jobs, and it isn’t going anywhere because unlike manufacturing, it’s impossible to outsource overseas the tasks of cleaning buildings, walking dogs, or cashing people out at the grocery store.

Florida argues that the service sector is a huge untapped source of jobs that could be made better if companies paid front-line workers more liveable wages, offered promotion potential, and made better use of the analytical and social skills of all workers, not just the managers. Companies pioneering this approach have already shown that extending creative input to all workers yields innovations that help the bottom line, and makes for a more fulfilling work experience for employees, reducing costly turnover.

High-speed rail infrastructure is an investment in the new economy

If the highway and private automobile provided the framework for the Suburban spatial fix, High-speed rail will be the backbone of the megaregion economy. High-speed rail is the fastest and most convenient ground transportation available – and is often quicker than air travel when you account for security procedures and wait times. Rail increases connectivity within and between major metropolises and their secondary cities. It facilitates the exchange of people, ideas, and economic functions, broadening labor markets and providing a framework for in-fill development along rail corridors. As I noted in a previous post, economists and politicians have argued that the cost of high-speed rail infrastructure is not justifiable. I was very pleased with Florida’s counterargument, that it is less justifiable to use federal dollars to bail out the auto industries and banks that fueled the recession – in terms of the sprawled, suburban landscape and accompanying housing bubble – in the first place. A new economic order, a Great Reset, calls for new infrastructure. He goes on to say:

“Infrastructure is always expensive, and there’s no clear way to measure the overall future return on the investment, whether it’s in the form of innovation, development, or new communities and jobs. Infrastructure provides a skeleton on which to grow a new economic model. The infrastructure investments we make now will determine the kind of economy we have in the future…In some ways, infrastructure is analogous to government support for basic research in medicine or the social sciences. Such investments, which are either too large or too risky for private companies to undertake, offer a significant social rate of return that can drive future invention, productivity, and growth” (p. 170).

High-speed rail is an example of such infrastructure, a critical adaptation and complement to a new economic order of megaregions, that could offer a high quality of life and fulfilling employment opportunities for those in the professional and service sectors alike.

Aside | Posted on by | Tagged , , , , , , , , | Leave a comment

New York’s Adult Tobacco Survey

I quit, and it feels so good! Not tobacco- a job. For the past 7 months, I’ve been humbled and also mortified by working the front lines of what is sometimes glorified as “primary data collection for health behavior research.” This is only phone peddling of an unexpected sort: rather than sales, donations, or debt collection, the desired outcome is completed surveys. But instead of collecting data, telephone interviewers spend most of their shifts getting yelled at or hung up on. And who could blame the world at large for reacting this way to calls from strangers with stilted and over-eager introductions?

I’ve worked in data collection before, measuring things like traffic volume, pavement cracks, salt marsh redox potential, and all the species of bugs and grubs scraped from riverbed rocks. But this is the first time I’ve had to interface (intervoice?) with fellow humans, and it’s painful. A lot of data these days is collected by pained people like me in call centers like this, where teams of 100+ interviewers cold-call random households and sometimes businesses for research studies commissioned by all sorts of clients (ex: universities, government health departments,) about topics like tobacco use, physical activity, nutrition, and smoking policies in apartment buildings. One of the surveys the office does each year is New York State Adult Tobacco Survey (ATS). When I discovered that this project’s data is published online, I jumped at the opportunity to examine the finished product of all those hours of tedious dialing. Below, I analyze the 2009 and 2010 data.

The purpose of ATS is to help the New York State Tobacco Control Program monitor how attitudes about smoking change over time. The program uses this information to better target their activities (smoking cessation services, media campaigns, and policy work promoting tobacco control), and also to brag that desired behavior/attitude changes reported through ATS are an indicator of their effective programming (FALLACY!). In any case, this is important to understand because tobacco use is currently the leading cause of preventable deaths in the U.S.

National Context:

The Centers for Disease Control and Prevention conducted a 2010 study of 33 Adult Tobacco Surveys conducted in 19 states from 2003-2007. The sample size (or number of survey respondents) for each ATS ranged from 1,300 to 12,000 (NY’s was just over 4,000 in 2009 and 2010). However, all analysis  is done after the data is weighted by each respondent’s probability of selection within the state, according to race, ethnicity, sex, and household size. Bottom line: the numbers presented here are in terms of projected state population rather than simply the number of survey respondents.

The prevalence of current smokers within New York State is  somewhere around 17.0% percent (16.3% in 2009, 17.7% in 2010) – which is fairly low nationally, as the CDC study showed a median prevalence of 19.2%, with states ranging from 13.3% (Hawaii in 2006) to 25.4% (West Virginia in 2005). Tobacco use prevalence for cigarettes, cigars (including cigarillos and little filtered cigars), and smokeless tobacco (chew, dip, snuff) is graphed below, comparing the lowest, highest, and median rates from the CDC review to New York’s 2009/2010 ATS results.

Tobacco use prevalence comparison

Smokin’ Maps?

But how does cigarette smoking prevalence vary across the state?

It’s a tricky question to answer through the ATS. The most precise geographic information in the 2010 data-set is the postal/zip code of each respondent. However, only about half of New York State’s (over-4,000) zip codes are represented in the data, as shown below:

2010 geographic coverage

To avoid bias, I merged the zip code data for county-level analysis, comparing the projected prevalence of current smokers (shown below). Keep in mind that this map is imprecise: estimates from a sample comprising only 0.03% of the state’s adult population, inhabiting only half of its zip codes. It’s unlikely, for example, that over half of adults in Yates county (population of 19,000 over-18) are smokers. However, the map does illustrate some broader trends, such as a cluster of counties radiating South and West from Albany, but North of NYC, with prevalence rates above the state average (17%). Though looking at the map above reveals these counties also happen to have a large proportion of their geographic area unrepresented in the data.

county smoking prevalence

The Cost

The cost of addiction- to the addicted, but also to the rest of the population- is a subject that I find particularly interesting.

Making cigarettes more expensive is meant to discourage smoking, and New York has the highest cigarette excise tax in the nation: bumped up to $4.35 per pack in July 2010. But it’s important to keep in mind that this tax revenue comes disproportionally from the pockets of lower-income residents.

Smoking Status by Income Bracket

That’s because people with annual household incomes under $30,000 are more than twice as likely to be current smokers (42%), compared to the general population (20%) as shown above, and among smokers, those with lower incomes are also 20% more likely to be at least moderately concerned about the cost of cigarettes. But the rising costs are here to stay, as the link between increased cigarette prices and lower smoking prevalence has been vigorously proven (vigorous = meta-analysis of over 500 studies!) across  low, middle, and high income groups.

Some health economists argue that the regressive nature of the excise tax (burdening the poor more than anyone else) is addressed by directing that tax revenue into cessation services that target low-income smokers. Indeed, NY’s ATS data shows that among smokers, those with lower incomes are most likely to be aware of and have used the state’s quit-line (which offers free counseling and Nicotine Replacement Therapy).  So low-income smokers may be making most use of cessation services, but they are also paying for it big-time, and it’s not helping much: the disparity in smoking status between income-groups remains.

Health disparity by social class is nothing new. But when it comes to addiction, nutrition, and other lifestyle factors, discussion tends to gravitate strongly toward the responsibility of the individual. And I agree, individual responsibility is an important factor. But it’s also important to remember what we are collectively responsible for: the barriers to employment, childcare, transportation, and other societal circumstances beyond an individual’s control that may fuel chronic stress and drive them toward certain health behaviors.

Posted in Uncategorized | Tagged , , , , | 1 Comment

Visual Display of Quantitative Information

I just finished Edward Tufte’s Visual Display of Quantitative Information (2nd ed), a classic modern text (or so I hear) on how to design data graphics, which “visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading, and color”.

This is a great book that I’m sorry I didn’t read sooner. Some key points I took away:

  • Tables are better suited for displaying data-sets with 20 objects or less, while visual graphics are better suited for summarizing a lot of information.
  • Color often muddles rather than clarifies data graphics, as the human eye does not easily give visual ordering to colors. Gray-scale shading, however, does convey a natural visual hierarchy, and so better represents varying quantities than color does.
  • If color is used, avoid red/green contrasts in consideration of color-blind viewers (5-10% of population). I am VERY guilty of this. Green/yellow/red scales are often my default. Contrasts with blue are a safer bet: color-blind people can generally differentiate blue from all other colors.
  • In regards to typography: the more that letters are differentiated, the easier the reading. This means that “serif” rather than “sans serif” fonts are preferable, and all-caps writing writing should be avoided (the more equal height/width/volume of capital letters makes for more difficult reading).
  • Do not vary the design of the graphic (ie the scale, symbology, etc), because this distorts how the viewer perceives variation in the data. Variation in the data is after all what the graphic is there to illustrate – and truthfully so.
  • The number of dimensions in the graphic should not exceed the number of dimensions in the data. So for example, don’t use area (2D, e.g through differently-sized circles) to represent a 1-D measure, such as the value of a dollar over time.
  • Less is more, or, maximize the Data-Ink Ratio. This is the ratio of ink used to represent the data (essential to the graphic) and total ink used to print the graphic (includes grid, frame, axis/scale bar ticks, etc)
  • Maximize the “data density” of the graphic – or the number of entries in the data matrix within the area of the graphic.
  • “Small multiples”, or a series/block of small graphics indexed by changes in a particular variable, are an effective graphic format because they are inherently comparative and tend to have high data densities.
  • Pie charts should never be used, because of their low data-density, and because the human eye is not adept at detecting differences in angles.

Maps in the History of Information Visualization:

One of the most interesting parts of the book outlines the history of data graphics . If you have any doubts that geography is awesome and always has been, consider this. Before charts, before graphs and plots, there were MAPS!

Geographic maps were the first form of data graphics, at least as far as historians can tell. While the first maps found on clay tablets date prior to 3500 BC, thousands of years passed before precise cartographic maps with full grids were created (1100’s AD in China, and 1550 AD in Western civilization), and it wasn’t until 1686 until cartography and statistics merged to create the first thematic map (which Tufte refers to as “data map”, but this has come to mean something else in the IT world, so I avoid the term for clarity’s sake). This early thematic map, courtesy of Edmond Halley, shows the location of trade winds and monsoons. Geographic analysis really blossomed after John Snow’s famous, even mythologized 1854 map of cholera deaths and water pumps in London – some say this kick-started the fields of health geography and spatial epidemiology. Charles Joseph Minard’s multivariate map (published 1869, shown below) of Napoleon’s 1812 Russian campaign is another famous merger of cartography and data visualization that Tufte says “may well be the best statistical graph ever drawn.”


Posted in Uncategorized | Tagged , , , , , , | 2 Comments

Geocoding in Ruby

Sans confidence, competence, I program! Computer languages have always seemed prohibitively complex and very boring. Given my interest in working with data and ‘information’, taking the leap to explore Information Technology is a predictable and necessary next step…that I’ve avoided for years. Having I.T bigshots for both a boyfriend and a brother has done little to alleviate my dread of the terminal and its command lines.

To bulk up my GIS muscles, I tried and failed to teach myself Python from a book this past Fall. The failure was purely one of attention span…how to trudge through mundane code rules, vaguely applicable to matters of interest, when much friendlier reading material was calling to me from the library shelves? Then in January I started to learn Ruby at awesome (& free!) Learning-to-Code classes held every other Monday night at the Co-Work Buffalo office, which uses Chris Pines’ Learn to Program book/website as a guide. This learning attempt was much more successful, in that I actually accomplished stuff – geocoding being the biggest deal to me and my mapping.

Geocoding is one of the most important processes in spatial analysis: matching a place-descriptive data field (like an address, city name) to a precise location in terms of latitude/longitude. To create a geocoder in Ruby I first installed Ruby Gems, which is a “framework” for managing other packages/libraries of code. I then installed a geocoding package aptly named Geocoder, which operates within the Ruby Gems framework and uses Google’s Geocoding API by default to look up addresses and/or geographic coordinates.

I used these packages to write the program below, which reads in a list of addresses (the file Addresses.txt), and writes a new file (LatLong.txt) listing the latitude and longitude coordinates for each address.


My next programming goal is to venture into the endless possibilities of web-scraping, using the Ruby Mechanize package. Inspired by this hilarious map, I want to mine Craigslist for all the rich social data it has to offer.

Posted in Uncategorized | Tagged , , , , | Leave a comment

Criminal Geography

Even if you do not have a mother/relative who is Jewish and from the Bronx, with a maiden-name that literally translates from German as “worry,” I’ll bet that you can empathize in some way with a childhood spent lovingly smothered by warnings of the dangers lurking within any given scenario – with special emphasis on random and violent crime. Abductions! muggings! rape! oh my!! (Love you mom, you’re great.)

More than crime itself, fear of crime has extraordinary power to transform the character of a place. Based on hearsay, news reports, or even personal judgment of people and/or the built environment, we come to know certain parks, alleys, and stretches of blocks as ‘safe’ or ‘unsafe’. If you accept how Mr. Maslow prioritizes human needs, then you agree that the need for personal safety is a very fundamental one – right above basic physiological functions like breathing, heart-beating, etc. Thus, few things affect our behavior and world-view as pervasively as fear of crime. Collectively, these attitudes help drive the presence/absence of disinvestment and casual surveillance that make neighborhoods thrive or…dive?

I obtained crime data from the past 6 months within the city of Buffalo from, a web service that Buffalo’s police department has partnered with in making crime data publicly available. However, this service only allows users to see 500 crimes at a time, and/or up to a maximum of 30 days in time period. I wanted to see more general trends, and so compiled this data for the past 6 months (the farthest back for which data is available) by copying and pasting the listings in 500-crime increments into a text file, which I then geocoded.


Of the 8,084 crimes reported within Buffalo within the past 6 months (9/24/2012-3/23/2013), 3,644 or 45% of them were thefts. The most crime-ridden day was 10/10/2012 (81 crimes reported), and the most common time associated with a crime was 12 pm (692 crimes), followed by 9 am (207). This time data is a little funky – a huge chunk of the day is missing (1pm to midnight), and I can’t find any metadata to specify whether this field referred to the time the crime was committed or when it was reported. So I haven’t looked  at any ‘time of day’ patterns too closely.


Buffalo’s crimes are classified into the following categories, noted with the % of crimes reported  within the past 6 months that fall within each category. I paraphrased crime type definitions from the FBI’s Uniform Crime Reporting Handbook.

  • Theft (45%): Completed or attempted theft of property or cash without personal contact.
  • Breaking & Entering (22%): the unlawful entry of a structure to commit a felony or a theft.
  • Assault (18%): Attack on one person by another. Includes both aggravated assault (for the purpose of inflicting severe or aggravated bodily injury) and simple assault (does not feature weapons or serious injury)
  • Robbery (7%): involves a theft/larceny but is aggravated by the element of force or threat of force.
  • Theft of Vehicle (7%): …yep, like it sounds, theft of a vehicle.
  • Other (1%): Includes Theft from vehicle, Homicide, Sexual Offense, Property Crime-both commercial and residential


The chart below shows the number of crimes reported by week, from late September to late March. I expected to see a spike in crime (particularly thefts) in November and December, preceding the holiday season, but this was not the case. In fact, there has been an overall though slight downward trend in the number of weekly crimes reported since late September.

Crimes by Date


The single location with the most crimes reported (42) was the 2100 block of Elmwood Ave – which is not in the Elmwood Village, but north of Hertel and surrounded by the parking lots of a Home Depot and other stripmall outlets. The 2600 block of Delaware Ave came in 2nd place (41 crimes), and is a similarly sprawled parking-lot-scape, as is the 600 block of Amherst Street (35 crimes). In each of these top-three crime locales, over 85% of reported crimes were thefts, presumably from the surrounding big-box stores.

If you are interested in zooming in to particular parts of the city to look at the data in greater detail, see the google map below. Crimes are color-coded by type (the three most common types), and if you click on each point you can also see the crime date, time, other notes, and the number of crimes committed at that location. At this time, Google maps is unable to differentiate points that occur in the same location. So, for example, the 42 crimes that occurred on the 2100 block of Elmwood Ave are in this map represented as one point, though the #Crimes@Location field does indicate that 42 crimes were reported there.

Crimes Reported in Buffalo, NY, 9/24/2012 – 3/23/2013:
Buffalo Crimes

To examine the distribution of crime beyond the scope of singular addresses, I created heat maps. These show the relative concentrations of crime incidents within a half-kilometer radius from each point throughout the city. The map below represents total crime, and is followed by maps showing the relative concentrations of the three most common types of crimes: thefts, breaking & entering, and assaults. Across all crime types, the two main hot spots in the city are the center of downtown (centering around City Hall), and the intersection of Genesee and the railroad east of Bailey Ave.

Buffalo Total Crime




Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Data Analysis in R: Loan Interest Rates

If you like school and are not already familiar with Coursera, I highly recommend you check it out. The internet has come a long way in evolving beyond its primordial functions (wasting time and checking facts), but I was surprised to discover that it can now be your free college education. Through Coursera, you sign up for courses being taught at universities around the world, and each week, you go through the lectures, quizzes, and assignments at your own pace. Of course, you don’t get actual college credit for your efforts (though that may soon change), just a certificate of completion.  But that doesn’t really matter to school-lovers like myself, who often question the cognitive payoff of the time, money, and stress vested in pursuit of post-secondary degrees (ie could I have made off like Will Hunting if I just spent enough time in the library? This is a foolish thought, as I spend most of my time in the library reading cookbooks, as well as Margaret Atwood and Zadie Smith novels, but I can’t help but wonder anyway).

I’m currently taking a Data Analysis course, taught by Jeff Leek through the John Hopkins Bloomberg School of Public Health. I took a couple of stats courses in college and am comfortable with most of the concepts, but I wanted to be better. Specifically, I wanted to learn how to use R, the open-source software for statistical analysis and computing.

The first assignment was due this morning: an exploration of the relationships between peer-to-peer loan interest rates and other characteristics of the loans and their applicants. If you’re interested (don’t worry, I’m not offended if you’re not) An excerpt of my own analysis is below:

Loan Amount and Length Affect Interest Rates for Applicants with Same Credit Scores


Peer-to-peer lending involves money lent between individuals, without involvement of traditional financial institutions, such as banks [1]. Like all lending, peer-to-peer loans involve different risk levels to the investor, depending on the characteristics of the loan and applicant. Higher risk is reflected in higher interest rates, which are undesirable for applicants as they increase the overall cost of the loan. An important characteristic reflecting risk to lenders is the applicant’s FICO credit score: lower scores reflect poor credit history and may therefore warrant the lender to charge higher interest rates. Given this relationship between FICO score and interest rates, the purpose of this data analysis is to identify associations between peer-to-peer loans’ interest rates with other characteristics of the loans and their applicants (controlling for the applicants’ FICO scores). These associations are important, as they give insight as to how peer-to-peer lenders consider characteristics of an applicant and prospective loan in determining a loan’s interest rate.


Data on 2,500 peer-to-peer loans was obtained from the Lending Club [3], downloaded as a csv file on 2/10/2013 [4], and analyzed using R Statistical Computing Software [5].

To evaluate the relationship between the loans’ interest rate and characteristics of the borrowers, I used multivariate linear regression, pairing average credit score with each other independent variable, with interest rate as the outcome variable.

Multivariate linear regression demonstrates the unique contribution of each categorical or continuous independent variable to variation in the continuous dependent variable: loan interest rate (IntRate) [6].

I began with a uni-variate linear regression model, estimating interest rates (IntRate) as a function of average FICO score (FICOavg). I interpreted the relative contribution of each variable to the model through its adjusted coefficient of determination, or Adjusted-R2 value, which only increases in multiple regression if a new variable added to the model increases the model’s explanatory powermore than would be expected by chance [7].


The data set includes records of loans granted in 46 states across the U.S. Besides state, other categorical variables in the data set included: applicants’ employment length, home ownership status, and FICO score range, as well as the loan purpose, and loan length. Quantitative variables included: applicants’ monthly income, debt-to-income ratio, revolving credit balance, the number of open credit lines, the number of inquiries within the last 6 months, the loan amount requested by applicants, and the loan amount funded by investors.

Loans funded by investors range in value from $0 to $35,000, with the mean value being $12,000. Loans are either 36 or 60 months in duration, and are categorized into 14 purposes, the most common being debt consolidation (n=1307) and credit card (n=444).

To determine the baseline explanatory power of average FICO score (FICOavg) for interest rates (IntRate, measured as a decimal from 0 to 1 rather than a percentage), I tested the following univariate linear regression model:

IntRate = b1 FICOavg + error

where b1 = -8.46e-4, indicating that an increase of 1 point in the applicant’s average FICO score corresponds to a reduction of 0.0846% in the loan’s interest rate. This relationship was highly significant at the P<0.0001 level, and the adjusted R2 value of the model was 0.5026.

I then tested each other variable’s effect (independent of FICOavg) on IntRate by adding them to the regression model, one at a time:

IntRate = b1 FICOavg + b2 Var2 + error

Where Var2 was the added or “new” variable. If Var2 was a categorical variable, then it was added to the regression model as a factor. A variable’s independent effect on interest rates was considered “strong” according to the following two criteria: (1) the new variable coefficient’s P value was less than 0.05, and (2) that variable contributed to the model’s predictive power of interest rates (interpreted in this analysis as an increase of at least 0.10 to the baseline adjusted R2 value of 0.5026. Thus, the adjusted R2 value must be at least 0.6026).

Finding 1: Longer loans have higher interest rates

The length of the loan (Loan.Length) was the only categorical variable that met these criteria. Loan.Length had two possible values: 36 months and 60 months. Including Loan.Length as a factor in the regression model increased the adjusted R2 value by to 0.6896 (an increase of 0.187 from the uni-variate model’s adjusted R2 value). The b2 coefficient for the Loan.Length factor of 60 months was 0.0427, indicating that 60-month loans have the effect of increasing interest rates by 4.37% (P<0.0001) as opposed to 36-month loans among applicants with the same FICO score. The Figure below represents the differential effects of loan length on interest rates, while controlling for applicants’ FICO scores.

Loan Length

Finding 2: The more money that is requested/funded, the higher the interest rates

Among quantitative variables, only two contributed to the model’s adjusted R2 value by at least 0.10: the value of the loan requested by the applicant (Amount.Requested), and the value of the loan funded by the investors (Amount.Funded). It is intuitive that these variables are highly correlated (larger requests are more likely to result in larger amounts funded), and indeed these variables have a Pearson’s correlation coefficient of 0.9698 (P<0.0001).

The b2 coefficient was exactly the same for when Amount.Requested and Amount.Funded were added to the multivariate regression model: 2.11e-6 (P<0.0001 for each variable). This means that for each $1 increase in the loan amount (whether the amount is being requested by the applicant or funded by the investor), the interest rate will increase by .000211%. The adjusted R2 value for the model when Amount.Requested was Var2 was 0.6564, and was 0.6551 when Amount.Funded was Var2. Thus, each of these variables increased the adjusted R2 value by at least 0.15 compared to the baseline model considering only FICOavg. The figure below demonstrates visually how the value of the loan requested (Amount.Requested, broken into quartiles) stratifies a scatterplot of interest rates versus credit scores.

Amount Requested

Other (Weaker) Relationships

When added to the multivariate regression model, other variables in the data set did have statistically significant (P<0.05) b2 coefficients, though none contributed to the baseline adjusted R2 of 0.5026 by more than 0.001, and thus did not significantly increase the model’s predictive power of interest rates, independent of average FICO scores. Variables that had positive and significant b2 coefficients (and thus were associated with increased interest rates for applicants with the same average FICO score) included monthly income, revolving credit balance, inquiries in the past 6 months, employment length of 10+ years, and loan purposes that were “house” or “small business”. Only one factor had a significant negative relationship with interest rates: applicants whose home ownership status was “rent”.


Loan length and the value of the loan (amount requested and amount funded) were the variables with the strongest effects on interest rates, independent of credit scores. Loans with a length of 60 months were shown to increase interest rates by over 4% compared to loans with a length of 36 months, while each dollar added to the loan amount requested/funded was associated with a .000211% increase in interest rates. There are several weaknesses in this analysis. The associations should be treated with caution because of the wide variation in units: interest rates varying by fractions of a percent, while FICO scores, incomes, and other variables are measured on different and larger scales. In addition, my multiple regression equations did not include interaction terms between FICO scores and other explanatory variables included in the analysis, which may affect how associations are interpreted [8].


[1] Wikipedia “Peer-to-peer lending” Page. URL: Accessed 2/15/2013.

[2] MyFICO “Credit Basics” Page. URL: Accessed 2/16/2013.

[3] The Lending Club. URL: Accessed 2/16/2013.

[4] Project Data Set. Accessed 2/12/2013.

[5] The R Project for Statistical Computing. URL: Accessed 2/16/2013.

[6] Virginia Tech Department of Statistics “Interpreting Multiple Regression: A Short Overview” Page. URL: Accessed 2/15/2013.

[7] Wikipedia “Coefficient of Determination” Page. URL: Accessed 2/16/2013.

[8] Macquarie University PSYSTAT “Testing and Interpreting Interactions in Regression – In a Nutshell” Page. URL: Accessed 2/16/2012

Posted in Uncategorized | Tagged , , , , , | 1 Comment

Buffalo 2012 In Rem Auction Results

A shot of the auction I took 10/29/2012

A shot of the auction I took 10/29/2012

In September, the list of tax delinquent properties that the city first published to be auctioned October 29-31 2012 had over 5,000 addresses. By auction day, that list shrunk by 40% : over 2,000 property owners paid their taxes and by the afternoon of October 31st, 3,192 properties were offered up for auction. Less than one third, or 1,090 of those properties, were bidded upon and sold to auction attendees. 1600 were “adjourned” or returned to property owners, while just over 1400 were “struck to the city,” that is, they are now city-owned.  A break-down of auctioned properties according to whether they were bought, adjourned, or struck to the city, and whether they were vacant lots or had homes/buildings, is provided in the chart below:

2012 Auction results


Among the 265 lots that were purchased, the average winning bid was $876.

222 or 84% of those lots were sold for $500, while 29 of them (or 11%) were sold for over $1,000. The most expensive lot, at 783 Niagara Street, was purchased on behalf of D’Youville College for $29,000 (91% of its tax-appraised value of $32,000).

Below is a map of all LOTS sold at the 2012 auction, color-coded by the value of the winning bid. When you click on each lot, you will see the exact value of the winning bid along with the lot’s address and its tax-assessed property value.

Lots purchased at the 2012 auction
2012 Lots sold


Among the 826 buildings/homes that were purchased, the average winning bid was $7,071 though 50% of the winning bids were $3,000 or less.

218 or 26% of those homes were purchased for only $500. The most expensive purchase, at 1740 Hertel Avenue, was $158,000 – nearly double the property’s tax-assessed value ($82,000).

Below is a map of all BUILDINGS sold at the 2012 auction, color-coded by the value of the winning bid. When you click on each building, you will see the exact value of the winning bid along with the lot’s address and its tax-assessed property value.

Buildings purchased at the 2012 auction
2012 buildings sold


The properties below were either adjourned or struck to the city, because nobody bid upon them. Opening bid was usually $500, though for some properties it was considerably more. For example, Brian and I had our eyes on a vacant lot on Masten Avenue for gardening – but the opening bid was $6,500 (forget THAT), but in hindsight it may have been worth it because purchasing vacant land for gardening from the city the regular way is proving much more arduous than anticipated, mainly because the city doesn’t see gardening as the “best use” of vacant residential land – they expect you to build upon it. I think that’s a little unrealistic given Buffalo’s decline and overabundance of the existing housing stock, but anyway, the map is below:

Properties not sold

Posted in Uncategorized | 4 Comments