An Exploration of the Public Library System in the United States

An Analysis of the Public Library System in Python


Introduction

Some of my earliest memories as a child revolve around going to the public library in my hometown. Born into a family of readers, I remember going every week to pick up a new stack of as many picture books as a library card would allow and tearing through them, sometimes alone, more often not. Sunday afternoons would often include my parents, my brother and I sitting together in the living room, each with a book in hand.

The impact that this had on me as a child is hard to quantify, but I strongly believe that this weekly (sometimes more often) pilgrimage to the library had lasting effects on my education and lead to a life-long love of reading.

Fast forward 12 years, as a freshman in college, I discovered another passion: data science. Typing strange words into a computer to reveal hidden patterns seemed to me not a contradictory hobby to reading and research, but a complementary one. Instead of synthesizing information from a book, I was synthesizing graphs and linear regressions. While the world sees these as opposite ends of some 'right brain/left brain' spectrum, I always found the methodology more similar than different.

As I progressed in my data science journey, I often heard one thing repeated over and over again:

"You can't just say you can do data science, you need to be able to prove it."

As I considered this, I tried to find something I was interested in to work on. I don't find sports very interesting and, given the number of poliical-centered thinkpieces, found that realm overdone. So I kept coming back to the thing that has remained true since I was a child: the importance of public libraries.

"Libraries are a cornerstone of democracy—where information is free and equally available to everyone. People tend to take that for granted, and they don’t realize what is at stake when that is put at risk." —Carla Hayden

This begins my new found goal, the beginnings of which are realized here: pushing for libraries not just with a public-service argument but with the most compelling one I can make: a data-centric one. The journey that follows here focuses on exploring the national library system and what it means to quantify a library's impact.

image.png

Data

The dataset I am focusing on here is a 2016 survey of US public libraries conducted by the Institute of Museum and Library Services. The data is extensive, broken into three files, two of which I use in this analysis. The first is broken down by state: 53 entries for the 50 states, District of Columbia, American Samoa and Guam (the outlying areas of Northern Mariana Islands, Puerto Rico, and U.S. Virgin Islands did not participate). The second file is even more detailed, with an entry for each main library. Of 9,234 libraries surveyed, 9,024 libraries responded to the survey, a response rate of 97.7 percent.

Those interested in accessing this data can find it here.

As this analysis focuses on the 50 US states, I removed the data from DC, American Samoa and Guam.


Exploration

I used a host of libraries through the exploration, beginning with:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  
import folium
import os

Then, a high level overview of the data

In [3]:
state_df = pd.read_csv('PLS_FY2016_State_pusum16a.csv')
state_df = state_df.drop([3, 8, 12])

state_df.head()
Out[3]:
STABR POPU_LSA F_POPLSA POPU_UND POPU_ST CENTLIB F_CENLIB BRANLIB F_BRLIB BKMOB ... GPTERMS F_GPTERM PITUSR F_PITUSR WIFISESS STARTDAT ENDDATE INCITSST YR_SUB OBEREG
0 AK 655885 R_16 649966 739828 88 R_16 16 R_16 1 ... 1053 IF16 617259 IF16 705742 01/01/2015 06/30/2016 2 2017 8
1 AL 4849377 R_16 4849377 4849377 223 R_16 72 R_16 16 ... 5062 IF16 3915279 IF16 5223893 10/01/2015 09/30/2016 1 2017 5
2 AR 2923845 R_16 2643928 2915918 54 R_16 181 R_16 3 ... 2800 IF16 1712997 IF16 1078341 01/01/2016 12/31/2016 5 2017 5
4 AZ 10783977 R_16 6835518 6835518 85 R_16 136 R_16 11 ... 7525 IF16 6625990 IF16 8489326 07/01/2015 06/30/2016 4 2017 6
5 CA 39239277 R_16 39239277 39255883 166 R_16 953 R_16 49 ... 22832 IF16 29309523 IF16 18028138 07/01/2015 06/30/2016 6 2017 8

5 rows × 114 columns

There are 114 columns listed in the data, which we will define as we use them.

The first aspect of the data I was curious about what the number of visits at each state's libraries

In [6]:
state_df.nlargest(5, 'VISITS')
Out[6]:
STABR POPU_LSA F_POPLSA POPU_UND POPU_ST CENTLIB F_CENLIB BRANLIB F_BRLIB BKMOB ... GPTERMS F_GPTERM PITUSR F_PITUSR WIFISESS STARTDAT ENDDATE INCITSST YR_SUB OBEREG
5 CA 39239277 R_16 39239277 39255883 166 R_16 953 R_16 49 ... 22832 IF16 29309523 IF16 18028138 07/01/2015 06/30/2016 6 2017 8
36 NY 19529286 R_16 19378102 19378102 755 R_16 313 R_16 10 ... 19424 R_16 18013306 R_16 15288377 04/01/2015 12/31/2016 36 2017 2
37 OH 11510467 R_16 11510467 11510467 239 R_16 475 R_16 56 ... 13413 R_16 16002416 IF16 18598158 01/01/2016 12/31/2016 39 2017 3
10 FL 20206713 R_16 20106352 20450729 60 R_16 473 R_16 20 ... 16874 IF16 15367156 IF16 14868341 10/01/2015 09/30/2016 12 2017 5
45 TX 25324740 IF16 25266651 27469114 556 R_16 323 R_16 11 ... 20641 IF16 15217894 IF16 16557891 02/01/2015 12/31/2016 48 2017 6

5 rows × 114 columns

In [176]:
state_geo = os.path.join('', 'us-states.json')
m = folium.Map(location=[37, -102], zoom_start=4)

m.choropleth(
 geo_data=state_geo,
 name='choropleth1',
 data=state_df,
 columns=['STABR', 'VISITS'],
 key_on='feature.id',
 fill_color='BuGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Number of Visits'
)
folium.LayerControl().add_to(m)
 
m
Out[176]:

Of course this data is influenced by the population of each of these states, but some stuck out immediately: New York and Ohio rank in the top 5 above Florida and Texas, both of which have much higher populations. In order to mitigate this influence, I graphed this in terms of visit per person in the "LSA" or legal service area

In [7]:
import warnings

warnings.simplefilter(action= 'ignore', category = FutureWarning)
In [23]:
mapByPop = {"STABR": state_df["STABR"], "Per100": (state_df['VISITS'] / state_df['POPU_LSA']) *100}
mapByPop = pd.DataFrame(data=mapByPop)

mapByPop.nlargest(10, "Per100")
Out[23]:
STABR Per100
37 OH 643.933204
15 ID 616.981725
52 WY 603.437256
21 MA 601.596867
6 CO 590.075381
14 IA 576.166461
16 IL 574.738342
49 WA 546.964902
50 WI 546.929716
18 KS 546.370011
In [24]:
m = folium.Map(location=[37, -102], zoom_start=4)

m.choropleth(
 geo_data=state_geo,
 name='choropleth',
 data=mapByPop,
 columns=['STABR', 'Per100'],
 key_on='feature.id',
 fill_color='BuGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Visits per 100 residents'
)
folium.LayerControl().add_to(m)
 
m
Out[24]:

As you can see, the number of visits varies wildly per state. One potential factor that could be influencing this is how far each of these people live and work from their closest library. However, even in more stereotypically rural states like Wyoming, visitation is strong.

This became the point of interest in my exploration: figuring out what was driving this variability in the number of visits per 100 people. I assumed that most of these visits were being done by a section of the population who visited frequently. Thus, the first question I had was whether there were simply more registered users of the libraries in these states.

In [46]:
mapByReg = {"STABR": state_df["STABR"], "Per100": (state_df["REGBOR"] / state_df['POPU_LSA']) *100}
mapByReg = pd.DataFrame(data=mapByReg)

mapByReg.nlargest(10, "Per100")
Out[46]:
STABR Per100
37 OH 75.742696
13 HI 69.704031
18 KS 68.518165
31 NE 67.041174
25 MN 66.768929
14 IA 66.739458
6 CO 66.696852
34 NM 66.665474
15 ID 64.525159
52 WY 61.718423
In [149]:
m = folium.Map(location=[37, -102], zoom_start=4)

m.choropleth(
 geo_data=state_geo,
 name='choropleth',
 data=mapByReg,
 columns=['STABR', 'Per100'],
 key_on='feature.id',
 fill_color='BuGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Registered Users per 100 residents'
)
folium.LayerControl().add_to(m)
 
m
Out[149]:
In [177]:
sns.set(style="darkgrid")

g = sns.jointplot(mapByPop["Per100"],mapByReg["Per100"], kind="reg",
                  xlim=(0, 1000), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Visits per 100 Residents', 'Registered Users per 100 Residents', fontsize=16)
Out[177]:
<seaborn.axisgrid.JointGrid at 0x1c257a3588>

The percentage of registered users varies widely by state, from just under 30% to just over 75%. How well can this explain the differences in the number of visits?

In [50]:
import statsmodels.api as sm
x = mapByReg["Per100"]
y = mapByPop["Per100"]

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[50]:
OLS Regression Results
Dep. Variable: Per100 R-squared: 0.949
Model: OLS Adj. R-squared: 0.948
Method: Least Squares F-statistic: 905.2
Date: Fri, 15 Mar 2019 Prob (F-statistic): 2.98e-33
Time: 16:48:52 Log-Likelihood: -302.86
No. Observations: 50 AIC: 607.7
Df Residuals: 49 BIC: 609.6
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Per100 8.0895 0.269 30.086 0.000 7.549 8.630
Omnibus: 0.485 Durbin-Watson: 1.938
Prob(Omnibus): 0.785 Jarque-Bera (JB): 0.635
Skew: -0.163 Prob(JB): 0.728
Kurtosis: 2.555 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The R-squared from a least squares regression model, calculated here, is just under 95%. This signifies that 95% of the variability of the number of visits can be correlated with the number of registered users.


Going Deeper

This seems intuitive: the more registered users you have, the more visits you have. But what's driving the number of registered users? One of the first things I thought of was the possibility that some libraries are keeping up with technology better than others (I personally began utilizing library's services at a higher rate after buying a Kindle. With the ability to download ebooks from the library, it eliminated the hassle of going in person for every book).

image.png

First, let's find the portion of material that's generally categorized as 'electronic'

In [61]:
d = {'STABR': state_df['STABR'], 'BKPER': state_df['BKVOL']/state_df['TOTCIR'], 'EBOOKPER': state_df['EBOOK']/state_df['TOTCIR'], 'AUDIOPER': (state_df['AUDIO_PH']+state_df['AUDIO_DL'])/state_df['TOTCIR'], 'VIDEOPER': (state_df['VIDEO_PH']+state_df['VIDEO_DL'])/state_df['TOTCIR']}

circPercent = pd.DataFrame(data=d)

circPercent['Other'] = 1 - circPercent.sum(axis=1)

circPercent.head()
df = circPercent

circPercent.head()
Out[61]:
STABR BKPER EBOOKPER AUDIOPER VIDEOPER Other
0 AK 0.457393 0.140770 0.113271 0.059675 0.228891
1 AL 0.455853 0.199255 0.066610 0.035556 0.242726
2 AR 0.435515 0.043316 0.031944 0.039202 0.450023
4 AZ 0.172226 0.054170 0.288394 0.024809 0.460400
5 CA 0.300766 0.051426 0.095558 0.027911 0.524339
In [174]:
d = {'STABR': state_df['STABR'], 'KIDCIRCL_PER': state_df['KIDCIRCL']/state_df['TOTCIR'], 'ELMATCIR_PER': state_df['ELMATCIR']/state_df['TOTCIR'], 'PHYSCIR_PER': state_df['PHYSCIR']/state_df['TOTCIR']}

physPercent = pd.DataFrame(data=d)
physPercent.head()
Out[174]:
STABR KIDCIRCL_PER ELMATCIR_PER PHYSCIR_PER
0 AK 0.325355 0.073734 0.970427
1 AL 0.324894 0.112982 0.887018
2 AR 0.311987 0.129488 0.870512
4 AZ 0.316838 0.132425 0.867565
5 CA 0.422199 0.087738 0.912273
In [91]:
chart = sns.barplot(x="STABR", y="ELMATCIR_PER", data=physPercent)
sns.set(rc={'figure.figsize':(15.7,5)})
chart.set_title('Percent of Material that is Electronic')
chart.set_ylabel('Percent Electronic')
chart.set_xlabel("State")
chart
Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c21273c50>

Now that we can see the percent of total electronic material, what's the specific breakdown of the circulated material?

In [95]:
x = circPercent['STABR']
y1 = circPercent['BKPER']
y2 = circPercent['EBOOKPER']
y3 = circPercent['AUDIOPER']
y4 = circPercent['VIDEOPER']
y5 = circPercent['Other']

# memo of sample number
snum = y1+y2+y3+y4+y5

# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
y4 = y4/snum*100.
y5 = y5/snum*100.
In [103]:
plt.figure(figsize=(4,3))

# stack bars
plt.bar(x, y1, label='Books (%)')
plt.bar(x, y2 ,bottom=y1,label='eBooks (%)')
plt.bar(x, y3 ,bottom=y1+y2,label='Audio Books (%)')
plt.bar(x, y4 ,bottom=y1+y2+y3,label='Videos (%)')
plt.bar(x, y5 ,bottom=y1+y2+y3+y4,label='Unidentified (%)')

plt.ylim(0,100)

plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)

This graph shows the rough distribution of media types but suffers from a high "unidentified" category as many of the libraries surveyed did not provide a breakdown of all items in circulation. We can remove this category to get a better look at the identified material but this graph must be treated with caution:

In [58]:
x = circPercent['STABR']
y1 = circPercent['BKPER']
y2 = circPercent['EBOOKPER']
y3 = circPercent['AUDIOPER']
y4 = circPercent['VIDEOPER']

snum = y1+y2+y3+y4

# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
y4 = y4/snum*100.
In [59]:
plt.figure(figsize=(4,3))

# stack bars
plt.bar(x, y1, label='Books (%)')
plt.bar(x, y2 ,bottom=y1,label='eBooks (%)')
plt.bar(x, y3 ,bottom=y1+y2,label='Audio Books (%)')
plt.bar(x, y4 ,bottom=y1+y2+y3,label='Videos (%)')

plt.ylim(0,100)

plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
plt.show()
In [69]:
d = {'STABR': state_df['STABR'], 'NonBookPercent': (1-(circPercent['BKPER']))*100, 'RegPer100': mapByReg["Per100"]}

RegUserWithBk = pd.DataFrame(data=d)


RegUserWithBk.head()
Out[69]:
STABR NonBookPercent RegPer100
0 AK 54.260731 52.087485
1 AL 54.414746 56.661979
2 AR 56.448470 56.577897
4 AZ 82.777373 29.800379
5 CA 69.923393 57.415750
In [70]:
sns.set(style="darkgrid")

g = sns.jointplot(RegUserWithBk["RegPer100"],RegUserWithBk['NonBookPercent'], kind="reg",
                  xlim=(0, 100), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Registered Users per 100 Residents', 'Circulation Identified as Books (%)', fontsize=16)
Out[70]:
<seaborn.axisgrid.JointGrid at 0x1c1f90d160>

What about the R-squared value?

In [73]:
import statsmodels.api as sm
x = RegUserWithBk["RegPer100"]
y = RegUserWithBk['NonBookPercent']

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[73]:
OLS Regression Results
Dep. Variable: NonBookPercent R-squared: 0.928
Model: OLS Adj. R-squared: 0.927
Method: Least Squares F-statistic: 633.9
Date: Fri, 15 Mar 2019 Prob (F-statistic): 1.09e-29
Time: 17:11:48 Log-Likelihood: -211.91
No. Observations: 50 AIC: 425.8
Df Residuals: 49 BIC: 427.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
RegPer100 1.0982 0.044 25.178 0.000 1.011 1.186
Omnibus: 1.284 Durbin-Watson: 1.961
Prob(Omnibus): 0.526 Jarque-Bera (JB): 0.540
Skew: -0.020 Prob(JB): 0.764
Kurtosis: 3.507 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

0.928. That means 92.8% of the variation in the number of registered users can be correlated with the percent of the collection that are identified as traditional books. Again, we need to make sure to take this with a grain of salt as the response rate for this question was not great.

Internet Accessibility

The next factor I wanted to consider is internet acessibility. This is often touted as the 'future' for public libraries. Internet access provides a way for citizens to gain access to the internet for self improvement, job hunting or homework after school.

In [170]:
compUse = {'SessionsPer100': state_df["PITUSR"]/state_df["POPU_LSA"]}
compUse = pd.DataFrame(data=compUse)
compUse['STABR'] = state_df["STABR"]
compUse.head()
Out[170]:
SessionsPer100 STABR
0 0.941109 AK
1 0.807378 AL
2 0.585871 AR
4 0.614429 AZ
5 0.746944 CA
In [171]:
sns.set(style="darkgrid")
funding_per_per = sns.barplot(x=compUse["STABR"], y=compUse['SessionsPer100'])
sns.set(rc={'figure.figsize':(15.7,5)})
funding_per_per.set_title("Computer Uses")
funding_per_per.set(xlabel='State', ylabel='Public Internet Computer Uses Per Year')
plt.show(funding_per_per)
In [172]:
m = folium.Map(location=[37, -102], zoom_start=4)

m.choropleth(
 geo_data=state_geo,
 name='choropleth',
 data=compUse,
 columns=['STABR', 'SessionsPer100'],
 key_on='feature.id',
 fill_color='BuGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Computer Sessions Per Resident'
)
folium.LayerControl().add_to(m)
 
m
Out[172]:
In [175]:
x = mapByReg["Per100"]
y = compUse['SessionsPer100']

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[175]:
OLS Regression Results
Dep. Variable: SessionsPer100 R-squared: 0.941
Model: OLS Adj. R-squared: 0.939
Method: Least Squares F-statistic: 776.3
Date: Fri, 15 Mar 2019 Prob (F-statistic): 1.05e-31
Time: 18:45:23 Log-Likelihood: 3.1755
No. Observations: 50 AIC: -4.351
Df Residuals: 49 BIC: -2.439
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Per100 0.0165 0.001 27.862 0.000 0.015 0.018
Omnibus: 0.066 Durbin-Watson: 2.501
Prob(Omnibus): 0.968 Jarque-Bera (JB): 0.019
Skew: -0.001 Prob(JB): 0.990
Kurtosis: 2.903 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This provides one of the most correlated relationships thus far.

Funding

The last factor I wanted to look at was the one I suspected would have the greatest impact: funding. The data had funding broken out by local, state and federal.

In [105]:
funding = pd.concat([state_df['STABR'], state_df['FEDGVT'], state_df['STGVT'], state_df['LOCGVT'], (state_df['FEDGVT'] + state_df['LOCGVT'] + state_df['STGVT'])], axis=1)
funding.columns = ['STABR', 'FEDGVT', 'STGVT', 'LOCGVT', 'TOTAL']
funding['fund_per_person'] = funding['TOTAL']/ state_df['POPU_LSA']
funding.head()
Out[105]:
STABR FEDGVT STGVT LOCGVT TOTAL fund_per_person
0 AK 898716 1007391 34499883 36405990 55.506667
1 AL 1082131 3817533 92145508 97045172 20.011884
2 AR 9126 5150706 68935417 74095249 25.341716
4 AZ 1129144 1420461 175748761 178298366 16.533637
5 CA 5292521 14674047 1405950231 1425916799 36.339018
In [158]:
sns.set(style="darkgrid")
funding_per_per = sns.barplot(x="STABR", y="fund_per_person", data=funding)
sns.set(rc={'figure.figsize':(15.7,5)})
funding_per_per.set_title("Funding")
funding_per_per.set(xlabel='State', ylabel='Funding Per Person')
plt.show(funding_per_per)
In [178]:
m = folium.Map(location=[37, -102], zoom_start=4)

m.choropleth(
 geo_data=state_geo,
 name='choropleth',
 data=funding,
 columns=['STABR', 'fund_per_person'],
 key_on='feature.id',
 fill_color='BuGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Funding Per Resident'
)
folium.LayerControl().add_to(m)
 
m
Out[178]:
In [181]:
x = funding['STABR']
y1 = funding['FEDGVT']
y2 = funding['STGVT']
y3 = funding['LOCGVT']

snum = y1+y2+y3

# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
In [183]:
plt.figure(figsize=(4,3))

# stack bars
plt.bar(x, y3 ,label='Local Funding(%)')
plt.bar(x, y2 ,bottom=y3,label='State Funding(%)')
plt.bar(x, y1 ,bottom=y3+y2,label='Federal Funding(%)')

plt.ylim(0,100)

plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)

This graph suprised me at first but made sense after thinking about it. The vast majority of a library's funding comes from local governments. The more the city pushes towards a library, the more funding it has. This rule is broken entirely by Hawaii which doesn't use local funding at all, instead relying on state dollars. Many of these libraries are also funded for specific initiatives instead of receiving blanket funding.

In [131]:
sns.set(style="darkgrid")

g = sns.jointplot(mapByReg["Per100"], funding['fund_per_person'], kind="reg",
                  xlim=(0, 100), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Registered Users per 100 Residents', 'Funding per Person', fontsize=16)
Out[131]:
<seaborn.axisgrid.JointGrid at 0x1c247398d0>
In [123]:
x = mapByReg["Per100"]
y = funding['fund_per_person']

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[123]:
OLS Regression Results
Dep. Variable: fund_per_person R-squared: 0.887
Model: OLS Adj. R-squared: 0.885
Method: Least Squares F-statistic: 385.1
Date: Fri, 15 Mar 2019 Prob (F-statistic): 7.39e-25
Time: 17:44:28 Log-Likelihood: -199.40
No. Observations: 50 AIC: 400.8
Df Residuals: 49 BIC: 402.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Per100 0.6665 0.034 19.624 0.000 0.598 0.735
Omnibus: 1.993 Durbin-Watson: 1.767
Prob(Omnibus): 0.369 Jarque-Bera (JB): 1.819
Skew: 0.453 Prob(JB): 0.403
Kurtosis: 2.771 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The value for this regresssion is .887 or 88.7% of the variability explained. That's not too great. This suprised me as I was expecting funding to be connected with higher registration. Out of curiosity I mapped funding against the total number of visits per 100 citizens

In [132]:
sns.set(style="darkgrid")

g = sns.jointplot(mapByPop["Per100"], funding['fund_per_person'], kind="reg",
                  xlim=(0, 1000), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Visits per 100 people', 'Funding per Person', fontsize=16)
Out[132]:
<seaborn.axisgrid.JointGrid at 0x1c24942320>
In [130]:
x = mapByPop["Per100"]
y = funding['fund_per_person']

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[130]:
OLS Regression Results
Dep. Variable: fund_per_person R-squared: 0.952
Model: OLS Adj. R-squared: 0.951
Method: Least Squares F-statistic: 977.4
Date: Fri, 15 Mar 2019 Prob (F-statistic): 4.98e-34
Time: 17:48:18 Log-Likelihood: -177.89
No. Observations: 50 AIC: 357.8
Df Residuals: 49 BIC: 359.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Per100 0.0831 0.003 31.263 0.000 0.078 0.088
Omnibus: 1.497 Durbin-Watson: 1.983
Prob(Omnibus): 0.473 Jarque-Bera (JB): 1.491
Skew: 0.361 Prob(JB): 0.475
Kurtosis: 2.560 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Wow- .952. I wonder why funding accounts for so much more variability in attendance than variability in registration.

My best guess is that well funded libraries attract guest back more frequently but don't impact the portion of citizens that come. Interesting.

It is also worth noting that funding for libraries is directly connected to tax revenue raised by municipalities. Thus, more wealthy areas almost always have libraries with my funding. This can be problematic as the populations most in need of libraries often don't have fully funded ones.


The Nitty Gritty

Since so much of the funding from libraries comes locally, this is where I began using the second sheet which had the information broken down by individual library. In order to increasing processing speed I pared this sheet down into just the columns I'm interested in. Let's see how much impact funding has per library.

In [133]:
indiv_df = pd.read_csv('smaller2016.csv')
indiv_df.head()
Out[133]:
STABR FSCSKEY LIBID LIBNAME ADDRESS CITY ZIP CNTY POPU_LSA BRANLIB LOCGVT STGVT FEDGVT VISITS BKVOL EBOOK AUDIO_PH TOTCIR TOTCOLL
0 AK AK0001 AK0001-002 ANCHOR POINT PUBLIC LIBRARY 34020 NORTH FORK ROAD ANCHOR POINT 99556 KENAI PENINSULA 2043 0 0 13784 0 6679 17453 0 97 13214 13195
1 AK AK0002 AK0002-011 ANCHORAGE PUBLIC LIBRARY 3600 DENALI STREET ANCHORAGE 99503 ANCHORAGE 299037 4 11861783 51592 192990 860751 367652 13694 37639 1831505 2040319
2 AK AK0003 AK0003-002 ANDERSON COMMUNITY LIBRARY 101 FIRST STREET ANDERSON 99744 DENALI 238 0 5000 6650 0 1145 15708 0 68 640 640
3 AK AK0006 AK0006-002 KUSKOKWIM CONSORTIUM LIBRARY 420 CHIEF EDDIE HOFFMAN HIGHWAY BETHEL 99559 BETHEL 6244 0 73600 6650 0 44775 33200 13694 317 11378 12669
4 AK AK0007 AK0007-002 BIG LAKE PUBLIC LIBRARY 3140 SOUTH BIG LAKE ROAD BIG LAKE 99652 MATANUSKA-SUSITNA 10259 0 350080 6650 3475 62283 26311 13696 2020 41094 41011
In [140]:
each_lib = {'State': indiv_df['STABR'], 'City': indiv_df['CITY'], 'Per100': (indiv_df['VISITS']/indiv_df['POPU_LSA'])*100, 'totalFunding': indiv_df['LOCGVT']+indiv_df['STGVT']+indiv_df['FEDGVT']}
each_lib = pd.DataFrame(data=each_lib)
each_lib['fund_per_person'] = each_lib['totalFunding']/indiv_df['POPU_LSA']

each_lib.head()
Out[140]:
State City Per100 totalFunding fund_per_person
0 AK ANCHOR POINT 326.921194 13784 6.746941
1 AK ANCHORAGE 287.840970 12106365 40.484505
2 AK ANDERSON 481.092437 11650 48.949580
3 AK BETHEL 717.088405 80250 12.852338
4 AK BIG LAKE 607.105956 360205 35.111122
In [186]:
sns.set(style="darkgrid")


g = sns.jointplot(each_lib["Per100"], each_lib['fund_per_person'], kind="scatter",
                  xlim=(0, 10000), ylim=(0, 1000), color="m", height=7)
g.set_axis_labels('Visits per 100 people', 'Funding per Person', fontsize=16)
Out[186]:
<seaborn.axisgrid.JointGrid at 0x1c27cdd5c0>
In [142]:
x = each_lib["Per100"]
y = each_lib['fund_per_person']

# Note the difference in argument order
model = sm.OLS(y, x).fit()

# Print out the statistics
model.summary()
Out[142]:
OLS Regression Results
Dep. Variable: fund_per_person R-squared: 0.847
Model: OLS Adj. R-squared: 0.847
Method: Least Squares F-statistic: 5.102e+04
Date: Fri, 15 Mar 2019 Prob (F-statistic): 0.00
Time: 18:02:50 Log-Likelihood: -65477.
No. Observations: 9252 AIC: 1.310e+05
Df Residuals: 9251 BIC: 1.310e+05
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Per100 0.3596 0.002 225.886 0.000 0.356 0.363
Omnibus: 15538.379 Durbin-Watson: 1.265
Prob(Omnibus): 0.000 Jarque-Bera (JB): 312255376.796
Skew: 10.141 Prob(JB): 0.00
Kurtosis: 902.772 Cond. No. 1.00


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Not as good as when we were looking at entire states. Curious about those outliers, I found the highest funded and most commonly visited states:

In [143]:
each_lib.nlargest(10, "Per100")
Out[143]:
State City Per100 totalFunding fund_per_person
4024 MI PONTIAC 161110.526316 1328960 69945.263158
8295 TX ROUND TOP 20145.161290 5150 55.376344
78 AK LAKE MINCHUMINA 13818.181818 12865 1169.545455
3377 MA PROVINCETOWN 10417.498314 319004 107.553608
8145 TX GUTHRIE 9164.539007 90556 321.120567
5553 NJ AVALON 7394.902549 2627676 1969.772114
4211 MN GRAND MARAIS 7159.124088 260343 190.031387
3779 ME MONHEGAN 6580.882353 500 7.352941
5721 NY CHAUTAUQUA 6252.888889 34951 31.067556
733 CO SILVERTON 6230.841121 112646 175.461059
In [145]:
each_lib.nlargest(10, "fund_per_person")
Out[145]:
State City Per100 totalFunding fund_per_person
4024 MI PONTIAC 161110.526316 1328960 69945.263158
1801 IL BEDFORD PARK 3788.793103 1571560 2709.586207
6904 OR COOS BAY 156.122449 504623 2574.607143
2055 IL MCCOOK 2013.157895 535504 2348.701754
5553 NJ AVALON 7394.902549 2627676 1969.772114
78 AK LAKE MINCHUMINA 13818.181818 12865 1169.545455
87 AK PEDRO BAY 1784.375000 36599 1143.718750
85 AK KLUKWAN 1471.578947 81339 856.200000
6281 NY AMAGANSETT 3002.124542 944619 692.028571
6320 NY QUOGUE 4661.111111 723646 670.042593

Unless the city of Pontiac, MI really cares about their public libraries, I'm guessing there are some flaws in our data. Without more data I can't decide what to do about this, so I'm going to assume that we will just need to take this with a grain of salt.

Conclusion

So what does all of this mean?

In the analysis done above we can see that the aspects that are the most connected with library visitation are funding per person and computer usage. Thus, my recommendation for increasing library attendance would be to lobby for higher funding (surprise) and to push more funding into computer and wifi access.

Unfortunately, finding the level of impact libraries have on people's lives is much harder to measure, especially with the data provided. One other major aspects that could impact attendance and engagement are programming for teens and children which was not explored in this. We also cannot account for the errors in our dataset.

In the future I hope to continue working on this in the hopes of giving other children the headstart that my local library gave me.