Some of my earliest memories as a child revolve around going to the public library in my hometown. Born into a family of readers, I remember going every week to pick up a new stack of as many picture books as a library card would allow and tearing through them, sometimes alone, more often not. Sunday afternoons would often include my parents, my brother and I sitting together in the living room, each with a book in hand.
The impact that this had on me as a child is hard to quantify, but I strongly believe that this weekly (sometimes more often) pilgrimage to the library had lasting effects on my education and lead to a life-long love of reading.
Fast forward 12 years, as a freshman in college, I discovered another passion: data science. Typing strange words into a computer to reveal hidden patterns seemed to me not a contradictory hobby to reading and research, but a complementary one. Instead of synthesizing information from a book, I was synthesizing graphs and linear regressions. While the world sees these as opposite ends of some 'right brain/left brain' spectrum, I always found the methodology more similar than different.
As I progressed in my data science journey, I often heard one thing repeated over and over again:
"You can't just say you can do data science, you need to be able to prove it."¶
As I considered this, I tried to find something I was interested in to work on. I don't find sports very interesting and, given the number of poliical-centered thinkpieces, found that realm overdone. So I kept coming back to the thing that has remained true since I was a child: the importance of public libraries.
"Libraries are a cornerstone of democracy—where information is free and equally available to everyone. People tend to take that for granted, and they don’t realize what is at stake when that is put at risk." —Carla Hayden¶
This begins my new found goal, the beginnings of which are realized here: pushing for libraries not just with a public-service argument but with the most compelling one I can make: a data-centric one. The journey that follows here focuses on exploring the national library system and what it means to quantify a library's impact.
The dataset I am focusing on here is a 2016 survey of US public libraries conducted by the Institute of Museum and Library Services. The data is extensive, broken into three files, two of which I use in this analysis. The first is broken down by state: 53 entries for the 50 states, District of Columbia, American Samoa and Guam (the outlying areas of Northern Mariana Islands, Puerto Rico, and U.S. Virgin Islands did not participate). The second file is even more detailed, with an entry for each main library. Of 9,234 libraries surveyed, 9,024 libraries responded to the survey, a response rate of 97.7 percent.
Those interested in accessing this data can find it here.
As this analysis focuses on the 50 US states, I removed the data from DC, American Samoa and Guam.
I used a host of libraries through the exploration, beginning with:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import os
Then, a high level overview of the data
state_df = pd.read_csv('PLS_FY2016_State_pusum16a.csv')
state_df = state_df.drop([3, 8, 12])
state_df.head()
There are 114 columns listed in the data, which we will define as we use them.
The first aspect of the data I was curious about what the number of visits at each state's libraries
state_df.nlargest(5, 'VISITS')
state_geo = os.path.join('', 'us-states.json')
m = folium.Map(location=[37, -102], zoom_start=4)
m.choropleth(
geo_data=state_geo,
name='choropleth1',
data=state_df,
columns=['STABR', 'VISITS'],
key_on='feature.id',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Number of Visits'
)
folium.LayerControl().add_to(m)
m
Of course this data is influenced by the population of each of these states, but some stuck out immediately: New York and Ohio rank in the top 5 above Florida and Texas, both of which have much higher populations. In order to mitigate this influence, I graphed this in terms of visit per person in the "LSA" or legal service area
import warnings
warnings.simplefilter(action= 'ignore', category = FutureWarning)
mapByPop = {"STABR": state_df["STABR"], "Per100": (state_df['VISITS'] / state_df['POPU_LSA']) *100}
mapByPop = pd.DataFrame(data=mapByPop)
mapByPop.nlargest(10, "Per100")
m = folium.Map(location=[37, -102], zoom_start=4)
m.choropleth(
geo_data=state_geo,
name='choropleth',
data=mapByPop,
columns=['STABR', 'Per100'],
key_on='feature.id',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Visits per 100 residents'
)
folium.LayerControl().add_to(m)
m
As you can see, the number of visits varies wildly per state. One potential factor that could be influencing this is how far each of these people live and work from their closest library. However, even in more stereotypically rural states like Wyoming, visitation is strong.
This became the point of interest in my exploration: figuring out what was driving this variability in the number of visits per 100 people. I assumed that most of these visits were being done by a section of the population who visited frequently. Thus, the first question I had was whether there were simply more registered users of the libraries in these states.
mapByReg = {"STABR": state_df["STABR"], "Per100": (state_df["REGBOR"] / state_df['POPU_LSA']) *100}
mapByReg = pd.DataFrame(data=mapByReg)
mapByReg.nlargest(10, "Per100")
m = folium.Map(location=[37, -102], zoom_start=4)
m.choropleth(
geo_data=state_geo,
name='choropleth',
data=mapByReg,
columns=['STABR', 'Per100'],
key_on='feature.id',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Registered Users per 100 residents'
)
folium.LayerControl().add_to(m)
m
sns.set(style="darkgrid")
g = sns.jointplot(mapByPop["Per100"],mapByReg["Per100"], kind="reg",
xlim=(0, 1000), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Visits per 100 Residents', 'Registered Users per 100 Residents', fontsize=16)
The percentage of registered users varies widely by state, from just under 30% to just over 75%. How well can this explain the differences in the number of visits?
import statsmodels.api as sm
x = mapByReg["Per100"]
y = mapByPop["Per100"]
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
The R-squared from a least squares regression model, calculated here, is just under 95%. This signifies that 95% of the variability of the number of visits can be correlated with the number of registered users.
This seems intuitive: the more registered users you have, the more visits you have. But what's driving the number of registered users? One of the first things I thought of was the possibility that some libraries are keeping up with technology better than others (I personally began utilizing library's services at a higher rate after buying a Kindle. With the ability to download ebooks from the library, it eliminated the hassle of going in person for every book).
First, let's find the portion of material that's generally categorized as 'electronic'
d = {'STABR': state_df['STABR'], 'BKPER': state_df['BKVOL']/state_df['TOTCIR'], 'EBOOKPER': state_df['EBOOK']/state_df['TOTCIR'], 'AUDIOPER': (state_df['AUDIO_PH']+state_df['AUDIO_DL'])/state_df['TOTCIR'], 'VIDEOPER': (state_df['VIDEO_PH']+state_df['VIDEO_DL'])/state_df['TOTCIR']}
circPercent = pd.DataFrame(data=d)
circPercent['Other'] = 1 - circPercent.sum(axis=1)
circPercent.head()
df = circPercent
circPercent.head()
d = {'STABR': state_df['STABR'], 'KIDCIRCL_PER': state_df['KIDCIRCL']/state_df['TOTCIR'], 'ELMATCIR_PER': state_df['ELMATCIR']/state_df['TOTCIR'], 'PHYSCIR_PER': state_df['PHYSCIR']/state_df['TOTCIR']}
physPercent = pd.DataFrame(data=d)
physPercent.head()
chart = sns.barplot(x="STABR", y="ELMATCIR_PER", data=physPercent)
sns.set(rc={'figure.figsize':(15.7,5)})
chart.set_title('Percent of Material that is Electronic')
chart.set_ylabel('Percent Electronic')
chart.set_xlabel("State")
chart
Now that we can see the percent of total electronic material, what's the specific breakdown of the circulated material?
x = circPercent['STABR']
y1 = circPercent['BKPER']
y2 = circPercent['EBOOKPER']
y3 = circPercent['AUDIOPER']
y4 = circPercent['VIDEOPER']
y5 = circPercent['Other']
# memo of sample number
snum = y1+y2+y3+y4+y5
# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
y4 = y4/snum*100.
y5 = y5/snum*100.
plt.figure(figsize=(4,3))
# stack bars
plt.bar(x, y1, label='Books (%)')
plt.bar(x, y2 ,bottom=y1,label='eBooks (%)')
plt.bar(x, y3 ,bottom=y1+y2,label='Audio Books (%)')
plt.bar(x, y4 ,bottom=y1+y2+y3,label='Videos (%)')
plt.bar(x, y5 ,bottom=y1+y2+y3+y4,label='Unidentified (%)')
plt.ylim(0,100)
plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
This graph shows the rough distribution of media types but suffers from a high "unidentified" category as many of the libraries surveyed did not provide a breakdown of all items in circulation. We can remove this category to get a better look at the identified material but this graph must be treated with caution:
x = circPercent['STABR']
y1 = circPercent['BKPER']
y2 = circPercent['EBOOKPER']
y3 = circPercent['AUDIOPER']
y4 = circPercent['VIDEOPER']
snum = y1+y2+y3+y4
# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
y4 = y4/snum*100.
plt.figure(figsize=(4,3))
# stack bars
plt.bar(x, y1, label='Books (%)')
plt.bar(x, y2 ,bottom=y1,label='eBooks (%)')
plt.bar(x, y3 ,bottom=y1+y2,label='Audio Books (%)')
plt.bar(x, y4 ,bottom=y1+y2+y3,label='Videos (%)')
plt.ylim(0,100)
plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
plt.show()
d = {'STABR': state_df['STABR'], 'NonBookPercent': (1-(circPercent['BKPER']))*100, 'RegPer100': mapByReg["Per100"]}
RegUserWithBk = pd.DataFrame(data=d)
RegUserWithBk.head()
sns.set(style="darkgrid")
g = sns.jointplot(RegUserWithBk["RegPer100"],RegUserWithBk['NonBookPercent'], kind="reg",
xlim=(0, 100), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Registered Users per 100 Residents', 'Circulation Identified as Books (%)', fontsize=16)
What about the R-squared value?
import statsmodels.api as sm
x = RegUserWithBk["RegPer100"]
y = RegUserWithBk['NonBookPercent']
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
0.928. That means 92.8% of the variation in the number of registered users can be correlated with the percent of the collection that are identified as traditional books. Again, we need to make sure to take this with a grain of salt as the response rate for this question was not great.
The next factor I wanted to consider is internet acessibility. This is often touted as the 'future' for public libraries. Internet access provides a way for citizens to gain access to the internet for self improvement, job hunting or homework after school.
compUse = {'SessionsPer100': state_df["PITUSR"]/state_df["POPU_LSA"]}
compUse = pd.DataFrame(data=compUse)
compUse['STABR'] = state_df["STABR"]
compUse.head()
sns.set(style="darkgrid")
funding_per_per = sns.barplot(x=compUse["STABR"], y=compUse['SessionsPer100'])
sns.set(rc={'figure.figsize':(15.7,5)})
funding_per_per.set_title("Computer Uses")
funding_per_per.set(xlabel='State', ylabel='Public Internet Computer Uses Per Year')
plt.show(funding_per_per)
m = folium.Map(location=[37, -102], zoom_start=4)
m.choropleth(
geo_data=state_geo,
name='choropleth',
data=compUse,
columns=['STABR', 'SessionsPer100'],
key_on='feature.id',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Computer Sessions Per Resident'
)
folium.LayerControl().add_to(m)
m
x = mapByReg["Per100"]
y = compUse['SessionsPer100']
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
This provides one of the most correlated relationships thus far.
The last factor I wanted to look at was the one I suspected would have the greatest impact: funding. The data had funding broken out by local, state and federal.
funding = pd.concat([state_df['STABR'], state_df['FEDGVT'], state_df['STGVT'], state_df['LOCGVT'], (state_df['FEDGVT'] + state_df['LOCGVT'] + state_df['STGVT'])], axis=1)
funding.columns = ['STABR', 'FEDGVT', 'STGVT', 'LOCGVT', 'TOTAL']
funding['fund_per_person'] = funding['TOTAL']/ state_df['POPU_LSA']
funding.head()
sns.set(style="darkgrid")
funding_per_per = sns.barplot(x="STABR", y="fund_per_person", data=funding)
sns.set(rc={'figure.figsize':(15.7,5)})
funding_per_per.set_title("Funding")
funding_per_per.set(xlabel='State', ylabel='Funding Per Person')
plt.show(funding_per_per)
m = folium.Map(location=[37, -102], zoom_start=4)
m.choropleth(
geo_data=state_geo,
name='choropleth',
data=funding,
columns=['STABR', 'fund_per_person'],
key_on='feature.id',
fill_color='BuGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Funding Per Resident'
)
folium.LayerControl().add_to(m)
m
x = funding['STABR']
y1 = funding['FEDGVT']
y2 = funding['STGVT']
y3 = funding['LOCGVT']
snum = y1+y2+y3
# normalization
y1 = y1/snum*100.
y2 = y2/snum*100.
y3 = y3/snum*100.
plt.figure(figsize=(4,3))
# stack bars
plt.bar(x, y3 ,label='Local Funding(%)')
plt.bar(x, y2 ,bottom=y3,label='State Funding(%)')
plt.bar(x, y1 ,bottom=y3+y2,label='Federal Funding(%)')
plt.ylim(0,100)
plt.legend(bbox_to_anchor=(1.01,0.5), loc='center left')
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
This graph suprised me at first but made sense after thinking about it. The vast majority of a library's funding comes from local governments. The more the city pushes towards a library, the more funding it has. This rule is broken entirely by Hawaii which doesn't use local funding at all, instead relying on state dollars. Many of these libraries are also funded for specific initiatives instead of receiving blanket funding.
sns.set(style="darkgrid")
g = sns.jointplot(mapByReg["Per100"], funding['fund_per_person'], kind="reg",
xlim=(0, 100), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Registered Users per 100 Residents', 'Funding per Person', fontsize=16)
x = mapByReg["Per100"]
y = funding['fund_per_person']
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
The value for this regresssion is .887 or 88.7% of the variability explained. That's not too great. This suprised me as I was expecting funding to be connected with higher registration. Out of curiosity I mapped funding against the total number of visits per 100 citizens
sns.set(style="darkgrid")
g = sns.jointplot(mapByPop["Per100"], funding['fund_per_person'], kind="reg",
xlim=(0, 1000), ylim=(0, 100), color="m", height=7)
g.set_axis_labels('Visits per 100 people', 'Funding per Person', fontsize=16)
x = mapByPop["Per100"]
y = funding['fund_per_person']
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
Wow- .952. I wonder why funding accounts for so much more variability in attendance than variability in registration.
My best guess is that well funded libraries attract guest back more frequently but don't impact the portion of citizens that come. Interesting.
It is also worth noting that funding for libraries is directly connected to tax revenue raised by municipalities. Thus, more wealthy areas almost always have libraries with my funding. This can be problematic as the populations most in need of libraries often don't have fully funded ones.
Since so much of the funding from libraries comes locally, this is where I began using the second sheet which had the information broken down by individual library. In order to increasing processing speed I pared this sheet down into just the columns I'm interested in. Let's see how much impact funding has per library.
indiv_df = pd.read_csv('smaller2016.csv')
indiv_df.head()
each_lib = {'State': indiv_df['STABR'], 'City': indiv_df['CITY'], 'Per100': (indiv_df['VISITS']/indiv_df['POPU_LSA'])*100, 'totalFunding': indiv_df['LOCGVT']+indiv_df['STGVT']+indiv_df['FEDGVT']}
each_lib = pd.DataFrame(data=each_lib)
each_lib['fund_per_person'] = each_lib['totalFunding']/indiv_df['POPU_LSA']
each_lib.head()
sns.set(style="darkgrid")
g = sns.jointplot(each_lib["Per100"], each_lib['fund_per_person'], kind="scatter",
xlim=(0, 10000), ylim=(0, 1000), color="m", height=7)
g.set_axis_labels('Visits per 100 people', 'Funding per Person', fontsize=16)
x = each_lib["Per100"]
y = each_lib['fund_per_person']
# Note the difference in argument order
model = sm.OLS(y, x).fit()
# Print out the statistics
model.summary()
Not as good as when we were looking at entire states. Curious about those outliers, I found the highest funded and most commonly visited states:
each_lib.nlargest(10, "Per100")
each_lib.nlargest(10, "fund_per_person")
Unless the city of Pontiac, MI really cares about their public libraries, I'm guessing there are some flaws in our data. Without more data I can't decide what to do about this, so I'm going to assume that we will just need to take this with a grain of salt.
So what does all of this mean?
In the analysis done above we can see that the aspects that are the most connected with library visitation are funding per person and computer usage. Thus, my recommendation for increasing library attendance would be to lobby for higher funding (surprise) and to push more funding into computer and wifi access.
Unfortunately, finding the level of impact libraries have on people's lives is much harder to measure, especially with the data provided. One other major aspects that could impact attendance and engagement are programming for teens and children which was not explored in this. We also cannot account for the errors in our dataset.
In the future I hope to continue working on this in the hopes of giving other children the headstart that my local library gave me.