Hello!

This project aims to predict the genre of a book given only its title. The following code was all written and designed by Allison Kahn but fit into a larger group project for UPenn's CIS 519 class, Applied Machine Learning. The models produced by all teammates were then combined into an ensemble method.

In [ ]:
import csv
from google.colab import drive

drive.mount('/content/gdrive')
Mounted at /content/gdrive

Reading in Dataset from JSON

The data we are using for our project is UCSD's Book Graph dataset. The main dataset is stored in a JSON with 2.3M entries and 29 fields, comprised of a variety of datatypes from integers and strings to dictionaries and lists. This JSON is too large to load into memory on colab, so we will process through the JSON line by line, appending the relevant data to a csv 10,000 lines at a time.

In [ ]:
import json
import gzip
import csv

fields_to_take = ['isbn', 'average_rating', 'description', 'link', 'authors',
                  'publisher', 'num_pages', 'isbn13', 'publication_year', 'image_url', 
                  'book_id', 'title', 'title_without_series', 'language_code'] #these are the fields that we want to keep
dict_chuck = []

i = 0
first = True
for line in gzip.open("gdrive/MyDrive/CIS 519 Project/Data/Raw Data/goodreads_books.json.gz", 'r'):
  json_line = json.loads(line.decode("utf-8"))
  line_dict = {}
  keep = False
  for field in fields_to_take: #for each field we want, extract the data associated with it and store in dict
    if field == "authors":
      if json_line[field] == []:
        out = ''
      else:
        out = json_line[field][0]['author_id']
    #in the image_url datafield, Good Reads uses a default image when it doesn't have access to a cover; we want to remove these
    elif field == "image_url" and json_line[field] == 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png':
      out = '' 
    elif field == "language_code": #if the language is set to a langauage other than english, we aren't using it in our analysis 
      if json_line[field] == "eng" or json_line[field] == "":
        keep = True
        out = json_line[field]
    else:
      out = json_line[field]

    line_dict[field] = out
  if keep: 
    dict_chuck.append(line_dict)

  i+=1

  # append rows every 10000 lines
  if i % 10000 == 0:
    print(i)
    if first:
      with open('gdrive/MyDrive/CIS 519 Project/Data/goodreads_books_cleaned.csv', 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames = fields_to_take)
        writer.writeheader()
        writer.writerows(dict_chuck)
      first = False
      dict_chuck = []
    else:
      with open('gdrive/MyDrive/CIS 519 Project/Data/goodreads_books_cleaned.csv', 'a') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames = line_dict.keys())
        writer.writerows(dict_chuck)
      dict_chuck = []

After we extract all of the data we want, we're left with the following csv with 1,768,097 rows and 14 columns:

In [ ]:
import pandas as pd
df = pd.read_csv('gdrive/MyDrive/CIS 519 Project/Data/goodreads_books_cleaned.csv')
df.head()
Out[ ]:
isbn average_rating description link authors publisher num_pages isbn13 publication_year image_url book_id title title_without_series language_code
0 0312853122 4.00 NaN https://www.goodreads.com/book/show/5333265-w-... 604031.0 St. Martin's Press 256.0 9780312853129 1984.0 https://images.gr-assets.com/books/1310220028m... 5333265 W.C. Fields: A Life on Film W.C. Fields: A Life on Film NaN
1 0743509986 3.23 Anita Diamant's international bestseller "The ... https://www.goodreads.com/book/show/1333909.Go... 626222.0 Simon & Schuster Audio NaN 9780743509985 2001.0 NaN 1333909 Good Harbor Good Harbor NaN
2 NaN 4.03 Omnibus book club edition containing the Ladie... https://www.goodreads.com/book/show/7327624-th... 10333.0 Nelson Doubleday, Inc. 600.0 NaN 1987.0 https://images.gr-assets.com/books/1304100136m... 7327624 The Unschooled Wizard (Sun Wolf and Starhawk, ... The Unschooled Wizard (Sun Wolf and Starhawk, ... eng
3 0743294297 3.49 Addie Downs and Valerie Adler were eight when ... https://www.goodreads.com/book/show/6066819-be... 9212.0 Atria Books 368.0 9780743294294 2009.0 NaN 6066819 Best Friends Forever Best Friends Forever eng
4 0850308712 3.40 NaN https://www.goodreads.com/book/show/287140.Run... 149918.0 NaN NaN 9780850308716 NaN https://images.gr-assets.com/books/1413219371m... 287140 Runic Astrology: Starcraft and Timekeeping in ... Runic Astrology: Starcraft and Timekeeping in ... NaN

Genre Identification Dataset

The next task we have is to include the genres in our data. The UCSD data pulls genre from user tags in the following format:

In [ ]:
import pandas as pd
genre_created = pd.read_json('gdrive/MyDrive/CIS 519 Project/Data/Raw Data/goodreads_book_genres_initial.json.gz',lines=True)
genre_created.head()
Out[ ]:
book_id genres
0 5333265 {'history, historical fiction, biography': 1}
1 1333909 {'fiction': 219, 'history, historical fiction,...
2 7327624 {'fantasy, paranormal': 31, 'fiction': 8, 'mys...
3 6066819 {'fiction': 555, 'romance': 23, 'mystery, thri...
4 287140 {'non-fiction': 3}

We then clean this data and process it into a usable format by sorting the genres by the number of tags and convert them into a columns where the 'first' column has the genre most often tagged, the 'second' column has the genre tagged the second most, etc. We also want to explore different genre conventions, so we add in the 'genre_cleaned' column which uses the second most common tag in place of the first if the first tag is 'fiction'. This is done in an attempt to make the genres more specific. In future parts of the project, models are trained against the true most common genre ("first") as well as this created columns.

In [ ]:
def cleanGenreList(originalList):
  out = sorted(dict(originalList['genres']), key=originalList['genres'].get, reverse=True)
  if len(out) == 0:
    out =  ['','','']
  elif len(out) == 1:
    out.append('')
    out.append('')
  elif len(out) == 2:
    out.append('')
  else:
    out = out[:3]
  return out
In [ ]:
genre_created_cleaned = genre_created.copy()
genre_created_cleaned[['first', 'second', 'third']] = genre_created.apply(lambda x: cleanGenreList(x), axis=1, result_type='expand')
genre_created_cleaned.head()
Out[ ]:
book_id genres first second third
0 5333265 {'history, historical fiction, biography': 1} history, historical fiction, biography
1 1333909 {'fiction': 219, 'history, historical fiction,... fiction history, historical fiction, biography
2 7327624 {'fantasy, paranormal': 31, 'fiction': 8, 'mys... fantasy, paranormal fiction mystery, thriller, crime
3 6066819 {'fiction': 555, 'romance': 23, 'mystery, thri... fiction romance mystery, thriller, crime
4 287140 {'non-fiction': 3} non-fiction
In [ ]:
genre_lookup[['first','book_id']].groupby(by=['first']).count()
Out[ ]:
book_id
first
children 116941
comics, graphic 91185
fantasy, paranormal 221908
fiction 435045
history, historical fiction, biography 177837
mystery, thriller, crime 194964
non-fiction 335632
poetry 43175
romance 276087
young-adult 58368
In [ ]:
genre_created_cleaned.to_csv('gdrive/MyDrive/CIS 519 Project/Data/genre_lookup_cleaned.csv')

#from google.colab import drive
#drive.mount('/content/gdrive')
#import pandas as pd
#genre_lookup = pd.read_csv('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Data/Raw Data/OLD_genre_lookup_cleaned.csv')

Joined together

Next we need to join these files together

In [ ]:
df_w_genre = df.merge(genre_lookup, how="inner", on="book_id")
df_w_genre = df_w_genre.drop(columns=['Unnamed: 0', 'genres'])
df_w_genre.head()
Out[ ]:
isbn average_rating description link authors publisher num_pages isbn13 publication_year image_url book_id title title_without_series language_code first second third genre_cleaned
0 0312853122 4.00 NaN https://www.goodreads.com/book/show/5333265-w-... 604031.0 St. Martin's Press 256.0 9780312853129 1984.0 https://images.gr-assets.com/books/1310220028m... 5333265 W.C. Fields: A Life on Film W.C. Fields: A Life on Film NaN history, historical fiction, biography NaN NaN history, historical fiction, biography
1 0743509986 3.23 Anita Diamant's international bestseller "The ... https://www.goodreads.com/book/show/1333909.Go... 626222.0 Simon & Schuster Audio NaN 9780743509985 2001.0 NaN 1333909 Good Harbor Good Harbor NaN fiction history, historical fiction, biography NaN history, historical fiction, biography
2 NaN 4.03 Omnibus book club edition containing the Ladie... https://www.goodreads.com/book/show/7327624-th... 10333.0 Nelson Doubleday, Inc. 600.0 NaN 1987.0 https://images.gr-assets.com/books/1304100136m... 7327624 The Unschooled Wizard (Sun Wolf and Starhawk, ... The Unschooled Wizard (Sun Wolf and Starhawk, ... eng fantasy, paranormal fiction mystery, thriller, crime fantasy, paranormal
3 0743294297 3.49 Addie Downs and Valerie Adler were eight when ... https://www.goodreads.com/book/show/6066819-be... 9212.0 Atria Books 368.0 9780743294294 2009.0 NaN 6066819 Best Friends Forever Best Friends Forever eng fiction romance mystery, thriller, crime romance
4 0850308712 3.40 NaN https://www.goodreads.com/book/show/287140.Run... 149918.0 NaN NaN 9780850308716 NaN https://images.gr-assets.com/books/1413219371m... 287140 Runic Astrology: Starcraft and Timekeeping in ... Runic Astrology: Starcraft and Timekeeping in ... NaN non-fiction NaN NaN non-fiction
In [ ]:
df_w_genre.to_csv('gdrive/MyDrive/CIS 519 Project/Data/books_with_genre.csv.gz', compression='gzip')

Cleaning

Closer inspection of the data shows that some rows missing the language tags are in a language other than English. As the scope of our project only extends to English language books, we need to identify and remove them.

Cleaning Language

In [ ]:
#!pip install langdetect
from langdetect import detect
In [ ]:
def detectLang(x):
  if x != x:
    return ''
  elif type(x) == str:
    try:
      return detect(x)
    except:
      print('') #if the description cannot be read by the language detection package, skip it. Most of these are only punctuation or links
  else:
    print(type(x))
In [ ]:
df_w_genre['detected_lang'] = df_w_genre.loc[(df_w_genre['language_code'] != df_w_genre['language_code']) & \
                                             (df_w_genre['description'] == df_w_genre['description']) & \
                                             (~df_w_genre['description'].isin(['<>', '<', '>', '.', ','])), 'description']\
                                             .apply(detectLang)

Now that all the rows in question have a language assigned, what do they look like?

In [ ]:
from collections import Counter

Counter(df_w_genre['detected_lang'])
Out[ ]:
Counter({None: 145,
         'af': 129,
         'ca': 135,
         'cs': 170,
         'cy': 4343,
         'da': 347,
         'de': 2877,
         'en': 754055,
         'es': 7630,
         'et': 284,
         'fi': 424,
         'fr': 4041,
         'hr': 361,
         'hu': 85,
         'id': 1563,
         'it': 1652,
         'lt': 98,
         'lv': 40,
         nan: 984276,
         'nl': 809,
         'no': 189,
         'pl': 353,
         'pt': 1454,
         'ro': 277,
         'sk': 184,
         'sl': 418,
         'so': 303,
         'sq': 108,
         'sv': 431,
         'sw': 66,
         'tl': 156,
         'tr': 537,
         'vi': 157})

Most of the nans in this list were values that were assigned English by the dataset and thus skipped in the last step. The values of None are the values that were exceptions in the language assignment code so we want to remove those.

In [ ]:
df_w_genre = df_w_genre[~df_w_genre['detected_lang'].isin([None])]
df_w_genre = df_w_genre[df_w_genre['first'] == df['first']]

Prepping Data for the Model

In [ ]:
#!pip install langdetect

import csv
from google.colab import drive
import pandas as pd
import numpy as np
from collections import Counter
import joblib
from wordcloud import WordCloud
import matplotlib.pyplot as plt




#drive.mount('/content/gdrive')
Mounted at /content/gdrive
In [ ]:
df = pd.read_csv('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Data/books_clean.csv.gz', compression='gzip')

Before we start processing the titles, we need to remove some of the punctuation that we don't want to separate words. For example, in the book title "10,000 Leagues Under the Sea", we want the unigram representation to be "10000, leagues, under, the, sea" not "10, 000, leagues, under, the, sea".

We also don't want words to split on apostrophes--for example, "Charlotte's Web" should be "charlottes, web" not "charlotte, s, web".

In [ ]:
df = df[df['title'] == df['title']]
df['title'] = df['title'].str.replace(',','')
df['title'] = df['title'].str.replace('\'','')

Next we can visualize common words in each genre

In [ ]:
df_word_cloud = df[['title', 'first']].drop_duplicates()
for genre in df['first'].unique():
  print(genre,": ")
  text = df_word_cloud[df_word_cloud['first'] == genre]['title'].values 
  wordcloud = WordCloud().generate(str(' '.join(text)))

  plt.imshow(wordcloud)
  plt.axis("off")
  plt.show()
history, historical fiction, biography : 
fiction : 
fantasy, paranormal : 
non-fiction : 
romance : 
mystery, thriller, crime : 
children : 
poetry : 
young-adult : 
comics, graphic : 

Balancing

As seen in the chart below, our data is fairly unbalanced. In order to improve our models, we can experiment with oversampling and undersampling. During the testing phase we found that oversampling worked the best, so the models below are trained with the balanced dataset.

In [ ]:
class_size = dict(Counter(df['first']))
plt.bar(class_size.keys(), class_size.values())
plt.xticks(rotation=90)
Out[ ]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], <a list of 10 Text major ticklabel objects>)
In [ ]:
min_val = round(min(class_size.values()) * 1.5)
undersampling_list = []

for i in df['first'].unique():
  if len(df[df['first'] == i]) < min_val:
    undersampling_list.extend(df[df['first'] == i]['book_id'])
  else:
    undersampling_list.extend(df[df['first'] == i].sample(min_val)['book_id'])

df_under = df[df['book_id'].isin(undersampling_list)]
In [ ]:
max_val = max(class_size.values())
oversampling_list = []

for i in df['first'].unique():
  oversampling_list.extend(df[df['first'] == i].sample(max_val, replace=True)['book_id'])

df_over = pd.DataFrame({'book_id':oversampling_list})
df_over = df_over.merge(df, on='book_id', how='left')

TF-IDF

After balancing our dataset, we need to transform our list of titles into a format that can be more easily used. In this application, we used TF-IDF. When creating this model, we experimented with using different preprocessing steps. In the end, we used the following parameters:

  • Removing english stopwords
  • Splitting the data into both unigrams and bigrams
  • Removing words that don't exist more than 5 times in the dataset
  • Using a logarithmic scale for the term frequency
  • Transforming to lowercase
  • TfidfVectorizer's default tokenizer
  • L2 unit norm
  • IDF-smoothing to prevent divide by zero errors
In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(df_over['title'], df_over['first'], test_size=0.2, random_state=42, stratify=df_over['first'])
In [ ]:
tf_idf = TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=5, sublinear_tf=True)

X_train_tf = tf_idf.fit_transform(X_train)
In [ ]:
X_test_tf = tf_idf.transform(X_test)
In [ ]:
print("n_samples: %d, n_features: %d" % X_train_tf.shape)
print("n_samples: %d, n_features: %d" % X_test_tf.shape)
n_samples: 2381088, n_features: 379797
n_samples: 595272, n_features: 379797

With the parameters set, there are ~380k features in our model. We can see what a slice of them look like here:

In [ ]:
tf_idf.get_feature_names_out()[20000:20010]
Out[ ]:
array(['architect unruly', 'architecting', 'architects',
       'architects fear', 'architects sketchbooks', 'architects vol',
       'architectural', 'architectural beginnings',
       'architectural cultural', 'architectural drawing'], dtype=object)
In [ ]:
#joblib.dump(tf_idf, 'gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/TFDIF_Model.pkl')

Naive Bayes

The first model we created in a Naive Bayes classifier with alpha=0.1, indentified through grid search. We can then print out the accuracies achieved with this model.

In [ ]:
from sklearn.naive_bayes import MultinomialNB

nb_clf = MultinomialNB(alpha=0.1)
nb_clf.fit(X_train_tf, y_train)
nb_clf.score(X_test_tf, y_test)
pred_bayes = nb_clf.predict(X_test_tf)
In [ ]:
print("Train F1 Score: ", metrics.f1_score(y_train, nb_clf.predict(X_train_tf), average='macro'))
print("Test F1 Score: ", metrics.f1_score(y_test, pred_bayes, average='macro'))

print("\nTrain Top 3 Accuracy: ", metrics.top_k_accuracy_score(y_train, nb_clf.predict_log_proba(X_train_tf), k=3))
print("Test Top 3 Accuracy: ", metrics.top_k_accuracy_score(y_test, nb_clf.predict_log_proba(X_test_tf), k=3))
Train F1 Score:  0.7793222196358716
Test F1 Score:  0.73955259994709

Train Top 3 Accuracy:  0.9328084472308458
Test Top 3 Accuracy:  0.9118403015764222
In [ ]:
print(metrics.classification_report(y_test, pred_bayes))
                                        precision    recall  f1-score   support

                              children       0.68      0.80      0.73     59527
                       comics, graphic       0.89      0.85      0.87     59527
                   fantasy, paranormal       0.73      0.75      0.74     59527
                               fiction       0.60      0.49      0.54     59528
history, historical fiction, biography       0.73      0.71      0.72     59527
              mystery, thriller, crime       0.77      0.73      0.75     59527
                           non-fiction       0.70      0.72      0.71     59528
                                poetry       0.89      0.86      0.87     59527
                               romance       0.66      0.70      0.68     59527
                           young-adult       0.78      0.80      0.79     59527

                              accuracy                           0.74    595272
                             macro avg       0.74      0.74      0.74    595272
                          weighted avg       0.74      0.74      0.74    595272

In [ ]:
#joblib.dump(nb_clf, 'gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/Title_NB_Model.pkl')
print("Model Saved")
Model Saved

Logistic Regression

The second model is a Logistic Regression model which achieved a higher accuracy than the naive bayes model. The breakout by genre also reveals very similar strengths and weaknesses.

In [ ]:
from sklearn import linear_model


lr_clf = linear_model.LogisticRegression(solver='sag',max_iter=200,random_state=450)
lr_clf.fit(X_train_tf, y_train)
pred_lr = lr_clf.predict(X_test_tf)
In [ ]:
print("Train F1 Score: ", metrics.f1_score(y_train, lr_clf.predict(X_train_tf), average='macro'))
print("Test F1 Score: ", metrics.f1_score(y_test, pred_lr, average='macro'))

print("\nTrain Top 3 Accuracy: ", metrics.top_k_accuracy_score(y_train, lr_clf.predict_log_proba(X_train_tf), k=3))
print("Test Top 3 Accuracy: ", metrics.top_k_accuracy_score(y_test, lr_clf.predict_log_proba(X_test_tf), k=3))
Train F1 Score:  0.8090957995221911
Test F1 Score:  0.7680985345492684

Train Top 3 Accuracy:  0.9423864216694217
Test Top 3 Accuracy:  0.9206950772084022
In [ ]:
print(metrics.classification_report(y_test, pred_lr))
                                        precision    recall  f1-score   support

                              children       0.77      0.81      0.79     59527
                       comics, graphic       0.92      0.88      0.90     59527
                   fantasy, paranormal       0.75      0.76      0.76     59527
                               fiction       0.57      0.56      0.56     59528
history, historical fiction, biography       0.76      0.73      0.75     59527
              mystery, thriller, crime       0.78      0.76      0.77     59527
                           non-fiction       0.75      0.73      0.74     59528
                                poetry       0.90      0.90      0.90     59527
                               romance       0.68      0.72      0.70     59527
                           young-adult       0.81      0.84      0.82     59527

                              accuracy                           0.77    595272
                             macro avg       0.77      0.77      0.77    595272
                          weighted avg       0.77      0.77      0.77    595272

In [ ]:
#joblib.dump(lr_clf, 'gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/Title_LR_Model.pkl')
print("Model Saved")
Model Saved

Exploring Results

Now that we have 2 models created to classify our titles, we can do a little experimentation of the models to see what the outputs look like. First, we can predict against some of my favorite books that exist in the dataset to see how they perform.

In [ ]:
for book in ["speaker for the dead", "the things they carried",  "the anybodies", "the picture of dorian gray"]:
  book_vectorized = tf_idf_model.transform([book])
  probs = NB_model.predict_log_proba(book_vectorized)[0]
  probs_ordered = np.sort(probs)
  print("\nName: ", book)
  print("Best Guess: ", NB_model.predict(book_vectorized)[0])

  for i in range(len(NB_model.classes_)):
    print(NB_model.classes_[i], ": ", probs[i])
Name:  speaker for the dead
Best Guess:  fiction
children :  -7.202452864288361
comics, graphic :  -6.590877028784092
fantasy, paranormal :  -3.0981893286126834
fiction :  -0.11164482977698498
history, historical fiction, biography :  -5.139801570978118
mystery, thriller, crime :  -3.581728209746842
non-fiction :  -4.222365293157111
poetry :  -4.921005049464977
romance :  -6.873400194395522
young-adult :  -6.369728651079626

Name:  the things they carried
Best Guess:  fiction
children :  -4.4024142841590255
comics, graphic :  -6.749376873614814
fantasy, paranormal :  -4.823854054087409
fiction :  -0.07885623549950083
history, historical fiction, biography :  -5.171407225119516
mystery, thriller, crime :  -7.235850262512322
non-fiction :  -5.168640407990345
poetry :  -4.269672742745847
romance :  -4.008362753747207
young-adult :  -4.5919104531874595

Name:  the anybodies
Best Guess:  children
children :  -2.302584253042242
comics, graphic :  -2.302584253042242
fantasy, paranormal :  -2.302584253042242
fiction :  -2.3025884528083163
history, historical fiction, biography :  -2.302584253042242
mystery, thriller, crime :  -2.302584253042242
non-fiction :  -2.3025884528083163
poetry :  -2.302584253042242
romance :  -2.302584253042242
young-adult :  -2.302584253042242

Name:  the picture of dorian gray
Best Guess:  fiction
children :  -5.446529187512873
comics, graphic :  -4.045415165818671
fantasy, paranormal :  -5.800954926745046
fiction :  -0.02742612711042014
history, historical fiction, biography :  -11.113486064018222
mystery, thriller, crime :  -8.352969281001268
non-fiction :  -11.743402140945218
poetry :  -11.88020522218753
romance :  -6.334048479409098
young-adult :  -8.651898742073932
  • The first book I chose is "Speaker for the Dead", a science fiction book by Orson Scott Card. The model correctly identifed the book as fiction with the second highest guess being "fantasy, paranormal". Considering how close sci-fi and fantasy are, I would call that a good guess.

  • The second book is "The Things They Carried" by Tim O'Brien. I would categorize the book as historical fiction and the model identified it as fiction. The model did not perform as well on this title; the correct genre is listed as the 3rd from the bottom in terms of likelihood. Given how "historical fiction" and "fiction" aren't necessarily mutually exclusive, this is an understandable and expected mistake.

  • The third book is "The Anybodies" by NE Bode, a childrens fantasy book (that happened to be my favorite book growing up). This was correctly identifed as a childrens book, although it looks like that might have just been lucky as it was an 8-way tie for first, likely because the word "anybodies" is not a word that appears often in the dataset.

  • The last book is "The Picture of Dorian Gray" by Oscar Wilde which was overwhelming predicted as fiction, the correct genre.



Now we can experiment with some books off Allison's shelf that were not present in the dataset at all:

In [ ]:
for book in ['atlas of the national parks', "the complete idiots guide to socially responsible investing", "Havana Bay"]:
  if len(df[df['title'].str.lower().str.contains(book)]['title'].drop_duplicates()) == 0:
    book_vectorized = tf_idf_model.transform([book])
    probs = NB_model.predict_log_proba(book_vectorized)[0]
    probs_ordered = np.sort(probs)
    print("\nName: ", book)
    print("Best Guess: ", NB_model.predict(book_vectorized)[0])

    for i in range(len(NB_model.classes_)):
      print(NB_model.classes_[i], ": ", probs[i])
  else:
    print("------Book in Dataset------")
Name:  atlas of the national parks
Best Guess:  non-fiction
children :  -2.914691801088388
comics, graphic :  -4.649831277718796
fantasy, paranormal :  -5.469519122192249
fiction :  -4.100779727136285
history, historical fiction, biography :  -1.0643994343524419
mystery, thriller, crime :  -5.337200190229783
non-fiction :  -0.5770570989028023
poetry :  -5.760586053631393
romance :  -7.831249616224213
young-adult :  -7.412184081588485

Name:  the complete idiots guide to socially responsible investing
Best Guess:  non-fiction
children :  -10.409373887388636
comics, graphic :  -10.526771258985189
fantasy, paranormal :  -11.628512989273712
fiction :  -11.517580489145704
history, historical fiction, biography :  -7.591677647584145
mystery, thriller, crime :  -11.876128212138045
non-fiction :  -0.0006070970605449588
poetry :  -13.10100441170977
romance :  -11.316181694106984
young-adult :  -12.150156864145682

Name:  Havana Bay
Best Guess:  mystery, thriller, crime
children :  -5.192015356199537
comics, graphic :  -7.825919022012517
fantasy, paranormal :  -7.097174434751295
fiction :  -3.4270597398856033
history, historical fiction, biography :  -4.238687348065518
mystery, thriller, crime :  -0.08686278733319597
non-fiction :  -4.974634623433914
poetry :  -7.407662704632955
romance :  -4.4983574582832055
young-adult :  -4.523044349467366

All of these books were correctly identified! The first two are non-fiction books of various types and the last is a crime novel by Martin Cruz Smith.



Looking at Bias

While exploring the predictions the model made regarding some books, I began to wonder what assumptions the model had made regarding gendered names. I took the 50 most common baby names of the past century and predicted against them as if they were novel titles: https://www.ssa.gov/oact/babynames/decades/century.html

In [ ]:
womens_names = ["Mary","Patricia","Jennifer","Linda","Elizabeth","Barbara","Susan",
                "Jessica","Sarah","Karen","Nancy","Lisa","Betty","Margaret","Sandra",
                "Ashley","Kimberly","Emily","Donna","Michelle","Dorothy","Carol","Amanda",
                "Melissa","Deborah","Stephanie","Rebecca","Sharon","Laura","Cynthia",
                "Kathleen","Amy","Shirley","Angela","Helen","Anna","Brenda","Pamela",
                "Nicole","Emma","Samantha","Katherine","Christine","Debra","Rachel",
                "Catherine","Carolyn","Janet","Ruth","Maria"]
mens_names = ["James","Robert","John","Michael","William","David","Richard","Joseph",
              "Thomas","Charles","Christopher","Daniel","Matthew","Anthony","Mark",
              "Donald","Steven","Paul","Andrew","Joshua","Kenneth","Kevin","Brian",
              "George","Edward","Ronald","Timothy","Jason","Jeffrey","Ryan","Jacob",
              "Gary","Nicholas","Eric","Jonathan","Stephen","Larry","Justin","Scott",
              "Brandon","Benjamin","Samuel","Gregory","Frank","Alexander","Raymond",
              "Patrick","Jack","Dennis","Jerry"]
In [ ]:
women_names_vectors = tf_idf.transform(womens_names)
women_predicted = nb_clf.predict(women_names_vectors)
Counter(women_predicted)
Out[ ]:
Counter({'children': 3,
         'comics, graphic': 2,
         'fantasy, paranormal': 1,
         'fiction': 5,
         'history, historical fiction, biography': 7,
         'mystery, thriller, crime': 22,
         'poetry': 3,
         'young-adult': 7})
In [ ]:
men_names_vectors = tf_idf.transform(mens_names)
men_predicted = nb_clf.predict(men_names_vectors)
Counter(men_predicted)
Out[ ]:
Counter({'children': 3,
         'comics, graphic': 5,
         'fantasy, paranormal': 1,
         'fiction': 1,
         'history, historical fiction, biography': 21,
         'mystery, thriller, crime': 15,
         'poetry': 3,
         'romance': 1})

The traditionally male names are overwhelmingly more likely to be associated with "history, historical fiction, biography" while the traditionally female names are more likely to be associated with "mystery, thriller, crime". The fact that male names are likely to be associated with historical figures is not suprising, while we found it interesting that female names are associated with mystery novels


Exporting Predictions

In [ ]:
tf_idf_model = joblib.load('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/TFDIF_Model.pkl')
NB_model = joblib.load('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/Title_NB_Model.pkl')
logreg_model = joblib.load('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Models/Title_LR_Model.pkl')
In [ ]:
df_to_predict = df[['title','book_id','first']].drop_duplicates()
In [ ]:
to_predict = tf_idf_model.transform(df_to_predict['title'])
In [ ]:
nb_pred = NB_model.predict(to_predict)
In [ ]:
logreg_pred = logreg_model.predict(to_predict)
In [ ]:
df_to_predict['NB Predictions'] = nb_pred
df_to_predict['LogReg Predictions'] = logreg_pred
In [ ]:
df_to_predict.to_csv('gdrive/MyDrive/Grad School/Spring 2022/CIS 519/CIS 519 Project/Final_Data/book_title_predictions.csv')
In [ ]:
df[df['description'] == df['description']].shape
Out[ ]:
(1233575, 20)
In [ ]: