A short script that uses generative AI to turn a pycharm console session into an annotated Jupyter notebook. Most of the time, it works every time.
The purpose of this notebook is to investigate, compare, and combine methodologies in order to determine whether a movie passes the Bechdel Test using its script and metadata gathered from various APIs.
The test, used to measure women's representation in film and other fiction, requires two female characters to have a conversation about something other than a male character in order to 'pass' the test. (Wikipedia)
It's worth noting that while the Bechdel Test can be a good way to survey the media landscape, it certainly doesn't tell the whole story of gender representation, especially on a film-by-film basis. A conversation can include or be about men and also represent real, complex female characters, while a short, throwaway scene having no effect on the plot can be enough to 'pass' the test. This is important because a model is only as good as its data.
In other words, no model trained to classify movie scripts, using Bechdel Test ratings as targets, can provide more insight on gender representation in media than the Bechdel Test itself.
This first set of data comes from the Bechdel Test Movie List, via the API. The API documentation asks that the 'getAllMovies' method be used sparingly, which is why the following code is commented out, but it can be used to recreate the csv file imported below.
Creating a dataset of movie scripts will prove much more difficult than a single API call, and manually labeling data by watching movies (or reading their scripts) would be prohibitively time consuming. Therefore, the best approach to creating a full, labeled dataset for this project is to begin with this labeled Bechdel Test data, and then attempt to find and attach a script for as many records as possible.
from urllib.error import HTTPError
from concurrent.futures import ThreadPoolExecutor, as_completed
import pandas as pd
import numpy as np
import os
import json
import requests
import matplotlib
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
from urllib.request import urlopen
import html2text
import math
import re
import ast
warnings.filterwarnings("ignore")
'''
r = requests.get('http://bechdeltest.com/api/v1/getAllMovies')
d = r.text
d = json.loads(d)
id, imdbid, rating, title, year = [], [], [], [], []
for i in data:
id.append(i['id'])
imdbid.append(i['imdbid'])
rating.append(i['rating'])
title.append(i['title'])
year.append(i['year'])
df = pd.DataFrame({'id':id, 'imdbid':imdbid, 'title':title, 'rating':rating, 'year':year})
df.to_csv('BechdelData.csv')
'''
"\nr = requests.get('http://bechdeltest.com/api/v1/getAllMovies')\nd = r.text\nd = json.loads(d)\nid, imdbid, rating, title, year = [], [], [], [], []\nfor i in data:\n id.append(i['id'])\n imdbid.append(i['imdbid'])\n rating.append(i['rating'])\n title.append(i['title'])\n year.append(i['year'])\ndf = pd.DataFrame({'id':id, 'imdbid':imdbid, 'title':title, 'rating':rating, 'year':year}) \ndf.to_csv('BechdelData.csv') \n"
targets = pd.read_csv('BechdelData.csv').drop(columns='Unnamed: 0')
The first movie listed as passing the Bechdel test is Cendrillon,, which is French for Cinderella. The full film is available on wikipedia- it has no script or words and I'm not sure that the Bechdel Test even really applies. However, it's only a few minutes long, and in 1899 when it was released, it's special effects and production were considered to be state-of-the-art.
targets.loc[83:87]
id | imdbid | title | rating | year | |
---|---|---|---|---|---|
83 | 5411 | 224240.0 | Temptation of St. Anthony, The | 0 | 1898 |
84 | 4994 | 246.0 | A Turn of the Century Illusionist | 0 | 1899 |
85 | 5914 | 230.0 | Cinderella | 3 | 1899 |
86 | 1594 | 291476.0 | Sherlock Holmes Baffled | 0 | 1900 |
87 | 4271 | 300.0 | Enchanted Drawing, The | 0 | 1900 |
For each movie or TV show in the dataset, the data contains an ID from the API, an ID from IMDb, the movie's title, the year it was released, and most importantly, the rating. The ratings go by the original comic- from the Bechdel Test Movie List, via the API website, they are assigned as follows:
A couple attributes are left out when querying the entire dataset that are available when querying by imdb id or title, notably including a boolean 'dubious' column, representing whether a movie's rating is, well, dubious. This data could definitely be useful in fine-tuning a model, so it's worth referring back to later with a narrower set of movies on which to query for more information (rather than making 10,000 requests right off the bat.)
r = requests.get('http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid=0000230')
pd.Series(json.loads(r.text))
visible 1 imdbid 0000230 rating 3 dubious 0 title Cinderella submitterid 11200 date 2014-11-12 06:46:54 id 5914 year 1899 dtype: object
Grouping the data by decade, there appear to be roughly as many movies passing as failing the test going as far back as 1930, with movies that pass the test gaining a slight edge in the mid 1980s. The yellow line represents 1985, the year Alison Bechdel originally published her comic strip.
plot = sns.histplot(data=targets, x='year', hue='rating', multiple='stack', binwidth=10, palette='bright')
plot.axvline(x=1985, color='y', linestyle='--')
Of the 10,286 rows in the data, 57% pass the Bechdel test, with the remaining 43% divided between the other categories. Though the main goal of this project is to create a binary classifier, these more 'granular' ratings may be very useful in determining where a process is going wrong, especially when involving generative AI. Moreover, they can be used as targets for various models as part of a larger ensemble approach.
df = pd.DataFrame(targets.groupby('rating')['id'].count())
df['percentage'] = round((df['id'] / sum(df['id']) * 100), 2)
df.rename(columns={'id':'count'}, inplace=True)
df
count | percentage | |
---|---|---|
rating | ||
0 | 1130 | 10.99 |
1 | 2223 | 21.61 |
2 | 1063 | 10.33 |
3 | 5870 | 57.07 |
The first dataset of movie scripts comes from Aveek Saha's Movie Script Database. The github repository for that project features a very comprehensive readme file with information on how to recreate the data. It runs into a few of errors on my machine- many movie scripts are no longer available at the given link, for example, and of the 3500+ unprocessed scripts downloaded by the first python script, only 229 are fully parsed by the end of the process.
The IMDb API on which this program is built has apparently been deprecated, so the movies are only labeled with the TMDb ID (and only in some cases). Those IDs can be found in the metadata, and conveniently, TMDb's API has a 'Find By ID' method that can be used to return movies based on an external ID such as an IMDb ID. Using this method, TMDb IDs can be appended to the target dataset as the first step in linking movie scripts to those labeled targets. This method will also return some more metadata about the movie.
For the next bit of code to work, Saha's Movie Script Database must be in the same folder as this notebook. The code is commented out because it cannot be linked programatically to the records in the Bechdel dataset. While this makes it a less appealing choice for an exploratory dataset, it may yet prove useful, especially by potentially leveraging LLMs.
'''
filenames = os.listdir('.\Movie-Script-Database\scripts\parsed\\dialogue')
scripts = []
for f in filenames:
scripts.append(open('.\Movie-Script-Database\scripts\parsed\\dialogue\\' + f).read())
df = pd.DataFrame({'script': scripts, 'filename': filenames})
'''
'''
ls = []
for i in df['filename']:
i = i.split('_')[0]
ls.append(i)
df['filename'] = ls
'''
'''
f = open('.\Movie-Script-Database\scripts\metadata\clean_files_meta.json')
data = json.load(f)
'''
In order to create an exploratory dataset, this script will scrape scripts from IMSDb by using the title of the film and matching it with the IMSDb url format. The first 1000 records run in 4-5 minutes on my machine, so it should take slightly less than an hour to get through the whole dataset.
This script uses a very broad try-except statement, leaving only 18 out of the first 1000 records when processed. 200 records will be enough for an exploratory dataset, but if scraping other websites doesn't yield better results IMSDb can definitely be revisited with a move intensive approach.
'''
titles = []
scripts = []
url = 'https://imsdb.com/scripts/asd.html'
x = urlopen(url)
notFound = x.read()
for i in targets['title']:
i = i.replace(' ', '-')
url = 'https://imsdb.com/scripts/' + i + '.html'
try:
page = urlopen(url)
html_bytes = page.read()
if html_bytes == notFound:
continue
else:
html = html_bytes.decode('utf-8')
scripts.append(html)
titles.append(i)
except (HTTPError, UnicodeDecodeError):
continue
'''
"\ntitles = []\nscripts = []\nurl = 'https://imsdb.com/scripts/asd.html'\nx = urlopen(url)\nnotFound = x.read()\nfor i in targets['title']:\n i = i.replace(' ', '-')\n url = 'https://imsdb.com/scripts/' + i + '.html'\n try: \n page = urlopen(url)\n html_bytes = page.read()\n if html_bytes == notFound:\n continue\n else:\n html = html_bytes.decode('utf-8')\n scripts.append(html)\n titles.append(i)\n \n except (HTTPError, UnicodeDecodeError): \n continue\n"
'''
df = pd.DataFrame({'title': titles, 'script': scripts})
df.to_csv('save.csv')
'''
"\ndf = pd.DataFrame({'title': titles, 'script': scripts})\ndf.to_csv('save.csv')\n"
df = pd.read_csv('save.csv')
The actual scripts are tagged with pre and /pre- this will return the raw HTML for the script itself. IndexErrors occur when there is nothing between the pre tags- these webpages usually have no script.
scripts = []
for sc in df['script']:
try:
sc = sc.split('<pre>')[1].split('</pre>')[0]
except IndexError:
sc = np.nan
scripts.append(sc)
df['script'] = scripts
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[15], line 4 2 for sc in df['script']: 3 try: ----> 4 sc = sc.split('<pre>')[1].split('</pre>')[0] 5 except IndexError: 6 sc = np.nan AttributeError: 'float' object has no attribute 'split'
df
Unnamed: 0 | title | script | |
---|---|---|---|
0 | 0 | It | \r\n\r\n\r\n<b> ... |
1 | 1 | Frankenstein | |
2 | 2 | Grand-Hotel | \r\n\r\n\r\n<b> ... |
3 | 3 | Scarface | #00766\r\n\r\n\r\n\r\n\r\n "Enjoy your... |
4 | 4 | Wizard-of-Oz,-The | FADE IN -- Title:\r\n\r\nFor nearly forty year... |
... | ... | ... | ... |
536 | 536 | Mulan | Disney's Mulan\r\nCompiled by Barry Adams <bja... |
537 | 537 | Dune | DUNE\r\n\r... |
538 | 538 | Scream | \r\n<b> ... |
539 | 539 | Willow | <html>\r\n<head>\r\n<script>\r\n<b><!--\r\n</b... |
540 | 540 | Little-Mermaid,-The | THE LITTLE MERMAID\r\n<b> -------... |
541 rows × 3 columns
This converts the HTML to Markdown:
x = html2text.html2text(df['script'][10])
And this prints a nice, user-friendly version of the Markdown script:
'''
from IPython.display import display, Markdown
display(Markdown(df['script'][10]))
'''
"\nfrom IPython.display import display, Markdown\ndisplay(Markdown(df['script'][10]))\n"
The code above can be sped up a little bit with multi-threading. This code runs in just over 7 minutes:
def scrape_imsdb(title, notFound=urlopen('https://imsdb.com/scripts/asd.html').read()):
t = title.replace(' ', '-')
url = 'https://imsdb.com/scripts/' + t + '.html'
try:
page = urlopen(url)
html_bytes = page.read()
if html_bytes == notFound:
tup = (title, np.nan)
else:
html = html_bytes.decode('utf-8')
tup = (title, html)
except (HTTPError, UnicodeDecodeError, IndexError):
tup = (title, np.nan)
finally:
return tup
'''
processes = []
with ThreadPoolExecutor(max_workers=10) as executor:
for title in targets['title']:
processes.append(executor.submit(scrape_imsdb, title))
dict = {}
for task in processes:
dict[task.result()[0]] = task.result()[1]
'''
"\nprocesses = []\nwith ThreadPoolExecutor(max_workers=10) as executor:\n for title in targets['title']:\n processes.append(executor.submit(scrape_imsdb, title))\n\n\ndict = {}\nfor task in processes:\n dict[task.result()[0]] = task.result()[1]\n"
'''
scripts = pd.DataFrame.from_dict(dict, orient='index')
scripts.to_csv('scraped_imsdb_scripts.csv')
'''
scripts = pd.read_csv('scraped_imsdb_scripts.csv', index_col=0)
data = targets.join(scripts, on='title').rename(columns={'0': 'html'})
data['html'].value_counts()
html \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
![]() | \r\n\tThe Internet Movie Script Database (IMSDb)\r\n |
![]() | |
![]() \r\n\t movie script resource! | \t\r\n\r\n\r\n\r\n\r\n |
\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n
\r\n\r\n\r\n\r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n \r\n \r\n \r\n
\r\n\r\n\r\n \r\n\t\r\n\t\r\n\t\r\n\t\r\n\t\r\n\t
\r\n\r\n\r\n \r\n\t
\r\n\r\n\r\n\r\n
\r\n\r\n\r\n\r\n | \r\n \r\n \r\n
\r\n |
\r\n |
Exploring the data, a couple of data quality issues are evident:
data[data['title'] == 'It']
id | imdbid | title | rating | year | html | |
---|---|---|---|---|---|---|
202 | 1227 | 18033.0 | It | 3 | 1927 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... |
2663 | 454 | 99864.0 | It | 1 | 1990 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... |
8852 | 7799 | 1396484.0 | It | 3 | 2017 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... |
Unfortunately, digging further on IMSDb's website leads only to more potential issues:
data.dropna(subset='html', inplace=True)
def scrape_imsdb_metadata(title):
t = title.replace(' ', '%20')
url = 'https://imsdb.com/Movie%20Scripts/' + t + '%20Script.html'
try:
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode('utf-8')
movie_date_pattern = r'(<b>Movie Release Date</b> : )(.*)(<br>)'
try: movie_date = re.search(movie_date_pattern, html)[2]
except TypeError: movie_date = 'not listed'
script_date_pattern = r'(<b>Script Date</b> : )(.*)(<br>)'
try: script_date = re.search(script_date_pattern, html)[2]
except TypeError: script_date = 'not listed'
tup = (title, movie_date, script_date)
except (HTTPError, UnicodeDecodeError, IndexError):
tup = (title, np.nan, np.nan)
finally:
return tup
'''
processes = []
with ThreadPoolExecutor(max_workers=10) as executor:
for title in data['title']:
processes.append(executor.submit(scrape_imsdb_metadata, title))
dict = {}
for task in processes:
dict[task.result()[0]] = (task.result()[1], task.result()[2])
'''
"\nprocesses = []\nwith ThreadPoolExecutor(max_workers=10) as executor:\n for title in data['title']:\n processes.append(executor.submit(scrape_imsdb_metadata, title))\n\n\ndict = {}\nfor task in processes:\n dict[task.result()[0]] = (task.result()[1], task.result()[2])\n"
'''
data = data.join(pd.DataFrame.from_dict(dict, orient='index'), on='title').rename(columns={0:'release_date', 1:'script_date'})
data.to_csv('exploratory_data.csv')
'''
"\ndata = data.join(pd.DataFrame.from_dict(dict, orient='index'), on='title').rename(columns={0:'release_date', 1:'script_date'})\ndata.to_csv('exploratory_data.csv')\n"
data = pd.read_csv('exploratory_data.csv', index_col=0)
data
id | imdbid | title | rating | year | html | release_date | script_date | |
---|---|---|---|---|---|---|---|---|
202 | 1227 | 18033.0 | It | 3 | 1927 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | September 2017 | March 2014 |
252 | 1317 | 21884.0 | Frankenstein | 1 | 1931 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | November 1994 | February 1993 |
276 | 1328 | 22958.0 | Grand Hotel | 3 | 1932 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | April 1932 | not listed |
292 | 6063 | 23427.0 | Scarface | 1 | 1932 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | not listed |
416 | 174 | 32138.0 | Wizard of Oz, The | 3 | 1939 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | March 1939 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
9665 | 9265 | 4566758.0 | Mulan | 3 | 2020 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | December 1998 |
9858 | 10052 | 1160419.0 | Dune | 3 | 2021 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | NaN | NaN |
9971 | 10221 | 11245972.0 | Scream | 3 | 2022 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | July 1995 |
10101 | 10684 | 10278918.0 | Willow | 3 | 2022 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | not listed |
10182 | 10926 | 5971474.0 | Little Mermaid, The | 3 | 2023 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | December 1989 |
541 rows × 8 columns
This next bit of code separates the month and year associated with the movies and scripts.
data['release_year'] = pd.Series()
data['release_month'] = pd.Series()
data['script_year'] = pd.Series()
data['script_month'] = pd.Series()
pattern = r'(.*)(\d\d\d\d)'
for i in data.index:
try:
match = re.search(pattern, data.loc[i].release_date)
data['release_year'][i] = match[2]
data['release_month'][i] = match[1]
except TypeError: pass
try:
match = re.search(pattern, data.loc[i].script_date)
data['script_year'][i] = match[2]
data['script_month'][i] = match[1]
except TypeError: pass
209 html scrapes can be matched to their movies programatically this way, split evenly between passing and failing the Bechdel Test!
data[data['year'] == data['release_year'].fillna(0).astype(int)]
id | imdbid | title | rating | year | html | release_date | script_date | release_year | release_month | script_year | script_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
276 | 1328 | 22958.0 | Grand Hotel | 3 | 1932 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | April 1932 | not listed | 1932 | April | NaN | NaN |
775 | 4071 | 45793.0 | From Here to Eternity | 3 | 1953 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | September 1953 | August 1952 | 1953 | September | 1952 | August |
865 | 10527 | 47849.0 | Bad Day at Black Rock | 0 | 1955 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | January 1955 | not listed | 1955 | January | NaN | NaN |
1634 | 4610 | 69704.0 | American Graffiti | 2 | 1973 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | August 1973 | not listed | 1973 | August | NaN | NaN |
1639 | 5315 | 70379.0 | Mean Streets | 1 | 1973 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | October 1973 | not listed | 1973 | October | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8791 | 7508 | 5052448.0 | Get Out | 3 | 2017 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | February 2017 | not listed | 2017 | February | NaN | NaN |
8852 | 7799 | 1396484.0 | It | 3 | 2017 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | September 2017 | March 2014 | 2017 | September | 2014 | March |
9082 | 8157 | 6644200.0 | A Quiet Place | 3 | 2018 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | April 2018 | not listed | 2018 | April | NaN | NaN |
9112 | 8368 | 7349662.0 | BlacKkKlansman | 3 | 2018 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | August 2018 | not listed | 2018 | August | NaN | NaN |
9142 | 8452 | 1502407.0 | Halloween | 3 | 2018 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | October 2018 | not listed | 2018 | October | NaN | NaN |
209 rows × 12 columns
data[data['year'] == data['release_year'].fillna(0).astype(int)]['rating'].value_counts()
rating 3 105 1 63 2 24 0 17 Name: count, dtype: int64
def get_tmsdb_metadata(imdb_id):
auth = 'Bearer ' + open('tmdbauth.txt').read()
headers = {
'accept' : 'application/json',
'Authorization' : auth
}
url = "https://api.themoviedb.org/3/find/tt" + str(int(imdb_id)).zfill(7) + "?external_source=imdb_id"
response = requests.get(url, headers=headers)
try: dict = json.loads(response.text)['movie_results'][0]
except IndexError: return pd.DataFrame()
dict['genre_ids'] = [tuple(dict['genre_ids'])]
dict['imdbid'] = imdb_id
return pd.DataFrame.from_dict(dict)
'''
processes = []
with ThreadPoolExecutor(max_workers=32) as executor:
for id in data['imdbid']:
processes.append(executor.submit(get_tmsdb_metadata, id))
df = pd.DataFrame()
for task in processes:
df = pd.concat([df, task.result()])
'''
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[424], line 9 7 df = pd.DataFrame() 8 for task in processes: ----> 9 df = pd.concat([df, task.result()]) File ~\anaconda3\Lib\concurrent\futures\_base.py:449, in Future.result(self, timeout) 447 raise CancelledError() 448 elif self._state == FINISHED: --> 449 return self.__get_result() 451 self._condition.wait(timeout) 453 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]: File ~\anaconda3\Lib\concurrent\futures\_base.py:401, in Future.__get_result(self) 399 if self._exception: 400 try: --> 401 raise self._exception 402 finally: 403 # Break a reference cycle with the exception in self._exception 404 self = None File ~\anaconda3\Lib\concurrent\futures\thread.py:58, in _WorkItem.run(self) 55 return 57 try: ---> 58 result = self.fn(*self.args, **self.kwargs) 59 except BaseException as exc: 60 self.future.set_exception(exc) Cell In[370], line 7, in get_tmsdb_metadata(imdb_id) 2 auth = 'Bearer ' + open('tmdbauth.txt').read() 3 headers = { 4 'accept' : 'application/json', 5 'Authorization' : auth 6 } ----> 7 url = "https://api.themoviedb.org/3/find/tt" + str(int(imdb_id)).zfill(7) + "?external_source=imdb_id" 8 response = requests.get(url, headers=headers) 9 try: dict = json.loads(response.text)['movie_results'][0] ValueError: cannot convert float NaN to integer
'''df.to_csv('tmdb_data.csv')'''
df = pd.read_csv('tmdb_data.csv', index_col=0, converters={"genre_ids": ast.literal_eval})
df['imdbid'] = df['imdbid'].astype(int)
df.set_index('imdbid', inplace=True)
data['pass_fail'] = pd.Series()
for i in data.index:
if data['rating'][i] == 3:
data['pass_fail'][i] = 'pass'
else:
data['pass_fail'][i] = 'fail'
data = data.join(df, on='imdbid', lsuffix='bechdel', rsuffix='tmdb')
url = "https://api.themoviedb.org/3/genre/movie/list"
auth = 'Bearer ' + open('tmdbauth.txt').read()
headers = {
'accept' : 'application/json',
'Authorization' : auth
}
response = requests.get(url, headers=headers)
genre_dict = {}
for genre in json.loads(response.text)['genres']:
genre_dict[genre['id']] = genre['name']
genre_dict
{28: 'Action', 12: 'Adventure', 16: 'Animation', 35: 'Comedy', 80: 'Crime', 99: 'Documentary', 18: 'Drama', 10751: 'Family', 14: 'Fantasy', 36: 'History', 27: 'Horror', 10402: 'Music', 9648: 'Mystery', 10749: 'Romance', 878: 'Science Fiction', 10770: 'TV Movie', 53: 'Thriller', 10752: 'War', 37: 'Western'}
for i in genre_dict.keys():
data[genre_dict[i]] = pd.Series()
for i in data.index:
try:
for j in data['genre_ids'][i]:
data[genre_dict[j]][i] = 1
except (TypeError, KeyError): continue
data.loc[:,'Action':] = data.loc[:,'Action':].fillna(0)
data.loc[:,'Action':].sum()
data
idbechdel | imdbid | titlebechdel | rating | year | html | release_datebechdel | script_date | release_year | release_month | ... | History | Horror | Music | Mystery | Romance | Science Fiction | TV Movie | Thriller | War | Western | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202 | 1227 | 18033.0 | It | 3 | 1927 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | September 2017 | March 2014 | 2017 | September | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
252 | 1317 | 21884.0 | Frankenstein | 1 | 1931 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | November 1994 | February 1993 | 1994 | November | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
276 | 1328 | 22958.0 | Grand Hotel | 3 | 1932 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | April 1932 | not listed | 1932 | April | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
292 | 6063 | 23427.0 | Scarface | 1 | 1932 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | not listed | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
416 | 174 | 32138.0 | Wizard of Oz, The | 3 | 1939 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | March 1939 | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9665 | 9265 | 4566758.0 | Mulan | 3 | 2020 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | December 1998 | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9858 | 10052 | 1160419.0 | Dune | 3 | 2021 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | NaN | NaN | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
9971 | 10221 | 11245972.0 | Scream | 3 | 2022 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | July 1995 | NaN | NaN | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
10101 | 10684 | 10278918.0 | Willow | 3 | 2022 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | not listed | NaN | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10182 | 10926 | 5971474.0 | Little Mermaid, The | 3 | 2023 | <html>\r\n<head>\r\n<!-- Google tag (gtag.js) ... | not listed | December 1989 | NaN | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
543 rows × 47 columns
Thanks to Alex Belengeanu for this code!
This notebook is getting long and convoluted, and I haven't even gotten to the actual data exploration set. I've gone ahead and used the above code to scrape TMDb for metadata for the entire Bechdel set, and I'll build a SQL database to start to structure the data before continuing much further. In the meantime, here's a raincloud plot!
def raincloud_plot(data, column: str, x_limit: tuple=None, column_lab: str='none'):
if column_lab == 'none':
column_lab = column
data_x = [data[data['pass_fail'] == 'pass'][column].dropna(), data[data['pass_fail'] == 'fail'][column].dropna()]
fig, ax = plt.subplots(figsize=(14, 7))
# Create a list of colors for the boxplots based on the number of features you have
boxplots_colors = ['yellowgreen', 'olivedrab']
# Boxplot data
bp = ax.boxplot(data_x, patch_artist = True, vert = False)
# Change to the desired color and add transparency
for patch, color in zip(bp['boxes'], boxplots_colors):
patch.set_facecolor(color)
patch.set_alpha(0.4)
# Create a list of colors for the violin plots based on the number of features you have
violin_colors = ['thistle', 'orchid']
# Violinplot data
vp = ax.violinplot(data_x, points=500,
showmeans=False, showextrema=False, showmedians=False, vert=False)
for idx, b in enumerate(vp['bodies']):
# Get the center of the plot
m = np.mean(b.get_paths()[0].vertices[:, 0])
# Modify it so we only see the upper half of the violin plot
b.get_paths()[0].vertices[:, 1] = np.clip(b.get_paths()[0].vertices[:, 1], idx+1, idx+2)
# Change to the desired color
b.set_color(violin_colors[idx])
# Create a list of colors for the scatter plots based on the number of features you have
scatter_colors = ['tomato', 'darksalmon']
# Scatterplot data
for idx, features in enumerate(data_x):
# Add jitter effect so the features do not overlap on the y-axis
y = np.full(len(features), idx + .8)
idxs = np.arange(len(y))
out = y.astype(float)
out.flat[idxs] += np.random.uniform(low=-.09, high=.09, size=len(idxs))
y = out
plt.scatter(features, y, s=.3, c=scatter_colors[idx])
plt.yticks(np.arange(1,3,1), ['Pass', 'Fail']) # Set text labels.
plt.xlabel(column_lab)
plt.title('Distributions of ' + column_lab + ' Among Movies Passing and Failing the Bechdel Test')
plt.xlim(x_limit)
plt.show()
raincloud_plot(data, 'vote_average', column_lab='Average Rating')
import nltk
import statistics
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import psycopg2
import warnings
from sklearn.model_selection import train_test_split
import math
warnings.filterwarnings("ignore")
The methodology for constructing this database can be found in the 'Building a Database' notebook on github.
conn = psycopg2.connect(dbname='bechdel_test', user='postgres', password='guest')
cur = conn.cursor()
cur.execute('SELECT * FROM imsdb_scripts JOIN bechdel_ratings ON imsdb_scripts.imdb_id = bechdel_ratings.imdb_id JOIN tmdb_data ON tmdb_data.imdb_id = imsdb_scripts.imdb_id;')
data = pd.DataFrame(cur.fetchall())
df = data.copy()
df.set_index(0, inplace=True)
cur.execute('SELECT genre.imdb_id, genre FROM genre JOIN imsdb_scripts ON imsdb_scripts.imdb_id = genre.imdb_id;')
genre = pd.DataFrame(cur.fetchall())
cur.close()
conn.close()
for genre_ in genre[1].unique():
df[genre_] = pd.Series()
for row in genre.iterrows():
df[row[1][1]][row[1][0]] = 1
df.rename(columns={0:'imdb_id',
1:'script_date',
2:'script',
3:'bechdel_id',
5:'title',
6:'release_year',
7:'bechdel_rating',
11:'language',
13:'popularity',
14:'vote_average',
15:'vote_count',
16:'overview'
},
inplace=True)
df.drop(columns=[4, 8, 9, 10, 12], inplace=True)
df.fillna(0, inplace=True)
df.replace('none', np.nan, inplace=True)
This function will clean and tokenize each script, eliminating stop words and punctuation.
def clean_text(text: str) -> list[str]:
text = word_tokenize(text.lower())
ls = list(string.punctuation) + stopwords.words('english') + ['...', '--', '\'\'', '``']
i = 0
while i < len(text):
if text[i] in ls:
text.remove(text[i])
else:
i += 1
return text
A couple of leftover nas remain in the dataset, otherwise we can go ahead and run the function on the dataset.
df = df.dropna(subset='script')
df['clean_text'] = [clean_text(text) for text in df['script']]
This function updates the weights of the naive bayes classifier for a single row of data.
genres = list(df.columns[11:-1])
def UpdateWeights(row: pd.Series,
weights: dict[str: dict[str, int]],
total_words_per_genre: dict[str: int],
genres: list[str]=genres) -> dict[str: dict[str, int]]:
genre_list = []
for genre in genres:
if row[genre] == 1:
total_words_per_genre[genre] += len(row['clean_text'])
genre_list.append(genre)
for token in row['clean_text']:
if token in weights:
for genre in genre_list:
weights[token][genre] += 1
else:
weights[token] = dict.fromkeys(genres, 0)
for genre in genre_list:
weights[token][genre] = 1
A couple of duplicates remain in the dataset:
x = df.duplicated(subset='script')
df = df.drop(list(x[x==True].index))
X_train, X_test, y_train, y_test = train_test_split(df['clean_text'], df.loc[:,'Drama':'History'], test_size=0.2, random_state=42)
train_df = y_train.join(X_train)
This function initiates the weights variable and updates it for each row in the dataframe.
def NaiveBayes(df: pd.DataFrame) -> dict[str: dict[str, int]]:
total_words_per_genre = dict.fromkeys(genres, 0)
weights = {}
for i in list(df.index):
UpdateWeights(df.loc[i], weights, total_words_per_genre)
for word in weights:
for genre in weights[word]:
weights[word][genre] /= total_words_per_genre[genre]
return weights
weights = NaiveBayes(train_df)
This function returns the natural logarithm of each weight, or -10,000 if the weight is 0.
def LogWeights(weights: dict[str: dict[str: float]]):
for word in weights.keys():
for genre in weights[word]:
if weights[word][genre] == 0:
weights[word][genre] = -10000
else:
weights[word][genre] = math.log(weights[word][genre])
LogWeights(weights)
These functions return the feature function and prediction scores for a script. The n highest scoring genres will be considered the model's predictions, where n is the amount of genres listed for the movie.
def FeatureFunction(tokens: list[str]) -> list[tuple[str, int]]:
return [(token, tokens.count(token)) for token in set(tokens)]
def Score(script: list[str], weights: dict[str: dict[str: float]]=weights, genres: list[str]=genres) -> dict[str: int]:
score = dict.fromkeys(genres, 0)
for word, count in FeatureFunction(script):
for genre in score:
if word in weights: score[genre] += weights[word][genre] * count
return score
For the first entry, index #349903, Crime and Thriller are listed as the movies genres. The Score function scores those two genres highest by an order of magnitude!
train_df
Drama | Romance | Adventure | Fantasy | Family | Mystery | Crime | Thriller | War | Comedy | Music | Western | Horror | Science Fiction | Action | Animation | History | clean_text | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ||||||||||||||||||
349903 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [ocean, 's, twelve, written, george, nolfi, ro... |
43014 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [sunset, boulevard, charles, brackett, billy, ... |
86510 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [fire, screenplay, clayton, frohman, ron, shel... |
114369 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [seven, andrew, kevin, walker, january, 27,199... |
758758 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [wild, written, sean, penn, based, book, jon, ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100405 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [p, r, e, w, n, jonathan, lawton, stephen, met... |
110632 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [natural, born, killers, written, quentin, tar... |
448157 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | [hancock, written, vincent, ngo, vince, gillig... |
1441326 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [martha, marcy, may, marlene, written, sean, d... |
109830 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [forrest, gump, screenplay, eric, roth, based,... |
330 rows × 18 columns
pd.Series(Score(train_df['clean_text'][349903])).sort_values(ascending=False)
Crime -1.169829e+05 Thriller -1.187586e+05 Drama -6.394165e+06 Comedy -9.610512e+06 Romance -1.141861e+07 Action -1.331699e+07 Adventure -1.500677e+07 Science Fiction -1.607495e+07 Mystery -1.653374e+07 Fantasy -1.806365e+07 Horror -1.845203e+07 History -2.054818e+07 Family -2.386636e+07 Music -2.577168e+07 Animation -2.709325e+07 Western -4.132910e+07 War -4.240746e+07 dtype: float64
test_df = y_test.join(X_test)
test_df['genres_listed'] = pd.Series()
for i in test_df.index:
test_df['genres_listed'][i] = sum(test_df.loc[i][:'History'])
test_df
Drama | Romance | Adventure | Fantasy | Family | Mystery | Crime | Thriller | War | Comedy | Music | Western | Horror | Science Fiction | Action | Animation | History | clean_text | genres_listed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||
1126590 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [big, eyes, written, scott, alexander, larry, ... | 1 |
1655420 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [week, marilyn, written, adrian, hodges, 1, ex... | 2 |
1365050 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [beasts, nation, written, cary, joji, fukunaga... | 2 |
1067774 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [monte, carlo, written, ron, bass, based, nove... | 3 |
164052 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | [hollow, man, written, andrew, w., marlowe, re... | 3 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1201167 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [funny, people, written, judd, apatow, april, ... | 2 |
1027718 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [wall, street, money, never, sleeps, written, ... | 2 |
162346 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [ghost, world, daniel, clowes, terry, zwigoff,... | 2 |
824747 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [changeling, true, story, written, j., michael... | 3 |
2473602 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | [get, written, steven, baigelman, jez, butterw... | 2 |
83 rows × 19 columns
We can define accuracy as how many of the model's first n predictions are correct over n. Informally, this accuracy represents the percentage of correct genres the model is able to identify.
def PredictionAccuracy(test_df: pd.DataFrame) -> tuple[int, float]:
total_score = 0
for i in test_df.index:
score = 0
num_genres = test_df.loc[i]['genres_listed']
preds = list(pd.Series(Score(test_df.loc[i]['clean_text'])).sort_values(ascending=False).index)
for genre in preds[:num_genres]:
if df.loc[i][genre] == 1:
score += 1
score /= num_genres
total_score += score
return total_score / len(test_df)
PredictionAccuracy(test_df)
0.40843373493975893
Calculating the precision, recall, and F score can give a more complete picture of the model's accuracy.
These scores will all be the same with the previously described mode of making predictions, because it takes into account the correct amount of labels to be predicted. Each false positive is accompanied by a false negative. This approach can be useful for tuning a larger generative model, which is how I ultimately plan to use this code. In order to calculate the accuracy in a more granular way, however, we can define a prediction threshold, either as a discrete quantity or a function of the predicted probabilities for the entire set of classes.
def Precision_Recall_F(test_df: pd.DataFrame, threshold_function, thresh_func_args: tuple) -> tuple[float, float, float]:
total_score = 0
true_positives = 0
true_negatives = 0
false_positives = 0
false_negatives = 0
for i in test_df.index:
num_genres = test_df.loc[i]['genres_listed']
p = pd.Series(Score(test_df.loc[i]['clean_text'])).sort_values(ascending=False)
preds = threshold_function(p, thresh_func_args)
#print(preds)
for genre in list(test_df.loc[:,'Drama':'History'].columns):
if genre in preds: pred = True #Positive prediction
else: pred = False #Negative prediction
if test_df.loc[i, genre] == 0: obs = False #Negative observed value
else: obs = True #Positive observed Value
match (pred, obs):
case (True, True):
true_positives += 1
case (True, False):
false_positives += 1
case (False, False):
true_negatives += 1
case (False, True):
false_negatives += 1
'''
print('preds: ', [genre for genre in preds])
print('row: ', test_df.loc[i])
print('TP: ', true_positives)
print('TN: ', true_negatives)
print('FP: ', false_positives)
print('FN: ', false_negatives)
print('-------')
'''
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f = 2 * ((precision * recall) / (precision + recall))
return (precision, recall, f)
def thresh_stdev(p, args=(1,)):
'''Threshold is defined at the given Z score for each row's predicted probabilities'''
return list(p[p > statistics.mean(p) + (statistics.stdev(p) * args[0])].index)
def correct_number_of_preds(p, args=(0,)):
'''Model will make the correct number of predictions, plus the given amount of extra predictions'''
global num_genres
return list(p.index)[:num_genres + args[0]]
def thresh_constant(p, args=(-3000000,)):
'''Threshold is a given constant'''
return list([p[p > args[0]]].index)
def thresh_linear_wrt_mean(p, args=(10,)):
'''Threshold is a given constant multiplied by the mean of each row's predicted probabilities'''
return list(p[p > statistics.mean(p) / args[0]].index)
(precision, recall, f) = Precision_Recall_F(test_df, correct_number_of_preds, (0,))
Ideally, I would have done this type of hyperparameter tuning before making predictions on the test set. As it is, it would be hard to pick an ideal model without overfitting the available data.
However, this data is very limiting to begin with. The model leans heavily in favor of predicting certain categories, predicting drama and thriller significantly more often than the other classes. Looking at the training data, these categories are heavily overrepresented.
for genre in train_df.columns[:17]:
print(genre, len(train_df[train_df[genre] == 1]))
Drama 160 Romance 53 Adventure 59 Fantasy 40 Family 19 Mystery 45 Crime 70 Thriller 118 War 5 Comedy 90 Music 7 Western 4 Horror 46 Science Fiction 61 Action 81 Animation 15 History 14
With more data this model can likely be made more accurate, and with the addition of some fresh validation and testing data, a 'final' model can be tuned. More than likely, this simple 'bag-of-words' classifier will only be useful as part of a larger ensemble or GAN if at all.
In this notebook is code to create the movie_overview_classification model. The model accepts an overview of a movie and returns a prediction regarding whether the movie will a pass the Bechdel test. It only achieves accuracy (measured via f-score) of .77, but it will be implemented as part of a larger ensemble algorithm.
import pandas as pd
import warnings
from sklearn.model_selection import train_test_split
import numpy as np
warnings.filterwarnings("ignore")
import datasets
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, DataCollatorWithPadding
from datasets import load_metric
import BechdelDataImporter as data
df = data.NoScripts()
First, instantiate a tokenizer, data collator, and model:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
'\npredicted_class_id = logits.argmax().item()\nmodel.config.id2label[predicted_class_id]\n'
df['overview_tokenized'] = pd.Series()
df['label'] = pd.Series()
df = df.drop_duplicates(subset=['overview']).dropna(subset=['overview'])
for i in df.index:
df['overview_tokenized'][i] = tokenizer(df.loc[i, 'overview'], return_tensors="pt")
if df['bechdel_rating'][i] == 3: df['label'][i] = 1
else: df['label'][i] = 0
Split off a test set:
X_train, X_test, y_train, y_test = train_test_split(df[['overview', 'overview_tokenized']], df['label'], test_size=0.2, random_state=42)
from huggingface_hub import notebook_login
notebook_login()
VBox(children=(HTML(value='
def compute_metrics(eval_pred):
load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
return {"accuracy": accuracy, "f1": f1}
def processing(X: pd.DataFrame, y: pd.Series) -> datasets.Dataset:
X['input_ids'] = pd.Series()
X['attention_mask'] = pd.Series()
for i in X.index:
X['input_ids'][i], X['attention_mask'][i] = X.loc[i, 'overview_tokenized'].input_ids.tolist()[0], X.loc[i, 'overview_tokenized'].attention_mask.tolist()[0]
return datasets.Dataset.from_pandas(X.join(y).drop(columns=['overview_tokenized']).rename(columns={'overview':'text'}))
train_df, test_df = processing(X_train, y_train), processing(X_test, y_test)
from transformers import TrainingArguments, Trainer
repo_name = "movie_overview_classification"
training_args = TrainingArguments(
output_dir=repo_name,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
save_strategy="epoch",
push_to_hub=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_df,
eval_dataset=test_df,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
Step | Training Loss |
---|
TrainOutput(global_step=1010, training_loss=0.5062790998137824, metrics={'train_runtime': 5485.4763, 'train_samples_per_second': 2.944, 'train_steps_per_second': 0.184, 'total_flos': 545236330318872.0, 'train_loss': 0.5062790998137824, 'epoch': 2.0})
trainer.evaluate()
{'eval_loss': 0.5222412347793579, 'eval_accuracy': 0.7439326399207529, 'eval_f1': 0.7701200533570476, 'eval_runtime': 272.8653, 'eval_samples_per_second': 7.399, 'eval_steps_per_second': 0.465, 'epoch': 2.0}
Pushing the model to Hugging Face hub
trainer.push_to_hub()
events.out.tfevents.1721757963.Marks_Laptop.61328.1: 0%| | 0.00/457 [00:00
Upload 2 LFS files: 0%| | 0/2 [00:00
events.out.tfevents.1721751825.Marks_Laptop.61328.0: 0%| | 0.00/5.53k [00:00
CommitInfo(commit_url='https://huggingface.co/mocboch/movie_overview_classification/commit/059aaf4a08b21825ae434df60b2c7af682dc5f6f', commit_message='End of training', commit_description='', oid='059aaf4a08b21825ae434df60b2c7af682dc5f6f', pr_url=None, pr_revision=None, pr_num=None)
from transformers import pipeline
final_model = pipeline(model="mocboch/movie_overview_classification")
config.json: 0%| | 0.00/640 [00:00
model.safetensors: 0%| | 0.00/268M [00:00
tokenizer_config.json: 0%| | 0.00/1.30k [00:00
vocab.txt: 0%| | 0.00/262k [00:00
special_tokens_map.json: 0%| | 0.00/132 [00:00
test_data = pd.DataFrame(test_df)
test_data['preds'] = pd.Series()
for i in test_data.index:
test_data['preds'][i] = final_model(test_df['text'][i])
test_data
text | input_ids | attention_mask | label | __index_level_0__ | preds | |
---|---|---|---|---|---|---|
0 | A young and devoted morning television produce... | [101, 1037, 2402, 1998, 7422, 2851, 2547, 3135... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 6240 | [{'label': 'LABEL_1', 'score': 0.8961271643638... |
1 | Don Birnam, a long-time alcoholic, has been so... | [101, 2123, 12170, 12789, 2213, 1010, 1037, 21... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 0 | 574 | [{'label': 'LABEL_0', 'score': 0.7674303054809... |
2 | One peaceful day on Earth, two remnants of Fri... | [101, 2028, 9379, 2154, 2006, 3011, 1010, 2048... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 0 | 8304 | [{'label': 'LABEL_0', 'score': 0.6576490402221... |
3 | Dominic Toretto and his crew battle the most s... | [101, 11282, 9538, 9284, 1998, 2010, 3626, 264... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 9737 | [{'label': 'LABEL_0', 'score': 0.8074591755867... |
4 | The Martins family are optimistic dreamers, qu... | [101, 1996, 19953, 2155, 2024, 21931, 24726, 2... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 10031 | [{'label': 'LABEL_1', 'score': 0.8910151124000... |
... | ... | ... | ... | ... | ... | ... |
2014 | Seven short films - each one focused on the pl... | [101, 2698, 2460, 3152, 1011, 2169, 2028, 4208... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 4951 | [{'label': 'LABEL_1', 'score': 0.6277556419372... |
2015 | After an unprecedented series of natural disas... | [101, 2044, 2019, 15741, 2186, 1997, 3019, 186... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 8786 | [{'label': 'LABEL_1', 'score': 0.6020154356956... |
2016 | Girl Lost tackles the issue of underage prosti... | [101, 2611, 2439, 10455, 1996, 3277, 1997, 210... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 8693 | [{'label': 'LABEL_1', 'score': 0.9668087363243... |
2017 | Loosely based on the true-life tale of Ron Woo... | [101, 11853, 2241, 2006, 1996, 2995, 1011, 216... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 1 | 7366 | [{'label': 'LABEL_1', 'score': 0.9210842251777... |
2018 | A young black pianist becomes embroiled in the... | [101, 1037, 2402, 2304, 9066, 4150, 7861, 1261... | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | 0 | 2045 | [{'label': 'LABEL_1', 'score': 0.9472063183784... |
2019 rows × 6 columns
import psycopg2
import pandas as pd
import numpy as np
import warnings
import google.generativeai as genai
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")
conn = psycopg2.connect(dbname='bechdel_test', user='postgres', password='guest')
cur = conn.cursor()
cur.execute('SELECT * FROM imsdb_scripts JOIN bechdel_ratings ON imsdb_scripts.imdb_id = bechdel_ratings.imdb_id JOIN tmdb_data ON tmdb_data.imdb_id = imsdb_scripts.imdb_id;')
data = pd.DataFrame(cur.fetchall())
df = data.copy()
df.set_index(0, inplace=True)
cur.execute('SELECT genre.imdb_id, genre FROM genre JOIN imsdb_scripts ON imsdb_scripts.imdb_id = genre.imdb_id;')
genre = pd.DataFrame(cur.fetchall())
cur.close()
conn.close()
for genre_ in genre[1].unique():
df[genre_] = pd.Series()
for row in genre.iterrows():
df[row[1][1]][row[1][0]] = 1
df.rename(columns={0:'imdb_id',
1:'script_date',
2:'script',
3:'bechdel_id',
5:'title',
6:'release_year',
7:'bechdel_rating',
11:'language',
13:'popularity',
14:'vote_average',
15:'vote_count',
16:'overview'
},
inplace=True)
df.drop(columns=[4, 8, 9, 10, 12], inplace=True)
df.fillna(0, inplace=True)
df.replace('none', np.nan, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['bechdel_rating']), df['bechdel_rating'], test_size=0.234, random_state=42)
def y_transform(y):
y = pd.DataFrame(y)
y['pass_fail'] = y['bechdel_rating'].map({0:0, 1:0, 2:0, 3:1})
return y
y_train, y_test = y_transform(y_train), y_transform(y_test)
X_train
script_date | script | bechdel_id | title | release_year | language | popularity | vote_average | vote_count | overview | ... | Thriller | War | Comedy | Music | Western | Horror | Science Fiction | Action | Animation | History | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||||
472033 | NaN | \n\n ... | 494 | 9 | 2009 | en | 71.590 | 6.921 | 3407 | When 9 first comes to life, he finds himself i... | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
120780 | March 1998 | "Out of Sight"\r\n\r\n\r\n ... | 2247 | Out of Sight | 1998 | en | 24.781 | 6.682 | 1203 | Meet Jack Foley, a smooth criminal who bends t... | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1706593 | NaN | CHRONICLE\r\n\r\n\r\... | 3037 | Chronicle | 2012 | en | 40.036 | 6.816 | 5119 | Three high school students make an incredible ... | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2911666 | NaN | JOHN WICK\r\n\r\n\... | 5897 | John Wick | 2014 | en | 105.961 | 7.430 | 18679 | Ex-hitman John Wick comes out of retirement to... | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
61722 | March 1967 | \t\t\t\t"THE GRADUATE"\r\n\r\n\r\n\t\t\t\tScre... | 616 | Graduate, The | 1967 | en | 30.980 | 7.700 | 3206 | Benjamin, a recent college graduate very worri... | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100814 | June 1988 | \n\nS. S. Wilson & Brent Maddock's "Tremors"\n... | 1663 | Tremors | 1990 | en | 77.463 | 6.896 | 3105 | Val McKee and Earl Bassett are in a fight for ... | ... | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
109506 | September 1992 | The CROW\r\n\r\n\tby\r\n\r\n\tDavis Schow\r\n\... | 3820 | Crow, The | 1994 | en | 54.672 | 7.527 | 3786 | Exactly one year after young rock guitarist Er... | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
765443 | NaN | EASTERN PROMISES\r\... | 3069 | Eastern Promises | 2007 | en | 35.119 | 7.362 | 3194 | A Russian teenager living in London dies durin... | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
816462 | October 2009 | NaN | 2636 | Conan the Barbarian | 2011 | en | 35.716 | 5.299 | 1792 | A quest that begins as a personal vendetta for... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
110148 | NaN | Interview with the Vampire\r\n\r\n\tScreenplay... | 120 | Interview with the Vampire | 1994 | en | 83.427 | 7.387 | 5627 | A vampire relates his epic life story of love,... | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
326 rows × 27 columns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [1000],
'max_depth': [7],
'min_samples_split': [15],
'min_samples_leaf': [4]}
grid = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=3)
grid.fit(X_train[['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History']], y_train['pass_fail'])
grid.best_params_
{'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 15, 'n_estimators': 1000}
random_forest_clf = grid.best_estimator_
y_test['pass_fail'] == random_forest_clf.predict(X_test[['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History']])
0 6644200 True 100477 True 124315 False 78748 False 480687 False ... 349903 True 481499 False 905372 True 43014 False 86510 True Name: pass_fail, Length: 100, dtype: bool
apikey = open('apikey.txt').read()
genai.configure(api_key=apikey)
model = 'models/embedding-001'
from google.api_core import retry
from tqdm.auto import tqdm
tqdm.pandas()
def make_embed_text_fn(model):
@retry.Retry(timeout=300.0)
def embed_fn(text: str) -> list[float]:
embedding = genai.embed_content(model=model,
content=text,
task_type='classification')
return embedding['embedding']
return embed_fn
def create_embeddings(model, df):
df['embeddings'] = df['overview'].progress_apply(make_embed_text_fn(model))
return df
X_train_embedded = create_embeddings(model, X_train)
0%| | 0/326 [00:00
X_train_embedded.head()
script_date | script | bechdel_id | title | release_year | language | popularity | vote_average | vote_count | overview | ... | War | Comedy | Music | Western | Horror | Science Fiction | Action | Animation | History | embeddings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||||||||||
472033 | NaN | \n\n ... | 494 | 9 | 2009 | en | 71.590 | 6.921 | 3407 | When 9 first comes to life, he finds himself i... | ... | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | [0.009933546, 0.028054273, -0.027433202, 0.011... |
120780 | March 1998 | "Out of Sight"\r\n\r\n\r\n ... | 2247 | Out of Sight | 1998 | en | 24.781 | 6.682 | 1203 | Meet Jack Foley, a smooth criminal who bends t... | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [0.054206412, 0.025492756, 0.036431577, -0.048... |
1706593 | NaN | CHRONICLE\r\n\r\n\r\... | 3037 | Chronicle | 2012 | en | 40.036 | 6.816 | 5119 | Three high school students make an incredible ... | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | [0.0023177029, -0.011539809, -0.0100571215, 0.... |
2911666 | NaN | JOHN WICK\r\n\r\n\... | 5897 | John Wick | 2014 | en | 105.961 | 7.430 | 18679 | Ex-hitman John Wick comes out of retirement to... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | [0.048869684, 0.029862285, -0.0062307036, -0.0... |
61722 | March 1967 | \t\t\t\t"THE GRADUATE"\r\n\r\n\r\n\t\t\t\tScre... | 616 | Graduate, The | 1967 | en | 30.980 | 7.700 | 3206 | Benjamin, a recent college graduate very worri... | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | [0.04377451, -0.034744043, 0.011599346, -0.005... |
5 rows × 28 columns
from sklearn.neural_network import MLPClassifier, MLPRegressor
mlp_clf = MLPClassifier()
def emb_arr(col=X_train_embedded['embeddings']):
embeddings = np.ndarray((len(col),768))
j = 0
for i in col.index:
try:
embeddings[j] = col[i]
j+=1
except: j+=1
return embeddings
embeddings = emb_arr(X_train_embedded['embeddings'])
mlp = MLPClassifier()
mlp.fit(embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1))
MLPClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MLPClassifier()
from sklearn.model_selection import cross_val_score, cross_val_predict
mlp = MLPClassifier()
cross_val_score(mlp, embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1), cv=5)
array([0.56060606, 0.70769231, 0.64615385, 0.64615385, 0.61538462])
param_grid = {
'hidden_layer_sizes':[(32,),(50,),(64,),(32,32,)],
'solver':['lbfgs']
}
grid = GridSearchCV(MLPClassifier(random_state=0),
param_grid=param_grid,
cv=3
)
grid.fit(embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1))
grid.best_params_
{'hidden_layer_sizes': (50,), 'solver': 'lbfgs'}
cross_val_score(grid.best_estimator_, embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1), cv=5)
array([0.51515152, 0.72307692, 0.64615385, 0.63076923, 0.64615385])
X_train_embedded['neural_net_preds'] = cross_val_predict(MLPClassifier(hidden_layer_sizes=(50,), solver='lbfgs'), embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1), cv=6)
cross_val_score(RandomForestClassifier(max_depth=7,
min_samples_leaf=4,
min_samples_split=15,
n_estimators=1000), X_train_embedded[['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History', 'neural_net_preds']], y_train['pass_fail'])
array([0.54545455, 0.70769231, 0.73846154, 0.58461538, 0.63076923])
from sklearn.tree import DecisionTreeClassifier
tree_clf = GridSearchCV(DecisionTreeClassifier(),
param_grid={'max_depth': [1,2,3,4,5],
'min_samples_leaf': [1,2,3,4,5,40,50],
'min_samples_split': [2,3,4,40,50],
'max_features': [1,2,3,12,30]},
cv=10)
tree_clf.fit(X_train_embedded[['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History', 'neural_net_preds']], y_train['pass_fail'])
tree_clf.best_params_
{'max_depth': 3, 'max_features': 12, 'min_samples_leaf': 3, 'min_samples_split': 50}
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(18,5))
plot_tree(tree_clf.best_estimator_, fontsize=11, feature_names=['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History', 'neural_net_preds'], ax=ax, rounded=True)
plt.show()
X_test_embedded = create_embeddings(model, X_test)
0%| | 0/100 [00:00
test_embeddings = emb_arr(col=X_test_embedded['embeddings'])
mlp_trained = MLPClassifier(hidden_layer_sizes=(50,),solver='lbfgs')
mlp_trained.fit(embeddings, y_train['pass_fail'].reset_index().drop(0, axis=1))
X_test_embedded['neural_net_preds'] = mlp_trained.predict(test_embeddings)
test_preds = tree_clf.best_estimator_.predict(X_test_embedded[['release_year', 'popularity', 'vote_average', 'vote_count', 'Drama', 'Romance', 'Adventure', 'Animation', 'Fantasy', 'Science Fiction', 'Family', 'Mystery', 'Crime', 'Thriller', 'War', 'Western', 'Comedy', 'Music', 'Horror', 'Action', 'History', 'neural_net_preds']])
X_test_embedded['neural_net_preds'] == y_test['pass_fail']
0 6644200 True 100477 False 124315 True 78748 True 480687 False ... 349903 True 481499 True 905372 True 43014 True 86510 True Length: 100, dtype: bool
test_preds == y_test['pass_fail']
0 6644200 True 100477 True 124315 True 78748 False 480687 False ... 349903 True 481499 True 905372 True 43014 False 86510 True Name: pass_fail, Length: 100, dtype: bool
X_train_embedded['neural_net_preds'] == y_train['pass_fail']
0 472033 True 120780 True 1706593 False 2911666 True 61722 False ... 100814 True 109506 False 765443 True 816462 True 110148 True Length: 326, dtype: bool
conn = psycopg2.connect(dbname='bechdel_test', user='postgres', password='guest')
cur = conn.cursor()
cur.execute('SELECT * FROM bechdel_ratings JOIN tmdb_data ON tmdb_data.imdb_id = bechdel_ratings.imdb_id;')
data = pd.DataFrame(cur.fetchall())
df = data.copy()
df.set_index(0, inplace=True)
cur.close()
conn.close()
df
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||||
9804 | 14495706 | La Rosace Magique | 1877 | 0 | 766094 | 14495706 | The Magic Rosette | xx | 1878-05-07 | 2.194 | 5.800 | 19 | Praxinoscope strip of a shifting rosette. Seri... |
9806 | 12592084 | Le singe musicien | 1878 | 0 | 751212 | 12592084 | The Musician Monkey | xx | 1878-05-07 | 2.560 | 5.900 | 25 | A pre-cinematograph colour animation of the mo... |
9832 | 8588366 | L'homme machine | 1885 | 0 | 585297 | 8588366 | L'Homme Machine | xx | 1885-01-01 | 1.149 | 4.629 | 31 | Animated stick drawings representing a man wal... |
9614 | 2075247 | Man Walking Around the Corner | 1887 | 0 | 159897 | 2075247 | Man Walking Around a Corner | xx | 1887-08-18 | 5.529 | 4.900 | 80 | The last remaining production of Le Prince's L... |
9841 | 7754902 | Man Riding Jumping Horse | 1887 | 0 | 1191584 | 7754902 | Man Riding Jumping Horse | en | None | 0.187 | 4.000 | 5 | A man riding a horse jumps over an obstacle. |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11302 | 21235248 | Ghostbusters: Frozen Empire | 2024 | 3 | 967847 | 21235248 | Ghostbusters: Frozen Empire | en | 2024-03-20 | 603.739 | 6.671 | 873 | When the discovery of an ancient artifact unle... |
11303 | 3359350 | Road House | 2024 | 3 | 359410 | 3359350 | Road House | en | 2024-03-08 | 483.627 | 7.024 | 1810 | Ex-UFC fighter Dalton takes a job as a bouncer... |
11317 | 14539740 | Godzilla x Kong: New Empire | 2024 | 3 | 823464 | 14539740 | Godzilla x Kong: The New Empire | en | 2024-03-27 | 3853.790 | 7.278 | 2211 | Following their explosive showdown, Godzilla a... |
11318 | 19356262 | Drive-Away Dolls | 2024 | 3 | 957304 | 19356262 | Drive-Away Dolls | en | 2024-02-22 | 81.501 | 5.531 | 208 | Jamie, an uninhibited free spirit bemoaning ye... |
11322 | 26658104 | Imaginary | 2024 | 1 | 1125311 | 26658104 | Imaginary | en | 2024-03-06 | 154.811 | 6.210 | 312 | When Jessica moves back into her childhood hom... |
10133 rows × 13 columns
model = genai.GenerativeModel('gemini-1.5-flash', safety_settings=[
{
"category": "HARM_CATEGORY_HARASSMENT",
"threshold": "BLOCK_NONE"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"threshold": "BLOCK_NONE"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"threshold": "BLOCK_NONE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"threshold": "BLOCK_NONE"
},
])
import time
timer = time.time()
gemini_flash_guesses = []
for i in X_train.index[:10]:
chat = model.start_chat()
if (1 - float(time.time() - timer)) > 0:
time.sleep(1 - float(time.time() - timer))
timer = time.time()
try:
response = model.generate_content('How many female characters does the following script contain?'
'Script:'
''
'' + X_train['script'][i] )
r_text = response.text
except Exception as e:
r_text = response.prompt_feedback
gemini_flash_guesses.append((i, r_text))
gemini_flash_guesses
[(472033, 'This script features **one** named female character: **7**. \n\nWhile the scientist is referred to as "he" and is presumably male, there are no other female characters mentioned in the script. \n'), (120780, 'The script "Out of Sight" contains the following female characters:\n\n* **Loretta:** A bank teller who is robbed by Foley.\n* **Lulu:** Chino\'s "wife" and accomplice in his escape plan. \n* **Adele:** Foley\'s ex-wife.\n* **Karen Sisco:** A federal marshal who becomes involved in Foley\'s escape and later attempts to apprehend him.\n* **Moselle:** Maurice Miller\'s girlfriend.\n* **Midge:** Richard Ripley\'s maid. \n* **Yonelle:** A transsexual who is murdered at Eddie Solomon\'s house. \n* **Regina Mary Bragg:** Buddy\'s sister, a born-again Christian who calls the FBI to report Buddy and Foley.\n* **Celeste:** A waitress at the Westin Hotel in Detroit.\n\nIt\'s important to note that some of these characters are only briefly mentioned and do not have any lines in the script. \n'), (1706593, 'The script "Chronicle" features **two** female characters:\n\n1. **Sandra Detmer:** Andrew\'s mother, who is ill and confined to bed.\n2. **Casey Letter:** Matt\'s girlfriend, who is a videoblogger and later becomes a love interest for Matt. \n'), (2911666, "This script contains only one female character: **Norma Wick**, John Wick's deceased wife. \n\nWhile there are female characters mentioned, like the Delivery Woman and the Waitress at the Red Circle, they don't have speaking roles or significant actions in the script. \n"), (61722, 'The script "The Graduate" features **three** female characters:\n\n1. **Mrs. Robinson:** The older woman who has an affair with Benjamin.\n2. **Elaine Robinson:** Mrs. Robinson\'s daughter, who Benjamin falls in love with.\n3. **Mrs. Braddock:** Benjamin\'s mother. \n'), (1311071, "This script features 7 female characters: \n\n1. **Naomi Ginsberg:** Allen's mother, struggling with a mental health condition.\n2. **Edie Parker:** Jack Kerouac's girlfriend, an art student.\n3. **Permissions Librarian:** A librarian at Columbia University.\n4. **Gwendolyn:** A page at Columbia University's library.\n5. **Edith Cohen:** A woman accompanying Louis Ginsberg.\n6. **Marion Carr:** Lucien Carr's mother.\n7. **Grandma Frankie:** Jack Kerouac's grandmother. \n"), (115632, "The script you provided has **two** female characters:\n\n1. **Matilde:** Jean Michel Basquiat's mother, who appears in a dream sequence and later in a mental hospital.\n2. **Gina Cardinale:** Jean Michel's girlfriend, who plays a significant role throughout the script. \n"), (499549, "This script contains **2** female characters:\n\n* **Dr. Grace Augustine:** The head of the Avatar Program and a renowned Pandoran botanist.\n* **Neytiri:** A fierce and beautiful Na'vi warrior who becomes Jake's teacher and love interest. \n"), (113101, block_reason: OTHER), (64665, block_reason: OTHER)]
X_train.loc[88944]
script_date April 1985 script NaN bechdel_id 2576 title Commando release_year 1985 language en popularity 46.667 vote_average 6.678 vote_count 2677 overview John Matrix, the former leader of a special co... Drama 0 Romance 0 Adventure 1 Fantasy 0 Family 0 Mystery 0 Crime 0 Thriller 1 War 0 Comedy 0 Music 0 Western 0 Horror 0 Science Fiction 0 Action 1 Animation 0 History 0 embeddings [0.03867363, 0.038765118, -0.029283423, -0.043... neural_net_preds 0 Name: 88944, dtype: object
response
response.close()
r
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[117], line 1 ----> 1 response.__dict__ AttributeError: 'coroutine' object has no attribute '__dict__'
from flask import * import google.generativeai as genai from flask_pymongo import PyMongo import os from sendgrid import SendGridAPIClient from sendgrid.helpers.mail import Mail from copy import deepcopy as copy from pymongo import MongoClient from langchain_community.vectorstores import MongoDBAtlasVectorSearch from langchain_google_genai import ChatGoogleGenerativeAI from langchain import hub from langchain_core.runnables import RunnablePassthrough, RunnableLambda from langchain_core.output_parsers import StrOutputParser import threading from user_agents import parse from GoogleEmbeddings import Embeddings from datetime import datetime from TinyDBRetriever import TinyDBRetriever application = app = Flask(__name__) app.secret_key = 'your_secret_key_here' app.langchain_api_key = open('langchain_api_key.txt').read() def logChat(): print('logchat') def inputUserData(name: str = None, email: str = None, how_found: str = None, is_recruiter: bool = None, company_name: str = None, job_hiring_for: str = None, job_description: str = None): ''' Extracts data from a chat log. Called when asked to extract data from a chat log. Args: name: the name of the user, or None if the user has not provided their name. email: the user's email address, or None if the user has not provided their email address how_found: how the user was directed to the website, or None if the user has not provided this information is_recruiter: A boolean variable representing whether this person is a recruiter, or None if the user has not provided this information company_name: the name of the company the user works for, or None if the user has not provided this information job_hiring_for: the job the user is hiring for, or None if the user has not provided this information job_description: a brief description of the job if the user provides the information; otherwise None ''' return {name: name, email: email, how_found: how_found, is_recruiter: is_recruiter, company_name: company_name, job_hiring_for: job_hiring_for, job_description: job_description} summary_model = genai.GenerativeModel('gemini-1.5-flash-latest', safety_settings=[ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, ], tools=[inputUserData]) chat_history = session.get('chat_hist', [{'role': 'model', 'parts': [{ 'text': 'Hi there! I\'m Mark\'s AI assistant. How can I help you?'}]}]) chat_history_string = '' chat_history_string = [f'{chat_history_string} {message["role"]}: {message["parts"][0]["text"]}' for message in chat_history] response = summary_model.generate_content(f'Extract data from the following chat log, by calling your function:' f'{chat_history_string}') args = {} print(response) for key in list(response.parts[0].function_call.args.keys()): # serializes the arguments from gemini as a list args[key] = response.parts[0].function_call.args[key] doc = locals()[response.parts[0].function_call.name](**args) client = MongoClient(app.mongo_uri) db = client['website-database'] collection = db['chat_logs'] clean_doc = {key: args[key] for key in args.keys() if key is not None} clean_doc['log'] = chat_history_string clean_doc['timestamp'] = datetime.now().isoformat() # print(clean_doc) result = collection.insert_one(clean_doc) def is_mobile(request): user_agent = parse(request.headers.get('User-Agent')) return user_agent.is_mobile @app.route('/refresh_chat', methods=['POST']) def refresh_chat(): logChat() session['chat_hist'] = [{'role': 'model', 'parts': [{'text': 'Hi there! I\'m Mark\'s AI assistant. How can I help you?'}]}] # chat_hist is stored as a list of dicts in session memory return jsonify({'reload': 1}) @app.before_request def startup(): #Sets up chat and vectorstore on startup app.retriever_ready = False app.projects_retriever_ready = False app.is_first_refresh = True app.current_page = 'home' with open('mongo_info.txt') as f: (user, password, url) = f.readlines() #Assemble the mongo connection URI mongo_uri = f'mongodb+srv://{user.strip()}:{password.strip()}@{url.strip()}/?retryWrites=true&w=majority&appName=website-database&tlsCAFile=isrgrootx1.pem' app.mongo_uri = mongo_uri app.config["MONGO_URI"] = os.environ.get('MONGODB_URI', mongo_uri) mongo = PyMongo(app) #Configure and run PyMongo app.google_api_key = open('google_api_key.txt').read() genai.configure(api_key=app.google_api_key) refresh = refresh_chat() def create_retriever(mongo_uri=mongo_uri): embeddings = Embeddings(api_key=app.google_api_key) #Instantiate a HuggingFaceEmbeddings model- this is taking too long vector_search = MongoDBAtlasVectorSearch.from_connection_string( mongo_uri, 'website-database.education-v2', #Create a vector search object embeddings, index_name="vector_index" ) app.retriever = vector_search.as_retriever(search_type="similarity", search_kwargs={"k": 15}) #Store it as a retriever for use later app.retriever_ready = True threading.Thread(target=create_retriever, daemon=True).start() app.before_request_funcs[None].remove(startup) def before_request(): if app.config.get('PREFERRED_URL_SCHEME', 'http') == 'https': from flask import _request_ctx_stack if _request_ctx_stack is not None: reqctx = _request_ctx_stack.top reqctx.url_adapter.url_scheme = 'https' @app.route('/') #Home page- robot image and chatbar def home(): print('/home called') if is_mobile(request): return render_template('mobile_resume.html') else: return render_template('home.html') @app.route('/projects') #Projects and Github browser page def projects(): print('/projects called') app.current_page = 'projects' if is_mobile(request): return render_template('mobile_projects.html') else: return render_template('projects.html') @app.route('/Resume', methods=['GET']) #Dynamic resume page def Resume(): print('/resume called') app.current_page = 'resume' if is_mobile(request): return render_template('mobile_resume.html') else: return render_template('Resume.html') @app.route('/mobile_contact', methods=['GET']) def contact(): return render_template('mobile_contact.html') @app.route('/handleMemo', methods=['POST']) def handleMemo(): message = request.json['message'] session['chat_hist'].append([ {'role': 'user', 'parts': {'text': message}}, {'role': 'model', 'parts': {'text': 'Ok, I\'ll let Mark know. What else can I help you with?'}} ]) msg = Mail( from_email='deskofmarkbotner@gmail.com', to_emails='markbochner1@gmail.com', subject='Message from Mr. Botner!', plain_text_content=message ) sg = SendGridAPIClient(open('Twilio.txt').readlines()[0].strip()) response = sg.send(msg) @app.route('/chat', methods=['POST']) def chat(): #Chat response logic genai.configure(api_key=app.google_api_key) #showResume returns a stock messsage and reroutes to the resume page count = session.get('count', 0) count += 1 print(count) if count == 10: count = 0 logChat() session['count'] = count def goHome(text: str=None): ''' Returns the user to the homepage. Called when the user requests to return to the home page. Args: text: ignore this argument Returns: rendered template ''' print('goHome called') if text is None or text.strip() == '': text = session['chat_hist'][-1]['parts'][0]['text'] = 'No problem. Is there anything else I can help you with today?' else: text = session['chat_hist'][-1]['parts'][0]['text'] = text return jsonify({'response': text, 'redirect_url': url_for('home'), 'type': 2}) def showProjects(text: str=None): ''' Shows Mark's projects to the used. Called when the user requests to see Mark's projects, or asks about Mark's project experience. Args: text: ignore this argument Returns: rendered template showing Mark's projects ''' print('showProjects called') if text is None or text.strip() == '': text = session['chat_hist'][-1]['parts'][0]['text'] = 'Mark has a few projects available on his github; you can browse the highlights here!' else: session['chat_hist'][-1]['parts'][0]['text'] = text return jsonify({'response': text, 'redirect_url': url_for('projects'), 'type': 2}) def showResume(job_title: str=None, text: str=None): ''' Shows Mark's Resume page to users. Called whenever a user inquires about Mark's resume. Args: text: ignore this argument job_title: A string containing the job title or description. If none is available pass None. Returns: rendered template ''' print('showResume called') if job_title is None: if text is None or text.strip() == '' or text.strip() is None: text = session['chat_hist'][-1]['parts'][0]['text'] = 'Sure, here is Mark\'s resume' else: session['chat_hist'][-1]['parts'][0]['text'] = text category = 'none' else: category = None if 'student' in job_title.lower() or 'intern' in job_title.lower() or 'science' in job_title.lower() or 'scientist' in job_title.lower(): category = 'student' elif 'apprentice' in job_title.lower(): category = 'mentorship' elif 'engineer' in job_title.lower() or 'integration' in job_title.lower() or ' AI ' in job_title or ' ML ' in job_title: category = 'engineer' elif 'manager' in job_title.lower() or 'leader' in job_title.lower(): category = 'leader' elif 'data' in job_title.lower(): category = 'data' else: if text is None or text.strip() == '': text = session['chat_hist'][-1]['parts'][0]['text'] = 'Sure, here is Mark\'s resume' else: session['chat_hist'][-1]['parts'][0]['text'] = text category = 'none' return jsonify({'response': text, 'redirect_url': url_for('Resume'), 'type':5, 'category': category, 'job_title':job_title}) #It searches documents related to my education and returns them as context #Also updates chat history for the model def sendMemo(): ''' Leaves a memo for Mark. Called when the user requests contact info or to get in touch with Mark. Returns: Sends Mark a note ''' session['chat_hist'][-1]['parts'][0]['text'] = 'Sure, I can definitely take a note for Mark! Go ahead and leave your message below, and I\'ll pass it along.' return jsonify({'response': 'Sure, I can definitely take a note for Mark! ' 'Go ahead and leave your message below, and I\'ll pass it along.', 'type':3}) def showContact(text: str=None): ''' Provides Mark's contact info to the user. Called when the user asks for Mark's contact information or when the user asks how to get in touch with Mark. Args: text: ignore this argument Returns: Mark's contact information ''' if text is None or text.strip() == '': text = session['chat_hist'][-1]['parts'][0]['text'] = 'Here\'s Mark\'s contact info; I can also take a message or set up a meeting if you\'d like.' else: session['chat_hist'][-1]['parts'][0]['text'] = text return jsonify({'response': 'Here\'s Mark\'s contact info; I can also take a message or set up a meeting if you\'d like.', 'type': 4, 'request': 'contact'}) def setInterview(text: str=None): ''' Sets an interview with Mark. Called when the user requests to speak with Mark or inquires about his schedule. Args: text: ignore this argument Returns: a calendly interface to interact with the customer ''' if text is None or text.strip() == '': text = session['chat_hist'][-1]['parts'][0]['text'] = 'I can set up a meeting with Mark- Have a look at his calendar and let me know what works for you.' else: session['chat_hist'][-1]['parts'][0]['text'] = text return jsonify({'response': 'I can set up a meeting with Mark- Have a look at his calendar and let me know what works for you.', 'type': 4, 'request': 'calendar'}) def askMark(query: str): ''' Answers any question a user may have about Mark. Called when no other tool is appropriate to answer a user's question. Args: query: A string containing the user's request Returns: A response to the user's request ''' print('DM Called') model = ChatGoogleGenerativeAI(model='gemini-1.5-flash', api_key=app.google_api_key) prompt = hub.pull('mocboch/rag-modified') def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) TDBR = TinyDBRetriever(tinydb_filepath='personal-info.json',google_api_key=app.google_api_key,k=3) retriever = RunnableLambda(TDBR._get_relevant_documents) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) r = ' '.join([chunk for chunk in rag_chain.stream(query)]) session['chat_hist'][-1]['parts'][0]['text'] = r print(query) print(r) return jsonify({'response': r, 'type': 1}) def discussEducation(query: str): ''' Do not ask followup questions before calling this function. Returns information about Mark's Education. Called when the user asks for information about Mark's education, including the Applied Business Analytics or Salesforce programs at ASU, the MS in Data Science at Eastern University, Alfred University, or the Academy for Information Technology. Args: query: A string, either containing the user's request or a slightly modified version of it if appropriate. returns: A response to the user's request ''' client = MongoClient(app.mongo_uri) model = ChatGoogleGenerativeAI(model='gemini-1.5-flash', api_key=app.google_api_key) prompt = hub.pull('rlm/rag-prompt') def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( {"context": app.retriever | format_docs, "question": RunnablePassthrough()} | prompt | model | StrOutputParser() ) r = ' '.join([chunk for chunk in rag_chain.stream(query)]) session['chat_hist'][-1]['parts'][0]['text'] = r print(query) return jsonify({'response': r, 'type':1}) tools = [askMark, showContact, sendMemo, setInterview, goHome, showResume, showProjects] if app.retriever_ready == True: tools.append(discussEducation) #if app.projects_retriever_ready == True: tools.append(discussProject) model = genai.GenerativeModel('gemini-1.5-flash-latest', safety_settings=[ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, ], tools=tools) #Instantiate a model with tools hist = session['chat_hist'] = session.get('chat_hist', [{'role': 'model', 'parts': [{'text': 'Hi there! I\'m Mark\'s AI assistant. How can I help you?'}]}]) #Get the chat history, which can be fed to the model as a list of dicts chat = model.start_chat(history=[{'role': 'user', 'parts': [{'text': 'You are Mark\'s assistant. You will attend politely to the user\'s requests by calling the appropriate tools. If an opportunity presents itself, you will ask what role the user is hiring for, at what company, who they are, etc. The user is currently looking at Mark\'s ' + app.current_page + ' page.'}]}] + hist[-6:]) #Start a chat with the model and history message = request.json['message'] #Gets the message (prompt) from the front end response = chat.send_message(message) print(response)#Calls gemini for a response to the prompt bypass_response_functions = ['discussEducation', 'askMark'] if not response.parts[0].function_call and len(response.parts) == 1: #Returns a text based response directly to the front end session['chat_hist'] += [{'role': msg.role, #Serializes the chat history as a list of dicts and puts it away in the session storage 'parts': [{'text': part.text} for part in msg.parts]} for msg in chat.history[-2:]] return jsonify({'response': response.text, 'type': 1}) elif len(response.parts) == 1: #else handles function calls fn_call = response.parts[0].function_call.name #gets the name of the function called args = {} for key in list(response.parts[0].function_call.args.keys()): #serializes the arguments from gemini as a list args[key] = response.parts[0].function_call.args[key] h = copy(chat.history[0]) #Makes a deepcopy of the first message, which is user submitted and therefore text chat.history[-1] = h #Uses that as a blank to fill in the appropriate response once it's generated chat.history[-1].parts[0].text = ' ' chat.history[-1].role = 'model' session['chat_hist'] += [{'role': msg.role, #Serialize and store the chat history 'parts': [{'text': part.text} for part in msg.parts]} for msg in chat.history[-2:]] return locals()[fn_call](**args) else: fn_call = response.parts[1].function_call.name # gets the name of the function called args = {} for key in list(response.parts[1].function_call.args.keys()): # serializes the arguments from gemini as a list args[key] = response.parts[1].function_call.args[key] print(fn_call) if fn_call not in bypass_response_functions: args['text'] = response.parts[0].text h = copy(chat.history[0]) # Makes a deepcopy of the first message, which is user submitted and therefore text chat.history[-1] = h # Uses that as a blank to fill in the appropriate response once it's generated chat.history[-1].parts[0].text = ' ' chat.history[-1].role = 'model' session['chat_hist'] += [{'role': msg.role, # Serialize and store the chat history 'parts': [{'text': part.text} for part in msg.parts]} for msg in chat.history[-2:]] return locals()[fn_call](**args) @app.route('/get_chat_history', methods=['GET']) def get_chat_history(): #route to update chat history after changing pages hist = session.get('chat_hist', [{'role': 'model', 'parts': [{'text': 'Hi there! I\'m Mark\'s AI assistant. How can I help you?'}]}] # chat_hist is stored as a list of dicts in session memory ) return jsonify({'history': hist}) if __name__ == '__main__': app.run(debug=True, port=8000)
var seq = 0 function loadChatHistory() { $.get('/get_chat_history', function (data) { data.history.forEach(function (message) { if (message.role === 'model') { addMessage('Mr. Botner: ' + message.parts[0].text); } else { addMessage('You: ' + message.parts[0].text); } }); }); } function sendMessage() { if (seq === 1) { var userInput = document.getElementById('userInput'); var message = userInput.value; userInput.value = ''; addMessage('Note for Mark: ' + message); fetch('/handleMemo', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({message: message}), }).then(seq = 0).then(addMessage('Mr. Botner: Ok, I\'ll let Mark know. What else can I help you with?')) } else { var userInput = document.getElementById('userInput'); var message = userInput.value; if (message.trim() !== '') { addMessage('You: ' + message); userInput.value = ''; // Send the message to the server fetch('/chat', { method: 'POST', headers: { 'Content-Type': 'application/json', }, body: JSON.stringify({message: message}), }) .then(response => response.json()) .then(data => { if (data.type === 2) { window.location.href = data.redirect_url; } addMessage('Mr. Botner: ' + data.response); if (data.type === 3) { seq = 1; } if (data.type === 4) { switch (data.request) { case 'calendar': openCalendly(); break; case 'contact': openContact(); break; } } }); } } } function addMessage(message) { var chatMessages = document.getElementById('chatMessages'); var messageElement = document.createElement('p'); messageElement.textContent = message; chatMessages.appendChild(messageElement); chatMessages.scrollTop = chatMessages.scrollHeight; } function openCalendly() { const browser = document.getElementById('calendly'); const content = document.getElementById('calendly-content'); browser.style.display = 'block'; } function openContact() { const browser = document.getElementById('contact'); const content = document.getElementById('contact-content'); browser.style.display = 'block'; } function closeCalendly() { const browser = document.getElementById('calendly'); browser.style.display = 'none'; } function closeContact() { const browser = document.getElementById('contact'); browser.style.display = 'none'; } window.onload = function () { Calendly.initInlineWidget({ url: 'https://calendly.com/markbochner1', parentElement: document.getElementById('calendly-embed') }); }; $(document).ready(function() { console.log('Document ready'); loadChatHistory(); $('.chat-input input').keypress(function(e) { if (e.which == 13) { sendMessage(); return false; } }); }); window.onbeforeunload = () => fetch('/stop'); event.stopPropagation();
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <link rel="stylesheet" href="static/chat.css"></link> <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script> <script type="text/javascript" src="https://assets.calendly.com/assets/external/widget.js"></script> <script src="static/chatbot.js"></script> <script src="https://platform.linkedin.com/badges/js/profile.js" async defer type="text/javascript"></script> <title>Mark Bochner - Projects</title> <style> #close-browser { position: absolute; top: 10px; right: 10px; cursor: pointer; } .white { color: white; } body { line-height: 1.6; margin: 0; padding: 0; color: #333; font-family: "Geist Mono Variable", sans-serif; background-color: #c4e1e7; } .page-container { margin: 0 auto; padding: 20px; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); display: flex; flex-direction: row; } .page-container div { } h1 { text-align: center; color: #333; font-size: 2.5em; margin-bottom: 30px; } .section { margin-bottom: 20px; padding: 15px; background-color: #f9f9f9; border-radius: 5px; cursor: pointer; transition: background-color 0.3s; border: 1px solid #e0e0e0; max-width: 800px; } .s{ margin-bottom: 20px; background-color: #F0F0F0; cursor: pointer; transition: background-color 0.3s; border: 2px solid #000000; max-width: 588px; font-weight: bold; color: #000000; font-size: 18px; } .chatbot-sidebar { max-width: 400px; height: 100vh; position: fixed; right: 0; top: 0; background-color: #0F68B6; padding: 20px; box-sizing: border-box; display: flex; flex-direction: column; } .section:hover { background-color: #e9e9e9; } .section h2 { margin-top: 0; color: #2c3e50; font-size: 1.5em; } .modal { display: none; position: fixed; z-index: 1; left: 0; top: 0; width: 100%; height: 100%; background-color: rgba(0,0,0,0.4); } .modal-content { background-color: #c4e1e7; margin: 5% auto; padding: 20px; border: 1px solid #888; width: 80%; max-width: 900px; border-radius: 5px; height: 80%; overflow-y: auto; } .close { color: #aaa; float: right; font-size: 28px; font-weight: bold; cursor: pointer; } .close:hover, .close:focus { color: black; text-decoration: none; cursor: pointer; } .modal-content h2 { color: #2c3e50; border-bottom: 2px solid #2c3e50; padding-bottom: 10px; } .modal-content h3 { color: #34495e; } .modal-content ul { padding-left: 20px; } .modal-content li { margin-bottom: 10px; } .links-container { align-items: center; position: relative; width: 100%; } .links-container div { margin: 8px; width: auto; } .chat-messages { flex: 1; overflow-y: auto; border: 1px solid #ccc; padding: 1rem; margin-bottom: 1rem; color: white; width: auto; } .buttons-container { display: flex; flex-direction: row; flex-wrap: wrap; } .button{ display: inline-flex; padding: 10px 20px; font-size: 16px; font-weight: bold; text-align: center; text-decoration: none; border-radius: 5px; transition: all 0.3s ease; cursor: pointer; } .b{ display: inline-flex; border-radius: 5px; margin-left: 16px; transition: all 0.3s ease; cursor: pointer; height: 100% } </style> </head> <body> <div class="page-container"> <div class="links-container"> <h1>Projects</h1> <div class="section" onclick="openModal('MovieReadme')"> <h2>Movie Script Analysis</h2> <div class="buttons-container"> <button class='button' onclick="openModal('BechdelTest');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Initial Data Collection</button> <button class='button' onclick="openModal('NaiveBayes');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Genre Classification with Naive Bayes</button> <button class='button' onclick="openModal('PredictiveModeling');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Predictive Modeling</button> <button class='button' onclick="openModal('DistilBERT');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Building a Classifier with DistilBERT</button> <button class='button' onclick="openModal('DataCard');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/huggingface-logo.svg"> Data Card </button> </div> </div> <div class="section" onclick="openModal('WebsiteReadme')"> <h2>View the Code Behind this Website</h2> <div class="buttons-container"> <button class='button' onclick="openModal('AppPy');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/python-logo.svg"> Server-Side Scripts</button> <button class='button' onclick="openModal('ThisPage');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/html-logo.svg"> This Page</button> <button class='button' onclick="openModal('ChatBot');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/js-logo.svg"> Chatbot Script</button> <button class='button' onclick="openModal('ChatCSS');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/css-3-logo.svg"> Chatbot Style</button> <button class='button' onclick="openModal('LangchainNB');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Custom Implementation of the LangChain Embeddings Class</button> <button class='button' onclick="openModal('TDBR');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/python-logo.svg"> TinyDBRetriever - A Custom Retriever for TinyDB and LangChain</button> <button class='button' onclick="openModal('TDBR_NB');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Building a Local Database for RAG with TinyDB</button> </div> </div> <div class="section" onclick="openModal('QuantStudyReadme')"> <h2>Quantitative Analysis Replication</h2> <div class="buttons-container"> <button class='button' onclick="openModal('DataPython');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/jupyter-logo.svg"> Data Cleaning and Exploration in Python</button> <button class='button' onclick="openModal('DataR');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/R-logo.svg"> Modeling in R</button> </div> </div> <div class="section" onclick="openModal('TikTok')"> <h2>TikTok Comment Analysis</h2> <div class="buttons-container"> <button class='button' onclick="openModal('TikTokData');event.stopPropagation()"> <img style="height:16px;width:16px;margin-right:16px;" src="../static/datalore-logo.svg"> Data Sample</button> </div> </div> <div class="section" onclick="openModal('ConsoleReporter')"> <h2>ConsoleReporter</h2> <p>A short script that uses generative AI to turn a pycharm console session into an annotated Jupyter notebook. Most of the time, it works every time.</p> </div> </div> <div class="chatbot-container"> <div id="calendly"> <div id="calendly-embed" class="calendly" style="min-width:1px;min-height:1px;display:inline-flex;"></div> <button id="close-browser" onclick="closeCalendly()">Close</button> </div> <div class="chatbot-sidebar"> <div style="display:flex;flex-direction:row;"> <a href="./Resume"><button>Resume</button></a> <a href="./"><button>Home</button></a> </div> <h3 class="white">Mark Botner, Office Manager</h3> <div class="chat-messages" id="chatMessages"></div> <div class="chat-input"> <input type="text" id="userInput" placeholder="Type your message..."> <button onclick="sendMessage()">Send</button> </div> </div> </div> </div> <div id="BechdelTestModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('BechdelTest')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FMovie-Script-Data-Analysis%2Fblob%2Fmaster%2FBechdel%2520Test.ipynb&style=default&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="DataCardModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('DataCard')">×</span> <iframe src="https://huggingface.co/datasets/mocboch/movie_scripts/embed/viewer/default/train" frameborder="2" width="100%" height="560px"></iframe> </div> </div> <div id="NaiveBayesModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('NaiveBayes')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FMovie-Script-Data-Analysis%2Fblob%2Fmaster%2FNaive%2520Bayes%2520for%2520Genre%2520Identification.ipynb&style=default&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="DistilBERTModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('DistilBERT')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FMovie-Script-Data-Analysis%2Fblob%2Fmaster%2FBuilding%2520a%2520Classifier%2520with%2520DistilBert.ipynb&style=routeros&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="PredictiveModelingModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('PredictiveModeling')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FMovie-Script-Data-Analysis%2Fblob%2Fmaster%2FData%2520Exploration.ipynb&style=routeros&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="AppPyModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('AppPy')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2Fapp.py&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="ChatBotModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('ChatBot')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2Fstatic%2Fchatbot.js&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="ThisPageModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('ThisPage')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2Ftemplates%2Fprojects.html&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="ChatCSSModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('ChatCSS')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2Fstatic%2Fchat.css&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="ConsoleReporterModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('ConsoleReporter')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FConsole-Reporter%2Fblob%2Fmaster%2FConsoleReporter.py&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="DataPythonModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('DataPython')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FStudy-Replication%2Fblob%2Fmain%2FQuantitative%2520Study%2520Replication.ipynb&style=intellij-light&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="DataRModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('DataR')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2FStudy-Replication%2Fblob%2Fmain%2FQuantitative%2520Study%2520Replication.rmd&style=intellij-light&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="TikTokDataModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('TikTokData')">×</span> <iframe style="width:100%;height:100%;overflow:scroll"src="https://datalore.jetbrains.com/report/embed/vuaFxFrLZ4lQPRqnFZ8JlN/SP2PYD4RKXnuuh5wkeWv4a/f3XiYK62JKF1bIb1nyAqF9?height=517" frameborder="0"></iframe> </div> </div> <div id="LangchainPyModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('LangchainPy')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2FGoogleEmbeddings.py&style=default&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="LangchainNBModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('LangchainNB')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2FBuilding%2520a%2520Custom%2520Implementation%2520of%2520the%2520LangChain%2520Embeddings%2520Class.ipynb&style=default&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="TDBRModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('TDBR')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2FTinyDBRetriever.py&style=default&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="TDBR_NBModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('TDBR_NB')">×</span> <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fmocboch%2Fwebsite%2Fblob%2Fmaster%2FSetting%2520Up%2520a%2520Local%2520Database%2520in%2520TinyDB.ipynb&style=default&type=ipynb&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script> </div> </div> <div id="WebsiteReadmeModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('WebsiteReadme')">×</span> <h2>Created a Website Controlled by Natural Language</h2> <ul> <li>Created a website controlled by natural language prompts using Flask</li> <li>Integrated Google Gemini for function calling to control navigation and other UX elements</li> <li>Built a NoSQL database with MongoDB Atlas including a vector search index with Hugging Face embeddings</li> <li>Used Langchain for retrieval augmented generation to provide context for certain responses</li> <li>Integrated Calendly and Sendgrid to establish communication with users</li> <li>Deployed to Amazon Web Services using Elastic Beanstalk</li> </ul> </div> </div> <div id="TikTokModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('TikTok')">×</span> <h2>Collecting and Analyzing a set of TikTok Comments</h2> <ul> <li>Scraping a set of millions of TikTok comments from political accounts such as @kamalaharris and @realdonaldtrump</li> <li>Using polling data from 538 as targets, will analyze comments' sentiments and try to predict changes in polling results</li> <li>Ongoing project- check back for more soon!</li> </ul> </div> </div> <div id="MovieReadmeModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('MovieReadme')">×</span> <h2>Collecting and Analyzing Unstructured Movie Script Data</h2> <ul> <li>Integrated data from various sources into a PostgreSQL database using web scraping and APIs</li> <li>Created visualizations in matplotlib and seaborn</li> <li>Cleaned and analyzed a large volume of structured and unstructured data, applying a variety of ML and NLP techniques</li> </ul> </div> </div> <div id="QuantStudyReadmeModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('QuantStudyReadme')">×</span> <h2>Replicated the Analytical Process of a Published Paper</h2> <ul> <li>Applied the same and modified processes to new data</li> <li>Cleaned and transformed data according to established procedures</li> </ul> </div> </div> <div id="LangchainReadmeModal" class="modal"> <div class="modal-content"> <span class="close" onclick="closeModal('LangchainReadme')">×</span> <h2>Built a Custom Implementation of the LangChain Embeddings Class</h2> <ul> <li>Custom class is very lightweight and will run on a small AWS server</li> <li>Vector search using only 64-dimensional vectors will power RAG for Mr. Botner</li> <li>More to come!</li> </ul> </div> </div> <div id="contact"> <div id="contact-embed" class="contact" style="min-width:320px;height:500px;display:inline-flex;"> <img src="https://genqrcode.com/embedded?style=7&inner_eye_style=3&outer_eye_style=5&logo=de7388f4e7d5c492aebc315844138144&color=%230f68b6FF&background_color=%23c4e1e7FF&inner_eye_color=%230f68b6&outer_eye_color=%23000000&imageformat=svg&language=en&frame_style=0&frame_text=SCAN%20ME&frame_color=%23000000&invert_colors=false&gradient_style=0&gradient_color_start=%230057B8&gradient_color_end=%23FFD700&gradient_start_offset=50&gradient_end_offset=50&stl_type=1&logo_remove_background=false&stl_size=100&stl_qr_height=1.5&stl_base_height=2&stl_qr_magnet_type=3&stl_qr_magnet_count=0&type=0&text=https%3A%2F%2Fqri.lu%2FN9HzcOI&width=500&height=500&bordersize=2" alt="qr code" /> <div class="contact-details"> <div class="github-card" data-github="mocboch" data-width="400" data-height="150" data-theme="default"></div> <script src="//cdn.jsdelivr.net/github-cards/latest/widget.js"></script> <br><span>(908) 403-1660</span><br> <span>markbochner1@gmail.com</span><br> <script src="https://platform.linkedin.com/badges/js/profile.js" async defer type="text/javascript"></script> <div class="badge-base LI-profile-badge" data-locale="en_US" data-size="large" data-theme="light" data-type="HORIZONTAL" data-vanity="mark-bochner-618071263" data-version="v1"><a class="badge-base__link LI-simple-link" href="https://www.linkedin.com/in/mark-bochner-618071263?trk=profile-badge">LinkedIn</a></div> </div> </div> <button id="close-browser" onclick="closeContact()">Close</button> </div> <script> function openModal(id) { document.getElementById(id + 'Modal').style.display = 'block'; } function closeModal(id) { document.getElementById(id + 'Modal').style.display = 'none'; } window.onclick = function(event) { if (event.target.className === 'modal') { event.target.style.display = 'none'; } } </script> </body> </html>
@font-face { font-family: "Geist Mono Variable"; src: url('../static/GeistMono-UltraLight.woff2'); } @font-face { font-family: "Geist Mono Variable"; font-weight: bold; src: url('../static/GeistMono-SemiBold.woff2'); } * { font-family: "Geist Mono Variable"; } .chatbot-container { flex: 1; background-color: #0F68B6; color: white; padding: 2rem; display: flex; flex-direction: column; } .chat-messages { flex: 1; overflow-y: auto; border: 1px solid #ccc; padding: 1rem; margin-bottom: 1rem; color: white; } .chat-input { display: flex; align-items: center; } .chat-input input { flex: 1; padding: 0.5rem; border: 1px solid #ccc; border-radius: 4px; margin-right: 0.5rem; } .chat-input button { background-color: #2f7d86; color: white; border: none; padding: 0.5rem 1rem; border-radius: 4px; cursor: pointer; transition: background-color 0.3s; } .chat-input button:hover { background-color: #3c9aa5; } .main-container { display: flex; overflow-x: auto; } .content-container { flex: 1; min-width: 300px; } .chatbot-sidebar { width: auto; flex-shrink: 0; } .close { color: #aaa; float: right; font-size: 28px; font-weight: bold; cursor: pointer; } .close:hover, .close:focus { color: black; text-decoration: none; cursor: pointer; } #contact-embed { height: 400px; width: 800px; } #calendly-embed { min-width: 320px; height: 630px; } #calendly, #contact { display: none; position: fixed; top: 10%; left: 10%; transform: translate(0%, -10%); background-color: #0F68B6; border-radius: 10px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); padding: 20px; z-index: 1000; color: white; font-size: 26px; } #calendly h2 { margin-top: 0; } #contact-embed{ display: flex; flex-direction: row; } #calendly-content { height: calc(100% - 60px); overflow-y: auto; } .contact-details { display: flex; flex-direction: column; align-items: center; } .contact img { width: 400px; height: 400px; margin-right: 25px; }
import google.generativeai as genai import os from datetime import date import nbformat as nbf import re genai.configure(api_key=open('apikey.txt').read()) model = genai.GenerativeModel('gemini-pro') def createReport(console_log, specified_outcome, outcome_line, output_name=('auto_console_report_' + str(date.today()) + '.ipynb')): """Accepts a pycharm console history and the outcome as input and outputs a report of the correct steps taken to achieve said outcome""" # console_log (string): name of a .txt file containing the console log # specified_outcome (string): the end result of the console session. The model will attempt to find all code # necessary to achieve the outcome in the log. # outcome_line (integer): the number of the line where the outcome was produced. # output_name (string): the file name for the output file. (.ipynb) chat = model.start_chat(history=[]) nb = nbf.v4.new_notebook() console_log = open(console_log).read() outcome_line = str(outcome_line) response = chat.send_message( 'Eliminate all code from the console log which is not necessary to achieve the specified outcome as ' 'demonstrated in the specified line. Ensure all code is included that is necessary to run the specified line. ' 'Use only code from the console log. Specified Outcome: ' + specified_outcome + '. Specified Line: ' + str( outcome_line) + '. Console Log: ' + console_log) response = chat.send_message( 'Explain each section of the code you returned. Label the explanations and code chunks \' ~E~ <explanation ' 'text>. ~C~ <code>') nb['cells'].append(nbf.v4.new_markdown_cell('# Console Session Report (Automatically Generated)\n' '**Created by:** ' + os.environ.get('USERNAME') + '\n**Date:** ' + date.today().strftime("%B %d, %Y") + '\n**Model Version:** ' + model.model_name + '\n\n**Session outcome:** *' + specified_outcome + '*')) splits = chat.history[3].parts[0].text.split('~') for i in range(len(splits)): if i == 0: continue if (i == 1) | (i % 4 == 1): assert (splits[i] == 'E') elif (i == 2) | (i % 4 == 2): nb['cells'].append(nbf.v4.new_markdown_cell(splits[i])) elif (i == 3) | (i % 4 == 3): assert (splits[i] == 'C') elif i % 4 == 0: nb['cells'].append(nbf.v4.new_code_cell(re.search(r'(```python\n)([\d\D\n]*)(\n```)', splits[i])[2])) with open(output_name, 'w') as f: nbf.write(nb, f)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
dat = pd.read_csv('StudyDataset.csv', encoding='windows-1252', index_col=0)
df = pd.read_csv('Quantitative Study Replication/Non-State Actors Dataset/nsa_v3.4_21November2013.asc', sep='\t', encoding='windows-1252')
df_3 = pd.read_csv('StudyDataset2.csv', encoding='windows-1252', index_col=0)
The DCJ dataset contains rows that list more than one rebel group in the 'Side B' column. In order to deal with this, first, sample weights were introduced. These rows were then split into a row for each group, and weighted such that the total weight of the newly created rows for each original rows sums to one.
In other words, if a row lists 2 groups, it will be split into 2 rows, each with one of the two groups and otherwise identical, weighted at 0.5 each.
dat
acdid | year | gwno | location | sidea | sideb | incomp | territory | startdate | epstartdate | ... | exile_erank | exile_sender | exile_scope | exile_scount | exile_implement | exile_rDCJ | exile_peaceagr | exile_start | exile_end | exile_perm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1946 | 145.0 | Bolivia | Bolivia | Popular Revolutionary Movement | 2 | NaN | 1946-06-30 | 1946-06-30 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 1 | 1952 | 145.0 | Bolivia | Bolivia | MNR | 2 | NaN | 1946-06-30 | 1952-04-09 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1 | 1967 | 145.0 | Bolivia | Bolivia | ELN | 2 | NaN | 1946-06-30 | 1967-03-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2 | 1946 | 811.0 | Cambodia (Kampuchea) | France | Khmer Issarak | 1 | Cambodia | 1946-08-31 | 1946-08-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 2 | 1947 | 811.0 | Cambodia (Kampuchea) | France | Khmer Issarak | 1 | Cambodia | 1946-08-31 | 1946-08-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3406 | 271 | 2011 | 620.0 | Libya | Libya | NTC, Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3407 | 271 | 2011 | 620.0 | Libya | Libya | NTC, Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3408 | 271 | 2011 | 620.0 | Libya | Libya | NTC, Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3409 | 271 | 2011 | 620.0 | Libya | Libya | NTC, Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3410 | 271 | 2011 | 620.0 | Libya | Libya | NTC, Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3410 rows × 168 columns
dat['model_weight'] = 1
indexes_to_delete = []
for i in dat.index:
if len(dat.loc[i]['sideb'].split(',')) > 1:
for actor in dat.loc[i]['sideb'].split(','):
new_row = dat.loc[i]
new_row['model_weight'] /= len(dat.loc[i]['sideb'].split(','))
new_row['sideb'] = actor.strip()
dat.loc[len(dat) + 1] = new_row
indexes_to_delete.append(i)
dat
acdid | year | gwno | location | sidea | sideb | incomp | territory | startdate | epstartdate | ... | exile_sender | exile_scope | exile_scount | exile_implement | exile_rDCJ | exile_peaceagr | exile_start | exile_end | exile_perm | model_weight | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1946 | 145.0 | Bolivia | Bolivia | Popular Revolutionary Movement | 2 | NaN | 1946-06-30 | 1946-06-30 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
2 | 1 | 1952 | 145.0 | Bolivia | Bolivia | MNR | 2 | NaN | 1946-06-30 | 1952-04-09 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
3 | 1 | 1967 | 145.0 | Bolivia | Bolivia | ELN | 2 | NaN | 1946-06-30 | 1967-03-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
4 | 2 | 1946 | 811.0 | Cambodia (Kampuchea) | France | Khmer Issarak | 1 | Cambodia | 1946-08-31 | 1946-08-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
5 | 2 | 1947 | 811.0 | Cambodia (Kampuchea) | France | Khmer Issarak | 1 | Cambodia | 1946-08-31 | 1946-08-31 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5861 | 271 | 2011 | 620.0 | Libya | Libya | Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 |
5862 | 271 | 2011 | 620.0 | Libya | Libya | NTC | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 |
5863 | 271 | 2011 | 620.0 | Libya | Libya | Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 |
5864 | 271 | 2011 | 620.0 | Libya | Libya | NTC | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 |
5865 | 271 | 2011 | 620.0 | Libya | Libya | Forces of Muammar Gaddafi | 2 | NaN | 2011-01-28 | 2011-03-04 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.5 |
5865 rows × 169 columns
dat.drop(index=indexes_to_delete, inplace=True)
Some of the spellings and acronyms used differ between the DCJ and NSA datasets. This step matches up the spellings so the datasets can be joined.
dict = {
'Peoples Liberation Army': 'PLA',
'Republic of Kurdistan/KDPI': 'KDPI',
'Indonesian Peoples Army': "Indonesian People's Army",
'Communist Party of the Philippines': 'CPP',
'Military Faction (forces of Honasan, Abenina & Zumel)': 'CPP, Military Faction (forces of Honasan, Abenina & Zumel)',
'LTS[p]A': 'LTS(p)A',
'Viet Nam Doc Dong Min Hoi': 'Viet minh',
'Opposition coalition [Febreristas, Liberals and Communists]': 'Opposition coalition (Febreristas, Liberals and Communists)',
'Military Faction (forces of General Rodriguez)': 'Military faction (forces of Andres Rodriguez)',
'Military faction (forces of General Alfredo Stroessner)': 'Military faction (forces of Alfredo Stroessner)',
'PVO - White Band faction': 'PVO - "White Band" faction',
"Arakan People's Liberation Party": 'APLP',
'Communist Party of Arakan': 'CPA',
'Rohingya Solidarity Organisation': 'RSO',
'Arakan Rohingya Islamic Front': 'ARIF',
'Naxalites/PWG': 'PWG',
'Naxalites/CPI [-Marxist]': 'CPI-ML',
'CPI–Maoist': 'CPI-Maoist',
'Military Faction [Navy]': 'Military faction (Navy)',
'Military Faction - 26th of July Movement': 'M-26-7',
'National Revolutionary Council': 'Cuban Revolutionary Council',
'Darul Islam Movement': 'Darul Islam',
'Military faction (forces of Eduardo A. Lonardi Doucet)': 'Military faction (forces of Samuel Toranzo Calderón), Military faction (forces of Eduardo A. Lonardi Doucet)',
'Supreme Council for the Islamic Revolution in Iraq (SCIRI)': 'SCIRI',
'ISI/Jama\'at Al-Tawhid wa Al-Jihad': 'ISI',
'RJF/Al-Jaysh al-Islami fi Iraq': 'RJF',
'Independent Nasserite Movement /Mourabitoun militia': 'Independent Nasserite Movement /Mourabitoun militia',
'Lebanese National Movement': 'LNM',
'Shan State Army - South (SSA-S)': 'SSA',
'Independent Mining State of South Kasai': 'Independent Mining State of South Kasai',
'Military faction (forces of Amsha Desta and Merid Negusie)': 'EPRDF, Military faction (forces of Amsha Desta and Merid Negusie)',
'CPN-M/UPF': 'CPN-M',
'KDP/DPK': 'KDP',
'North Kalimantan Liberation Army': 'North Kalimantan Liberation Army',
'Military faction (forces of Hugo Chávez)': 'Military faction (forces of Hugo Chávez)',
"Military faction (forces loyal to Léon M'Ba)": "Military faction (forces loyal to Léon M'Ba)",
'Military faction (forces loyal to Gervais Nyangoma)': 'Military faction (forces loyal to Gervais Nyangoma)',
'First Liberation Army': 'First Liberation Army',
'Second Liberation Army (Frolinat)': 'Second Liberation Army',
'Military faction (forces of Maldoum Bada Abbas)': 'MDD, Military faction (forces of Maldoum Bada Abbas)',
'Military faction (Constitutionalists)': 'Military faction (Constitutionalists)',
'Military faction (forces of Jerry John Rawlings)': 'Military faction (forces of Jerry John Rawlings)',
'Military faction (forces of Ekow Dennis and Edward Adjei-Ampofo) ': 'Military faction (forces of Ekow Dennis and Edward Adjei-Ampofo)',
'Military faction (forces of Patrick Nzeogwu)': 'Military faction (forces of Patrick Nzeogwu)',
'Boko Haram': "Jama'atu Ahlis Sunna Lidda'awati wal-Jihad",
'Military faction (forces loyal to Nureddin Atassi and Youssef Zeayen)': 'Military faction (forces loyal to Nureddin Atassi and Youssef Zeayen)',
'Khmer Rouge/PDK': 'KR',
'FUNCINPEC/ANS': 'FUNCINPEC',
'MIM/Mindanao Independence Movement': 'MIM',
'MNLF – NM': 'MNLF - NM',
'MNLF – HM': 'MNLF - HM',
'SLM/A': 'SLM/A',
'SLM/A – MM': 'SLM/A - MM',
'SPLM/A-North': 'SPLM/A-North',
'Military faction (forces of Mohamed Madbouh)': 'Military faction (forces of Mohamed Madbouh)',
'Mukti Bahini: Liberation Force': 'Mukti Bahini',
'Military faction (forces of Idi Amin)': 'Military faction (forces of Idi Amin)',
'Military faction (forces of Charles Arube)': 'Military faction (forces of Charles Arube)',
"Lord's Army": "Lord's Army",
'PIRA/IRA': 'PIRA',
'Military faction (forces of Benjamin Mejia)': 'Military faction (forces of Benjamin Mejia)',
'MLN or Tupamaros': 'MLN/Tupamaros',
'Military faction (forces of Augusto Pinochet, Toribio Merino and Leigh Guzman)': 'Military faction (forces of Augusto Pinochet, Toribio Merino and Leigh Guzman)',
'JSS/SB/Shanti Bahini': 'JSS/SB',
'BLA/Baluchistan Liberation Army': 'BLA',
'BRA/Baluchistan Republican Army': 'BRA',
'Hezb-i-Islami': 'Hizb-i Islami-yi Afghanistan',
'Hezb-i-Wahdat': 'Hizb-i Wahdat',
'Jamiat-i-Islami': "Jam'iyyat-i Islami-yi Afghanistan",
'Junbish-i Milli-yi Islami': 'Junbish-i Milli-yi Islami',
'Hizb-i Demokratik-i Khalq-i Afghanistan': 'PDPA',
'Harakat-i Inqilab-i Islami-yi Afghanistan': 'Harakat-i Inqilab-i Islami-yi Afghanistan',
'Mahaz-i Milli-yi Islami-yi Afghanistan': 'Mahaz-i Milli-yi Afghanistan',
'Jabha-yi Nijat-i Milli-yi Afghanistan': 'Jabha-yi Nijat-i Milli-yi Afghanistan',
'Ittihad-i Islami Bara-yi Azadi-yi Afghanistan': 'Ittihad-i Is',
'Harakat-i Islami-yi Afghanistan': 'Harakat-i Islami-yi Afghanistan',
'Hizb-i Islami-yi Afghanistan - Khalis faction': 'Hizb-i Islami-yi Afghanistan - Khalis faction',
'FDN/Contras': 'Contras/FDN',
'USC Faction': 'USC/SNA',
'ARS/UIC': 'ARS/UIC',
'Mujahideen e Khalq': 'MEK',
'Military faction (forces of Samuel Doe)': 'Military faction (forces of Samuel Doe)',
'Resistance Armee Tunisienne': 'Résistance Armée Tunisienne',
'Military faction (forces of Hezekiah Ochuka)': 'Military faction (forces of Hezekiah Ochuka)',
'PKK/Kadek': 'PKK',
'Yemenite Socialist Party - Abdul Fattah Ismail faction': 'Yemenite Socialist Party - Abdul Fattah Ismail faction',
'Military faction (forces of Moisés Giroldi)': 'Military faction (forces of Moisés Giroldi)',
'Military faction (forces of Himmler Rebu and Guy Francois)': 'Military faction (forces of Himmler Rebu and Guy Francois)',
'Military faction (forces of Raol Cédras)': 'Military faction (forces of Raol Cédras)',
'OP Lavalas (Chimères)': 'OP Lavalas (Chimères)',
'Government of Armenia and ANM': 'Republic of Armenia',
'Azerbaijani Popular Front': 'APF',
'FRUD – AD': 'FRUD - AD',
'Croatian irregulars': 'Croatian irregulars',
'Exile and Redemption': "Takfir wa'l Hijra",
'FLEC–FAC': 'FLEC-FAC',
'Republic of Nagorno-Karabakh': 'Republic of Nagorno-Karabakh',
'Serbian Republic of Bosnia and Herzegovina': 'Serbian Republic of Bosnia-Herzegovina',
'Serbian Republic of Krajina': 'Serbian Republic of Krajina',
'al-Gamaa al-Islamiyya': "al-Gama'a al-Islamiyya",
'Republic of Abkhazia': 'Republic of Abkhazia',
'Republic of South Ossetia': 'Republic of South Ossetia',
'Dniestr Republic': 'PMR',
'Movement for Peace in Tajikistan': 'Movement for Peace in Tajikistan',
'Husseinov Military Faction': 'Military faction (forces of Suret Husseinov)',
'OPON forces': 'OPON forces',
'Autonomous Province of Western Bosnia': 'Autonomous Province of Western Bosnia',
'Croatian Republic of Bosnia and Herzegovina': 'Croatian Republic of Bosnia-Herzegovina',
'Republic of Chechnya': 'Chechen Republic of Ichkeria',
'Military Junta for the Consolidation of Democracy, Peace and Justice': 'Military Junta for the Consolidation of Democracy, Peace and Justice',
'National Liberation Army (UCK)': 'UCK',
'al-Qaida [The Base]': 'al-Qaida',
'Forces of the Caucasus Emirate': 'Forces of the Caucasus Emirate',
'NDFB – RD': 'NDFB - RD',
'Republic of South Sudan': 'Republic of South Sudan',
'Forces of Muammar Gaddafi': 'Forces of Muammar Gaddafi'
}
df = df.groupby(['ucdpid', 'side_b']).first().reset_index()
df.replace(dict, inplace=True)
This cell merges the three datasets into one complete dataset
data = pd.merge(left=dat, right=df, left_on=['acdid','sideb'], right_on=['ucdpid', 'side_b'], suffixes=('_DCJ', '_NSA'), how='inner')
data = data.join(df_3.set_index(['acdid', 'styear', 'endyear', 'year']), on=['acdid', 'styear', 'endyear', 'year'], rsuffix='_DCJ2', how='left')
There are a couple of data cleaning steps to get the data in the form that's described in the study:
The original paper mentions 36 leftist groups. The full list of involved groups was given to Claude Sonnet 3.5, and it was asked to identify leftist groups. The list was then confirmed using basic google searches. Data was labeled 1/True if a group is a leftist group, or 0 if not.
data['is_leftist_group'] = pd.Series()
for i in data.index:
if data.loc[i]['sideb'] in ['CPP', 'CPI', 'CPI-ML', 'PWG', 'CPI-Maoist', 'CPT', 'CPM', 'CPN-M', 'FARC', 'Sendero Luminoso', 'JVP', 'MEK', 'MCC', 'Viet minh', 'Pathet Lao', 'TPLF', 'EPRDF', 'Frelimo', 'MPLA', 'SWAPO', 'ZAPU', 'ZANU', 'MLN/Tupamaros', 'FSLN', 'PDPA', 'UNLF', 'PFLP', 'PFLP-GC', 'ELN', 'PLA', 'DSE', 'Huk', 'M-26-7', 'EPL', 'MIR', 'EPRP']:
data['is_leftist_group'][i] = 1
else:
data['is_leftist_group'][i] = 0
'Dataset' contains just the necessary elements from 'data', and includes all the information needed for this analysis.
dataset = data[['acdid', 'location', 'sidea', 'sideb', 'model_weight', 'mobcap', 'fightcap', 'intens', 'polity', 'is_leftist_group', 'trial', 'truth', 'rep', 'amnesty', 'purge', 'exile', 'incomp', 'terrcont', 'year']]
This variable represents how many years a conflict has been ongoing, as described in the original study. It is increased by 1 for every consecutive year that the same conflict appears in the dataset.
dataset['year_of_conflict'] = pd.Series()
dataset['year_of_conflict'][0] = 1
i = 1
for idx in dataset.index[1:]:
if list(dataset.loc[idx][['acdid', 'location', 'sidea', 'sideb']]) == list(dataset.loc[idx-1][['acdid', 'location', 'sidea', 'sideb']]) and dataset.loc[idx, 'year'] == dataset.loc[idx-1]['year'] + 1:
i+=1
elif list(dataset.loc[idx][['acdid', 'location', 'sidea', 'sideb']]) == list(dataset.loc[idx-1][['acdid', 'location', 'sidea', 'sideb']]) and dataset.loc[idx, 'year'] == dataset.loc[idx-1]['year']:
pass
else:
i = 1
dataset['year_of_conflict'][idx] = i
dataset
acdid | location | sidea | sideb | model_weight | mobcap | fightcap | intens | polity | is_leftist_group | ... | amnesty | purge | exile | incomp | terrcont | year | year_of_conflict | conciliatory | coercive | DCJ_used | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bolivia | Bolivia | Popular Revolutionary Movement | 1.0 | moderate | moderate | 2 | 0 | 0 | ... | 0 | 0 | 0 | 2 | no | 1946 | 1 | 0 | 0 | 0 |
1 | 1 | Bolivia | Bolivia | MNR | 1.0 | moderate | moderate | 1 | 0 | 0 | ... | 0 | 0 | 0 | 2 | no | 1952 | 1 | 0 | 0 | 0 |
2 | 1 | Bolivia | Bolivia | ELN | 1.0 | low | low | 1 | 0 | 1 | ... | 0 | 0 | 0 | 2 | no | 1967 | 1 | 0 | 0 | 0 |
3 | 2 | Cambodia (Kampuchea) | France | Khmer Issarak | 1.0 | low | low | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | yes | 1946 | 1 | 0 | 0 | 0 |
4 | 2 | Cambodia (Kampuchea) | France | Khmer Issarak | 1.0 | low | low | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | yes | 1947 | 2 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4302 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | low | moderate | 2 | 0 | 0 | ... | 0 | 0 | 0 | 2 | yes | 2011 | 1 | 1 | 0 | 1 |
4303 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | low | moderate | 2 | 0 | 0 | ... | 0 | 0 | 0 | 2 | yes | 2011 | 1 | 0 | 1 | 1 |
4304 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | low | moderate | 2 | 0 | 0 | ... | 0 | 0 | 0 | 2 | yes | 2011 | 1 | 0 | 1 | 1 |
4305 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | low | moderate | 2 | 0 | 0 | ... | 1 | 0 | 0 | 2 | yes | 2011 | 1 | 1 | 0 | 1 |
4306 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | low | moderate | 2 | 0 | 0 | ... | 0 | 0 | 0 | 2 | yes | 2011 | 1 | 0 | 1 | 1 |
4307 rows × 23 columns
The authors code their 'democratic regime' variable based on the polity2 scores found in the data, with 6 and up being considered a democratic regime. This process is repeated in the cell below, returning the results as a binary variable in place in the polity column.
dataset['polity'] = dataset['polity'].apply(lambda x : 1 if x >= 6 else 0)
Creates columns for coercive judicial processes (1/True if trial, exile, or purge was used), conciliatory (1/True if truth, rep, or amnesty was used), and DCJ_used (1/True if coercive or conciliatory is true).
dataset['conciliatory'], dataset['coercive'] = pd.Series(), pd.Series()
dataset['DCJ_used'] = pd.Series()
for idx in dataset.index:
if dataset.loc[idx, 'truth'] == 1\
or dataset.loc[idx, 'rep'] == 1\
or dataset.loc[idx, 'amnesty'] == 1:
dataset['conciliatory'][idx] = 1
dataset['coercive'][idx] = 0
dataset['DCJ_used'][idx] = 1
elif dataset.loc[idx, 'trial'] == 1\
or dataset.loc[idx, 'exile'] == 1\
or dataset.loc[idx, 'purge'] == 1:
dataset['coercive'][idx] = 1
dataset['conciliatory'][idx] = 0
dataset['DCJ_used'][idx] = 1
else:
dataset['coercive'][idx] = 0
dataset['conciliatory'][idx] = 0
dataset['DCJ_used'][idx] = 0
Mobility Capacity is coded, per the authors, as a binary variable with moderate and above being 1/True.
mobcap_dict = {
'low': 0,
'no': 0,
'moderate': 1,
'high': 1
}
dataset['mobcap'].replace(mobcap_dict, inplace=True)
Incomp is rated 1 in the original data if the incompatability is territory, and 2 if it is government. Because we're only interested in territory as the 'True' category, we'll just change the 2s to 0s.
dataset['incomp'].replace({2:0}, inplace=True)
This variable just needs to be converted from 'yes'/'no' to 1/0 notation
dataset['terrcont'].replace({'yes': 1, 'no': 0}, inplace=True)
Adds a cold_war column based on whether the year is before or after 1989, as per the study
dataset['cold_war'] = pd.Series()
for idx in dataset.index:
if dataset['year'][idx] >= 1989: dataset['cold_war'][idx] = 0
else: dataset['cold_war'][idx] = 1
The authors use a binary variable for intensity, which is coded as 1- not intense and 2- intense in the original data. We can just subtract 1 from each row to get to the binary notation we need.
dataset['intens'] -= 1
This cell codes all binary variables as boolean true/false variables to prepare the data for regression.
for i in ['mobcap', 'intens', 'polity', 'is_leftist_group', 'trial', 'truth', 'rep', 'amnesty', 'exile', 'purge', 'conciliatory', 'coercive', 'cold_war', 'incomp', 'terrcont', 'DCJ_used']:
dataset[i].replace({1: True, 0: False}, inplace=True)
Fighting capacity remains as a 4 category column in the data, so we'll get dummies to one-hot encode the data. Each category will end up in it's own column.
dataset = pd.get_dummies(dataset, columns=['fightcap'])
dataset.to_csv('CleanData.csv')
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(penalty=None)
There are just a couple of NA rows in the data, we'll fill them with False.
dataset['terrcont'].fillna(False, inplace=True)
dataset['mobcap'].fillna(False, inplace=True)
logreg.fit(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset['DCJ_used'], sample_weight=dataset['model_weight'])
LogisticRegression(penalty=None)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(penalty=None)
coefs = pd.DataFrame(logreg.coef_)
for i in range(11):
coefs.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1, inplace=True)
coefs.index = ['All DCJ Processes']
logreg2 = LogisticRegression(penalty=None)
logreg2.fit(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset['conciliatory'], sample_weight=dataset['model_weight'])
coefs2 = pd.DataFrame(logreg2.coef_)
for i in range(11):
coefs2.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1, inplace=True)
coefs2.index = ['Conciliatory Processes']
logreg3 = LogisticRegression(penalty=None)
logreg3.fit(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset['coercive'], sample_weight=dataset['model_weight'])
coefs3 = pd.DataFrame(logreg3.coef_)
for i in range(11):
coefs3.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1, inplace=True)
coefs3.index = ['Coercive Processes']
These next two analyses only includes rows where a DCJ was used, and in order to compare conciliatory process use directly to coercive process use among conflict years where a DCJ was used.
dataset2 = dataset[dataset['DCJ_used'] == True]
logreg4 = LogisticRegression(penalty=None)
logreg4.fit(dataset2[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset2['conciliatory'], sample_weight=dataset2['model_weight'])
coefs4 = pd.DataFrame(logreg4.coef_)
for i in range(11):
coefs4.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1, inplace=True)
coefs4.index = ['Conciliatory Processes vs. Coercive']
logreg5 = LogisticRegression(penalty=None)
logreg5.fit(dataset2[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset2['coercive'], sample_weight=dataset2['model_weight'])
coefs5 = pd.DataFrame(logreg5.coef_)
for i in range(11):
coefs5.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1, inplace=True)
coefs5.index = ['Coercive Processes vs. Conciliatory']
coefs['constant'] = logreg.intercept_
coefs2['constant'] = logreg2.intercept_
coefs3['constant'] = logreg3.intercept_
coefs4['constant'] = logreg4.intercept_
coefs5['constant'] = logreg5.intercept_
from sklearn.metrics import log_loss
coefs['log_likelihood'] = -log_loss(logreg.predict(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset['DCJ_used'])
coefs2['log_likelihood'] = -log_loss(logreg2.predict(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset['conciliatory'])
coefs3['log_likelihood'] = -log_loss(logreg3.predict(dataset[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset['coercive'])
coefs4['log_likelihood'] = -log_loss(logreg4.predict(dataset2[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset2['conciliatory'])
coefs5['log_likelihood'] = -log_loss(logreg5.predict(dataset2[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset2['coercive'])
results = pd.concat([coefs, coefs2, coefs3, coefs4, coefs5]).T
results
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
mobcap | 0.661976 | 0.388605 | 0.241040 | 0.240883 | -0.240883 |
fightcap_high | -0.449099 | 0.457280 | -0.829236 | 0.958880 | -0.958879 |
fightcap_moderate | 0.103389 | -0.053578 | 0.222631 | -0.200837 | 0.200837 |
fightcap_low | 0.167188 | -0.516068 | 0.515424 | -0.650171 | 0.650172 |
intens | 0.634839 | 0.191754 | 0.336639 | -0.070971 | 0.070971 |
polity | 1.551596 | -0.125052 | 1.304639 | -0.743485 | 0.743485 |
is_leftist_group | -0.472057 | -0.310735 | -0.119309 | -0.307852 | 0.307852 |
incomp | -0.976216 | -0.347354 | -0.527997 | -0.059748 | 0.059748 |
terrcont | -0.769757 | 0.134406 | -0.792893 | 0.599034 | -0.599034 |
year_of_conflict | 0.079885 | 0.024635 | 0.030991 | 0.014480 | -0.014480 |
cold_war | -1.734773 | -0.812434 | -0.942777 | 0.049247 | -0.049247 |
constant | 0.788109 | -0.868502 | -0.627620 | -0.231353 | 0.231352 |
log_likelihood | -7.054748 | -7.121697 | -10.276667 | -10.238452 | -10.238452 |
dataset3 = dataset[dataset['conciliatory'] == True]
logreg6 = LogisticRegression(penalty=None)
logreg6.fit(dataset3[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset3['truth'],
sample_weight=dataset3['model_weight'])
coefs6 = pd.DataFrame(logreg6.coef_)
for i in range(11):
coefs6.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs6.index = ['Truth Commissions']
logreg7 = LogisticRegression(penalty=None)
logreg7.fit(dataset3[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset3['rep'],
sample_weight=dataset3['model_weight'])
coefs7 = pd.DataFrame(logreg7.coef_)
for i in range(11):
coefs7.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs7.index = ['Reparations']
logreg8 = LogisticRegression(penalty=None)
logreg8.fit(dataset3[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset3['amnesty'],
sample_weight=dataset3['model_weight'])
coefs8 = pd.DataFrame(logreg8.coef_)
for i in range(11):
coefs8.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs8.index = ['Amnesty']
dataset4 = dataset[dataset['coercive'] == True]
logreg9 = LogisticRegression(penalty=None)
logreg9.fit(dataset4[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset4['trial'],
sample_weight=dataset4['model_weight'])
coefs9 = pd.DataFrame(logreg9.coef_)
for i in range(11):
coefs9.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs9.index = ['Trials']
logreg10 = LogisticRegression(penalty=None)
logreg10.fit(dataset4[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset4['exile'],
sample_weight=dataset4['model_weight'])
coefs10 = pd.DataFrame(logreg10.coef_)
for i in range(11):
coefs10.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs10.index = ['Exiles']
logreg11 = LogisticRegression(penalty=None)
logreg11.fit(dataset4[
['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group',
'incomp', 'terrcont', 'year_of_conflict', 'cold_war']], dataset4['purge'],
sample_weight=dataset4['model_weight'])
coefs11 = pd.DataFrame(logreg11.coef_)
for i in range(11):
coefs11.rename({i: ['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity',
'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war'][i]}, axis=1,
inplace=True)
coefs11.index = ['Purges']
coefs6['constant'] = logreg6.intercept_
coefs7['constant'] = logreg7.intercept_
coefs8['constant'] = logreg8.intercept_
coefs9['constant'] = logreg9.intercept_
coefs10['constant'] = logreg10.intercept_
coefs11['constant'] = logreg11.intercept_
dataset3
acdid | location | sidea | sideb | model_weight | mobcap | intens | polity | is_leftist_group | trial | ... | terrcont | year | year_of_conflict | conciliatory | coercive | DCJ_used | cold_war | fightcap_high | fightcap_low | fightcap_moderate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
31 | 6 | Iran | Iran | KDPI | 1.0 | False | True | False | False | False | ... | False | 1979 | 1 | True | False | True | True | False | True | False |
32 | 6 | Iran | Iran | KDPI | 1.0 | False | True | False | False | False | ... | True | 1979 | 1 | True | False | True | True | False | True | False |
33 | 6 | Iran | Iran | KDPI | 1.0 | False | True | False | False | False | ... | False | 1979 | 1 | True | False | True | True | False | True | False |
34 | 6 | Iran | Iran | KDPI | 1.0 | False | True | False | False | False | ... | True | 1979 | 1 | True | False | True | True | False | True | False |
43 | 6 | Iran | Iran | KDPI | 1.0 | False | False | False | False | False | ... | False | 1984 | 6 | True | False | True | True | False | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4295 | 271 | Libya | Libya | NTC | 0.5 | True | True | False | False | False | ... | True | 2011 | 1 | True | False | True | False | False | False | True |
4298 | 271 | Libya | Libya | NTC | 0.5 | True | True | False | False | False | ... | True | 2011 | 1 | True | False | True | False | False | False | True |
4301 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | False | True | False | False | False | ... | True | 2011 | 1 | True | False | True | False | False | False | True |
4302 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | False | True | False | False | False | ... | True | 2011 | 1 | True | False | True | False | False | False | True |
4305 | 271 | Libya | Libya | Forces of Muammar Gaddafi | 0.5 | False | True | False | False | False | ... | True | 2011 | 1 | True | False | True | False | False | False | True |
848 rows × 26 columns
coefs6['log_likelihood'] = -log_loss(y_pred=logreg6.predict(dataset3[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), y_true=dataset3['truth'], labels=[True, False])
coefs7['log_likelihood'] = -log_loss(logreg7.predict(dataset3[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset3['rep'])
coefs8['log_likelihood'] = -log_loss(logreg8.predict(dataset3[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset3['amnesty'])
coefs9['log_likelihood'] = -log_loss(logreg9.predict(dataset4[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset4['trial'], labels=[True, False])
coefs10['log_likelihood'] = -log_loss(logreg10.predict(dataset4[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset4['exile'], labels=[True, False])
coefs11['log_likelihood'] = -log_loss(logreg11.predict(dataset4[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]), dataset4['purge'], labels=[True, False])
results = pd.concat([coefs, coefs2, coefs3, coefs4, coefs5, coefs6, coefs7, coefs8, coefs9, coefs10, coefs11]).T
results
1 | 2 | 3 | 4 | 5 | Truth Commissions | Reparations | Amnesty | Trials | Exiles | Purges | |
---|---|---|---|---|---|---|---|---|---|---|---|
mobcap | 0.661976 | 0.388605 | 0.241040 | 0.240883 | -0.240883 | 0.388710 | 0.115131 | -0.258683 | 0.490600 | -0.475132 | -0.460048 |
fightcap_high | -0.449099 | 0.457280 | -0.829236 | 0.958880 | -0.958879 | 9.692238 | -1.852436 | 0.296176 | -1.269715 | -4.782811 | 7.576780 |
fightcap_moderate | 0.103389 | -0.053578 | 0.222631 | -0.200837 | 0.200837 | 8.170877 | -0.838711 | 0.501842 | -0.671203 | -0.996337 | 6.767420 |
fightcap_low | 0.167188 | -0.516068 | 0.515424 | -0.650171 | 0.650172 | 8.116678 | -1.293090 | 0.898837 | -0.252559 | -0.554519 | 5.739381 |
intens | 0.634839 | 0.191754 | 0.336639 | -0.070971 | 0.070971 | 0.289315 | 0.222522 | -0.324709 | -0.351910 | 0.299519 | 0.431241 |
polity | 1.551596 | -0.125052 | 1.304639 | -0.743485 | 0.743485 | 0.703958 | 0.497264 | -0.732421 | 0.362496 | -0.065839 | -0.673826 |
is_leftist_group | -0.472057 | -0.310735 | -0.119309 | -0.307852 | 0.307852 | -1.536770 | 0.749356 | -0.194584 | 0.710211 | -1.167977 | -0.176958 |
incomp | -0.976216 | -0.347354 | -0.527997 | -0.059748 | 0.059748 | 0.043450 | 0.155515 | -0.156774 | -0.032337 | 0.021043 | -0.007348 |
terrcont | -0.769757 | 0.134406 | -0.792893 | 0.599034 | -0.599034 | -0.333139 | -0.730111 | 0.747777 | -0.112496 | 0.529108 | -0.322486 |
year_of_conflict | 0.079885 | 0.024635 | 0.030991 | 0.014480 | -0.014480 | 0.018288 | 0.026035 | -0.029845 | -0.003418 | 0.015373 | -0.016281 |
cold_war | -1.734773 | -0.812434 | -0.942777 | 0.049247 | -0.049247 | -0.255991 | -0.913336 | 0.816814 | -0.473723 | 0.546352 | 0.361043 |
constant | 0.788109 | -0.868502 | -0.627620 | -0.231353 | 0.231352 | -10.741586 | -0.221894 | 0.217306 | 2.356162 | -2.356438 | -8.609149 |
log_likelihood | -7.054748 | -7.121697 | -10.276667 | -10.238452 | -10.238452 | -3.910396 | -8.500862 | -11.986215 | -3.190586 | -1.678381 | -1.512205 |
Note: Truth, exile and purge each contain less than 100 "true" rows
import statsmodels.api as sm
# Add a constant term to the features (intercept)
X_with_intercept = sm.add_constant(np.array(dataset3[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']]))
# Fit the model using statsmodels
logit_model = sm.Logit(np.array(), X_with_intercept).fit()
# Get the p-values
p_values = logit_model.pvalues
print(p_values)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[75], line 5 2 X_with_intercept = sm.add_constant(np.array(dataset3[['mobcap', 'fightcap_high', 'fightcap_moderate', 'fightcap_low', 'intens', 'polity', 'is_leftist_group', 'incomp', 'terrcont', 'year_of_conflict', 'cold_war']])) 4 # Fit the model using statsmodels ----> 5 logit_model = sm.Logit(np.array(dataset3['truth']), X_with_intercept).fit() 7 # Get the p-values 8 p_values = logit_model.pvalues File ~\anaconda3\Lib\site-packages\statsmodels\discrete\discrete_model.py:475, in BinaryModel.__init__(self, endog, exog, offset, check_rank, **kwargs) 472 def __init__(self, endog, exog, offset=None, check_rank=True, **kwargs): 473 # unconditional check, requires no extra kwargs added by subclasses 474 self._check_kwargs(kwargs) --> 475 super().__init__(endog, exog, offset=offset, check_rank=check_rank, 476 **kwargs) 477 if not issubclass(self.__class__, MultinomialModel): 478 if not np.all((self.endog >= 0) & (self.endog <= 1)): File ~\anaconda3\Lib\site-packages\statsmodels\discrete\discrete_model.py:185, in DiscreteModel.__init__(self, endog, exog, check_rank, **kwargs) 183 def __init__(self, endog, exog, check_rank=True, **kwargs): 184 self._check_rank = check_rank --> 185 super().__init__(endog, exog, **kwargs) 186 self.raise_on_perfect_prediction = False # keep for backwards compat 187 self.k_extra = 0 File ~\anaconda3\Lib\site-packages\statsmodels\base\model.py:270, in LikelihoodModel.__init__(self, endog, exog, **kwargs) 269 def __init__(self, endog, exog=None, **kwargs): --> 270 super().__init__(endog, exog, **kwargs) 271 self.initialize() File ~\anaconda3\Lib\site-packages\statsmodels\base\model.py:95, in Model.__init__(self, endog, exog, **kwargs) 93 missing = kwargs.pop('missing', 'none') 94 hasconst = kwargs.pop('hasconst', None) ---> 95 self.data = self._handle_data(endog, exog, missing, hasconst, 96 **kwargs) 97 self.k_constant = self.data.k_constant 98 self.exog = self.data.exog File ~\anaconda3\Lib\site-packages\statsmodels\base\model.py:135, in Model._handle_data(self, endog, exog, missing, hasconst, **kwargs) 134 def _handle_data(self, endog, exog, missing, hasconst, **kwargs): --> 135 data = handle_data(endog, exog, missing, hasconst, **kwargs) 136 # kwargs arrays could have changed, easier to just attach here 137 for key in kwargs: File ~\anaconda3\Lib\site-packages\statsmodels\base\data.py:675, in handle_data(endog, exog, missing, hasconst, **kwargs) 672 exog = np.asarray(exog) 674 klass = handle_data_class_factory(endog, exog) --> 675 return klass(endog, exog=exog, missing=missing, hasconst=hasconst, 676 **kwargs) File ~\anaconda3\Lib\site-packages\statsmodels\base\data.py:88, in ModelData.__init__(self, endog, exog, missing, hasconst, **kwargs) 86 self.const_idx = None 87 self.k_constant = 0 ---> 88 self._handle_constant(hasconst) 89 self._check_integrity() 90 self._cache = {} File ~\anaconda3\Lib\site-packages\statsmodels\base\data.py:133, in ModelData._handle_constant(self, hasconst) 131 check_implicit = False 132 exog_max = np.max(self.exog, axis=0) --> 133 if not np.isfinite(exog_max).all(): 134 raise MissingDataError('exog contains inf or nans') 135 exog_min = np.min(self.exog, axis=0) TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
import numpy as np
dataset5 = dataset.replace({True: 1, False: 0})
dataset5.to_csv('CleanData1_0.csv')
dataset5[dataset5['conciliatory'] == 1].to_csv('CleanData_conc.csv')
dataset5[dataset5['coercive'] == 1].to_csv('CleanData_coer.csv')
df_3
acdid | year | location | gwno | region | epid | styear | endyear | epend | dcjdummy | ... | epdum_govdcj | epdum_rebdcj | polity | regime | rebstrength | bdeadbes | bdeadchgrel | chgbdeadrel | outcome | conflterm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 100 | 1966 | Nigeria | 475.0 | 4 | 100_1966 | 1966 | 1966 | 1 | 0 | ... | 0 | 0 | -7.0 | 3.0 | 4.0 | 20.0 | NaN | NaN | 4.0 | 2.0 |
2 | 100 | 2009 | Nigeria | 475.0 | 4 | 100_2009 | 2009 | 2009 | 0 | 3 | ... | 1 | 0 | 4.0 | 2.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
3 | 100 | 2011 | Nigeria | 475.0 | 4 | 100_2011 | 2011 | 2011 | 0 | 14 | ... | 1 | 0 | 4.0 | 2.0 | 1.0 | NaN | NaN | NaN | NaN | NaN |
4 | 101 | 1966 | South Africa | 560.0 | 4 | 101_1966-1988 | 1966 | 1988 | 0 | 0 | ... | 1 | 0 | 4.0 | 2.0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
5 | 101 | 1967 | South Africa | 560.0 | 4 | 101_1966-1988 | 1966 | 1988 | 0 | 1 | ... | 1 | 0 | 4.0 | 2.0 | 2.0 | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1933 | 9 | 1949 | Laos | 812.0 | 3 | 9_1946-1953 | 1946 | 1953 | 0 | 0 | ... | 0 | 0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
1934 | 9 | 1950 | Laos | 812.0 | 3 | 9_1946-1953 | 1946 | 1953 | 0 | 0 | ... | 0 | 0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
1935 | 9 | 1951 | Laos | 812.0 | 3 | 9_1946-1953 | 1946 | 1953 | 0 | 0 | ... | 0 | 0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
1936 | 9 | 1952 | Laos | 812.0 | 3 | 9_1946-1953 | 1946 | 1953 | 0 | 0 | ... | 0 | 0 | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN |
1937 | 9 | 1953 | Laos | 812.0 | 3 | 9_1946-1953 | 1946 | 1953 | 1 | 0 | ... | 0 | 0 | NaN | NaN | 1.0 | NaN | NaN | NaN | 5.0 | 3.0 |
1937 rows × 66 columns
Quantitative Study Replication Model 1: All DCJ used ```{r} data <- read.csv('CleanData1_0.csv') model <- glm(DCJ_used ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 summary(model) ``` Model 2: Conciliatory Processes ```{r} model <- glm(conciliatory ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 3: Coercive Processes ```{r} model <- glm(coercive ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 4: Truth Commissions ```{r} data <- read.csv('CleanData_conc.csv') model <- glm(truth ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 5: Reparations ```{r} model <- glm(rep ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 6: Amnesty ```{r} model <- glm(amnesty ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 7: Exile ```{r} data <- read.csv('CleanData_coer.csv') model <- glm(exile ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 8: Purge ```{r} model <- glm(purge ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ``` Model 9: Trial ```{r} model <- glm(trial ~ mobcap + fightcap_high + fightcap_moderate + fightcap_low + intens + polity + is_leftist_group + incomp + terrcont+ year_of_conflict + cold_war, family = binomial(link = "logit"), data = data) summary(model) coefs <- summary(model)$coefficients odds <- exp(coef(model)) - 1 data.frame(coefs, odds) ```
import google.generativeai as genai class Embeddings(): def __init__(self, api_key, model='models/text-embedding-004', dim=64): self.model, self.dim = model, dim genai.configure(api_key=api_key) def embed_documents(self, texts: list[str]) -> list[list[float]]: embeddings = [genai.embed_content(model=self.model, content=text, task_type='RETRIEVAL_DOCUMENT', output_dimensionality=self.dim)['embedding'] for text in texts] return embeddings def embed_query(self, text: str) -> list[float]: return genai.embed_content(model=self.model, content=text, task_type='RETRIEVAL_DOCUMENT', output_dimensionality=self.dim)['embedding']
This notebook will document the steps involved in creating a custom implementation of the langchain embeddings class. The idea of this implementation is to be a lightweight alternative to the HuggingFaceEmbeddings class, which I was previously using for this integration, but takes up a ton of disk space during installation.
import google.generativeai as genai
from pymongo import MongoClient
from langchain.vectorstores import MongoDBAtlasVectorSearch
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
with open('google_api_key.txt') as f:
api_key = f.read()
with open('mongo_info.txt') as f:
(user, password, url) = f.readlines()
mongo_uri = f'mongodb+srv://{user.strip()}:{password.strip()}@{url.strip()}/?retryWrites=true&w=majority&appName=website-database'
This class works by getting embeddings from Google's Gecko model. It follows the abstract methods outlined on LangChain's github, and will serve my needs just fine. Most importantly, this class accomplishes in just a few lines of code what I was previously unable to fit onto the server space I have available with the AWS free tier.
class Embeddings():
def __init__(self, model='models/text-embedding-004', api_key=api_key, dim=64):
self.model, self.dim = model, dim
genai.configure(api_key=api_key)
def embed_documents(self, texts: list[str]) -> list[list[float]]:
embeddings = [genai.embed_content(model=self.model, content=text,
task_type='RETRIEVAL_DOCUMENT',
output_dimensionality=self.dim)['embedding']
for text in texts]
return embeddings
def embed_query(self, text: str) -> list[float]:
return genai.embed_content(model=self.model, content=text, task_type='RETRIEVAL_DOCUMENT', output_dimensionality=self.dim)['embedding']
These are a list of webpages with program descriptions and other related pages having to do with my education:
ed = [
'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-5',
'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-6',
'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/ms-data-faqs',
'https://www.eastern.edu/academics/colleges-seminary/college-health-and-sciences/departments/department-mathematical-10',
'https://news.asu.edu/20210322-university-news-asu-will-lead-effort-upskill-reskill-workforce-through-8m-grant',
'https://degrees.apps.asu.edu/minors/major/ASU00/BABDACERT/applied-business-data-analytics?init=false&nopassive=true',
'https://aznext.pipelineaz.com/static_assets/sites/aznext.pipelineaz.com/AZNext.Brochure.-.ASU.Salesforce.Developer.Academy.participants.pdf',
'https://www.alfred.edu/academics/undergrad-majors-minors/environmental-studies.cfm',
'https://www.alfred.edu/about/',
'https://www.ucvts.org/domain/300'
]
from langchain.document_loaders import WebBaseLoader
pages = [WebBaseLoader(url).load() for url in ed]
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[
"\n\n", "\n", "(?<=\. )", " "], length_function=len)
docs = [text_splitter.split_documents(page) for page in pages]
client = MongoClient(mongo_uri)
collection = client['website-database']['education-v2']
embeddings = Embeddings()
docsearches = [MongoDBAtlasVectorSearch.from_documents(
doc, embeddings, collection=collection
) for doc in docs]
This is an object in the Python code that allows LangChain to connect to MongoDB and search its records
vector_search = MongoDBAtlasVectorSearch.from_connection_string(
mongo_uri,
'website-database.education-v2',
embeddings,
index_name="vector_index"
)
retriever = vector_search.as_retriever(search_type="similarity", search_kwargs={"k": 15})
model = ChatGoogleGenerativeAI(model='gemini-1.5-flash', api_key=api_key)
prompt = hub.pull('rlm/rag-prompt')
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
query = 'Tell me about Eastern University\'s Masters in Data Science program'
response = ' '.join([chunk for chunk in rag_chain.stream(query)])
response
"Eastern University offers a Master's in Data Science program that has been highly ranked by several organizations. The program includes a curriculum that covers various aspects of data science , and the university provides information about admissions requirements and student learning outcomes. You can find more details on the Eastern University website. \n"
query = 'Tell me about the Advanced Business Data Analytics program at ASU'
response = ' '.join([chunk for chunk in rag_chain.stream(query)])
response
'The Applied Business Data Analytics certificate program at Arizona State University (ASU) is offered by the W. P. Carey School of Business. It is available both online and in person in Tempe. The program focuses on practical applications of computer-based tools for managing and analyzing large datasets, including predictive analytics, big data techniques, and visualization. \n'
This retriever is lightweight, will fit on my website, and does a pretty good job with only 64-dimensional vectors. I'd call this project a success!
import numpy as np import pandas as pd from GoogleEmbeddings import Embeddings from tinydb import TinyDB, Query from langchain_core.documents import Document from langchain_core.runnables import RunnableLambda class TinyDBRetriever(): def __init__(self, tinydb_filepath: str, google_api_key: str, k: int): self.tinydb_filepath = tinydb_filepath self.google_api_key = google_api_key self.k = k def embedQuery(self, query: str): embeddings = Embeddings(api_key=self.google_api_key) embedded_query = embeddings.embed_query(query) return embedded_query def getVecSearch(self) -> list[tuple[int, list[float]]]: db = TinyDB(self.tinydb_filepath) table = db.table('_default') Q = Query() vec_search = [tuple((t.doc_id, t['question-embedded'])) for t in table.all()] db.close() return vec_search def getSimilarityScores(self, query: list[float], keys: list[tuple[int, list[float]]]) -> pd.DataFrame(): scores = [] for tup in keys: num = np.dot(query, tup[1]) denom = np.sqrt(np.dot(query, query) * np.dot(tup[1], tup[1])) scores.append(tuple((tup[0], num/denom))) return pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'}) def _get_relevant_documents(self, query: str) -> list[Document]: embedded_query = self.embedQuery(query) vecsearch = self.getVecSearch() scores = self.getSimilarityScores(embedded_query, vecsearch)[:self.k] db = TinyDB(self.tinydb_filepath) table = db.table('_default') Q = Query() docs = [Document( page_content=table.get(doc_id=doc[0])['answer'], metadata={"question": table.get(doc_id=doc[0])['question']}) for doc in scores.iterrows()] return docs def as_retriever(self): return RunnableLambda(self._get_relevant_documents)
In testing and receiving feedback from some of my initial users, one thing I found was that most of my users expected the AI to be able to answer basic questions about me in a way that it wasn't. To solve this problem, I'm going to create a local RAG system of questions and answers to common interview questions. This database is built with TinyDB, which offers efficient and lightweight algorithms for keeping a small, local DB. I will continue adding my answers to common questions to this dataset to improve the quality of answers.
from tinydb import TinyDB, Query
from GoogleEmbeddings import Embeddings
This code splits the text read in from the file, and embeds it using the GoogleEmbeddings class.
with open('personal_info.txt') as f:
text = f.read()
text = text.split('\n* ~')
ls = [t.strip('~* ') for t in text]
dicts = []
embeddings = Embeddings(api_key=open('google_api_key.txt').read())
for i in range(len(ls)):
if i%2 == 0:
dicts.append({'question': ls[i],'answer': ls[i+1],
'question-embedded': embeddings.embed_query(ls[i]),
'answer-embedded': embeddings.embed_query(ls[i+1])})
Here is an example of a question and answer that I'll be inserting into the database.
dicts[0]['question']
'Tell me about Mark'
dicts[0]['answer']
'Mark grew up in New Jersey, and first got interested in code when he learned to program using DarkBASIC in middle school. He attended an IT-focused high school program where he continued to develop his skills, adding programming in Python and Java to his skillset along with certifications in SQL, Microsoft Excel, and Comptia A+. He went on to attend Alfred University for Environmental Science. After that, Mark had a couple of jobs in that field, including working as a Park Ranger, and as an intern working with critically endangered birds. Following that, he discovered a passion for cooking and pursued it for years, eventually becoming a kitchen manager. When he decided it was time for a change, he went back to school for a Master’s degree in data science, which he is now finishing. Mark is very interested in Natural Language Processing and Bayesian Statistics.'
This cell creates a DB and a query object in TinyDB, and the next inserts the dictionaries created in the previous step as documents. That function returns a list of the doc_ids associated with the inserts.
db = TinyDB('personal-info.json')
User = Query()
db.insert_multiple(dicts)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
The next two cells create a table object with the default table created by TinyDB, and then a vector search object with a pair of vector and scalar indices for each entry. For now, I have a small enough dataset that I can hold all of these vectors in memory, and use a naive algorithm to find the best match, but I will revisit this in the near future to create a more efficient search algorithm that considers time and space complexity.
table = db.table('_default')
vec_search = [tuple((t.doc_id, t['question-embedded'])) for t in table.all()]
This cell defines a function to provide cosine similarity scores. Cosine similarity scores are a commonly used metric of comparison between vector embeddings.
import numpy as np
def cosine_similarity(vec1: list[float], vec2: list[float]) -> float:
num = np.dot(vec1, vec2)
denom = np.sqrt(np.dot(vec1, vec1) * np.dot(vec2, vec2))
return num/denom
Now, I can test it out. Below are some test queries, the resulting cosine similarity scores, and the highest scoring questions in the dataset.
import pandas as pd
query = 'Give me a reason I should hire Mark'
embedded_query = embeddings.embed_query(query)
scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]
pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})
score | |
---|---|
0 | |
4 | 0.952122 |
1 | 0.890730 |
7 | 0.884191 |
13 | 0.851925 |
6 | 0.841705 |
8 | 0.833763 |
3 | 0.822064 |
5 | 0.821628 |
9 | 0.819010 |
12 | 0.804606 |
2 | 0.762136 |
10 | 0.757881 |
14 | 0.748708 |
11 | 0.746675 |
table.get(doc_id=4)['question']
'Why should we hire Mark?'
query = 'What should I know about Mark?'
embedded_query = embeddings.embed_query(query)
scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]
pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})
score | |
---|---|
0 | |
1 | 0.951057 |
8 | 0.932492 |
9 | 0.925750 |
7 | 0.915157 |
4 | 0.913031 |
3 | 0.898955 |
6 | 0.893020 |
5 | 0.884995 |
10 | 0.855615 |
2 | 0.839801 |
13 | 0.837229 |
11 | 0.804068 |
14 | 0.799615 |
12 | 0.763757 |
table.get(doc_id=1)['question']
'Tell me about Mark'
table.get(doc_id=8)['question']
'What are Mark’s goals?'
table.get(doc_id=9)['question']
'What are Mark’s interests?'
query = 'What does Mark like to do on the weekends?'
embedded_query = embeddings.embed_query(query)
scores = [tuple((row[0], cosine_similarity(embedded_query, row[1]))) for row in vec_search]
pd.DataFrame(scores).set_index(0).sort_values(ascending=False, by=1).rename(columns={0:'doc_id', 1:'score'})
score | |
---|---|
0 | |
3 | 0.929328 |
9 | 0.897411 |
1 | 0.879798 |
5 | 0.871701 |
8 | 0.868609 |
4 | 0.844823 |
7 | 0.840330 |
13 | 0.838112 |
2 | 0.817903 |
6 | 0.814839 |
11 | 0.814703 |
10 | 0.804551 |
12 | 0.760181 |
14 | 0.753478 |
table.get(doc_id=3)['question']
'What are Mark’s hobbies?'
In addition to implementing a more efficient search algorithm, and adding more questions and answers to the dataset, in the near future I would like to create amn training set of some queries paired with 'golden documents.' This will allow me to get a concrete assessment of the model's accuracy, which in turn will allow me to investigate reducing the dimensionality of the vector index further with a reliable approach. I would also like to implement a 'skills' section of this database, where I will write up paragraphs about each of my skills, pulled from job descriptions I'm looking at, and then allow the model to search using string matching TinyDB's own search algorithms.
The final system will allow the machine to respond to users with context from:
This system can also be integrated into the personalized resume creation process by generating skill descriptions that match keywords in the job description context