Digging in to Hilary Clinton's emails- USA Presidential Candidate




So few months ago, Hillary Clinton released her email communications  happened (sent and received) during her tenure as Secretary of State in response to a FOIA request. You can download the extracted and normalized version of these raw email data here. Now, Let's see what interesting information we can dig from this. 

To start off, what are the most common words in her emails. Using PyEnChant to ignore the language specific words provides us with some interesting names and incidents. Here is the script and the result.

import pandas as pd
import enchant
import re
from collections import Counter
word_list = []
email_dataframe = pd.read_csv("../input/Emails.csv")
raw_text_list = email_dataframe['RawText'].tolist()
for raw_text in raw_text_list:
word_list = word_list + re.findall('\w+', raw_text.lower())
english_checker = enchant.Dict("en_US")
high_frequency_words = Counter(x for x in word_list if not english_checker.check(x)).most_common(100)
view raw CommonWords.py hosted with ❤ by GitHub


Word Count
cheryl 5981
huma 5735
abedin 5195
clintonemail 4997
b6 4951
sullivan 4027
fw 3847
obama 3610
clinton 3478
jacob 3459
hdr22 3022
abedinh 2786
american 2410
benghazi 2394
millscd 2383


How about the countries that she communicated mostly about. Here's the result of that attempt with the assistance of PyCountry.  There is some noise in the final data (such as 'tv' & 'fm' most probably referring to the respective medias, not to the Tuvalu Islands and Federated States of Micronesia ). But as a whole, we can see which countries have received most of her attention. 

import pandas as pd
import re
import csv
import pycountry
from collections import Counter
# Reading all the text in raw email body and partitioning to words.
word_list = []
email_dataframe = pd.read_csv("../input/Emails.csv")
raw_text_list = email_dataframe['RawText'].tolist()
for raw_text in raw_text_list:
word_list = word_list + re.findall('\w+', raw_text.lower())
# Using pyCountry module to list down all the countries. We consider shorten forms of the countries as well.
# Example: We consider all 3 formats of pakistan. (pakistan, pk, pak).
# But if it is a language specific word such as 'us', we ignore those items from our search.
country_list = []
english_checker = enchant.Dict("en_US")
country_list = []
for country in pycountry.countries:
country_list.append(country.name.strip().lower())
if not english_checker.check(country.alpha2.lower()):
country_list.append(country.alpha2.lower())
if not english_checker.check(country.alpha3.lower()):
country_list.append(country.alpha3.lower())
high_frequency_countries = Counter(x for x in word_list if x in country_list).most_common(100)


Country Count
haiti 2154
israel 1827
al 1432
afghanistan 1324
pakistan 1285
libya 1256
china 1247
ve 1065
iraq 770
turkey 602
de 573
af 546
india 541
honduras 492
fm 490
egypt 465
ireland 388
mexico 355
germany 318
armenia 285
sudan 269
brazil 267
france 249
tv 244
au 235
bangladesh 210
colombia 207
indonesia 200
palau 196
ni 195
japan 191
ben 189
canada 178
usa 175
cuba 165
yemen 154
na 135
greece 133
spain 133
italy 132
congo 131
morocco 130
qatar 129
poland 128
kenya 121
uganda 119
guinea 116
jordan 112
kyrgyzstan 111
somalia 106
argentina 106
angola 106
pak 104
ie 103
singapore 92
tunisia 86
lebanon 81
ga 78
rwanda 77
australia 74
chile 71
im 69
ecuador 68
georgia 68
ki 64
jm 64
il 60
nigeria 60
liberia 59
jersey 58
malaysia 58
peru 55
ph 55
se 54
png 54
uruguay 53
ukraine 51
fo 51
dk 51
norway 50
qa 48
portugal 48
philippines 47
jamaica 47
azerbaijan 46
slovenia 46
aq 46
mali 44
thailand 43
switzerland 42
lt 41
tm 40
ps 40
tt 37
md 37
tuv 36
va 36
belgium 36
samoa 35


Addition to these basic information, although it's not huge amount of data, this could be the source for much more sophisticated sentiment analysis such as finding out  how happy, sad or frustrated Hilary was during this period.

Comments

Popular Posts