Digging in to Hilary Clinton's emails- USA Presidential Candidate
So few months ago, Hillary Clinton released her email communications happened (sent
and received) during her tenure as Secretary of State in response to a
FOIA request. You can download the extracted and normalized version of these raw email data here. Now, Let's see what interesting information we can dig from this.
To start off, what are the most common words in her emails. Using PyEnChant to ignore the language specific words provides us with some interesting names and incidents. Here is the script and the result.
How about the countries that she communicated mostly about. Here's the result of that attempt with the assistance of PyCountry. There is some noise in the final data (such as 'tv' & 'fm' most probably referring to the respective medias, not to the Tuvalu Islands and Federated States of Micronesia ). But as a whole, we can see which countries have received most of her attention.
Addition to these basic information, although it's not huge amount of data, this could be the source for much more sophisticated sentiment analysis such as finding out how happy, sad or frustrated Hilary was during this period.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import enchant | |
import re | |
from collections import Counter | |
word_list = [] | |
email_dataframe = pd.read_csv("../input/Emails.csv") | |
raw_text_list = email_dataframe['RawText'].tolist() | |
for raw_text in raw_text_list: | |
word_list = word_list + re.findall('\w+', raw_text.lower()) | |
english_checker = enchant.Dict("en_US") | |
high_frequency_words = Counter(x for x in word_list if not english_checker.check(x)).most_common(100) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Word | Count | |
---|---|---|
cheryl | 5981 | |
huma | 5735 | |
abedin | 5195 | |
clintonemail | 4997 | |
b6 | 4951 | |
sullivan | 4027 | |
fw | 3847 | |
obama | 3610 | |
clinton | 3478 | |
jacob | 3459 | |
hdr22 | 3022 | |
abedinh | 2786 | |
american | 2410 | |
benghazi | 2394 | |
millscd | 2383 |
How about the countries that she communicated mostly about. Here's the result of that attempt with the assistance of PyCountry. There is some noise in the final data (such as 'tv' & 'fm' most probably referring to the respective medias, not to the Tuvalu Islands and Federated States of Micronesia ). But as a whole, we can see which countries have received most of her attention.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import re | |
import csv | |
import pycountry | |
from collections import Counter | |
# Reading all the text in raw email body and partitioning to words. | |
word_list = [] | |
email_dataframe = pd.read_csv("../input/Emails.csv") | |
raw_text_list = email_dataframe['RawText'].tolist() | |
for raw_text in raw_text_list: | |
word_list = word_list + re.findall('\w+', raw_text.lower()) | |
# Using pyCountry module to list down all the countries. We consider shorten forms of the countries as well. | |
# Example: We consider all 3 formats of pakistan. (pakistan, pk, pak). | |
# But if it is a language specific word such as 'us', we ignore those items from our search. | |
country_list = [] | |
english_checker = enchant.Dict("en_US") | |
country_list = [] | |
for country in pycountry.countries: | |
country_list.append(country.name.strip().lower()) | |
if not english_checker.check(country.alpha2.lower()): | |
country_list.append(country.alpha2.lower()) | |
if not english_checker.check(country.alpha3.lower()): | |
country_list.append(country.alpha3.lower()) | |
high_frequency_countries = Counter(x for x in word_list if x in country_list).most_common(100) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Country | Count | |
---|---|---|
haiti | 2154 | |
israel | 1827 | |
al | 1432 | |
afghanistan | 1324 | |
pakistan | 1285 | |
libya | 1256 | |
china | 1247 | |
ve | 1065 | |
iraq | 770 | |
turkey | 602 | |
de | 573 | |
af | 546 | |
india | 541 | |
honduras | 492 | |
fm | 490 | |
egypt | 465 | |
ireland | 388 | |
mexico | 355 | |
germany | 318 | |
armenia | 285 | |
sudan | 269 | |
brazil | 267 | |
france | 249 | |
tv | 244 | |
au | 235 | |
bangladesh | 210 | |
colombia | 207 | |
indonesia | 200 | |
palau | 196 | |
ni | 195 | |
japan | 191 | |
ben | 189 | |
canada | 178 | |
usa | 175 | |
cuba | 165 | |
yemen | 154 | |
na | 135 | |
greece | 133 | |
spain | 133 | |
italy | 132 | |
congo | 131 | |
morocco | 130 | |
qatar | 129 | |
poland | 128 | |
kenya | 121 | |
uganda | 119 | |
guinea | 116 | |
jordan | 112 | |
kyrgyzstan | 111 | |
somalia | 106 | |
argentina | 106 | |
angola | 106 | |
pak | 104 | |
ie | 103 | |
singapore | 92 | |
tunisia | 86 | |
lebanon | 81 | |
ga | 78 | |
rwanda | 77 | |
australia | 74 | |
chile | 71 | |
im | 69 | |
ecuador | 68 | |
georgia | 68 | |
ki | 64 | |
jm | 64 | |
il | 60 | |
nigeria | 60 | |
liberia | 59 | |
jersey | 58 | |
malaysia | 58 | |
peru | 55 | |
ph | 55 | |
se | 54 | |
png | 54 | |
uruguay | 53 | |
ukraine | 51 | |
fo | 51 | |
dk | 51 | |
norway | 50 | |
qa | 48 | |
portugal | 48 | |
philippines | 47 | |
jamaica | 47 | |
azerbaijan | 46 | |
slovenia | 46 | |
aq | 46 | |
mali | 44 | |
thailand | 43 | |
switzerland | 42 | |
lt | 41 | |
tm | 40 | |
ps | 40 | |
tt | 37 | |
md | 37 | |
tuv | 36 | |
va | 36 | |
belgium | 36 | |
samoa | 35 |
Addition to these basic information, although it's not huge amount of data, this could be the source for much more sophisticated sentiment analysis such as finding out how happy, sad or frustrated Hilary was during this period.
Comments
Post a Comment
Leave your ideas