Terrorism is the use of violence and intimidation in the pursuit of political or ideological goals. It can take many forms and can be carried out by individuals, groups, or governments. The effects of terrorism are far-reaching and can have a profound impact on individuals, communities, and nations.
Terrorism often seeks to create fear and chaos, and it can have a devastating impact on the physical and emotional well-being of those who are directly affected by it. In addition to the physical harm caused by attacks, terrorism can also lead to economic disruption, as businesses and tourism can be negatively affected. It can also lead to social and political instability, as governments and societies may struggle to respond to and recover from attacks.
On a global scale, terrorism can also have significant international implications, as it can lead to tensions and conflicts between nations and can threaten international stability and security. The fight against terrorism is an ongoing challenge for governments and international organizations, and it requires a combination of efforts to address the root causes of terrorism and to prevent and respond to attacks.
For our final project, we decided to use the Global Terrorism Database. This database is maintained by UMD, and has information on attacks from 1970 to 2017. It can be found at https://www.kaggle.com/datasets/START-UMD/gtd. The database also comes with a codebook, which contains information about how to read and understand the information provided in the database, which can be found at https://www.start.umd.edu/gtd/downloads/Codebook.pdf. The First thing we'll do is some exploratory analysis and look at some information we found interesting. For our model production, it is important to know that while the database does classify what groups were responsible for attacks, there are many attacks that were labeled as unknown. We will try to predict which group was most likely responsible for an attack.
#All the libararies that we will be using to complete this project
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import plotly.io as pio
pio.renderers.default='notebook'
import folium
import requests
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import classification_report
from IPython.display import Image
import warnings
##warnings.filterwarnings("ignore", category=DtypeWarning)
#reading in data set from local machine
#dataset can be found at https://www.kaggle.com/datasets/START-UMD/gtd
#dataset cookbook can be found at https://www.start.umd.edu/gtd/downloads/Codebook.pdf
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)
df = pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')
df
eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | region_txt | provstate | city | latitude | longitude | specificity | vicinity | location | summary | crit1 | crit2 | crit3 | doubtterr | alternative | alternative_txt | ... | nhostkid | nhostkidus | nhours | ndays | divert | kidhijcountry | ransom | ransomamt | ransomamtus | ransompaid | ransompaidus | ransomnote | hostkidoutcome | hostkidoutcome_txt | nreleased | addnotes | scite1 | scite2 | scite3 | dbsource | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | related | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaN | 58 | Dominican Republic | 2 | Central America & Caribbean | NaN | Santo Domingo | 18.456792 | -69.951164 | 1.0 | 0 | NaN | NaN | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PGIS | 0 | 0 | 0 | 0 | NaN |
1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaN | 130 | Mexico | 1 | North America | Federal | Mexico city | 19.371887 | -99.086624 | 1.0 | 0 | NaN | NaN | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | 1.0 | 0.0 | NaN | NaN | NaN | Mexico | 1.0 | 800000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PGIS | 0 | 1 | 1 | 1 | NaN |
2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaN | 160 | Philippines | 5 | Southeast Asia | Tarlac | Unknown | 15.478598 | 120.599741 | 4.0 | 0 | NaN | NaN | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaN | 78 | Greece | 8 | Western Europe | Attica | Athens | 37.997490 | 23.762728 | 1.0 | 0 | NaN | NaN | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaN | 101 | Japan | 4 | East Asia | Fukouka | Fukouka | 33.580412 | 130.396361 | 1.0 | 0 | NaN | NaN | 1 | 1 | 1 | -9.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
181686 | 201712310022 | 2017 | 12 | 31 | NaN | 0 | NaN | 182 | Somalia | 11 | Sub-Saharan Africa | Middle Shebelle | Ceelka Geelow | 2.359673 | 45.385034 | 2.0 | 0 | The incident occurred near the town of Balcad. | 12/31/2017: Assailants opened fire on a Somali... | 1 | 1 | 0 | 1.0 | 1.0 | Insurgency/Guerilla Action | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | "Somalia: Al-Shabaab Militants Attack Army Che... | "Highlights: Somalia Daily Media Highlights 2 ... | "Highlights: Somalia Daily Media Highlights 1 ... | START Primary Collection | 0 | 0 | 0 | 0 | NaN |
181687 | 201712310029 | 2017 | 12 | 31 | NaN | 0 | NaN | 200 | Syria | 10 | Middle East & North Africa | Lattakia | Jableh | 35.407278 | 35.942679 | 1.0 | 1 | The incident occurred at the Humaymim Airport. | 12/31/2017: Assailants launched mortars at the... | 1 | 1 | 0 | 1.0 | 1.0 | Insurgency/Guerilla Action | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | "Putin's 'victory' in Syria has turned into a ... | "Two Russian soldiers killed at Hmeymim base i... | "Two Russian servicemen killed in Syria mortar... | START Primary Collection | -9 | -9 | 1 | 1 | NaN |
181688 | 201712310030 | 2017 | 12 | 31 | NaN | 0 | NaN | 160 | Philippines | 5 | Southeast Asia | Maguindanao | Kubentog | 6.900742 | 124.437908 | 2.0 | 0 | The incident occurred in the Datu Hoffer distr... | 12/31/2017: Assailants set fire to houses in K... | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | "Maguindanao clashes trap tribe members," Phil... | NaN | NaN | START Primary Collection | 0 | 0 | 0 | 0 | NaN |
181689 | 201712310031 | 2017 | 12 | 31 | NaN | 0 | NaN | 92 | India | 6 | South Asia | Manipur | Imphal | 24.798346 | 93.940430 | 1.0 | 0 | The incident occurred in the Mantripukhri neig... | 12/31/2017: Assailants threw a grenade at a Fo... | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | "Trader escapes grenade attack in Imphal," Bus... | NaN | NaN | START Primary Collection | -9 | -9 | 0 | -9 | NaN |
181690 | 201712310032 | 2017 | 12 | 31 | NaN | 0 | NaN | 160 | Philippines | 5 | Southeast Asia | Maguindanao | Cotabato City | 7.209594 | 124.241966 | 1.0 | 0 | NaN | 12/31/2017: An explosive device was discovered... | 1 | 1 | 1 | 0.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | "Security tightened in Cotabato following IED ... | "Security tightened in Cotabato City," Manila ... | NaN | START Primary Collection | -9 | -9 | 0 | -9 | NaN |
181691 rows × 135 columns
At this point, we can see that the database is big, and also has a lot of information that we don't necessarily need as there were a bunch of columns that weren't useful. For example, the location column usually had a street on it, even though there is also a longitude and latitude given for the attack. Similarly, we're not interested in the scite1, scite2, scite3, resolution, multiple, and approxdate columns. while the approxdate column might seem useful, they also give the year, month, and date as columns, so the approxdate column is mundane. So pruning the dataset is necessary.
#dropping unecessary columns and renaming columns for more clarity
df.drop(['approxdate', 'location', 'resolution', 'multiple',
'scite1', 'scite2', 'scite3'], axis=1, inplace=True) #Droppping all useless columns
#Renaming columns for ease of access
df = df.rename(columns={"country": "country_id", "alternative": "alternative_id", "region": "region_id", "gname": "group_name"})
rows = df.shape[0]
#Dropping all lat and longtide rows without a value
df = df[df['latitude'].notna()]
df = df[df['longitude'].notna()]
dropped_rows = df.shape[0]
noloc_rows = rows - dropped_rows
#print("The number of rows with no latitude/longitude informatiun is {}".format(noloc_rows))
#Checking for any null values in country and group_name columns
df['group_name'].isna().sum() #no null values for terrorism group name
df['country_txt'].isna().sum() #no null values for country
#Making sure all the year, month, and day columns have the same value, so that we don't have to worry about missing dates
df['iyear'].isna().sum()
df['imonth'].isna().sum()
df['iday'].isna().sum()
#no null values for date columns, so I can merge columns accurately
0
dtypes = df.dtypes
dtypes
#creating a date-time column
df['iday'] = df['iday'].replace(0,1)
df['imonth'] = df['imonth'].replace(0,1)
df["Date"] = df["iyear"].apply(str) + "/" + df["imonth"].apply(str) + "/" + df["iday"].apply(str)
df['Date'] = pd.to_datetime(df['Date'])
#moving datetime column to the front of the dataframe:
date_col = df.pop("Date")
df.insert(0, date_col.name, date_col)
#quickly observing unique values of importnat columns
#df.attacktype1_txt.unique()
#df.targtype1_txt.unique()
#df.targsubtype1_txt.unique()
#df.weaptype1_txt.unique()b
#df.propextent_txt.unique()
#df.iyear.unique()
#df.imonth.unique()
First, we dropped a couple of the columns that we were not interested in and renamed some of the columns we were interested in using. We also decided to drop all the rows without a latitude or longitude value, as it would cause further headaches down the line. Finally, we removed the redundant date columns and added one unified date that gave us the date in a matter that we found to be more helpful. At this point, we decided the dataframe was good enough for our purposes, and it was time for exploring the data.
The first question we decided to ask was "What were some of the common words people have used to describe terrorist attacks?"
df["summary"]=df["summary"].astype(str)
summary_str = " ".join(summ for summ in df.summary)
stopwords = set(STOPWORDS)
stopwords.update(["the", "and", "so", "are", "because", "at", "in", "no", "however", "nan", "near", "incident",
"unkown", "one"])
wordcloud = WordCloud(stopwords=stopwords, background_color="black").generate(summary_str)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Most of the words that were found fit in with the question, as most people would describe the area, what occurred, as well as who was responsible and potential casualties. This wordcloud produced nothing that we found surprising/ out of the norm. However, it was interesting to see that the biggest word on the wordcloud is "claimed responsibility" which would indicate that a lot of terrorist attacks are being claimed by terrorist groups/organizations. The next question we decided to ask was "what are the words being described for each terrorist group's motives?"
try:
df["motive"]=df["motive"].astype(str)
except KeyError as ke:
pass
summary_str = " ".join(summ for summ in df.motive)
stopwords = set(STOPWORDS)
stopwords.update(["nan nan", "nan", "sources speculated", "unknown", "sources posited", "Unkown",
"January"])
wordcloud = WordCloud(stopwords=stopwords, background_color="black").generate(summary_str)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
This word cloud also produced nothing out of the ordinary. Some of the words that could be important include sectarian violence, intimidate, protest, death, as well as groups, locations, and affiliations. One important word to note is "larger trend", which indicates the motives of many terrorist attacks are linked together and could be a part of a common ideology and goal.
The first thing that we decided to plot was the countries that had the most terrorist attacks from 1970, the earliest point the database kept track of, to 2017. We then decided to compare it to the last two years of the database, to see if there were any significant differences.
df['country_txt'].value_counts(sort=True)[:30].plot.bar()
<AxesSubplot:>
recent_df = df.loc[df['iyear'] > 2016]
recent_df['country_txt'].value_counts(sort=True)[:30].plot.bar()
<AxesSubplot:>
It is of note that Iraq blows every other country out of the water when it comes to the number of terrorist attacks in both of these graphs. While some of the top countries in the graph from 1970 - 2017 stay close to their position, almost every other country has a major drop or gain in their positioning in terms of the top 30. Countries like the DRC, Chile, Ukraine( to name a few) all belong in the top 30 of 2016-2017, but not 1980 to 2017. The only one that keeps its position, and with no doubt, is Iraq. This also lines up with Iraq's history of an unstable government, with various groups trying to gain control and authority, and general instability in the area for many years now.
The next piece of information that we were interested in was the number of terrorist attacks per year, and how they fluctuated from year to year.
#creating a new column for the count of attacks by year
df['year_count'] = df.groupby('iyear')['iyear'].transform('count')
#seaborn plot edits
sns.set_style("darkgrid")
sns.set(rc={"figure.figsize":(12,8)})
sns.set(font_scale=1.75)
#making lineplot
g = sns.lineplot(data=df, x="iyear", y="year_count")
g.set_xlabel("Year")
g.set_ylabel("Number of Terrorist Attacks")
g.set_title("Amount of Terrorist Attacks per Year")
Text(0.5, 1.0, 'Amount of Terrorist Attacks per Year')
There are many fluctuations on a year-to-year basis, and the biggest thing of note is that there was a huge jump in the number of attacks around 2014. The number of attacks went from around 5000 to around 12000, before peaking around 16000 attacks in one year. The number of attacks since then has started to reduce, but post-2010, terrorist attacks are now a lot more prevalent than they used to be. This could potentially be attributed to the end of the Iraq war, which was in 2011 (where the graph spiked). After the United States ended the war, it is likely terrorist groups and organizations in Iraq became more active. The war in Iraq could also be a reason why there was a dip of attacks in the early 2000s. After the US declared war on Iraq, attacks that were previously classified as terrorist attacks were now just attacks that occurred during the war. As well as that, groups that were under attack by the US during the war were also likely to reduce their activity, for fear of being targeted by the US.
#convertin column to string type
df["group_name"]=df["group_name"].astype(str)
#dropping terrorist group names of 'unknown'
threat_df = df.drop(df[df.group_name == "Unknown"].index)
#creating a column: "killsPerAttack" which shows the average amount of deaths per terrorist attack
threat_df['group_success'] = threat_df.groupby(['group_name','nkill'])['nkill'].transform('sum')
threat_df['group_count'] = threat_df.groupby('group_name')['group_name'].transform('count')
pd.set_option('display.max_columns', None)
threat_df["killsPerAttack"] = threat_df["group_success"]/threat_df["group_count"]
#filtering the dataframe by removing duplicate org names and taking the top 25 in sorted order
threat_plot = threat_df.drop_duplicates(subset=['group_name'], keep=False)
threat_plot = threat_plot.sort_values(by=['killsPerAttack'], ascending = False)
threat_plot = threat_plot.head(25)
#creating barchart using seaborn
g = sns.catplot(data=threat_plot, y='group_name', x='killsPerAttack',kind='bar',
ci=None, legend_out=True, height = 10, aspect = 1.75, orient = "h")
g.set_axis_labels("Number of Fatalities caused on average per Terrorist Attack", "Terrorist Groups/Organizations", size = 20)
plt.title("Top 25 Most Deadly Terrorist Groups and Organizations", y=1, fontsize = 25)
Text(0.5, 1, 'Top 25 Most Deadly Terrorist Groups and Organizations')
From the above graph, we can see that some of the deadliest groups are groups that a lot of the United States likely hasn't heard of without significant research. Many people in the United States have likely only heard of groups such as ISIL, Al-Qaeda, and other groups that are commonly covered by news outlets. Neither of the aforementioned groups are present in this list. It is noteworthy to note that many of the groups in the top 30 have some sort of ideological motivation. These motivations include religious beliefs(such as Christianity or Islam), or political(MDJT in Chad). It is also worthwhile to note that Ahmad Jibril, the second bar on this chart, is actually a person. Jibril was a radical islamic speaker, and he and his followers carried out attacks that landed them on this graph. It is important to note that this graph is not plotting groups with the most kills. It is graphing the groups with the most fatalities per attack, which is a different metric.
#heatmap of all terrorist attacks representing amounnt of casualties, hover over the heatmap to inspect the specific
#terrorist organization
fig = px.density_mapbox(df, lat='latitude', lon='longitude', z='nkill', hover_name="group_name",
mapbox_style="stamen-terrain", zoom=0)
fig.show("notebook")
In terms of North America, the US and Canada have not seen many terrorist attacks. The United States had one major attack (9/11) and the rest are few and far between. Most of the lethal attacks occurred on the East Coast, and most other attacks are sparse, had no casualties, and spread across the US. It can also be noted that while Al-Qaeda does result in quite a big spread for 9/11, many of the other attacks were held by domestic "groups". The word group is used lightly here, as many of these "groups" are not actually organized. While there were some deaths from these attacks, most of them had little to no casualties and were not relevant enough to end up on the heat map.
The same cannot be said for the rest of the world. While there are plenty of attacks that led to no casualties, there are plenty more with 1 or more casualties and the heat map shows as such.
Terrorism relies on the use of weapons in order to carry out attacks of deadly force, and breaking down weapon usage in attacks holds merit. We decided to use 5 regions: Central American and the Caribbean, North America, The Middle East, and North Africa, Central Asia, and Eastern Europe. Each of these regions has at least one or more "relevant" terrorist groups.
#only using the 5 most interesting/relevant regions
regions = ['Central America & Caribbean', 'North America', 'Middle East & North Africa', 'Central Asia', 'Eastern Europe']
pie_df = df[df['region_txt'].isin(regions)]
pie_df = pie_df[pie_df['weaptype1_txt'] != "Unknown"]
pie_df['weap_count'] = pie_df.groupby(['weaptype1_txt', 'region_txt'])['weaptype1_txt'].transform('count')
pie_df = pie_df.drop_duplicates(subset=['weaptype1_txt', 'region_txt'], keep = 'last')
pie_df
pie1 = pie_df[pie_df['region_txt'] == 'Central America & Caribbean']
pie1
fig = px.pie(pie1, values='weap_count', names='weaptype1_txt',
title='Split of Attack Method in Central America & Caribbean')
fig.show("notebook")
pie2 = pie_df[pie_df['region_txt'] == 'North America']
fig = px.pie(pie2, values='weap_count', names='weaptype1_txt',
title='Split of Attack Method in North America')
fig.show("notebook")
pie3 = pie_df[pie_df['region_txt'] == 'Middle East & North Africa']
pie3
fig = px.pie(pie3, values='weap_count', names='weaptype1_txt',
title='Split of Attack Method in Middle East & North Africa')
fig.show("notebook")
pie4 = pie_df[pie_df['region_txt'] == 'Central Asia']
fig = px.pie(pie4, values='weap_count', names='weaptype1_txt',
title='Split of Attack Method in Central Asia')
fig.show("notebook")
pie5 = pie_df[pie_df['region_txt'] == 'Eastern Europe']
fig = px.pie(pie5, values='weap_count', names='weaptype1_txt',
title='Split of Attack Method in Eastern Europe')
fig.show("notebook")
These pie charts show a lot of interesting information. The only region where firearms have a majority is in Central America and the Caribbean. In every other region, explosives are the primary attack method. It is also interesting to note that every circle has the same top 4 methods. In no particular order, those 4 are Firearms, Explosives, Incendiary, and Melee. This could potentially be correlated to their ease of access. Compared to chemical or biological agents, explosives(which can be made) firearms(relatively easily acquired), incendiary(can also be made), and melee(no explanation required) are all significantly easier to acquire and could explain why they are more commonly used than weapons that are not as easily acquired.
Finally, we're going to look at some graphs and maps that are more locally based to us, and we will look at some of the attacks classified as terrorism in the United States.
#Making a dataframe where all attackers are known
threat_df_for_map = df[df['group_name']!= "Unknown"]
#print(threat_df_for_map)
#Making a map and adding points to it.
map_osm_for_US = folium.Map(location=[39.14, -101.2996], zoom_start=4.5)
threat_df_for_US = threat_df_for_map[threat_df_for_map["country_txt"] == "United States"]
#threat_plot = threat_plot.sort_values(by=['killsPerAttack'], ascending = False)
threat_df_for_US = threat_df_for_US.sort_values(by =["Date"],ascending = False)
#threat_df_for_US["group_name"]
aae = 0;
faln = 0;
we = 0;
lwm = 0;
for index, row in threat_df_for_US.iterrows():
if row["group_name"] == "Anti-Abortion extremists":
if aae == 50:
continue
else:
aae = aae + 1
folium.Marker(location=[row["latitude"], row["longitude"]],
popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y"),
icon=folium.Icon(color='red')).add_to(map_osm_for_US)
if row["group_name"] == "Left-Wing Militants":
if lwm == 50:
continue
else:
lwm = lwm + 1
folium.Marker(location=[row["latitude"], row["longitude"]],
popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
icon=folium.Icon(color='blue')).add_to(map_osm_for_US)
if row["group_name"] == "Fuerzas Armadas de Liberacion Nacional (FALN)":
if faln == 50:
continue
else:
faln = faln + 1
folium.Marker(location=[row["latitude"], row["longitude"]],
popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
icon=folium.Icon(color='gray')).add_to(map_osm_for_US)
if row["group_name"] == "White extremists":
if we == 50:
continue
else:
we = we + 1
folium.Marker(location=[row["latitude"], row["longitude"]],
popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
icon=folium.Icon(color='purple')).add_to(map_osm_for_US)
map_osm_for_US
This map shows the 50 most recent attacks for the top 4 unique groups in the United States. This was done by first removing every other country from the dataset, isolating and finding out the top 4 groups responsible for attacks, and then sorting the dataframe by descending date. Initially the map had the first 50 of each group that appeared in the dataset, and the dates were all pretty much from the 1970-80s. After sorting, the dates were more in line with what we wanted to see. As a result, we were able to see attacks from as late as 2017 on the map.
While making this map of locations of top groups in the United States, we found it noteworthy that most of the terrorist attacks in the US were not organizations that most people think of when they think of terrorism(Al-Qaeda, ISIL, etc) and were instead undefined groups like "white extremists" or "anti-abortion extremists". This led to the belief that it would be useful to see the top contributors to attacks in the US, vs somewhere like Iraq, where the most attacks occur. Also, while the FALN (Fuerzas Armadas de Liberación Nacional) does have some attacks on Mainland US, most of the attacks were done in the Caribbean, specifically Puerto Rico.
In this section, we are going to compare the groups responsible for the most attacks in their respective country.
#Getting a dataframe that only has the group name and how many occurrences they have
group_counts_for_US = threat_df_for_US['group_name'].value_counts().reset_index()
group_counts_for_US.columns = ['group_name', 'count']
#print(group_counts_for_US)
#Getting the top ten groups in the Us, and printing only those
group_counts_for_US.head(10)
group_counts_for_US = group_counts_for_US[group_counts_for_US['count'] >= 66]
#df.loc[row_index] = df.loc[row_index].rename('new_index_name')
#Renaming group names so that they fit better on the bar chart
group_counts_for_US.at[2, 'group_name'] = 'FALN'
group_counts_for_US.at[4, 'group_name'] = 'NWLF'
group_counts_for_US.at[6, 'group_name'] = 'ALF'
group_counts_for_US.at[7, 'group_name'] = 'JDL'
group_counts_for_US.at[9, 'group_name'] = 'ELF'
group_counts_for_US.at[0, 'group_name'] = 'Anti-Abortion'
group_counts_for_US.at[1, 'group_name'] = 'Left-Wing'
#print(group_counts
#Making the graph for the US
sns.set(rc={"figure.figsize":(14, 12)})
g = sns.barplot(data=group_counts_for_US, x="group_name", y='count')
g.set_xlabel("Group Name", fontsize = 20)
g.set_ylabel("Number of Attacks", fontsize = 20)
g.set_title("Number of Attacks by Terrorist Group in the United States", fontsize = 30)
Text(0.5, 1.0, 'Number of Attacks by Terrorist Group in the United States')
Many of the terrorism groups responsible for attacks in the United States are not actually groups at all. Instead, they are mostly ideologies, like Anti-Abortion, Left-Wing, and White extremists. This shows that there aren't really organized terrorism groups in the US. This is a far cry from the top groups that are found in Iraq. Interestingly, many of them are not religiously based and are instead points of contention that can still be found in American politics today.
threat_df_for_Iraq = threat_df_for_map[threat_df_for_map["country_txt"] == "Iraq"]
group_counts_for_Iraq = threat_df_for_Iraq['group_name'].value_counts().reset_index()
group_counts_for_Iraq.columns = ['group_name', 'count']
group_counts_for_Iraq = group_counts_for_Iraq[group_counts_for_Iraq['count'] >= 20]
group_counts_for_Iraq
group_counts_for_Iraq.at[0,"group_name"] = 'ISIL'
group_counts_for_Iraq.at[1,"group_name"] = 'Al-Qaida'
group_counts_for_Iraq.at[2,"group_name"] = 'ISI'
group_counts_for_Iraq.at[5,"group_name"] = 'T&J'
group_counts_for_Iraq.at[6,"group_name"] = 'JRTN'
group_counts_for_Iraq.at[5,"group_name"] = "JTJ"
group_counts_for_Iraq.at[7,"group_name"] = 'Muslim Ex.'
group_counts_for_Iraq.at[9,"group_name"] = 'MCTR'
group_counts_for_Iraq
g = sns.barplot(data=group_counts_for_Iraq, x="group_name", y='count')
g.set_xlabel("Group Name", fontsize = 20)
g.set_ylabel("Number of Attacks", fontsize = 20)
g.set_title("Number of Attacks by Terrorist Group in Iraq", fontsize = 30)
Text(0.5, 1.0, 'Number of Attacks by Terrorist Group in Iraq')
The top 10 groups responsible for attacks in Iraq is what we expected when we chose to analyze the groups responsible. The top 10 for Iraq consists mostly of religiously affiliated, established terrorism groups. It is also important to note that ISIL/ISIS has been responsible for over 5x as many attacks as the next nearest group (Al-Qaeda). Comparing this bar chart to the previous bar chart, it becomes clear that religious groups have a significantly larger share of attacks, and therefore stronger chokehold, than the US. As previously stated, many of the groups responsible for attacks in the US are not established groups, with clear hierarchies and leaders. In contrast, many of the groups in Iraq do have clear leadership structures and a public figure leading them. This, along with their previously established reputations, allows them to wreak havoc as they do in Iraq and other Middle Eastern countries.
For our machine learning part of the final tutorial, we decided to try and solve the unknown group problem. While many terrorist organizations have claimed responsibility for a plethora of attacks, there are a lot of attacks in which their was no identified perpetrator/group responsible. By using the random forest algorithm, we tried to categorize each group, and classify their attack patterns based on a number of variables: The country the attack took place in, the goal of the attack(criit1-3), whether it was successful, whether or not it was a suicidal mission, and a couple more. The specifics can be found at the codebook linked above. For both of these models, we used a random forest classifier, made available with the sklearn python package.
A random forest classifier is a machine learning algorithm that belongs to the ensemble learning family. It is a type of decision tree classifier, but it uses multiple decision trees and combines their predictions to improve the overall accuracy of the model. In a random forest classifier, a large number of decision trees are trained on randomly selected subsets of the training data. Each decision tree makes a prediction, and the random forest classifier combines these predictions by taking the mode (for classification) or mean (for regression) of the predictions. This combination of predictions helps to reduce overfitting and improve the generalization performance of the model.
#Tyring to predict the terrorist organization based on selected predictor columns
ml_df = df[df.group_name != 'Unknown']
enc=OneHotEncoder()
pd.set_option('display.max_columns', 50)
#Making the columns categorical and removing any unknowns from the 'ihostkid' and 'intany' columns
df['specificity'] = pd.Categorical(df.specificity)
df['vicinity'] = pd.Categorical(df.specificity)
df['success'] = pd.Categorical(df.success)
ml_df['ishostkid'].replace('-9','0')
ml_df['INT_ANY'].replace('-9','0')
#choosing dependent variable columns
enc_data=pd.DataFrame(enc.fit_transform(ml_df[['extended','country_txt', 'specificity', 'vicinity', 'crit1',
'crit2','crit3','success','suicide','attacktype1_txt','targtype1_txt', 'guncertain1','weaptype1_txt',
'property','ishostkid','ransom','INT_ANY']]).toarray())
X = enc_data
y = ml_df["group_name"] #target variable
#training the model
#The max depth determines how deep each tree in the random forest will go down to before it must make a conclusion
# The number of estimators is the number of decision trees that the classifier will make
SEED = 99
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
rfc = RandomForestClassifier(n_estimators=100, max_depth=9,random_state=SEED)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
#evaluation metrics
final_df = pd.DataFrame(classification_report(y_test,y_pred, zero_division = 0,output_dict = True))
final_df = final_df.drop(final_df.index[-1])
final_df.head(50)
16 January Organization for the Liberation of Tripoli | 1920 Revolution Brigades | 1st of May Group | 2 April Group | 20 December Movement (M-20) | 23rd of September Communist League | 2nd of June Movement | 31 January People's Front (FP-31) | 9 February | 9 May People's Liberation Force | Abbala extremists | Abd al-Krim Commandos | Abdul Qader Husseini Battalions of the Free Palestine movement | Abdullah Azzam Brigades | Abida Tribe | Abkhazian Separatists | Abkhazian guerrillas | Abu Amarah Battalion | Abu Nidal Organization (ANO) | Abu Obaida bin Jarrah Brigade | Abu Sayyaf Group (ASG) | Achik National Cooperative Army (ANCA) | Achik National Liberation Army (ANLA) | Achik National Volunteer Council-B (ANVC-B) | Achik Songna An'pachakgipa Kotok (ASAK) | ... | Workers' Self-Defense Movement (MAO) | World Church of the Creator | Xhosa Tribal Workers | Yakariya Bango Insurgent Group | Yemenis | Young Communist League | Young Pioneers | Youth Action Group | Youth for Revolution | Youths | Zapatista National Liberation Army | Zawiya Martyrs Brigade | Zebra killers | Zeliangrong United Front | Zero Tolerance | Zimbabwe African Nationalist Union (ZANU) | Zimbabwe African People's Union | Zimbabwe Guerrillas | Zimbabwe Patriotic Front | Zulu Miners | Zuwar al-Imam Rida | leftist guerrillas-Bolivarian militia | accuracy | macro avg | weighted avg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
precision | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.818182 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.568792 | 0.030449 | 0.417401 |
recall | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.088235 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.568792 | 0.026104 | 0.568792 |
f1-score | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.159292 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.568792 | 0.024629 | 0.448185 |
3 rows × 1541 columns
Unfortunately, this machine learning model had mixed success at best. It does show some potential, though. For some of the groups, the model was able to produce relatively high values in terms of precision, meaning that it was able to predict the individual group relatively well. However, for many of the groups, there isn't a big enough sample size. In the whole dataset, there are ~ 27 groups with more than 100 attacks. Over 3/4's of the groups involved had attack numbers in the single digits, and there aren't enough data points for the model to learn the behavior of the group, and predict if an attack was conducted by them. This led to the model being unable to predict many of the groups in the dataset. While it was kind of precise with some groups, it had a precision of 0.00 for too many groups in the dataset in order to properly conclude that the model was able to predict groups based on the given dependent variables.
However, even with the 27 groups that had over 100 attacks to train on, the precision of those 27 were all over the place. The model was able to predict some of the groups with high precision, while with other groups, the model was not precise at all, with a precision value of 0.00. Unfortunately, this means that the model was unsuccessful in its goal of predicting a group given variables about the attack.
Given that our first model was not a resounding success, we decided to go with a simpler idea. The database contains a column that determines whether a given attack was a success or not. This categorical variable’s value (either a 0 or 1) is determined by whether or not the attack successfully took place. The attack does not necessarily need to complete its objective. The only exception to this are assassination attempts. These attempts require the target to have been killed. Otherwise, the attack would be labeled as a failure. Each specific type of attack has their own description of what is a success and what is a failure, and more details can be found in the codebook linked above.
For the second attempt, we decided to use the same method(random forest classifier) and use 10 fold cross validation in order to determine whether or not the model would be able to successfully predict the outcome of an attack given certain variables.
#Predicting whether the terrorist attack was a success or not
ml1_df=df
ml1_df = df[df.weaptype1_txt != 'Unknown']
pd.set_option('display.max_columns', 50)
ml1_df['ishostkid'].replace('-9','0')
ml1_df['INT_ANY'].replace('-9','0')
#choosing dependent variable columns
enc_data = ml1_df[['success','weaptype1_txt','region_txt',
'attacktype1_txt','targtype1_txt', 'guncertain1', 'specificity']]
#one-hot encoding
df_dummies = pd.get_dummies(data=enc_data, columns=enc_data.columns[1:])
y = df_dummies['success']
X = df_dummies.drop(['success'], axis = 1)
#training the model
SEED = 99
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=SEED)
rfc = RandomForestClassifier(n_estimators=5, max_depth=9,random_state=SEED)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
#evaluation metrics
print(classification_report(y_test,y_pred))
#10-fold cross validation
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(rfc, X, y,cv=cv, n_jobs=-1)
print(scores)
precision recall f1-score support 0 0.72 0.24 0.35 2681 1 0.91 0.99 0.95 21745 accuracy 0.91 24426 macro avg 0.82 0.61 0.65 24426 weighted avg 0.89 0.91 0.88 24426 [0.8940678 0.90770081 0.89670843 0.90407762 0.90241356 0.90241356 0.90339618 0.89940429 0.89756187 0.89246453]
Unfortunately, it doesn't seem like this model was successful in determing attacks either. While the model was able to predict successful attacks with relatively high precision, and was able to find all relevant information for successes, it was not able to do the same for failures. Predicting failures produces a precision value of .72, and a recall of .24, meaning it was not very precise and it was unable to find relevant information in the dataset going back.
Overall, it was a great learning experience utilizing the Global Terrorism Database while levaraging the knowledge we learned in CMSC320 to produce visuals and graphics that we found intriguing. The only downside of this project is that we weren't able to successfully make predictions for either the group responsible for an attack or whether or not an attack was successful. Hopefully, this tutorial made you a little bit more interested in analyzing data, whether it be about terrorism or something else you find interesting. While we did go through a lot of information through this tutorial, the Global Terrorism database holds magnitudes more information that we didn't begin to touch on in this tutorial, and if you are interested in looking at it yourself, the links can be found under our Explanation header. We hoped you enjoyed looking through or work, and we thank you for taking the time to do so.
Omeed and AJ