CMSC320 Final Project: Analyzing the Global Terrorism Database¶

By Omeed Zarrabian and Adithya Raj¶

Introduction
¶

Terrorism is the use of violence and intimidation in the pursuit of political or ideological goals. It can take many forms and can be carried out by individuals, groups, or governments. The effects of terrorism are far-reaching and can have a profound impact on individuals, communities, and nations.

Terrorism often seeks to create fear and chaos, and it can have a devastating impact on the physical and emotional well-being of those who are directly affected by it. In addition to the physical harm caused by attacks, terrorism can also lead to economic disruption, as businesses and tourism can be negatively affected. It can also lead to social and political instability, as governments and societies may struggle to respond to and recover from attacks.

On a global scale, terrorism can also have significant international implications, as it can lead to tensions and conflicts between nations and can threaten international stability and security. The fight against terrorism is an ongoing challenge for governments and international organizations, and it requires a combination of efforts to address the root causes of terrorism and to prevent and respond to attacks.

Explanation
¶

For our final project, we decided to use the Global Terrorism Database. This database is maintained by UMD, and has information on attacks from 1970 to 2017. It can be found at https://www.kaggle.com/datasets/START-UMD/gtd. The database also comes with a codebook, which contains information about how to read and understand the information provided in the database, which can be found at https://www.start.umd.edu/gtd/downloads/Codebook.pdf. The First thing we'll do is some exploratory analysis and look at some information we found interesting. For our model production, it is important to know that while the database does classify what groups were responsible for attacks, there are many attacks that were labeled as unknown. We will try to predict which group was most likely responsible for an attack.

Reading and Cleaning the Data¶

In [59]:
#All the libararies that we will be using to complete this project 
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import plotly.io as pio
pio.renderers.default='notebook'
import folium
import requests
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import classification_report
from IPython.display import Image
import warnings
##warnings.filterwarnings("ignore", category=DtypeWarning)


#reading in data set from local machine
#dataset can be found at https://www.kaggle.com/datasets/START-UMD/gtd
#dataset cookbook can be found at https://www.start.umd.edu/gtd/downloads/Codebook.pdf
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=pd.errors.DtypeWarning)
    df = pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')
df
Out[59]:
eventid iyear imonth iday approxdate extended resolution country country_txt region region_txt provstate city latitude longitude specificity vicinity location summary crit1 crit2 crit3 doubtterr alternative alternative_txt ... nhostkid nhostkidus nhours ndays divert kidhijcountry ransom ransomamt ransomamtus ransompaid ransompaidus ransomnote hostkidoutcome hostkidoutcome_txt nreleased addnotes scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related
0 197000000001 1970 7 2 NaN 0 NaN 58 Dominican Republic 2 Central America & Caribbean NaN Santo Domingo 18.456792 -69.951164 1.0 0 NaN NaN 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN PGIS 0 0 0 0 NaN
1 197000000002 1970 0 0 NaN 0 NaN 130 Mexico 1 North America Federal Mexico city 19.371887 -99.086624 1.0 0 NaN NaN 1 1 1 0.0 NaN NaN ... 1.0 0.0 NaN NaN NaN Mexico 1.0 800000.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN PGIS 0 1 1 1 NaN
2 197001000001 1970 1 0 NaN 0 NaN 160 Philippines 5 Southeast Asia Tarlac Unknown 15.478598 120.599741 4.0 0 NaN NaN 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
3 197001000002 1970 1 0 NaN 0 NaN 78 Greece 8 Western Europe Attica Athens 37.997490 23.762728 1.0 0 NaN NaN 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
4 197001000003 1970 1 0 NaN 0 NaN 101 Japan 4 East Asia Fukouka Fukouka 33.580412 130.396361 1.0 0 NaN NaN 1 1 1 -9.0 NaN NaN ... NaN NaN NaN NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
181686 201712310022 2017 12 31 NaN 0 NaN 182 Somalia 11 Sub-Saharan Africa Middle Shebelle Ceelka Geelow 2.359673 45.385034 2.0 0 The incident occurred near the town of Balcad. 12/31/2017: Assailants opened fire on a Somali... 1 1 0 1.0 1.0 Insurgency/Guerilla Action ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN "Somalia: Al-Shabaab Militants Attack Army Che... "Highlights: Somalia Daily Media Highlights 2 ... "Highlights: Somalia Daily Media Highlights 1 ... START Primary Collection 0 0 0 0 NaN
181687 201712310029 2017 12 31 NaN 0 NaN 200 Syria 10 Middle East & North Africa Lattakia Jableh 35.407278 35.942679 1.0 1 The incident occurred at the Humaymim Airport. 12/31/2017: Assailants launched mortars at the... 1 1 0 1.0 1.0 Insurgency/Guerilla Action ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN "Putin's 'victory' in Syria has turned into a ... "Two Russian soldiers killed at Hmeymim base i... "Two Russian servicemen killed in Syria mortar... START Primary Collection -9 -9 1 1 NaN
181688 201712310030 2017 12 31 NaN 0 NaN 160 Philippines 5 Southeast Asia Maguindanao Kubentog 6.900742 124.437908 2.0 0 The incident occurred in the Datu Hoffer distr... 12/31/2017: Assailants set fire to houses in K... 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN "Maguindanao clashes trap tribe members," Phil... NaN NaN START Primary Collection 0 0 0 0 NaN
181689 201712310031 2017 12 31 NaN 0 NaN 92 India 6 South Asia Manipur Imphal 24.798346 93.940430 1.0 0 The incident occurred in the Mantripukhri neig... 12/31/2017: Assailants threw a grenade at a Fo... 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN "Trader escapes grenade attack in Imphal," Bus... NaN NaN START Primary Collection -9 -9 0 -9 NaN
181690 201712310032 2017 12 31 NaN 0 NaN 160 Philippines 5 Southeast Asia Maguindanao Cotabato City 7.209594 124.241966 1.0 0 NaN 12/31/2017: An explosive device was discovered... 1 1 1 0.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN "Security tightened in Cotabato following IED ... "Security tightened in Cotabato City," Manila ... NaN START Primary Collection -9 -9 0 -9 NaN

181691 rows × 135 columns

At this point, we can see that the database is big, and also has a lot of information that we don't necessarily need as there were a bunch of columns that weren't useful. For example, the location column usually had a street on it, even though there is also a longitude and latitude given for the attack. Similarly, we're not interested in the scite1, scite2, scite3, resolution, multiple, and approxdate columns. while the approxdate column might seem useful, they also give the year, month, and date as columns, so the approxdate column is mundane. So pruning the dataset is necessary.

Data Cleaning¶

In [60]:
#dropping unecessary columns and renaming columns for more clarity
df.drop(['approxdate', 'location', 'resolution', 'multiple',
        'scite1', 'scite2', 'scite3'], axis=1, inplace=True) #Droppping all useless columns
#Renaming columns for ease of access
df = df.rename(columns={"country": "country_id", "alternative": "alternative_id", "region": "region_id", "gname": "group_name"})
rows = df.shape[0]
In [61]:
#Dropping all lat and longtide rows without a value 
df = df[df['latitude'].notna()]
df = df[df['longitude'].notna()]
dropped_rows = df.shape[0]
noloc_rows = rows - dropped_rows
#print("The number of rows with no latitude/longitude informatiun is {}".format(noloc_rows))
In [62]:
#Checking for any null values in country and group_name columns
df['group_name'].isna().sum() #no null values for terrorism group name
df['country_txt'].isna().sum() #no null values for country

#Making sure all the year, month, and day columns have the same value, so that we don't have to worry about missing dates
df['iyear'].isna().sum()
df['imonth'].isna().sum()
df['iday'].isna().sum()

#no null values for date columns, so I can merge columns accurately
Out[62]:
0
In [63]:
dtypes = df.dtypes
dtypes

#creating a date-time column
df['iday'] = df['iday'].replace(0,1)
df['imonth'] = df['imonth'].replace(0,1)
df["Date"] = df["iyear"].apply(str) + "/" + df["imonth"].apply(str) + "/" + df["iday"].apply(str)
df['Date'] =  pd.to_datetime(df['Date'])

#moving datetime column to the front of the dataframe:
date_col = df.pop("Date")
df.insert(0, date_col.name, date_col)
In [64]:
#quickly observing unique values of importnat columns

#df.attacktype1_txt.unique()
#df.targtype1_txt.unique()
#df.targsubtype1_txt.unique()
#df.weaptype1_txt.unique()b
#df.propextent_txt.unique()
#df.iyear.unique()
#df.imonth.unique()

First, we dropped a couple of the columns that we were not interested in and renamed some of the columns we were interested in using. We also decided to drop all the rows without a latitude or longitude value, as it would cause further headaches down the line. Finally, we removed the redundant date columns and added one unified date that gave us the date in a matter that we found to be more helpful. At this point, we decided the dataframe was good enough for our purposes, and it was time for exploring the data.

Data Exploration and Visualization
¶

Wordclouds¶

The first question we decided to ask was "What were some of the common words people have used to describe terrorist attacks?"

In [65]:
df["summary"]=df["summary"].astype(str)
summary_str = " ".join(summ for summ in df.summary)
stopwords = set(STOPWORDS)
stopwords.update(["the", "and", "so", "are", "because", "at", "in", "no", "however", "nan", "near", "incident",
                 "unkown", "one"])
wordcloud = WordCloud(stopwords=stopwords, background_color="black").generate(summary_str)

plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Most of the words that were found fit in with the question, as most people would describe the area, what occurred, as well as who was responsible and potential casualties. This wordcloud produced nothing that we found surprising/ out of the norm. However, it was interesting to see that the biggest word on the wordcloud is "claimed responsibility" which would indicate that a lot of terrorist attacks are being claimed by terrorist groups/organizations. The next question we decided to ask was "what are the words being described for each terrorist group's motives?"

In [66]:
try:
    df["motive"]=df["motive"].astype(str)
except KeyError as ke:
    pass
    
summary_str = " ".join(summ for summ in df.motive)
stopwords = set(STOPWORDS)
stopwords.update(["nan nan", "nan", "sources speculated", "unknown", "sources posited", "Unkown", 
                  "January"])
wordcloud = WordCloud(stopwords=stopwords, background_color="black").generate(summary_str)

plt.figure(figsize=(10,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

This word cloud also produced nothing out of the ordinary. Some of the words that could be important include sectarian violence, intimidate, protest, death, as well as groups, locations, and affiliations. One important word to note is "larger trend", which indicates the motives of many terrorist attacks are linked together and could be a part of a common ideology and goal.

Graphing¶

The first thing that we decided to plot was the countries that had the most terrorist attacks from 1970, the earliest point the database kept track of, to 2017. We then decided to compare it to the last two years of the database, to see if there were any significant differences.

Top 30 countries with the most terrorist attacks since 1970:¶
In [67]:
df['country_txt'].value_counts(sort=True)[:30].plot.bar()
Out[67]:
<AxesSubplot:>
Top 30 countries with the most terrorist between 2016-2017 (last 2 years in dataset):¶
In [68]:
recent_df = df.loc[df['iyear'] > 2016]
recent_df['country_txt'].value_counts(sort=True)[:30].plot.bar()
Out[68]:
<AxesSubplot:>

It is of note that Iraq blows every other country out of the water when it comes to the number of terrorist attacks in both of these graphs. While some of the top countries in the graph from 1970 - 2017 stay close to their position, almost every other country has a major drop or gain in their positioning in terms of the top 30. Countries like the DRC, Chile, Ukraine( to name a few) all belong in the top 30 of 2016-2017, but not 1980 to 2017. The only one that keeps its position, and with no doubt, is Iraq. This also lines up with Iraq's history of an unstable government, with various groups trying to gain control and authority, and general instability in the area for many years now.

Graphic Terrorist Attacks per Year¶

The next piece of information that we were interested in was the number of terrorist attacks per year, and how they fluctuated from year to year.

In [69]:
#creating a new column for the count of attacks by year
df['year_count'] = df.groupby('iyear')['iyear'].transform('count')

#seaborn plot edits
sns.set_style("darkgrid")
sns.set(rc={"figure.figsize":(12,8)})
sns.set(font_scale=1.75)

#making lineplot
g = sns.lineplot(data=df, x="iyear", y="year_count")
g.set_xlabel("Year")
g.set_ylabel("Number of Terrorist Attacks")
g.set_title("Amount of Terrorist Attacks per Year")
Out[69]:
Text(0.5, 1.0, 'Amount of Terrorist Attacks per Year')
Picture of the Iraq War from NPR

There are many fluctuations on a year-to-year basis, and the biggest thing of note is that there was a huge jump in the number of attacks around 2014. The number of attacks went from around 5000 to around 12000, before peaking around 16000 attacks in one year. The number of attacks since then has started to reduce, but post-2010, terrorist attacks are now a lot more prevalent than they used to be. This could potentially be attributed to the end of the Iraq war, which was in 2011 (where the graph spiked). After the United States ended the war, it is likely terrorist groups and organizations in Iraq became more active. The war in Iraq could also be a reason why there was a dip of attacks in the early 2000s. After the US declared war on Iraq, attacks that were previously classified as terrorist attacks were now just attacks that occurred during the war. As well as that, groups that were under attack by the US during the war were also likely to reduce their activity, for fear of being targeted by the US.

Graphing the Most Deadly Terrorist Organizations¶

In [70]:
#convertin column to string type
df["group_name"]=df["group_name"].astype(str)
#dropping terrorist group names of 'unknown'
threat_df = df.drop(df[df.group_name == "Unknown"].index)
#creating a column: "killsPerAttack" which shows the average amount of deaths per terrorist attack
threat_df['group_success'] = threat_df.groupby(['group_name','nkill'])['nkill'].transform('sum')
threat_df['group_count'] = threat_df.groupby('group_name')['group_name'].transform('count')
pd.set_option('display.max_columns', None)
threat_df["killsPerAttack"] = threat_df["group_success"]/threat_df["group_count"]

#filtering the dataframe by removing duplicate org names and taking the top 25 in sorted order
threat_plot = threat_df.drop_duplicates(subset=['group_name'], keep=False)
threat_plot = threat_plot.sort_values(by=['killsPerAttack'], ascending = False)
threat_plot = threat_plot.head(25)

#creating barchart using seaborn
g = sns.catplot(data=threat_plot, y='group_name',  x='killsPerAttack',kind='bar',
            ci=None, legend_out=True, height = 10, aspect = 1.75, orient = "h")
g.set_axis_labels("Number of Fatalities caused on average per Terrorist Attack", "Terrorist Groups/Organizations", size = 20)
plt.title("Top 25 Most Deadly Terrorist Groups and Organizations", y=1, fontsize = 25)
Out[70]:
Text(0.5, 1, 'Top 25 Most Deadly Terrorist Groups and Organizations')

From the above graph, we can see that some of the deadliest groups are groups that a lot of the United States likely hasn't heard of without significant research. Many people in the United States have likely only heard of groups such as ISIL, Al-Qaeda, and other groups that are commonly covered by news outlets. Neither of the aforementioned groups are present in this list. It is noteworthy to note that many of the groups in the top 30 have some sort of ideological motivation. These motivations include religious beliefs(such as Christianity or Islam), or political(MDJT in Chad). It is also worthwhile to note that Ahmad Jibril, the second bar on this chart, is actually a person. Jibril was a radical islamic speaker, and he and his followers carried out attacks that landed them on this graph. It is important to note that this graph is not plotting groups with the most kills. It is graphing the groups with the most fatalities per attack, which is a different metric.

In [71]:
#heatmap of all terrorist attacks representing amounnt of casualties, hover over the heatmap to inspect the specific 
#terrorist organization

fig = px.density_mapbox(df, lat='latitude', lon='longitude', z='nkill', hover_name="group_name", 
                        mapbox_style="stamen-terrain", zoom=0)

fig.show("notebook")

In terms of North America, the US and Canada have not seen many terrorist attacks. The United States had one major attack (9/11) and the rest are few and far between. Most of the lethal attacks occurred on the East Coast, and most other attacks are sparse, had no casualties, and spread across the US. It can also be noted that while Al-Qaeda does result in quite a big spread for 9/11, many of the other attacks were held by domestic "groups". The word group is used lightly here, as many of these "groups" are not actually organized. While there were some deaths from these attacks, most of them had little to no casualties and were not relevant enough to end up on the heat map.

The same cannot be said for the rest of the world. While there are plenty of attacks that led to no casualties, there are plenty more with 1 or more casualties and the heat map shows as such.

Weapon Usage¶

Terrorism relies on the use of weapons in order to carry out attacks of deadly force, and breaking down weapon usage in attacks holds merit. We decided to use 5 regions: Central American and the Caribbean, North America, The Middle East, and North Africa, Central Asia, and Eastern Europe. Each of these regions has at least one or more "relevant" terrorist groups.

In [72]:
#only using the 5 most interesting/relevant regions
regions = ['Central America & Caribbean', 'North America', 'Middle East & North Africa', 'Central Asia', 'Eastern Europe']
pie_df = df[df['region_txt'].isin(regions)]
pie_df = pie_df[pie_df['weaptype1_txt'] != "Unknown"]
pie_df['weap_count'] = pie_df.groupby(['weaptype1_txt', 'region_txt'])['weaptype1_txt'].transform('count')
pie_df = pie_df.drop_duplicates(subset=['weaptype1_txt', 'region_txt'], keep = 'last')
pie_df


pie1 = pie_df[pie_df['region_txt'] == 'Central America & Caribbean']
pie1
fig = px.pie(pie1, values='weap_count', names='weaptype1_txt', 
             title='Split of Attack Method in Central America & Caribbean')
fig.show("notebook")

pie2 = pie_df[pie_df['region_txt'] == 'North America']
fig = px.pie(pie2, values='weap_count', names='weaptype1_txt', 
             title='Split of Attack Method in North America')
fig.show("notebook")

pie3 = pie_df[pie_df['region_txt'] == 'Middle East & North Africa']
pie3
fig = px.pie(pie3, values='weap_count', names='weaptype1_txt', 
             title='Split of Attack Method in Middle East & North Africa')
fig.show("notebook")
    
pie4 = pie_df[pie_df['region_txt'] == 'Central Asia']
fig = px.pie(pie4, values='weap_count', names='weaptype1_txt', 
             title='Split of Attack Method in Central Asia')
fig.show("notebook")
    
pie5 = pie_df[pie_df['region_txt'] == 'Eastern Europe']
fig = px.pie(pie5, values='weap_count', names='weaptype1_txt', 
             title='Split of Attack Method in Eastern Europe')
fig.show("notebook")
    

These pie charts show a lot of interesting information. The only region where firearms have a majority is in Central America and the Caribbean. In every other region, explosives are the primary attack method. It is also interesting to note that every circle has the same top 4 methods. In no particular order, those 4 are Firearms, Explosives, Incendiary, and Melee. This could potentially be correlated to their ease of access. Compared to chemical or biological agents, explosives(which can be made) firearms(relatively easily acquired), incendiary(can also be made), and melee(no explanation required) are all significantly easier to acquire and could explain why they are more commonly used than weapons that are not as easily acquired.

Some More (Localized) Information¶

Finally, we're going to look at some graphs and maps that are more locally based to us, and we will look at some of the attacks classified as terrorism in the United States.

In [73]:
#Making a dataframe where all attackers are known
threat_df_for_map = df[df['group_name']!= "Unknown"]
#print(threat_df_for_map)
#Making a map and adding points to it.
map_osm_for_US = folium.Map(location=[39.14, -101.2996], zoom_start=4.5)
threat_df_for_US = threat_df_for_map[threat_df_for_map["country_txt"] == "United States"]
#threat_plot = threat_plot.sort_values(by=['killsPerAttack'], ascending = False)
threat_df_for_US = threat_df_for_US.sort_values(by =["Date"],ascending = False)
#threat_df_for_US["group_name"]
aae = 0;
faln = 0;
we = 0;
lwm = 0;
for index, row in threat_df_for_US.iterrows():
    if row["group_name"] == "Anti-Abortion extremists":
        if aae == 50:
            continue
        else: 
            aae = aae + 1
            folium.Marker(location=[row["latitude"], row["longitude"]],
                    popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y"),
                    icon=folium.Icon(color='red')).add_to(map_osm_for_US)
        

            
    if row["group_name"] == "Left-Wing Militants":
        if lwm == 50:
            continue
        else: 
            lwm = lwm + 1
            folium.Marker(location=[row["latitude"], row["longitude"]],
                    popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
                    icon=folium.Icon(color='blue')).add_to(map_osm_for_US)
            
    if row["group_name"] == "Fuerzas Armadas de Liberacion Nacional (FALN)":
        if faln == 50:
            continue
        else: 
            faln = faln + 1
            folium.Marker(location=[row["latitude"], row["longitude"]],
                    popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
                    icon=folium.Icon(color='gray')).add_to(map_osm_for_US)
            
    if row["group_name"] == "White extremists":
        if we == 50:
            continue
        else: 
            we = we + 1
            folium.Marker(location=[row["latitude"], row["longitude"]],
                    popup = row["group_name"] + " " + row["Date"].strftime("%m/%d/%Y") ,
                    icon=folium.Icon(color='purple')).add_to(map_osm_for_US)
            
        
map_osm_for_US
Out[73]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This map shows the 50 most recent attacks for the top 4 unique groups in the United States. This was done by first removing every other country from the dataset, isolating and finding out the top 4 groups responsible for attacks, and then sorting the dataframe by descending date. Initially the map had the first 50 of each group that appeared in the dataset, and the dates were all pretty much from the 1970-80s. After sorting, the dates were more in line with what we wanted to see. As a result, we were able to see attacks from as late as 2017 on the map.

While making this map of locations of top groups in the United States, we found it noteworthy that most of the terrorist attacks in the US were not organizations that most people think of when they think of terrorism(Al-Qaeda, ISIL, etc) and were instead undefined groups like "white extremists" or "anti-abortion extremists". This led to the belief that it would be useful to see the top contributors to attacks in the US, vs somewhere like Iraq, where the most attacks occur. Also, while the FALN (Fuerzas Armadas de Liberación Nacional) does have some attacks on Mainland US, most of the attacks were done in the Caribbean, specifically Puerto Rico.

Comparing Groups in the United States vs Groups in Iraq¶

In this section, we are going to compare the groups responsible for the most attacks in their respective country.

In [74]:
#Getting a dataframe that only has the group name and how many occurrences they have
group_counts_for_US = threat_df_for_US['group_name'].value_counts().reset_index()
group_counts_for_US.columns = ['group_name', 'count']
#print(group_counts_for_US)
#Getting the top ten groups in the Us, and printing only those
group_counts_for_US.head(10)
group_counts_for_US = group_counts_for_US[group_counts_for_US['count'] >= 66]
#df.loc[row_index] = df.loc[row_index].rename('new_index_name')
#Renaming group names so that they fit better on the bar chart
group_counts_for_US.at[2, 'group_name'] = 'FALN'
group_counts_for_US.at[4, 'group_name'] = 'NWLF'
group_counts_for_US.at[6, 'group_name'] = 'ALF'
group_counts_for_US.at[7, 'group_name'] = 'JDL'
group_counts_for_US.at[9, 'group_name'] = 'ELF'
group_counts_for_US.at[0, 'group_name'] = 'Anti-Abortion'
group_counts_for_US.at[1, 'group_name'] = 'Left-Wing'
#print(group_counts
#Making the graph for the US
sns.set(rc={"figure.figsize":(14, 12)})
g = sns.barplot(data=group_counts_for_US, x="group_name", y='count')
g.set_xlabel("Group Name", fontsize = 20)
g.set_ylabel("Number of Attacks", fontsize = 20)
g.set_title("Number of Attacks by Terrorist Group in the United States", fontsize = 30)
Out[74]:
Text(0.5, 1.0, 'Number of Attacks by Terrorist Group in the United States')
Picture of Anti-Abortion Activists from NPR

Many of the terrorism groups responsible for attacks in the United States are not actually groups at all. Instead, they are mostly ideologies, like Anti-Abortion, Left-Wing, and White extremists. This shows that there aren't really organized terrorism groups in the US. This is a far cry from the top groups that are found in Iraq. Interestingly, many of them are not religiously based and are instead points of contention that can still be found in American politics today.

In [75]:
threat_df_for_Iraq = threat_df_for_map[threat_df_for_map["country_txt"] == "Iraq"]
group_counts_for_Iraq = threat_df_for_Iraq['group_name'].value_counts().reset_index()
group_counts_for_Iraq.columns = ['group_name', 'count']
group_counts_for_Iraq = group_counts_for_Iraq[group_counts_for_Iraq['count'] >= 20]
group_counts_for_Iraq
group_counts_for_Iraq.at[0,"group_name"] = 'ISIL'
group_counts_for_Iraq.at[1,"group_name"] = 'Al-Qaida'
group_counts_for_Iraq.at[2,"group_name"] = 'ISI'
group_counts_for_Iraq.at[5,"group_name"] = 'T&J'
group_counts_for_Iraq.at[6,"group_name"] = 'JRTN'
group_counts_for_Iraq.at[5,"group_name"] = "JTJ"
group_counts_for_Iraq.at[7,"group_name"] = 'Muslim Ex.'
group_counts_for_Iraq.at[9,"group_name"] = 'MCTR'

group_counts_for_Iraq
g = sns.barplot(data=group_counts_for_Iraq, x="group_name", y='count')
g.set_xlabel("Group Name", fontsize = 20)
g.set_ylabel("Number of Attacks", fontsize = 20)
g.set_title("Number of Attacks by Terrorist Group in Iraq", fontsize = 30)
Out[75]:
Text(0.5, 1.0, 'Number of Attacks by Terrorist Group in Iraq')
Picture of ISIL from The Seattle Times

The top 10 groups responsible for attacks in Iraq is what we expected when we chose to analyze the groups responsible. The top 10 for Iraq consists mostly of religiously affiliated, established terrorism groups. It is also important to note that ISIL/ISIS has been responsible for over 5x as many attacks as the next nearest group (Al-Qaeda). Comparing this bar chart to the previous bar chart, it becomes clear that religious groups have a significantly larger share of attacks, and therefore stronger chokehold, than the US. As previously stated, many of the groups responsible for attacks in the US are not established groups, with clear hierarchies and leaders. In contrast, many of the groups in Iraq do have clear leadership structures and a public figure leading them. This, along with their previously established reputations, allows them to wreak havoc as they do in Iraq and other Middle Eastern countries.

Producing our Machine Learning Model(s)
¶

For our machine learning part of the final tutorial, we decided to try and solve the unknown group problem. While many terrorist organizations have claimed responsibility for a plethora of attacks, there are a lot of attacks in which their was no identified perpetrator/group responsible. By using the random forest algorithm, we tried to categorize each group, and classify their attack patterns based on a number of variables: The country the attack took place in, the goal of the attack(criit1-3), whether it was successful, whether or not it was a suicidal mission, and a couple more. The specifics can be found at the codebook linked above. For both of these models, we used a random forest classifier, made available with the sklearn python package.

A random forest classifier is a machine learning algorithm that belongs to the ensemble learning family. It is a type of decision tree classifier, but it uses multiple decision trees and combines their predictions to improve the overall accuracy of the model. In a random forest classifier, a large number of decision trees are trained on randomly selected subsets of the training data. Each decision tree makes a prediction, and the random forest classifier combines these predictions by taking the mode (for classification) or mean (for regression) of the predictions. This combination of predictions helps to reduce overfitting and improve the generalization performance of the model.

First Attempt (Predicting which group was responsible for an attack)¶

In [76]:
#Tyring to predict the terrorist organization based on selected predictor columns

ml_df = df[df.group_name != 'Unknown']
enc=OneHotEncoder()
pd.set_option('display.max_columns', 50)

#Making the columns categorical and removing any unknowns from the 'ihostkid' and 'intany' columns
df['specificity'] = pd.Categorical(df.specificity)
df['vicinity'] = pd.Categorical(df.specificity)
df['success'] = pd.Categorical(df.success)
ml_df['ishostkid'].replace('-9','0')
ml_df['INT_ANY'].replace('-9','0')

#choosing dependent variable columns
enc_data=pd.DataFrame(enc.fit_transform(ml_df[['extended','country_txt', 'specificity', 'vicinity', 'crit1',
    'crit2','crit3','success','suicide','attacktype1_txt','targtype1_txt', 'guncertain1','weaptype1_txt',
    'property','ishostkid','ransom','INT_ANY']]).toarray())

X = enc_data
y = ml_df["group_name"] #target variable

#training the model
#The max depth determines how deep each tree in the random forest will go down to before it must make a conclusion 
# The number of estimators is the number of decision trees that the classifier will make
SEED = 99
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)
rfc = RandomForestClassifier(n_estimators=100, max_depth=9,random_state=SEED)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

#evaluation metrics

final_df = pd.DataFrame(classification_report(y_test,y_pred, zero_division = 0,output_dict = True))
final_df = final_df.drop(final_df.index[-1])
final_df.head(50)
Out[76]:
16 January Organization for the Liberation of Tripoli 1920 Revolution Brigades 1st of May Group 2 April Group 20 December Movement (M-20) 23rd of September Communist League 2nd of June Movement 31 January People's Front (FP-31) 9 February 9 May People's Liberation Force Abbala extremists Abd al-Krim Commandos Abdul Qader Husseini Battalions of the Free Palestine movement Abdullah Azzam Brigades Abida Tribe Abkhazian Separatists Abkhazian guerrillas Abu Amarah Battalion Abu Nidal Organization (ANO) Abu Obaida bin Jarrah Brigade Abu Sayyaf Group (ASG) Achik National Cooperative Army (ANCA) Achik National Liberation Army (ANLA) Achik National Volunteer Council-B (ANVC-B) Achik Songna An'pachakgipa Kotok (ASAK) ... Workers' Self-Defense Movement (MAO) World Church of the Creator Xhosa Tribal Workers Yakariya Bango Insurgent Group Yemenis Young Communist League Young Pioneers Youth Action Group Youth for Revolution Youths Zapatista National Liberation Army Zawiya Martyrs Brigade Zebra killers Zeliangrong United Front Zero Tolerance Zimbabwe African Nationalist Union (ZANU) Zimbabwe African People's Union Zimbabwe Guerrillas Zimbabwe Patriotic Front Zulu Miners Zuwar al-Imam Rida leftist guerrillas-Bolivarian militia accuracy macro avg weighted avg
precision 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.818182 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.568792 0.030449 0.417401
recall 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.088235 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.568792 0.026104 0.568792
f1-score 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.159292 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.568792 0.024629 0.448185

3 rows × 1541 columns

Unfortunately, this machine learning model had mixed success at best. It does show some potential, though. For some of the groups, the model was able to produce relatively high values in terms of precision, meaning that it was able to predict the individual group relatively well. However, for many of the groups, there isn't a big enough sample size. In the whole dataset, there are ~ 27 groups with more than 100 attacks. Over 3/4's of the groups involved had attack numbers in the single digits, and there aren't enough data points for the model to learn the behavior of the group, and predict if an attack was conducted by them. This led to the model being unable to predict many of the groups in the dataset. While it was kind of precise with some groups, it had a precision of 0.00 for too many groups in the dataset in order to properly conclude that the model was able to predict groups based on the given dependent variables.

However, even with the 27 groups that had over 100 attacks to train on, the precision of those 27 were all over the place. The model was able to predict some of the groups with high precision, while with other groups, the model was not precise at all, with a precision value of 0.00. Unfortunately, this means that the model was unsuccessful in its goal of predicting a group given variables about the attack.

Second Attempt at a Model (Predicting whether or not an attack was successful)¶

Given that our first model was not a resounding success, we decided to go with a simpler idea. The database contains a column that determines whether a given attack was a success or not. This categorical variable’s value (either a 0 or 1) is determined by whether or not the attack successfully took place. The attack does not necessarily need to complete its objective. The only exception to this are assassination attempts. These attempts require the target to have been killed. Otherwise, the attack would be labeled as a failure. Each specific type of attack has their own description of what is a success and what is a failure, and more details can be found in the codebook linked above.

For the second attempt, we decided to use the same method(random forest classifier) and use 10 fold cross validation in order to determine whether or not the model would be able to successfully predict the outcome of an attack given certain variables.

In [77]:
#Predicting whether the terrorist attack was a success or not

ml1_df=df
ml1_df = df[df.weaptype1_txt != 'Unknown']
pd.set_option('display.max_columns', 50)
ml1_df['ishostkid'].replace('-9','0')
ml1_df['INT_ANY'].replace('-9','0')

#choosing dependent variable columns
enc_data = ml1_df[['success','weaptype1_txt','region_txt',
                   'attacktype1_txt','targtype1_txt', 'guncertain1', 'specificity']]
#one-hot encoding
df_dummies = pd.get_dummies(data=enc_data, columns=enc_data.columns[1:])
y = df_dummies['success']
X = df_dummies.drop(['success'], axis = 1)

#training the model
SEED = 99
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=SEED)
rfc = RandomForestClassifier(n_estimators=5, max_depth=9,random_state=SEED)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

#evaluation metrics
print(classification_report(y_test,y_pred))

#10-fold cross validation
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(rfc, X, y,cv=cv, n_jobs=-1)
print(scores)
              precision    recall  f1-score   support

           0       0.72      0.24      0.35      2681
           1       0.91      0.99      0.95     21745

    accuracy                           0.91     24426
   macro avg       0.82      0.61      0.65     24426
weighted avg       0.89      0.91      0.88     24426

[0.8940678  0.90770081 0.89670843 0.90407762 0.90241356 0.90241356
 0.90339618 0.89940429 0.89756187 0.89246453]

Unfortunately, it doesn't seem like this model was successful in determing attacks either. While the model was able to predict successful attacks with relatively high precision, and was able to find all relevant information for successes, it was not able to do the same for failures. Predicting failures produces a precision value of .72, and a recall of .24, meaning it was not very precise and it was unable to find relevant information in the dataset going back.

Conclusion and Closing Thoughts
¶

Overall, it was a great learning experience utilizing the Global Terrorism Database while levaraging the knowledge we learned in CMSC320 to produce visuals and graphics that we found intriguing. The only downside of this project is that we weren't able to successfully make predictions for either the group responsible for an attack or whether or not an attack was successful. Hopefully, this tutorial made you a little bit more interested in analyzing data, whether it be about terrorism or something else you find interesting. While we did go through a lot of information through this tutorial, the Global Terrorism database holds magnitudes more information that we didn't begin to touch on in this tutorial, and if you are interested in looking at it yourself, the links can be found under our Explanation header. We hoped you enjoyed looking through or work, and we thank you for taking the time to do so.

Omeed and AJ