import calendar from collections import Counter import datetime as dt import h3 import ipywidgets as widgets import matplotlib.pyplot as plt import numpy as np import networkx as nx import pandas as pd import plotly.express as px import plotly.graph_objects as go import pycountry import scipy as sp from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer import spacy import torch from transformers import DistilBertForQuestionAnswering, DistilBertTokenizer, pipeline from transformers import PegasusForConditionalGeneration, PegasusTokenizer from wordcloud import WordCloud from ydata_profiling import ProfileReport import warnings warnings.filterwarnings('ignore')

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', return_token_type_ids=True) model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased-distilled-squad') qa_pipeline = pipeline('question-answering', model=model, tokenizer=tokenizer)

nlp = spacy.load("en_core_web_sm")

Introduction.

The Armed Conflict Location & Event Data Project (ACLED), is an open-source dataset that collects real-time data on political violence and conflict events worldwide. The dataset consists of actor names, locations, attack times, the types of attack observed, and descriptive notes outlining the attack methodology & results. This project aims to track the dynamics of the attacks over time and investigate the notes text to further analyse the outcome of the attacks using NLP. As this is a trimmed-down project secondary to my main notebook, the potential impact of this particular project will be minimal, and serve only to uncover some minor insights. I am working on this locally and have uploaded as much of the project as my DN subscription's memory will allow (but more will be added here if I feel like pushing my luck).

Due to the condensed nature of this project, the EDA has only unveiled a handful of specific keywords and groups. I have created a general outline of the information and found finer insights into certain groups where processing power would allow for things like NER.

Packages used in this project.

H3's k_ring for spatial analysis (to calculate the distances to neighbouring attacks in a location using concentric rings), not displayed on this public version.

Ydata for profile reporting, descriptive and quantile statistics. This is paired with IpyWidgets for rendering.

NetworkX which I use for graph / network analysis ordinarily, but implemented it in calculating the Geospatial data displayed on the Plotly world map this time around.

SpaCy for junk word removal, POS-tagging, count-vectorising and named entity recognition; parsing organisational & human entities from text.

PyCountry for geo-codes.

I resorted to transfer learning to further analyse the notes column more efficiently, implementing the question_answering_pipeline from HuggingFace's Transformers library. This approach leveraged the model's existing knowledge to answer questions based on the context I provided after cleaning the text column, such as:

Question: "What is the main issue where [df['location'] == 'Sidnaya Military Prison']?"

Answer: {'score': 1.69609779732127, 'start': 61, 'end': 68, 'answer': 'torture'}, etc..

And finally, HuggingFace's Pegasus was used to summarise the longest notes in the dataset, for example to convert a 2000 char note into, "A chronology of key events in a police raid on a suspected drug dealer's apartment in New York City, which led to the death of a pregnant woman and the wounding of two officers. Copyright (c) The Vancouver Sun".

The data.

Data cleaning.

Dropping 11 duplicate rows.

df = df.drop_duplicates()

Creating a country code column for geo-visualisation.

country_codes = [] for country in df['country']: try: country_codes.append(pycountry.countries.lookup(country).alpha_3) except LookupError: country_codes.append("N/A") df['country_codes'] = country_codes

Adding day and month abbreviations to the dataframe, dropping the index, date and event_type columns (as there is only one event type), stripping whitespace from column labels and repositioning the date columns so they sit in sequential order next to the date column.

df['date'] = pd.to_datetime(df['event_date'], errors='coerce')

df['day'] = df['date'].dt.day_name().str[:3] df['month'] = df['date'].dt.month.apply(lambda x: calendar.month_abbr[x])

df = df.drop(['index', 'date', 'event_type'], axis = 1)

df.columns = df.columns.str.strip()

column_a = df.pop('day') column_b = df.pop('month') df.insert(2, 'day', column_a) df.insert(3, 'month', column_b)

The new dataframe.

Profile report.

(Skipping over the 'ISO', 'latitude', 'longitude', 'Notes', 'Timestamp' and 'geo_precision' columns)

1: Overview.

• 1.1: There are 19 remaining columns present with 115,886 observations in total resulting in a memory consumption of average 166B per record, 0% missing information, 5 numeric columns, 7 categorical and 6 text columns.

2: Event date

(The information frame is a scrollable element with a button labelled 'more information', which will return a histogram of fixed-size bins displaying data of increasing frequency running up to 2022)

• 2.1: There are 1,622 distinct dates in the dataset ranging from 1st of Jan 2018 to either the 10th of June 2022.

3: Weekday.

(Clicking on 'more details' and the tab labeled 'categories' below that returns the daily figures for the armed conflicts)

• 3.1: Wednesday is the most common day overall with a figure of 15%. Monday, Tuesday, Thursday, Sunday, Saturday & Friday follow in that order, descending through a figure range of 14.7% - 13.%.

4: Month.

• 4.1: Similar to the day variables, the month column has its' own discernible set of characteristics, with the months of May, Jan, Mar and Apr holding the most distinct figures (between 9.9% - 9.5%).

•.4.2: Again, in descending order, these months are followed by Feb, Jun, Oct, Jul, Aug, Nov all holding figures between 8.8% - 7.3%

5: Year.

• 5.1: In correlation to the event date features pattern-wise for obvious reasons, we see a general uptick in information from 2018, dipping by a figure of 0.2% in 2019 and taking-off considerably in 2021.

6: Sub event type.

• 6.1: Plenty of imbalance in this column due to the sub event being 'attack'. The two remaining sub event types are 'Abduction/forced disappearance' (10.3% of the data) and 'Sexual violence' (1.5%).

7: Actor1.

• 7.1: There are 3,689 distinct threat actors in the data.

• 7.2: The top words for the actor names are "Unidentified Armed Group", "Armed Forces", "Police", "Military", "Militia" and "Mexico".

• 7.2.1: (Which order these words appear in will become more apparent later).

8: Region.

• 8.1: There are 16 distinct regions here.

• 8.2: North America is the most common regional entry with 22.8% of the data. This is followed by South America (15.2%) and the Middle East (11.1%) as the two most distinct features.

• 8.3: Then we see Eastern Africa (8.4%), Western Africa (8.2%) Southeast Asia (7.0%), Middle Africa (6.9%), South Asia (5.1%), Central America (4.7%), Caucasus and Central Asia (2.8%), and Other values (6).

9: Country.

• 9.1: There are 179 distinct countries in the dataset.

• 9.2: Mexico is the most common entry with a count of 22515, 18.2% of the data collected.

• 9.3: In Descending order from there, Brazil, Syria, the DRC, Nigeria, the Philippines and India each hold between 9.4% and 2.5% of the entries, with 'other' holding 45%.

10: Location.

• 10.1: There are 29,139 distinct locations.

• 10.2: Among the top location words are, "San", "De", "Cuidad", "Tijuana", "Los", "City" and "Manaus".

11: Source.

• 11.1: There are 13,729 distinct sources in the data.

• 11.2: The top word hits here are, "News", "Nigeria", "Daily", "g1", "El", "La", "De" and "Undisclosed Source".

12: Source scale.

• 12.1: Source scale consists of 26 distinct features.

• 12.2: The primary amount (47.7%) of those are at a National level, followed by Sub-national at 18.4%.

• 12.3: Then we see "Local", "Other", "New media", "Subnational-National", "Local partner-National", "International" and "Regional" making-up the rest of the data from a descending range of 8.3% - 1.3%, and "Other values" to the tune of 4.9%.

13: Fatalities.

• 13.1: There are 90 distinct fatalities in the dataset.

• 13.2: With a mean of 1.194, the majority of fatalities are in low figure ranges (attacks on an individual) with the odd large group attack here and there.

• 13.3: Looking at the quantile statistics, the median figure backs that up. The Kurtosis figure represents a very large tail in the distribution and the CV represents a relatively large spread around the mean. Finally, the histogram says it all...

Analysis.

Network clusters of actors by geographical location.

(Where "Unidentified" does not appear in the actor title).

I purposefully omitted the continent outlines to emphasise the scale of the world's issues at large. This is a relatively small dataset and only contains armed conflicts, but that being said, the volume of attacks is so apparent that you get an idea of where the continents (and some countries) exist on the map just by seeing the data points on the white background. Although the majority groups aren't included in this map ("Unknown Group / ..."), large clusters can be witnessed in the Middle East, South America and Africa.

Fatality sum per location (including the country's location fatality rate).

Querying locations per country, the sum of fatalities for that country, and the fatality rate for each location as a percentage of the total number of locations for each country in the dataset.

So with a sum of 2,055 fatalities contributing to 8.05% of the Mexican locations present here, it's evident that Tijuana is only a small part of the puzzle for Mexican authorities.

The figure of 777 fatalities in Caracas accounts for 66.8% of fatalities across all of Venezuela's 148 locations in the dataset.

Likewise, May Caldera's single fatality figure of 660 accounts for 62.3% of Ethiopia's 400 present locations.

SELECT DISTINCT df.location AS location, SUM(df.fatalities) OVER (PARTITION BY df.location) AS fatality_sum, df.country AS country, CONCAT(ROUND((SUM(df.fatalities) OVER (PARTITION BY df.location) * 1.0 / COUNT(df.location) OVER (PARTITION BY df.country)) * 100, 1), '%') as fatality_rate FROM df ORDER BY fatality_sum DESC;

The ten most common threat actors & corresponding country of origin.

Actor counts per location.

Actor counts per location shows Manaus in Brazil, (Tijuana, Monterrey, Leon Do Los Aldama, Culiacán Rosales and Acapulco de Juarez) in Mexico, Hole Camp in Syria and Guatemala City all holding group values above 100.

A time-series analysis: average monthly fatalities.

The average monthly fatalities spike just to, or above 1.5 in July of 2018, November 2020 and March 2022.

There are visible slumps in activity in Autumn of 2018, 2019, 2020 and 2021 before slight or major upticks in activity leading to all three of the aforementioned spikes. Further intelligence reports have recently outlined similar trends but with no assumptions as to why these slumps & spikes happen, my guess was that the slumps are "planning phases" for strategic attacks.

The top ten actors creating those spikes in the date ranges of (June - July 2018), (October - November 2021) and (February - March 2022) consist of:

• 1: Unidentified gangs in Mexico.

• 2: Unidentified Armed Groups in Mexico.

• 3: Unidentified Gang and / or Police Militia.

• 4: Military Forces of Russia (2000-).

• 5: Military Forces of Syria (2000-).

• 6: Fulani Ethnic Militia of Nigeria.

• 7: The Allied Democratic Forces.

• 8: Military Forces of Mali (2021-).

• 9: Military Forces of Myanmar (2021-).

• 10: Unidentified Armed Groups in Brazil.

Those worldwide fatality figures translated to a "month-on-month pct change", scrollable using the arrow keys at the bottom of the chart:

Sources by volume.

"G1" of Brazil is the top source in the dataset, followed by "Undisclosed source'. Milenio of Mexico, the Syrian Observatory for Human Rights and Columbia's Zona Franca are among the five most common media entries. We also see the Facebook & Twitter platforms included here, outlining the requirement for social media in times of crisis.

NLP.

Entities involved in the attacks where actor1 title is "Unidentified Gang (Mexico)".

unidentified_gang_mexico_df = df[df['actor1'] == "Unidentified Gang (Mexico)"] ner_results = [] for note in unidentified_gang_mexico_df['notes']: doc = nlp(note) entities = [(ent.text, ent.label_) for ent in doc.ents] ner_results.append(entities)

org_entities = [entity[0] for sublist in ner_results for entity in sublist if entity[1] == 'ORG']

The NER processing results related to organisations where "Unidentified Gang (Mexico)" is the threat actor returns reports including the Sinaloa Cartel among others.

The ten maximum fatality figures by country and month for the annual range of 2018-2022.

• The DRC, 2020: 600 total fatalities. The actor responsible for the largest singular figure: Nduma Defence of Congo (NDC-R).

• Brazil, 2018: 348 total fatalities. The actor responsible for the largest singular figure: Unidentified Gang and / or Police Militia.

• Burkina Faso, 2019: 270 total fatalities. The actor responsible for the largest singular figure: Military Forces of Burkina.

• Philippines, 2020: 225 total fatalities. The actor responsible for the largest singular figure: Unidentified Clan Militia.

• Nigeria, 2021: 160 total fatalities. The actor responsible for the largest singular figure: Fulani Ethnic Militia.

• Ethiopia, 2019: 157 total fatalities. The actor responsible for the largest singular figure: Unidentified Ethnic Militia.

• South Sudan, 2019: 95 total fatalities. The actor responsible for the largest singular figure: Rek Clan Militia.

• Mexico, 2018: 46 total fatalities. The actor responsible for the largest singular figure: Unidentified Gang.

• 2019: Syria: 31 total fatalities. The actor responsible for the largest singular figure: Military Forces of Syria.

• 2018, Mali: 9 total fatalities. The actor responsible for the largest singular figure: Dozo Communal Militia.

Cleaned dataframe with stopwords removed and POS-tagged junk words.

Common words "kill", "Injure" and "attack" also removed.

df_final['cleaned_notes'] = df_final['cleaned_notes'].fillna('') vectorizer = CountVectorizer(max_features=1000, stop_words='english') X = vectorizer.fit_transform(df_final['cleaned_notes']).toarray() feature_names = vectorizer.get_feature_names_out()

Keywords by volume.

Judging by the instances of 'motorcycle' (6195 occurrences) and 'car' (8092 occurrences) etc. in the reports, there is plenty of vehicular use in these attacks. Some of these could be victims in / on vehicles, so this will be worth expanding on. I would assume Brazil, or South America in general holds the largest counts for these entity types.

The word 'Abduct' appears 8340 times, 'Woman' 12,662 times, and 'Man' 42,747 times.

Vehicular-based entity counts per country.

The majority of these correlate with 'Unidentified armed groups", "Unidentified gangs" and "Unidentified Gang and/or Police Militia" in each country (insights also uncovered using DistilliBert):

• Mexico: 'vehicle' (1524), 'car' (3606), 'motorcycle' (1490).

• Brazil: 'vehicle' (183), 'car' (2671), 'motorcycle' (2772).

• Puerto Rico: 'vehicle' (142), 'car' (129).

• Mali: 'vehicle' (138).

• Colombia: 'vehicle' (113), 'car' (109), 'motorcycle' (170).

Keyword analysis per country.

Identifying the ten most common countries and their prominent keywords.

Purely looking at the most common keywords for each country now, we can get a sense of the attack styles. Afghanistan's attacks primarily involve militia and civilians. Attacks in Myanmar involve governmental / military figures. Colombia's entities are mostly unidentified, possibly due to the nature of local cartels. Civilians and villages are common targets in the DRC. Police in India and the Philippines feature heavy involvement, with drugs appearing related to the latter. Abductions in Nigeria are always common methods of fund-raising for terrorist entities, and Syria's "civilian" keyword suggests military / militia involvement, especially in the countryside.

• Afghanistan: civilian, report, unknown, shoot, militant.

• Brazil: man, shoot, armed, motivation, fatality.

• Myanmar: military, village, shoot, region, arrest.

• Colombia: armed, area, victim, fatality, unidentified.

• DRC: man, armed, unidentified, village, civilian.

• India: police, village, unidentified, shoot, man.

• Mexico: fatality, man, body, shoot, armed.

• Nigeria: abduct, unidentified, armed, resident, gunman.

• Philippines: police, suspect, drug, raid, unidentified.

• Syria: countryside, civilian, unidentified, shoot, man.

Identifying entities where keyword == 'woman'.

df_woman_mentions = df_final[df_final['cleaned_notes'].str.contains('woman', case=False, na=False)] df_woman_mentions[['cleaned_notes']].drop_duplicates()

Separating person NER entities from the above dataframe.

Armed kidnapping is the most common entity related to 'woman' in the NER results. We also see 'Gang member shoot', 'ransom hostage' and 'homeless tran'.

SELECT df_ner_results_women.Entity AS Entity, COUNT(df_ner_results_women.Entity) AS Count FROM df_ner_results_women WHERE df_ner_results_women.Label = 'PERSON' GROUP BY df_ner_results_women.Entity ORDER BY Count DESC LIMIT 20;

The most unsafe areas for women.

After a little SQL querying of the NER entities and further digging into the location data, "Colonia" is the top hit and is associated with Tijuana as the least safe place for a woman. Yangon in Myanmar is almost level-pegging with Tijuana.

NER results where label == 'GPE':

Expanding on those entities with the addition of their corresponding locations:

Resulting instances from Tijuana where 'woman' in text.

Using DistilliBERT's QA model to query the data. The women here are primarily gunshot victims, and more soberingly (possibly due to being narco-related).....

context = " ".join(df_colonia_and_woman_instances_notes) question = "Where are female bodies mostly found in Colonia?" #"What is the primary cause of death in Tijuana?" answer = qa_pipeline({ 'question': question, 'context': context }) print(answer)

Occurrences of political entities in the data over the years.

(These are accounts directly involving civil servants, not to say that any other attacks in the dataset aren't politically-motivated).

Querying the NER results to investigate a range of political keywords and visualising them according to which year in the dataset they occur by count.

• 2018 is the leading year for reports of politically-related attacks, followed quite closely by 2021.

• 2019 and 2020 see a notable slump in reports of this nature, with 2022 seeing the absolute least.

N-Grams in reports posted on Twitter and Facebook.

Using the Tf-idf vectoriser to parse important ngrams of a length of 8 from the notes column where [source == 'Facebook'] and [source == 'Twitter'] returns some of the same topics almost word-for-word, meaning either the media source's social media bod has posted the same article on both platforms to raise max awareness to the attack, different sources have posted a report on the same incident, or both.

Answer: "Property Destruction"

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction.