Introduction.
The Armed Conflict Location & Event Data Project (ACLED), is an open-source dataset that collects real-time data on political violence and conflict events worldwide. The dataset consists of actor names, locations, attack times, the types of attack observed, and descriptive notes outlining the attack methodology & results. This project aims to track the dynamics of the attacks over time and investigate the notes text to further analyse the outcome of the attacks using NLP. As this is a trimmed-down project secondary to my main notebook, the potential impact of this particular project will be minimal, and serve only to uncover some minor insights. I am working on this locally and have uploaded as much of the project as my DN subscription's memory will allow (but more will be added here if I feel like pushing my luck).
Due to the condensed nature of this project, the EDA has only unveiled a handful of specific keywords and groups. I have created a general outline of the information and found finer insights into certain groups where processing power would allow for things like NER.
Packages used in this project.
H3's k_ring for spatial analysis (to calculate the distances to neighbouring attacks in a location using concentric rings), not displayed on this public version.
Ydata for profile reporting, descriptive and quantile statistics. This is paired with IpyWidgets for rendering.
NetworkX which I use for graph / network analysis ordinarily, but implemented it in calculating the Geospatial data displayed on the Plotly world map this time around.
SpaCy for junk word removal, POS-tagging, count-vectorising and named entity recognition; parsing organisational & human entities from text.
PyCountry for geo-codes.
I resorted to transfer learning to further analyse the notes column more efficiently, implementing the question_answering_pipeline from HuggingFace's Transformers library. This approach leveraged the model's existing knowledge to answer questions based on the context I provided after cleaning the text column, such as:
Question: "What is the main issue where [df['location'] == 'Sidnaya Military Prison']?"
Answer: {'score': 1.69609779732127, 'start': 61, 'end': 68, 'answer': 'torture'}, etc..
And finally, HuggingFace's Pegasus was used to summarise the longest notes in the dataset, for example to convert a 2000 char note into, "A chronology of key events in a police raid on a suspected drug dealer's apartment in New York City, which led to the death of a pregnant woman and the wounding of two officers. Copyright (c) The Vancouver Sun".
The data.
Data cleaning.
Dropping 11 duplicate rows.
Creating a country code column for geo-visualisation.
Adding day and month abbreviations to the dataframe, dropping the index, date and event_type columns (as there is only one event type), stripping whitespace from column labels and repositioning the date columns so they sit in sequential order next to the date column.
The new dataframe.
Profile report.
(Skipping over the 'ISO', 'latitude', 'longitude', 'Notes', 'Timestamp' and 'geo_precision' columns)
1: Overview.
• 1.1: There are 19 remaining columns present with 115,886 observations in total resulting in a memory consumption of average 166B per record, 0% missing information, 5 numeric columns, 7 categorical and 6 text columns.
2: Event date
(The information frame is a scrollable element with a button labelled 'more information', which will return a histogram of fixed-size bins displaying data of increasing frequency running up to 2022)
• 2.1: There are 1,622 distinct dates in the dataset ranging from 1st of Jan 2018 to either the 10th of June 2022.
3: Weekday.
(Clicking on 'more details' and the tab labeled 'categories' below that returns the daily figures for the armed conflicts)
• 3.1: Wednesday is the most common day overall with a figure of 15%. Monday, Tuesday, Thursday, Sunday, Saturday & Friday follow in that order, descending through a figure range of 14.7% - 13.%.
4: Month.
• 4.1: Similar to the day variables, the month column has its' own discernible set of characteristics, with the months of May, Jan, Mar and Apr holding the most distinct figures (between 9.9% - 9.5%).
•.4.2: Again, in descending order, these months are followed by Feb, Jun, Oct, Jul, Aug, Nov all holding figures between 8.8% - 7.3%
5: Year.
• 5.1: In correlation to the event date features pattern-wise for obvious reasons, we see a general uptick in information from 2018, dipping by a figure of 0.2% in 2019 and taking-off considerably in 2021.
6: Sub event type.
• 6.1: Plenty of imbalance in this column due to the sub event being 'attack'. The two remaining sub event types are 'Abduction/forced disappearance' (10.3% of the data) and 'Sexual violence' (1.5%).
7: Actor1.
• 7.1: There are 3,689 distinct threat actors in the data.
• 7.2: The top words for the actor names are "Unidentified Armed Group", "Armed Forces", "Police", "Military", "Militia" and "Mexico".
• 7.2.1: (Which order these words appear in will become more apparent later).
8: Region.
• 8.1: There are 16 distinct regions here.
• 8.2: North America is the most common regional entry with 22.8% of the data. This is followed by South America (15.2%) and the Middle East (11.1%) as the two most distinct features.
• 8.3: Then we see Eastern Africa (8.4%), Western Africa (8.2%) Southeast Asia (7.0%), Middle Africa (6.9%), South Asia (5.1%), Central America (4.7%), Caucasus and Central Asia (2.8%), and Other values (6).
9: Country.
• 9.1: There are 179 distinct countries in the dataset.
• 9.2: Mexico is the most common entry with a count of 22515, 18.2% of the data collected.
• 9.3: In Descending order from there, Brazil, Syria, the DRC, Nigeria, the Philippines and India each hold between 9.4% and 2.5% of the entries, with 'other' holding 45%.
10: Location.
• 10.1: There are 29,139 distinct locations.
• 10.2: Among the top location words are, "San", "De", "Cuidad", "Tijuana", "Los", "City" and "Manaus".
11: Source.
• 11.1: There are 13,729 distinct sources in the data.
• 11.2: The top word hits here are, "News", "Nigeria", "Daily", "g1", "El", "La", "De" and "Undisclosed Source".
12: Source scale.
• 12.1: Source scale consists of 26 distinct features.
• 12.2: The primary amount (47.7%) of those are at a National level, followed by Sub-national at 18.4%.
• 12.3: Then we see "Local", "Other", "New media", "Subnational-National", "Local partner-National", "International" and "Regional" making-up the rest of the data from a descending range of 8.3% - 1.3%, and "Other values" to the tune of 4.9%.
13: Fatalities.
• 13.1: There are 90 distinct fatalities in the dataset.
• 13.2: With a mean of 1.194, the majority of fatalities are in low figure ranges (attacks on an individual) with the odd large group attack here and there.
• 13.3: Looking at the quantile statistics, the median figure backs that up. The Kurtosis figure represents a very large tail in the distribution and the CV represents a relatively large spread around the mean. Finally, the histogram says it all...
Analysis.
Network clusters of actors by geographical location.
(Where "Unidentified" does not appear in the actor title).
I purposefully omitted the continent outlines to emphasise the scale of the world's issues at large. This is a relatively small dataset and only contains armed conflicts, but that being said, the volume of attacks is so apparent that you get an idea of where the continents (and some countries) exist on the map just by seeing the data points on the white background. Although the majority groups aren't included in this map ("Unknown Group / ..."), large clusters can be witnessed in the Middle East, South America and Africa.
Fatality sum per location (including the country's location fatality rate).
Querying locations per country, the sum of fatalities for that country, and the fatality rate for each location as a percentage of the total number of locations for each country in the dataset.
So with a sum of 2,055 fatalities contributing to 8.05% of the Mexican locations present here, it's evident that Tijuana is only a small part of the puzzle for Mexican authorities.
The figure of 777 fatalities in Caracas accounts for 66.8% of fatalities across all of Venezuela's 148 locations in the dataset.
Likewise, May Caldera's single fatality figure of 660 accounts for 62.3% of Ethiopia's 400 present locations.
The ten most common threat actors & corresponding country of origin.
Actor counts per location.
Actor counts per location shows Manaus in Brazil, (Tijuana, Monterrey, Leon Do Los Aldama, Culiacán Rosales and Acapulco de Juarez) in Mexico, Hole Camp in Syria and Guatemala City all holding group values above 100.
A time-series analysis: average monthly fatalities.
The average monthly fatalities spike just to, or above 1.5 in July of 2018, November 2020 and March 2022.
There are visible slumps in activity in Autumn of 2018, 2019, 2020 and 2021 before slight or major upticks in activity leading to all three of the aforementioned spikes. Further intelligence reports have recently outlined similar trends but with no assumptions as to why these slumps & spikes happen, my guess was that the slumps are "planning phases" for strategic attacks.
The top ten actors creating those spikes in the date ranges of (June - July 2018), (October - November 2021) and (February - March 2022) consist of:
• 1: Unidentified gangs in Mexico.
• 2: Unidentified Armed Groups in Mexico.
• 3: Unidentified Gang and / or Police Militia.
• 4: Military Forces of Russia (2000-).
• 5: Military Forces of Syria (2000-).
• 6: Fulani Ethnic Militia of Nigeria.
• 7: The Allied Democratic Forces.
• 8: Military Forces of Mali (2021-).
• 9: Military Forces of Myanmar (2021-).
• 10: Unidentified Armed Groups in Brazil.
Those worldwide fatality figures translated to a "month-on-month pct change", scrollable using the arrow keys at the bottom of the chart:
Sources by volume.
"G1" of Brazil is the top source in the dataset, followed by "Undisclosed source'. Milenio of Mexico, the Syrian Observatory for Human Rights and Columbia's Zona Franca are among the five most common media entries. We also see the Facebook & Twitter platforms included here, outlining the requirement for social media in times of crisis.
NLP.
Entities involved in the attacks where actor1 title is "Unidentified Gang (Mexico)".
The NER processing results related to organisations where "Unidentified Gang (Mexico)" is the threat actor returns reports including the Sinaloa Cartel among others.
The ten maximum fatality figures by country and month for the annual range of 2018-2022.
• The DRC, 2020: 600 total fatalities. The actor responsible for the largest singular figure: Nduma Defence of Congo (NDC-R).
• Brazil, 2018: 348 total fatalities. The actor responsible for the largest singular figure: Unidentified Gang and / or Police Militia.
• Burkina Faso, 2019: 270 total fatalities. The actor responsible for the largest singular figure: Military Forces of Burkina.
• Philippines, 2020: 225 total fatalities. The actor responsible for the largest singular figure: Unidentified Clan Militia.
• Nigeria, 2021: 160 total fatalities. The actor responsible for the largest singular figure: Fulani Ethnic Militia.
• Ethiopia, 2019: 157 total fatalities. The actor responsible for the largest singular figure: Unidentified Ethnic Militia.
• South Sudan, 2019: 95 total fatalities. The actor responsible for the largest singular figure: Rek Clan Militia.
• Mexico, 2018: 46 total fatalities. The actor responsible for the largest singular figure: Unidentified Gang.