Automation of Violent Activity Recognition Utilising CCTV Video Data

Parris, Matthew M. G. (2024) Automation of Violent Activity Recognition Utilising CCTV Video Data. Doctoral thesis, The University of Buckingham.

[img] Text
Final Thesis_Parris_M_Towards_The_Automation_Of_Violetn_Activty_Recognition-5.pdf - Submitted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (68MB)

Abstract

In this era, security trends identified violence as a significant issue plaguing society globally. Statistics depicted alarming thresholds for violence, establishing itself as a momentous challenge for homeland security and defence institutions, predominantly in schools and other public locations. The advent of state-of-the-art closed circuit television (CCTV) surveillance solutions exists to aid in limiting the manifestations of violence and its impact. However, most institutions need proper analysis mechanisms that lead to prevention, apprehension, or conviction in a timely fashion. Manually monitoring and collectively analysing anthropometric data generated by CCTV surveillance devices proved impractical and time-consuming, and its outcome increases the complexity of identifying violent behavioural patterns as substantial evidence. Despite innovative CCTV sensor improvement, the impact of adequately analysing vast amounts of CCTV data adds to the monitoring challenge. This thesis proposed the amalgamation of the ”You Only Look Once version five medium” model (YOLOv5m) as activity recognition and Three Dimensional Convolution Neural Network Single level (3DCNNsl) activity recognition, two state-of-the-art artificial intelligence models incorporating weight embedding procedures to identify primitive stages of violence and weapons artefacts. The approach integrates classification support to confirm the existence of specific weapon objects (knives, bladed instruments, clubs, and guns) of interest belonging to a specific class of violence (beating, shooting, stabbing). It also validates the presence (primitive stages of violence) of violent classes by utilising the existence of weapons belonging to its category group to infer the activity outcome. Utilising classification support concepts to validate the existence of primitive stages of violence enhances the classification outcome of violent activity recognition with robust results. This thesis commenced by conducting a two-stage literature investigation to satisfy the research objectives, which disclosed the state-of-the-art 3DCNNsl at stage one and the YOLOv5m framework for activity with artefact recognition towards violence at stage two. The proposed one-stage (simultaneously performing object localisation and classification) solution combines the models’ processing, reducing the impact of their architectural limitations. 3DCNNsl facilitates behavioural pattern classification, generically associating sub-class labels suggesting the presence of violence at high accuracy. In addition to 3DCNNsl, YOLOv5m architecture serves two functions: operating in an activity recognition capacity, fortifying 3DCNNsl activity output, and detecting artefacts, which establish the presence of weapons, enhancing the action classification and overall accuracy. The thesis optimised the deep learning model selections by identifying violence in scenarios and validating its presence through a redundant weapon artefact classification weight embedding procedure. The concept allows the classification of violence in its primitive stages before its impact escalates to lethal outcomes. The proposal extensively reviewed its operations via transfer learning in multiple fusion scenarios to identify the most optimal strategies to realise the research objective. The evaluation dataset utilised in this thesis encompassed a selection of samples accumulated via the University of Central Florida (UCF) dataset and several social media forums. The violent action samples reflect several multifaceted real-world scenarios representing sporadic accelerated motion attributes in various environments, which aids in reducing the risk of dispensing biased results and affecting the model’s robustness. The proposal disclosed three contributory elements, which reflect the following; 1. Conducted performance testing of two known machine learning techniques (YOLOv5m and 3DCNNsl) in independently recognising violent and non-violent activities in CCTV video footage. 2. Demonstrated violent activity recognition performance in such videos when both machine learning techniques operate in tandem. 3. Implemented performance enhancement by further incorporating threat object detection in the previous combined solution. Contribution one disclosed the effectiveness of YOLOv5m activity recognition at 74% and 3 the state-of-the-art 3DCNN at 75%, conceding high misclassifications utilising data with and without augmentations and resolution modifications. The operations emphasised the obligation to explore alternative processing measures to alleviate the disadvantages of the two machine learning models. Contribution two emphasised the effectiveness of fusion enhancement techniques via decision-level voting at 85.20% over 3DCNNsl and YOLOv5m activity recognition. As a validation strategy, the operations incorporated surplus data encompassing 50 samples designed to enhance the classification complexity. The approach rigorously appraised the operations, thus confirming its applicability. Contribution three showcased the amalgamation of fusion’s activity recognition and the power of object detection to establish its effectiveness in concatenating weight embedding. The experiments maintained data consistency similar to contribution two. Analysis disclosed the dominance of fusion incorporating threat object detection at 88.20% over 3DCNNsl, YOLOv5m activity recognition, and fusion without threat object enhancement. The results underscore the robustness of the proposed method, which has proven its classification competence, particularly in scenarios with surplus data, from an overall accuracy perspective. While the proposal debates the efficiency of individual processing compared to fusion without support, the research endeavour accentuates the effectiveness of integrating classification redundancy through weight embedding to suggest the presence of artefacts confirming the occurrence of violent actions. The findings highlight the effectiveness of the proposed method without artefact processing at 85.20% while incorporating threat object support analysis concatenating weapons (knife, club, gun in the videos) improved the accuracy to 88.20%. This evidence substantiates the solution’s robustness, fulfilling the research objectives to conclude the investigations.

Item Type: Thesis (Doctoral)
Uncontrolled Keywords: Security ; violence ; artificial intelligence ; "You Only Look Once version five medium" model ; Three Dimensional Convolution Neural Network Single level ; classification ; artefact recognition.
Subjects: H Social Sciences > HN Social history and conditions. Social problems. Social reform
Q Science > Q Science (General)
T Technology > T Technology (General)
Divisions: School of Computing
Depositing User: Freya Tyrrell
Date Deposited: 28 Aug 2025 15:45
Last Modified: 28 Aug 2025 15:45
URI: http://bear.buckingham.ac.uk/id/eprint/704

Actions (login required)

View Item View Item