Automatic Identification of Relevant Moments in Security Force Videos Using Multimodal Analysis
DOI:
https://doi.org/10.22456/2175-2745.143487Keywords:
Egocentric Vision, Video understanding, Semantic Information, Security ForcesAbstract
Due to the increasing requirement for police officers to wear body cameras, there is an increased need for algorithms that can automatically detect relevant moments in footage. This paper presents an automated system that uses audio and video inputs to highlight key events in recordings, reducing the need for operators to watch entire videos. Our method detects firearms and crowd gatherings with object detection, identifies people raising their hands with pose estimation. We also detect sound patterns such as sirens, gunshots, and shouts and use Automatic Speech Recognition to transcribe conversations and identify keywords for relevant events. Our system, evaluated with videos from YouTube channels such as PMTVSP, PoliceActivity, and Code Blue Cam, effectively identifies significant moments in security footage where agents are engaged in activities beyond routine patrol, thus avoiding the need for a human to watch the entire video to point out relevant moments.
Downloads
References
SULTANI, W.; CHEN, C.; SHAH, M. Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.n.], 2018. p. 6479–6488. Disponível em: ⟨https://doi.org/10.1109/CVPR.2018.00678⟩.
YELLAPRAGADA, S. et al. CCTV-Gun: Benchmarking Handgun Detection in CCTV Images. 2023. Disponível em: ⟨https://doi.org/10.48550/arXiv.2303.10703⟩.
GIRSHICK, R. Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV). [s.n.], 2015. p. 1440–1448. Disponível em: ⟨https://doi.org/10.1109/ICCV.2015.169⟩.
GIRSHICK, R. Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV). [s.n.], 2015. p. 1440–1448. Disponível em: ⟨https://doi.org/10.1109/ICCV.2015.169⟩.
REDMON, J. et al. You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [s.n.], 2016. p. 779–788. Disponível em: ⟨https://doi.org/10.1109/CVPR.2016.91⟩.
OLMOS, R.; TABIK, S.; HERRERA, F. Automatic handgun detection alarm in videos using deep learning. Neurocomputing, v. 275, p. 66–72, 2018. ISSN 0925-2312. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0925231217308196⟩.
GONZÁLEZ, J. L. S. et al. Real-time gun detection in CCTV: An open problem. Neural Networks, v. 132, p. 297–308, 2020. ISSN 0893-6080. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0893608020303361⟩.
LIM, J. et al. Deep multi-level feature pyramids: Application for non-canonical firearm detection in video surveillance. Engineering Applications of Artificial Intelligence, v. 97, p. 104094, 2021. ISSN 0952-1976. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0952197620303456⟩.
SULTANI, W.; CHEN, C.; SHAH, M. Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.n.], 2018. p. 6479–6488. Disponível em: ⟨https://doi.org/10.1109/CVPR.2018.00678⟩.
KONG, Q. et al. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE, v. 28, p. 2880–2894, 2020. Disponível em: ⟨https://doi.org/10.48550/arXiv.1912.10211⟩.
GEMMEKE, J. F. et al. Audio Set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [s.n.], 2017. p. 776–780. Disponível em: ⟨https://doi.org/10.1109/ICASSP.2017.7952261⟩.
CHEN, S. et al. BEATs: Audio pre-training with acoustic tokenizers. In: KRAUSE, A. et al. (Ed.). Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023. (Proceedings of Machine Learning Research, v. 202), p. 5178–5193. Disponível em: ⟨https://proceedings.mlr.press/v202/chen23ag.html⟩.
HUANG, P.-Y. et al. Masked autoencoders that listen. Advances in Neural Information Processing Systems, v. 35, p. 28708–28720, 2022. Disponível em: ⟨https://doi.org/10.48550/arXiv.2207.06405⟩.
GULATI, A. et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. Disponível em: ⟨https://doi.org/10.48550/arXiv.2005.08100⟩.
CHAN, W. et al. SpeechStew: Simply mix all available speech recognition data to train one large neural network. arXiv preprint arXiv:2104.02133, 2021. Disponível em: ⟨https://doi.org/10.48550/arXiv.2104.02133⟩.
RADFORD, A. et al. Robust speech recognition via large-scale weak supervision. In: PMLR. International Conference on Machine Learning. 2023. p. 28492–28518. Disponível em: ⟨https://doi.org/10.48550/arXiv.2212.04356⟩.
MOLINO, A. G. del et al. Summarization of egocentric videos: A comprehensive survey. IEEE Transactions on Human-Machine Systems, v. 47, n. 1, p. 65–76, 2017. Disponível em: ⟨https://doi.org/10.1109/THMS.2016.2623480⟩.
LEE, Y. J.; GHOSH, J.; GRAUMAN, K. Discovering important people and objects for egocentric video summarization. In: CVPR. [s.n.], 2012. p. 1346–1353. Disponível em: ⟨https://doi.org/10.1109/CVPR.2012.6247820⟩.
NEVES MICHEL SILVA, M. C. E. R. N. A. A gaze driven fast-forward method for first-person videos. In: EPIC@Computer Vision and Pattern Recognition. [s.n.], 2020. p. 1–4. Disponível em: ⟨https://doi.org/10.48550/arXiv.2006.05569⟩.
LIN, T.-Y. et al. Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [s.n.], 2017. p. 936–944. Disponível em: ⟨https://doi.org/10.1109/CVPR.2017.106⟩.
LIU, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). [s.n.], 2021. p. 9992–10002. Disponível em: ⟨https://doi.org/10.1109/ICCV48922.2021.00986⟩.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Luísa Ferreira, Michel Silva

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Autorizo aos editores a publicação de meu artigo, caso seja aceito, em meio eletrônico de acordo com as regras do Public Knowledge Project.