Automatic Identification of Relevant Moments in Security Force Videos Using Multimodal Analysis

Luísa Ferreira; Michel Melo da Silva

doi:10.22456/2175-2745.143487

Authors

Luísa Ferreira Federal University of Viçosa (UFV) https://orcid.org/0009-0002-9587-9603
Michel Silva Federal University of Viçosa (UFV) https://orcid.org/0000-0002-2499-9619

DOI:

https://doi.org/10.22456/2175-2745.143487

Keywords:

Egocentric Vision, Video understanding, Semantic Information, Security Forces

Abstract

Due to the increasing requirement for police officers to wear body cameras, there is an increased need for algorithms that can automatically detect relevant moments in footage. This paper presents an automated system that uses audio and video inputs to highlight key events in recordings, reducing the need for operators to watch entire videos. Our method detects firearms and crowd gatherings with object detection, identifies people raising their hands with pose estimation. We also detect sound patterns such as sirens, gunshots, and shouts and use Automatic Speech Recognition to transcribe conversations and identify keywords for relevant events. Our system, evaluated with videos from YouTube channels such as PMTVSP, PoliceActivity, and Code Blue Cam, effectively identifies significant moments in security footage where agents are engaged in activities beyond routine patrol, thus avoiding the need for a human to watch the entire video to point out relevant moments.

Downloads

Download data is not yet available.

References

SULTANI, W.; CHEN, C.; SHAH, M. Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.n.], 2018. p. 6479–6488. Disponível em: ⟨https://doi.org/10.1109/CVPR.2018.00678⟩.

YELLAPRAGADA, S. et al. CCTV-Gun: Benchmarking Handgun Detection in CCTV Images. 2023. Disponível em: ⟨https://doi.org/10.48550/arXiv.2303.10703⟩.

GIRSHICK, R. Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV). [s.n.], 2015. p. 1440–1448. Disponível em: ⟨https://doi.org/10.1109/ICCV.2015.169⟩.

REDMON, J. et al. You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [s.n.], 2016. p. 779–788. Disponível em: ⟨https://doi.org/10.1109/CVPR.2016.91⟩.

OLMOS, R.; TABIK, S.; HERRERA, F. Automatic handgun detection alarm in videos using deep learning. Neurocomputing, v. 275, p. 66–72, 2018. ISSN 0925-2312. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0925231217308196⟩.

GONZÁLEZ, J. L. S. et al. Real-time gun detection in CCTV: An open problem. Neural Networks, v. 132, p. 297–308, 2020. ISSN 0893-6080. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0893608020303361⟩.

LIM, J. et al. Deep multi-level feature pyramids: Application for non-canonical firearm detection in video surveillance. Engineering Applications of Artificial Intelligence, v. 97, p. 104094, 2021. ISSN 0952-1976. Disponível em: ⟨https://www.sciencedirect.com/science/article/pii/S0952197620303456⟩.

SULTANI, W.; CHEN, C.; SHAH, M. Real-world anomaly detection in surveillance videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. [s.n.], 2018. p. 6479–6488. Disponível em: ⟨https://doi.org/10.1109/CVPR.2018.00678⟩.

KONG, Q. et al. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE, v. 28, p. 2880–2894, 2020. Disponível em: ⟨https://doi.org/10.48550/arXiv.1912.10211⟩.

GEMMEKE, J. F. et al. Audio Set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [s.n.], 2017. p. 776–780. Disponível em: ⟨https://doi.org/10.1109/ICASSP.2017.7952261⟩.

CHEN, S. et al. BEATs: Audio pre-training with acoustic tokenizers. In: KRAUSE, A. et al. (Ed.). Proceedings of the 40th International Conference on Machine Learning. PMLR, 2023. (Proceedings of Machine Learning Research, v. 202), p. 5178–5193. Disponível em: ⟨https://proceedings.mlr.press/v202/chen23ag.html⟩.

HUANG, P.-Y. et al. Masked autoencoders that listen. Advances in Neural Information Processing Systems, v. 35, p. 28708–28720, 2022. Disponível em: ⟨https://doi.org/10.48550/arXiv.2207.06405⟩.

GULATI, A. et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020. Disponível em: ⟨https://doi.org/10.48550/arXiv.2005.08100⟩.

CHAN, W. et al. SpeechStew: Simply mix all available speech recognition data to train one large neural network. arXiv preprint arXiv:2104.02133, 2021. Disponível em: ⟨https://doi.org/10.48550/arXiv.2104.02133⟩.

RADFORD, A. et al. Robust speech recognition via large-scale weak supervision. In: PMLR. International Conference on Machine Learning. 2023. p. 28492–28518. Disponível em: ⟨https://doi.org/10.48550/arXiv.2212.04356⟩.

MOLINO, A. G. del et al. Summarization of egocentric videos: A comprehensive survey. IEEE Transactions on Human-Machine Systems, v. 47, n. 1, p. 65–76, 2017. Disponível em: ⟨https://doi.org/10.1109/THMS.2016.2623480⟩.

LEE, Y. J.; GHOSH, J.; GRAUMAN, K. Discovering important people and objects for egocentric video summarization. In: CVPR. [s.n.], 2012. p. 1346–1353. Disponível em: ⟨https://doi.org/10.1109/CVPR.2012.6247820⟩.

NEVES MICHEL SILVA, M. C. E. R. N. A. A gaze driven fast-forward method for first-person videos. In: EPIC@Computer Vision and Pattern Recognition. [s.n.], 2020. p. 1–4. Disponível em: ⟨https://doi.org/10.48550/arXiv.2006.05569⟩.

LIN, T.-Y. et al. Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). [s.n.], 2017. p. 936–944. Disponível em: ⟨https://doi.org/10.1109/CVPR.2017.106⟩.

LIU, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). [s.n.], 2021. p. 9992–10002. Disponível em: ⟨https://doi.org/10.1109/ICCV48922.2021.00986⟩.