Generalized VLM-Based In-Car Violence Detection Leveraging Synthetic Data
Generalized VLM-Based In-Car Violence Detection Leveraging Synthetic Data
In: IEEE (Hrsg.). Proceedings of. International Telecommunications Conference (ITC-2026), 6th International Telecommunications Conference (ITC-Egypt 2026), located at ITC-2026, July 27-30, Cairo, Egypt, IEEE, Piscataway, New Jersey, USA, 7/2026.
- Abstract:
- As autonomous vehicles move toward full driverless operation, ensuring passenger safety through in-cabin monitoring is critical, yet violence detection remains challenging due to the scarcity of real-world datasets. Traditional approaches rely on supervised CNN-based architectures, which often fail to generalize across different environments. In this paper, we evaluate the transition from task-specific models to large-scale Video-Language Models (VLMs), specifically InternVideo2-Chat, for identifying aggressive interactions. We demonstrate that while the Temporal Shift Module (TSM) achieves high accuracy on seen distributions, it suffers a huge drop in performance on out-of-distribution data. In contrast, Low-Rank Adaptation (LoRA) fine-tuned VLM matches state-of-the-art performance (98.7% accuracy) while maintaining generalization. Furthermore, we investigate the potential of AI-generated content’s ability to replace real-world captured datasets by building a synthetic in-car violence dataset using the open source Wan2.2 diffusion model. The results demonstrate that fine-tuning solely on generated video data leads to a substantial improvement of approximately 29% points in recall over the zero-shot baseline, while maintaining a high accuracy of 90.5% on real-world test data, suggesting that generative models can effectively mitigate data scarcity for rare and sensitive events. Our findings highlight that semantic-driven VLMs offer a more scalable and robust solution for the future of shared autonomous mobility security.