Automatic learning of a Semantic Scene Model

Aim

To learn activity-based semantic scene features of a scene from large sets of observations automatically.

Introduction

In old-style surveillance, human personnel is responsible to monitor the activity of the scene through a set of cameras and look out for unusual events. Also, quite often they have to go through recorded video data and gather information and evidence for specific events. However both tasks are quite tedious and boring, therefore it is inevitable that interesting events are overlooked.
Potentially, these problems could be tackled by automated visual surveillance systems with high level capabilities such as to detect suspicious events and alarm the personnel, annotate and encode events in a way that facilitates the interaction with the human operators.
Here, we discuss how a semantic scene model can be learnt automatically from observations and how such a model can be used for a variety of applications.

Semantic Dictionary for Visual Surveillance

The semantics that are of interest in a surveillance system can be classified into three categories: Targets (e.g pedestrians, cars, large vehicles), Actions (e.g. move, stop, enter/exit, accelerate, turn left/right)
and Scene Static features (e.g. road, corridor, door, gate, ATM, desk, bus stop). The proposed general scheme is that Targets perform Actions in a environment consisting of other targets and scene static features. This work is mainly about learning the scene static features.

Learning in Visual Surveillance

We suggest that learning in Visual Surveillance must be performed mainly unsupervisedly, for two reasons:
Firstly, to exploit the vast amount of observations that are available due to the continuous operation of the surveillance system.
Secondly, to allow the development of systems that can automatically learn their environment so they can be easily installed (plug’n’play) and adapt.

A Reverse Engineering Approach

We want our system to learn the structure of the scene (static scene features). Scene features usually influence the activity of the targets directly or indirectly (e.g. traffic lanes indicate where vehicles can move, pathways encourage pedestrians to walk along, bus stops indicate where pedestrians should wait for the bus and where the bus should stop). Therefore, these features can be labelled by observing the activity of the scene (e.g. A traffic lane is where vehicles move along, a pathway where pedestrians walk along, a bus stop where pedestrians wait and where buses stop).

This approach is considered, because the "visual perception" of our surveillance system is based on the low-level motion detection module. In other words, the surveillance system is equipped only with a "motion-attention" mechanism and it fails to see motionless objects. Therefore, the scene can only be "understood", through the related activity.

Automatic Learning of a Semantic Scene Model

Aim

Introduction

Semantic Dictionary for Visual Surveillance

Learning in Visual Surveillance

A Reverse Engineering Approach

Entrances/Exits/Stop areas

Route models / Paths / Junctions

Applications

Database

Video Annotation

Activity Analysis

Conclusions

Publications

About this work

Typical activity of a pedestrian. Similar trajectories has been seen by the system and capture by the route model.	Atypical activity of a pedestrian who "climbs up" the wall. The red circle indicates the detection of the atypical event.