Automatic Learning of a Semantic Scene Model

Aim

To learn activity-based semantic scene features of a scene from large sets of observations automatically.

Introduction

In old-style surveillance, human personnel is responsible to monitor the activity of the scene through a set of cameras and look out for unusual events. Also, quite often they have to go through recorded video data and gather information and evidence for specific events. However both tasks are quite tedious and boring, therefore it is inevitable that interesting events are overlooked.
Potentially, these problems could be tackled by automated visual surveillance systems with high level capabilities such as to detect suspicious events and alarm the personnel, annotate and encode events in a way that facilitates the interaction with the human operators.
Here, we discuss how a semantic scene model can be learnt automatically from observations and how such a model can be used for a variety of applications.

Semantic Dictionary for Visual Surveillance

The semantics that are of interest in a surveillance system can be classified into three categories: Targets (e.g pedestrians, cars, large vehicles), Actions (e.g. move, stop, enter/exit, accelerate, turn left/right)
and Scene Static features (e.g. road, corridor, door, gate, ATM, desk, bus stop). The proposed general scheme is that Targets perform Actions in a environment consisting of other targets and scene static features. This work is mainly about learning the scene static features.
Semantic Dictionary Diagram

Learning in Visual Surveillance

We suggest that learning in Visual Surveillance must be performed mainly unsupervisedly, for two reasons:
Firstly, to exploit the vast amount of observations that are available due to the continuous operation of the surveillance system.
Secondly, to allow the development of systems that can automatically learn their environment so they can be easily installed (plug’n’play) and adapt.

A Reverse Engineering Approach

We want our system to learn the structure of the scene (static scene features). Scene features usually influence the activity of the targets directly or indirectly (e.g. traffic lanes indicate where vehicles can move, pathways encourage pedestrians to walk along, bus stops indicate where pedestrians should wait for the bus and where the bus should stop). Therefore, these features can be labelled by observing the activity of the scene (e.g. A traffic lane is where vehicles move along, a pathway where pedestrians walk along, a bus stop where pedestrians wait and where buses stop).
Reverse Engineering diagram
This approach is considered, because the "visual perception" of our surveillance system is based on the low-level motion detection module. In other words, the surveillance system is equipped only with a "motion-attention" mechanism and it fails to see motionless objects. Therefore, the scene can only be "understood", through the related activity.

Entrances/Exits/Stop areas

For instance, a surveillance system can learn the entrances/exits of a scene (which may be associated to doors, gates, etc) from sets of points that indicate entry/exit events (derived by a motion tracking algorithm). An unsupervised learning algorithm based on the Expectation-Maximisation (EM) algorithm for Gaussian Mixture Models (GMM) has been used to provide the following results.




Route models / Paths / Junctions

Also, we developed a route model and an associated unsupervised online algorithm to represent scene entities like "roads", "traffic lanes", "pathways", "corridors". The learning algorithm clusters similar trajectories to form route models.
Further analysis on the route models, based on Computational Geometry, allows the extraction of other features, such as path segments and junctions.

Scene TrajectoriesScene RoutesScene Paths and Junctions

Applications

Database

We designed a hierarchical context-based surveillance database that is based on the extracted semantic scene features of the scene. the benefit is that it allows the summarisation of the object motion history with very few parameters. Because humans interpret object motion in relation to other objects of the scene, the semantic description layer provides the basis for such content-based description of motion. It also allows human operators to use context-based queries to communicate with the surveillance database. For instance, the database can be queried about pedestrians that used a particular route (see images below).


Video Annotation

An automatic video annotation system can be constructed if the visual semantics that are learnt by our system are linked to proper words. The EPSRC-funded project REVEAL aims to address this issue, by integrating visual learning with natural language processing.

Activity Analysis

The static features of the scene are used as the basis for probabilistic models (HMM) that allow the assessment of the activity of pedestrians/vehicles. These models can capture typical patterns of activity. The outliers of these models correspond to atypical activity, which may be suspicious.

Typical activity of a pedestrian. Similar trajectories has been seen by the system and capture by the route model. Atypical activity of a pedestrian who "climbs up" the wall. The red circle indicates the detection of the atypical event.
Typical Activity Atypical Activity

Conclusions

Automated Visual Surveillance can be significantly benefited by a high-level "semantic" understanding of the scene and its activity. The focus of the above work is visual learning of semantic static features of the scene. Our approach is that visual semantics can be learnt automatically, exploiting the motion observations that can be extracted by a motion tracking module.

Publications

D. Makris, T.J. Ellis, "Learning Semantic Scene Models from Observing Activity in Visual Surveillance" in 'IEEE Transactions on Systems Man and Cybernetics - Part B', 35(3) June, pp. 397-408. ISBN/ISSN 1083-4419 (2005) abstract download
D. Makris, T.J. Ellis, "Path Detection in Video Surveillance" in 'Image and Vision Computing', 20(12) Elsevier, October, pp. 895-903. ISBN/ISSN 0262-8856 (2002) abstract download
D. Makris, T.J. Ellis, J. Black, Chapter "Mapping an Ambient Environment" in 'Ambient Intelligence - A Novel Paradigm', Edited by P. Remagnino, G. Foresti, T.J. Ellis, Springer, pp. 139-164. ISBN/ISSN 0-387-22990-6 (2005)
D. Makris, T.J. Ellis, J. Black, "Learning Scene Semantics", ECOVISION 2004 Early Cognitive Vision Workshop, May, Isle of Skye, Scotland, UK, (2004) abstract download
D. Makris, T.J. Ellis, "Automatic Learning of an Activity-Based Semantic Scene Model", IEEE International Conference on Advanced Video and Signal Based Surveillance, July, Miami, FL, USA, pp. 183-188. (2003) abstract download
D. Makris, T.J. Ellis, "Spatial and Probabilistic Modelling of Pedestrian Behaviour", British Machine Conference 2002, September, Cardiff, UK, pp. 557-566. (2002) abstract download
D. Makris, T.J. Ellis, "Finding Paths in Video Sequences", British Machine Vision Conference 2001, September, Manchester, UK, pp. 263-272. (2001) abstract download

About this work

This work is part of the EPSRC-funded projects IMCASM and  REVEAL, conducted by Dimitrios Makris, Tim Ellis and James Black.