This
research is in the area of computer vision - making
computers which can understand what is happening in
photographs and video. However, most learning based
approaches require annotated data which can be expensive to
acquire. This project seeks to develop automated tools that
allow temporal visual content,such as a human gesturing, using
sign language, or interacting with objects or other humans, to
be learnt from standard TV broadcast signals using the high
level annotation in the form of subtitles or scripts. This
requires the development of models of the visual appearance
and dynamics of actions, and learning methods which can train
such models using the weak supervision provided by the text.
As such there are two main domains to this work, Sign Language
Recognition and more general understanding of actions and
behavior in broadcast footage. More details on some of these
elements are given below.
Sign
Language
Upper Body Pose Estimation and Tracking
Learning
Sign Language by Watching TV
Sign
Recognition using Sequential Pattern Trees
Additional pose data and CNN body pose software
Action Recognition
Recognising
actions in 2D
Recognising
actions in 3D
Tracking and Character Identification
2D detection and tracking
Tracking
3D objects from 2D footage
Tracking
Hands in 3D
Identifying
Characters in Footage