Abstract: Human visual scene understanding is remarkable: with only a brief
glance at an image, an abundance of information is available – spatial
structure, scene category and the identity of main objects in the
scene. In traditional computer vision, scene and object recognition are
two visual tasks generally studied separately. However, it is unclear
whether it is possible to build robust systems for scene and object
recognition, matching human performance, based only on local
representations. Another key component of machine vision algorithms is
the access to data that describe the content of images. As the field
moves into integrated systems that try to recognize many object classes
and learn about contextual relationships between objects, the lack of
large annotated datasets hinders the fast development of robust
solutions. In the early days, the first challenge a computer vision
researcher would encounter would be the difficult task of digitizing a
photograph. Even once a picture was in digital form, storing a large
number of pictures (say six) consumed most of the available
computational resources. In addition to the algorithmic advances
required to solve object recognition, a key component to progress is
access to data in order to train computational models for the different
object classes. This situation has dramatically changed in the last
decade, especially via the internet, which has given computer vision
researchers access to billions of images and videos. In this talk I
will describe recent work on visual scene understanding that try to
build integrated models for scene and object recognition, emphasizing
the power of large database of annotated images in computer vision.