Image Sequence Processing Paradigms

Image sequence processing paradigms

Contrary to the approach chosen almost everywhere else, where "computational vision" has been conceived as a direct inversion of the perspective mapping process, our approach - right from the beginning - very much relied upon spatio-temporal modeling and local linear approximations to these models. This was conceived initially as an extension of the well known (Luenberger) observer technique from control engineering by local linearization of the perspective mapping equations [MeD 83]; in 83/84 the stochastic formulation as an estimation process has been evaluated carefully in comparison and introduced as the new standard at UBM by Wuensche [Wue 86]. It took several years until this approach has been adopted as a standard for image sequence processing worldwide. One of the reasons for this delay was that solving the initialization problem of vision over and over again had become the standard in the AI-community. Understanding each image as good as possible, irrespective of the processing time needed for each image, has been the goal; spatial changes over time have been extracted afterwards on the basis of the last two (or more) images analyzed. On the contrary, the 4-D approach of UBM exploits process models with spatial velocity components as states. Both, poses and spatial velocity components are thus iterated in a least square sense following the approach of Kalman [Wue 87; Wue 88; DiG 88; DiW 99] (recursive estimation with a dynamical model of the process observed and with a number of measurement models specific for each sensor). This approach allows much more efficient image sequence interpretation when the evaluation frequency is kept high; therefore, again in contrast to most other approaches at that time, the processes selected for visual interpretation and the hardware for parallel data evaluation were designed such that cycle times of 140 ms never were exceeded, in order not to violate validity and linearity conditions of the models used. The good predictions thereby possible allowed an increase in feature extraction efficiency by orders of magnitude [DiW 99]. According to this approach, three levels of knowledge are of importance:
  1. On the feature level it is important to know, which features of certain object classes are robust against changes in photometric parameters like lighting conditions (e.g. edges, homogeneous blobs); in addition, it is important to know which collections of features in the image are indicative of an object in the scene (for bottom-up object hypothesizing).
  2. On the object level, classes of objects with a wide range of parameters may be characterized by a given structure of their shape or even motion (like roads, cars, trucks, buildings, animals etc.). This structural knowledge allows checking object hypotheses efficiently top-down. The phenomenon called "Gestalt-idea of perception" by psychologists can thus be implemented easily. Motion models introduce continuity in aspect conditions.
  3. Several objects in the environment in conjunction with the own goals (mission) form the situation, which influences behavior decisions for successful acting. In addition, environmental properties like lighting conditions for sensing as well as weather conditions for vehicle control and sensing constitute the situation.
In the domain of knowledge representation and intelligent control again, three levels of development may be distinguished as with the hardware: Initially, with the BVVx the principle possibilities in well constraint environments had to be demonstrated. With the second generation (transputer) systems, first real-world applications in well structured normal environments (like existing highway traffic) were the goal [Dic 95a]. The third generation systems just entering the test phase have been designed to lift the performance level closer to the human one (see EMS-Vision below). Joint inertial and visual sensing and interpretation as well as active/reactive viewing direction control including saccades exploiting foveal/peripheral scaling have been introduced. Due to object-orientation (coded in C++) and the modularity in design, these systems have growth potential towards larger knowledge bases and a wide selection of behavioral capabilities (taking advantage of the characteristics of vertebrate vision) [Gr et al. 00; LüD 00; Ho et al. 00; Mau 00; SiD 00; PeD 00; GrD 00; Dic 00].