published in journal of Machine Vision and Applications

Disparity Disambiguation by Fusion of Signal-

and Symbolic-Level Information

Jarno Ralli

1

, Javier D

´

ıaz

1

, Sinan Kalkan

2

, Norbert Kr

¨

uger

3

, and

Eduardo Ros

1

jarno@ralli.ﬁ, jdiaz@atc.ugr.es, skalkan@ceng.metu.edu.tr,

norbert@mip.sdu.dk, eros@atc.ugr.es

1

Departamento de Arquitectura y Tecnolog

´

ıa de Computadores

∗

2

KOVAN Research Lab

†

3

Cognitive Vision Lab

‡

Abstract

We describe a method for resolving ambiguities in low-level disparity calcu-

lations in a stereo-vision scheme by using a recurrent mechanism that we call

signal-symbol loop. Due to the local nature of low-level processing it is not al-

ways possible to estimate the correct disparity values produced at this level. Sym-

bolic abstraction of the signal produces robust, high conﬁdence, multimodal image

features which can be used to interpret the scene more accurately and therefore

disambiguate low-level interpretations by biasing the correct disparity. The fusion

process is capable of producing more accurate dense disparity maps than the low-

and symbolic-level algorithms can produce independently. Therefore we describe

an efﬁcient fusion scheme that allows symbolic- and low-level cues to complement

each other, resulting in a more accurate and dense disparity representation of the

scene.

1 Introduction

Visual perception is a complex process that transforms image signals into cognitive in-

formation. The complexity of the vision system is due to multiple levels of abstraction

that must be taken into account when interpreting the image scene. In order to under-

stand the vision system better we can represent schematically the different levels of

the process. Vision researchers tend to classify vision algorithms and representations

into three levels: low (sensory/signal), middle (symbolic) and high (knowledge based)

[28][3][22]. Low-level vision deals with local operations such as spatio-temporal ﬁl-

ters to extract low-level cues. In biological systems this is done by cells in the retina

∗

Escuela T

´

ecnica Superior de Ingenier

´

ıa Informatica y de Telecomunicac

´

ıon, Universidad de Granada,

Calle Periodista Daniel Saucedo Aranda s/n, E-18071 Granada, Spain

†

Dept. of Computer Engineering, Middle East Technical University, 06531 Ankara, Turkey

‡

The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Niels Bohrs Alle 1, DK-5230

Odense M, Denmark

www.jarnoralli.ﬁ 1

and the primary visual cortex. From a set of basic spatio-temporal ﬁlters of different

sizes and temporal characteristics, low-level vision models generate information about

stereopsis, motion within the scene, local contrast and so on. Low-level operations on

the image signal are local in nature and can produce several possible interpretations due

to the lack of a more global scene interpretation. At middle-level vision, visual cues

and segmentation mechanisms are integrated, thus allowing the efﬁcient and construc-

tive combination of different visual modalities (motion, stereo, orientation and so on)

or the segmentation of abstracted information such as independently moving objects

(IMOs)[27][5][19]. High-level vision is a cognitive processing stage, where scenes

are interpreted via more speciﬁc sub-tasks, such as object recognition, prediction and

comparison with already perceived scenarios. We use the terms low- or signal-level

algorithm when referring to algorithms that work at the signal-level (i.e. in a pixel-

wise representation) without trying to arrive at a higher-level description of the scene

and the terms middle- or symbolic-level to refer to algorithms that arrive at a higher

description of the scene using semantically meaningful and discrete symbolic descrip-

tors. Even though at this level reasoning based on the symbolic descriptors takes place,

we are still far from high-level processes where actual scene understanding happens.

In this paper we propose a disambiguation mechanism for creating coherent dispar-

ity estimations by fusing signal- and symbolic level information i.e. fusing estimations

of different level of abstraction within a cross-validation scheme. There are several

kinds of disambiguation mechanism used in disparity calculation algorithms [25], both

local and global, such as: aggregation of evidence, search for salient features [14][13],

combination of both monocular and binocular cues [15] and so on. Where our work

differs from earlier studies is that before disambiguation we arrive at a symbolic-level

scene description using robust, cross-validated, biologically motivated, multimodal im-

age features that we shall refer to as primitives in the rest of the paper [12] [18] [16].

Our main contributions in this paper are as follows. First we show that the proposed

disambiguation mechanism can greatly enhance quality of the resulting disparity es-

timations: coherency is increased by accepting those feedback values that ﬁt the ev-

idence suggested by the data and by rejecting those that are not consistent with the

low-level data-driven estimations. Secondly we show that hardware implementations

suffering from numerical restrictions will also beneﬁt from the proposed scheme.

Our system consists of two parallel data streams preceded by a process that trans-

forms the signal into harmonic representation [23]. By “harmonic representation” we

mean a multichannel, band-pass representation of the image, achieved by ﬁltering op-

erations with complex valued band-pass kernels. From this perspective the visual stim-

ulus is represented locally by phase, energy and orientation on several scales [23]. The

two parallel data streams are the following: a signal-level process that calculates the

dense disparity map; and a symbolic-level process that arrives at a scene reconstruction

using perceptual grouping constraints for the multimodal primitives.

Fig. 1 illustrates the parallel data streams, both using a common harmonic rep-

resentation with colour information, and the feedback signal from the symbolic-level

to the signal-level. Due to the lack of a more general interpretation of the scene, the

low-level process is prone to error when several different interpretations are possible.

The symbolic-level process, on the other hand, generates a more robust and descrip-

tive representation of the scene, capable of reﬁning the estimates for a better overall

semantic coherence. Coherent interpretation at this level is possible through seman-

tic reasoning using concepts such as co-linearity, co-colority, co-planarity and so on

[18]. The dense signal-level disparity is disambiguated by feeding the symbolic-level

information back into the low-level process [10][11] and by biasing those low-level in-

2

Left image

Signal-level

process

Symbolic-level

process

Signal-level disparity

Symbolic-level disparity

Right image

Harmonic

transformation

(Harmonic representation and

color)

Recurrent feed back

Figure 1: shows both the signal- and symbolic-level disparity calculation processes,

which are based on the harmonic representation obtained by harmonic transformation.

terpretations that are coherent with the symbolic-level. Since the feedback takes place

at several image scales (multi-scale), the sparse symbolic-level information is propa-

gated spatially, thus being capable of ‘guiding’ the low-level process over a far greater

area than the original density [20]. We refer to the symbolic-level information used in

the disambiguation as feedback maps. The system presented in this paper has only one

’direction’ of feedback (from symbolic- to signal-level) and extending the feedback

into the other direction (signal- to symbolic-level) is left for future work.

1.1 Signal-symbol Loop

As mentioned above, we use the concept of signal-symbol loop as a feedback mech-

anism by which discrete symbolic descriptors obtained from the harmonic representa-

tion are fed back into the signal-level so as to enhance the extraction of desired features.

To the best of our knowledge the term ‘signal-symbol loop’ was ﬁrst introduced in [11]

to describe a way of dealing with three dilemmas that computer vision encounters when

interpreting a scene. It is argued that such interpretations require the original signal to

be turned into semantic tokens or symbols, which, however, involves a number of prob-

lems. The ﬁrst problem (known as the interpretation/decision dilemma) is of particular

relevance in the context of this paper. It deals with the need to interpret the input sig-

nal, which in turn requires binary decisions. These decisions concern, for example,

setting thresholds for edge detection or discrete selection of feature positions. More-

over, decisions about which features are relevant for a speciﬁc task often need to be

made. Without making further assumptions about the input signal or the task in hand,

these decisions are difﬁcult to justify. Hence it is important that they become veriﬁed

and guided by higher level processes that operate on the symbolic level. In [11] it is

argued that feedback mechanisms in terms of signal-symbol loops can moderate be-

tween the different levels of information and be used for enhancing the image signal

to detect desired features and disambiguate unclear interpretations of the local cues. In

[10] a ﬁrst example is given for the application of a signal-symbol loop in the context

of taking advantage of the regularity of rigid motion for edge detection. In this paper

we give a further example by addressing the interaction of sparse and dense stereo by

signal-symbol loops.

3

1.2 Hardware based real-time low-level processing

Low level stages (extraction engines of primitives) can be efﬁciently implemented

through special purpose hardware such as reconﬁgurable devices [1][2][3]. However

in order maximise on-chip parallel processing capabilities only restricted ﬁxed point

arithmetics are allowed in the model. Furthermore the models are usually simpliﬁed in

order to adapt better to the technological substrate in which they will be implemented.

Therefore these kinds of low-level processing engines produce noisier results than their

respective software implementations. In this work we study if the signal-symbol fusion

mechanisms (described in this paper) help to enhance the system accuracy by con-

structively integrating higher level information and thus allowing designs with lower

resource requirements and power consumption (critical in embedded systems).

1.3 Structure of the Document

We proceed by describing brieﬂy both the low- and symbolic-level algorithms followed

by a description of the fusion process. As low-level algorithm we have chosen a method

that is based on the phase component of a band-pass ﬁltered image due to the robust-

ness of the phase information. We cannot overstress the fact that we are not trying to

come up with a new stereo algorithm but to validate our concept that by fusing infor-

mation from several different visual representation levels more robust and meaningful

interpretations can be achieved. After this we demonstrate quantitatively results of the

fusion process using several well known stereo-images. Testing was done by using

both a software implementation and a simulation of hardware implementation (FPGA)

of the system. Due to a increasing interest, both in scientiﬁc community as well as in

the commercial sector, in implementing artiﬁcial vision systems solving complex tasks

in real-time, we feel that such results should be of interest to anyone implementing real-

time vision systems on chip. After presenting the results we proceed to conclusions,

future work and acknowledgements.

2 Method

In this section we describe the low-level method used for generating the disparity es-

timations, followed by a description of the symbolic-level process used for generating

robust, sparse, feature-based, disparities employing multimodal primitives. After the

low- and symbolic-level algorithms have been covered, the fusion process will be in-

troduced.

2.1 Low-level algorithm description

For the dense, low-level disparity estimation we have used a method based on the

phase component of band-pass ﬁltered versions of the stereo-images (input). Phase in-

formation was used for reasons of efﬁciency and stability [24][3][26][6][23]. Fleet and

Jepson showed the stability of the phase component with respect to small geometric

deformations [7], making phase more robust than amplitude for computing disparity

based on binocular cues. If the cameras have almost identical orientation and the base-

line is not too big whilst the distance to the object being observed is sufﬁcient then the

geometric deformations induced by motion parallax due to a change of viewpoint will

be small. In this case, the phase-based estimations obtained might be expected to be

accurate.

4

The model used is inspired by the optical-ﬂow calculation model of Gautama and

Van Hulle [8] and by the single-scale disparity calculation of Solari et al. [26]. The

ﬁnal model combines the advantages of both methods, using a coarse-to-ﬁne multi-

resolution computation scheme with warping [23]. In the chosen implementation, dis-

parity is calculated based on the phase-difference between stereo-images ﬁltered by a

bank of seven Gabor ﬁlters with different orientations, without explicitly calculating

the phase, thus rendering the method both hardware friendly and suitable for real-time

computations [3][2]. In addition to the above-mentioned properties (density, efﬁciency

and stability in the face of small geometric deformations), the phase-difference method

works explicitly at sub-pixel accuracy. For a more detailed explanation of the algo-

rithm, see Section 4.3. The stages of the algorithm are the following:

1. If on the coarsest scale skip this stage, otherwise

-expansion of results to current scale: D

k

(x) = expand(D

k+1

(x))

-warping of right stereo-image: I

r

(x) = warp(I

r

(x + D

k

(x)))

2. Convolution of the input images I

k

r

(x) and I

k

l

(x) with the Gabor ﬁlters to obtain

the Gabor ﬁlter responses. Each image scale is convolved with the same set of

ﬁlters, tuned to seven different orientations and with a spatial frequency peak of

0.25.

3. Filtering out those responses that are below a given energy threshold: those re-

sponses that do not tune well with the ﬁlters, corresponding to low energy, are

considered unreliable and thus are ﬁltered out.

4. Disparity calculation using remaining responses (those that have not been ﬁltered

out). Since there are seven ﬁlters each image position receives several disparity

estimations.

5. Choosing the disparity estimation for each image position using median ﬁlter, as

indicated by (1), in order to obtain D

k

new

(x).

6. Merge valid disparity estimations

D

k

(x) = merge(D

k

(x),D

k

new

(x)).

7. If not at the ﬁnal scale return to 1.

where I

k

l

(x) and I

k

r

(x) are the left- and right stereo-images and D

k

(x) is the disparity

map corresponding to scale (resolution) k and position x = (x, y). Disparity estimation

for each image position is chosen using a median ﬁlter, as indicated by (1)

D

k

(x) = median(D

k

θ

(x,dP; f

0

)) (1)

where D

k

(x) is the ﬁnal resulting disparity for each image position x = (x, y) and

D

k

θ

(x,dP; f

0

) are the disparity responses corresponding to ﬁlter orientation

θ

and scale

k, where dP is the phase difference, f

0

is the peak frequency of the ﬁlter.

2.1.1 Hardware implementation

In this section we describe how the low-level algorithm, without fusion at this stage, has

been implemented in a hardware design. The reason for including this part is that we

have studied effectiveness of the proposed fusion scheme in a simulation of the hard-

ware implementation and by combining this information with the results, feasibility of

5

implementing the fusion in the hardware can be estimated. The hardware architecture

was implemented in a Xilinx Virtex XC4VFX1000 FPGA using a high-level-hardware

(high-level HDL) description language which permits description of the functionality

at algorithmic level.

The system consists of two different main stages:

1. Stage 1: rectiﬁcation and image pyramid creation.

2. Stage 2: processing loop, coarse to ﬁne scale.

With rectiﬁcation we mean stereo-rectiﬁcation using epipolar geometry and by image

pyramid we refer to a multi-resolution strategy. The design aims at a ﬁne pipelined

circuit beneﬁtting from high parallelism of the FPGA circuit. The initial processing

circuits for left- and right images are replicated and work in parallel. Nevertheless

inside these processing blocks the work is done sequentially, combining stages where

possible: image rectiﬁcation and ﬁrst down-scaling are done simultaneously as soon

as enough rectiﬁed pixels are available. As soon as the image pyramids have been

created, processing loop starts from the coarsest scale, advancing towards the ﬁnest, by

repeating sequentially the same block.

Main steps of the processing loop are:

1. Expansion of results to next scale,

D

k

(x) = expand

D

k+1

(x)

.

2. Warping of input images as per expanded disparity.

3. Disparity calculation for current scale, D

k

new

(x).

4. Merging of disparity estimations,

D

k

(x) = merge

D

k

new

(x),D

k

(x)

.

The architecture works at a data rate of one pixel per clock cycle. Table 1 displays

the amount of resources consumed by the implementation. The implementation uses

a ﬁxed-point representation and the number of bits used for representing fractions is

given in the table 1.

Table 1: Implementation details for Xilinx Virtex XC4VFX1000 FPGA.

LUTs

(50560)

Slice

Flip Flops

(50560)

Slices

(25280)

DSP

(128)

Block

RAM

(232)

Freq.

MHz

Frac.

bits

15810 11693 12464 80 16 60,0 2

2.2 Symbolic-level algorithm description

The multi-modal visual primitives are local, multi-modal visual feature descriptors that

were described in [12]. They are semantically and geometrically meaningful descrip-

tions of local image patches, motivated by the hyper-columnar structures in V1 [9].

Primitives can be edge-like or homogeneous and either 2D or 3D. In this work, only

edge-like primitives are relevant and for other deﬁnitions the reader should consult [12].

An edge-like 2D primitive is deﬁned by equation 2:

π

= (x,

θ

,

ω

,(c

l

,c

m

,c

r

)), (2)

6

where x is the image position of the primitive;

θ

is the 2D orientation;

ω

represents the

contrast transition; and (c

l

,c

m

,c

r

) is the representation of the color, corresponding to

the left (c

l

), the middle (c

m

) and the right side (c

r

) of the primitive. Fig. 2 shows the

extracted primitives for an example scene.

(a) (b)

(c) (d)

Figure 2: Extracted primitives (b) for the example image in (a). Magniﬁed primitives

in (d) and edge primitives in (c) for the marked region of interest in (b).

A 2D edge primitive

π

is a 2D feature which can be used to ﬁnd correspondences

in a stereo framework to create 3D edge primitives (as introduced in [17]) the formula

for which is given in equation 3:

Π = (X,Θ, Ω,(c

l

,c

m

,c

r

)), (3)

where X is the 3D position; Θ is the 3D orientation; Ω is the phase (i.e., contrast

transition); and, (c

l

,c

m

,c

r

) is the representation of the colour, corresponding to the left

(c

l

), the middle (c

m

) and the right side (c

r

) of the 3D primitive.

2.3 Fusion process

The density of sparse algorithms is considerably lower than that of dense algorithms,

typically well below 15%. Without propagating the sparse symbolic-level disparity,

before feeding back at the signal-level, such low-density maps would be able to dis-

ambiguate only locally over a very limited area thus making the improvements depend

directly upon both accuracy and the density of the symbolic-disparity. In a multi-scale

approach, however, fusion is done at each scale, meaning that the sparse disparity map

7

has to be scaled down to match the downscaled image sizes. We scale the symbolic-

level disparity map down using median ﬁltering (ignoring positions that do not contain

disparity values) which results in the natural propagation of the disparities. Neverthe-

less, before scaling down the symbolic-level disparity, it is densiﬁed by applying voting

mask propagation (VMP) [20]. Ralli et al. show in their VMP paper that signiﬁcant

densiﬁcation of a sparse-disparity is possible, producing only a very minor increase in

error, using mask propagation with a voting scheme (cf. [20]). Once the disparity map

provided by the symbolic-level algorithm is densiﬁed by applying VMP and scaled

down, it is used for disambiguation of the interpretations generated by the low-level al-

gorithm by biasing the corresponding nearest (most similar) value and thus maximising

the coherency. Pseudo-code of the fusion process is given in Algorithm 1.

Algorithm 1 If the difference between the most similar low- and symbolic-level values

is above a given rejection threshold then the symbolic-level value is discarded and

decision is made as indicated by Equation (1).

if min(D

sym

(x) − D

low

θ

(x,

θ

)) > thr OR (D

sym

(x) = /0) then

D = median(D

low

θ

(x))

else

D = nearest(D

symbol

(x),D

low

θ

(x))

end if

where D

sym

(x) is the symbolic-level disparity approximation, D

low

θ

(x) are the energy-

ﬁltered, low-level disparity approximations as per orientations, thr is the rejection

threshold, /0 is an empty set and the function nearest(A, B) returns the value from B that

is nearest to A (nearest in Euclidean sense). Therefore, if there are no symbolic-level

disparity approximations or the difference between the closest symbolic- and low-level

disparities is greater than the rejection threshold, then the disparity is chosen ‘nor-

mally’ as deﬁned by Equation (1). If the difference between the closest symbolic-level

and low-level disparity is below the rejection threshold, then the closest low-level dis-

parity is chosen. This selection mechanism can be understood as biasing the closest

low-level disparity value in such a way as to have more likelihood of being chosen.

Biasing can be done in other ways, possibly using either a cost function or a reliability

measure for instance. Since the disparities provided by the symbolic-level are based on

multimodal visual primitives that already have local support [11][18], there is no need

to aggregate local evidence in the fusion process when biasing the disparity calculated

by the low-level. If aggregation of evidence is needed, because the multimodal visual

primitives arrive at a higher scene description the aggregation of evidence should be

carried out at this level by grouping the primitives into groups describing the same ob-

jects and/or object contours, for example [18]. Figs. 3 and 4 display data ﬂow without

and with fusion.

3 Experiments

The proposed fusion model was tested both quantitatively using well known bench-

mark images from the Middlebury

1

database and qualitatively using images from the

DRIVSCO

2

project. In the Middlebury case the results are given in two different ways:

1

http://vision.middlebury.edu/stereo/data

2

http://www.pspc.dibe.unige.it/

~

drivsco/

8