A RTIFICIAL PARIETAL C ORTEX N EURONS FOR 3D R ECONSTRUCTION
J EREMIAH J. N EUBERT Mechanical Engineering University of Wisconsin Madison, Wisconsin N ICOLA J. F ERRIER ? Mechanical Engineering University of Wisconsin Madison, Wisconsin A BSTRACT A neural network is used for three-dimensional (3D) reconstruction of a point from a pair of images obtained with an active stereo system. Our active stereo system describes the position of a point with eight parameters: two pan angles, two tilt angles, and two-dimensional coordinates of the projected point in each image. Three-dimensional (3D) reconstruction consists of learning the function which maps these eight parameters to the 3D world coordinates (xw , yw , and zw ). This paper outlines two possible networks for learning the mapping function. One contains simple Gaussian neurons in the hidden layer. The other is composed of neurons based on a model of the neurons in the parietal cortex of the human brain thought to be involved in 3D reconstruction from visual data. The paper compares the performance of each network. We evaluate the size of training set required and the effect of the selection method used to obtain the training examples. I NTRODUCTION Three dimensional reconstruction with active stereo cameras is a dif?cult problem. The traditional kinematic model based approaches can involve more than forty different parameters. Finding these parameters requires a complicated calibration procedure and the collection of several different images [Knight and Reid, 2000, Guse, 1999]. Neubert, et al  suggest that a neural network (NN) may provide a more accurate and robust method for 3D reconstruction. This paper further explores the ability of arti?cial neural networks to estimate three dimensional coordinates from a dynamic stereo head. Two neural network structures were investigated. The networks were compared using sum squared error and optimum size of the training set. Feed-forward neural networks are often used to approximate functions and have several advantages over the standard model-based approach to 3D reconstruction from stereo images. They do not require a pre-supposed camera and kinematic model, and are robust to noise in the training data. This makes them well suited to the reconstruction problem, but regression with a neural network can require a rather large set of training data to create a quality model of the function and avoid over?tting. One of the most important factors in using a neural network to approximate a function is the proper choice of the network structure and neurons [Poggio and Girosi, 1990]. Simple traditional neural networks are well understood, but may not be capable of providing a good approximation of the desired function. This paper will justify the added complexity
? During part of the period of this research, the ?rst author also worked as a KTI Fellow, and received fellowship support under the NSF GK-12 Program (NSF-9979628). This research supported in part by NSF IRI-9703352.
Y U H. H U Electrical & Computer Engineering University of Wisconsin Madison, Wisconsin
of the biologically inspired neural network by comparing it to the simple Gaussian radial basis function network similar to those proposed by others [Moody and Darken, 1989, Poggio and Girosi, 1990]. A description of data selection methods is given next. Followed by a description of the biologically inspired network, the Gaussian network, and the methods used to train them. Then the ?ndings will be summarized and discussed. DATA S ELECTION M ETHODS Each network was trained with 1000, 1776, 2000, 4000, and 8000 training examples. For each set data was selected both randomly and systematically from approximately 16,000 examples gathered using the method described by Neubert et. al. . The random selection method occurred without replacement and each sample had the same probablity of being selected. The systematic method attempts to minimize the maximum amount of space between training examples. Because we know the position of the points this is a trivial task. We conjectured that that this method would provide a better representation of the variable space than that of the random selection, especially with smaller data sets, but this was found not to be the case as will be shown. B IOLOGICALLY I NSPIRED N EURAL N ETWORK The Biologically Inspired Neural Network (BioNet) was utilized because it is based on a model of a system that has been honed by tens of thousands of years of evolution. This paper demonstrates that the active stereo system shares enough similarities with human eyes that this model of the neurons in the parietal cortex performs well on this data set. BioNet Structure: The network was composed of three layers of neurons. The ?rst layer was the input layer, which consisted of the eight random variables: pan and tilt of each camera, and position of the blob centroid in each image. A subset of the input layer was connected to each of the neurons in the hidden layer. The structure of hidden neurons was based on Pougets model of those contained in the parietal cortex of the human brain [Pouget, 1994]. These neurons were modeled as a product of a sigmoid and a Gaussian. This product, referred to as a basis function, had the following form: hi (x, ) = e
(x?xi )2 2 2 ?i T
1 + e?
The hidden neuron hi had two inputs x and . Input x was one of the coordinates of a target centroid in the image, and is one of the PTU angles. The image position inputs are each paired up with one of the PTUs angle inputs. This leads to four sets of neurons (ulef t , lef t ), (vlef t , lef t ), (uright , right ), and (vright , right ). These groupings of hidden neurons will be referred to as a quadrants. Each of the quadrants are laid out in a grid based on their centers, xi and i (See Figure 1). The hidden neurons were fully connected to the output neurons, x world , yworld , and zworld . The output neurons were simple linear neurons. Their value was a weighted sum of the outputs from the hidden neurons. BioNet Training: Four parameters were tuned by the training algorithm: Gaussian radius, sigmoid radius, number of neurons, and the weights from the hidden layer to the output layer. This training method is a modi?cation of the work done by Fritzke  and Blanzieri et. al.. The training data was divided into two sets, tuning and training (training refers to the portion of the training set not allocated to tuning). The tuning set, 10% of the total training data, was used to predict the performance of the network on examples that were not in the training set, thus allowing training to be stopped before the network begins to generalize poorly. The neuron centers were initialized by choosing the desired number of center values to position along each component in the input vector. The center values were evenly spaced
Figure 1: The grid on the left represents the initial network and the one on the right represents a trained network. Several new neurons added through training can be seen on the right. The neurons are the large dots and their position is determined by their center, xi and i . The lines connecting the neurons represent neighbor relationships.
over a the range of the training data. All the possible combinations of the values corresponding to a quadrant became the initial centers of the neurons in that quadrant. Figure 1 shows a quadrant with three initial center values along each component of the input vector that corresponds to that quadrant. The neighbors of each neuron were found by locating the neurons with the closest center in each of the four directions (See Figure 1). These neighbors were later used in the placement of new neurons. The radii of the sigmoid and Gaussian were determined. The radius of each function was de?ned as the average distance to neighbors in the direction of the function. For example the radius of the Gaussian of neuron B in left side of Figure 1 would be the average distance of E and C from B, while the radius of the sigmoid is the average distance to A and D from B. If the neuron didnt have any neighbors in one of the directions then that radius was set to the average radii value of the neighbors that it had. There are two widely used methods to determine the proper weight vector for each output neuron, matrix inversion (least squared error) or back propagation. Matrix inversion in general is a O(n3 ) operation and large amounts of training data can require a undesirable amount of CPU cycles. Matrix inversion also produced weights on the order of 10 9 causing poor generalization in the initial testing. Thus we used a back propagation approach, which uses gradient decent to minimize the error function: = 1 2 (di ? yi )2
where yi was one of the outputs of the network and di was its corresponding desired output. After each training example presented during back propagation the values of A j , the activity of the hidden neuron j, Eji , the sum of the error values from output neuron i abs associated with hidden neuron j, and Eji , the sum of the magnitude of the error attributed abs to hidden neuron j, are updated. The change in Aj , Eji , and Eji for the j = s, where neuron s was de?ned as a neuron with the closest center to the input in a quadrant, was abs ?As , ?Esi , and ?Esi respectively. ?As = as where as was the output of hidden neuron s and ?Esi = (di ? yi ) (4) (3)
abs ?Esi = |di ? yi |.
The change in the Aj , Eij and follows:
for j hidden neurons and j = s is de?ned as (6) (7) (8)
?Aj = ?Aj ?Eji = ?Eij
abs abs ?Eji = ?Eij
where was the rate of decay of the past values, and [0, 1]. The greater the magnitude of alpha the shorter the memory of past values. After G examples each quadrant was checked to see if any neurons have an activation, Aj , greater than the user de?ned threshold. If one or more neurons were found, then their accumulated error, Ej , was calculated and used to determine where the new neuron should be inserted in each quadrant. Ej =
ioutput abs (|Eji | ? Eji )2
Equation (9) is a slight variation on that developed by Blanzieri et. al. . Blanzieri abs describes the equation |Ej | ? Ej , which is intended for a single output network. Equation (9) is the L2 norm of all the Eji associated with hidden neuron j. The accumulated error is an indirect calculation of the variation in the error. If the weights were the source of the error Ej was small because all the error had the same sign; thus the magnitude of the absolute value of the error had the same magnitude as the sum of the error. If a new neuron was needed to reduce error the sign of the error should be changing bringing about a sum of error near zero and producing a large Ej . Once the neuron with the largest E was found in each quadrant, a neuron was added between that neuron and its neighbor with the greatest E. The addition of the neuron abs results in a decrease of Aj , Eji , and Eji by a factor f for the parent neurons. The parent neurons were no longer neighbors, but they were each a neighbor of the new neuron. The weights on the output of the newly inserted neuron was initialized to the average weights of its parents. Then the neighbors in the direction orthogonal to its parents are located. This process was repeated until a user speci?ed number of epochs has occurred. The algorithm then returns the network that performed the best on the tuning set. G AUSSIAN R ADIAL BASIS N ETWORK The Gaussian radial basis network consisted of three layers, input, hidden, and output. This was the same basis network that was used by Fritzke . His paper states that a Gaussian network with this structure, in principle, should be able to approximate any smooth function. Gaussian Network Structure: The input layer was a replica of that described above, but unlike the biologically inspired NN the input layer was fully connected to the hidden layer. The hidden layer was made up of neurons with Gaussian activation functions: hi (x) = x ? xi 2 i
The x is a input vector containing all the input neuron outputs. xi , referred to as the center, is a vector with the same dimensionality as the input. The term, i , determines the spread of the Gaussian and is referred to as the radius of the function. Gaussian Network Training: Four parameters were tuned by the training algorithm: the centers of the hidden neurons, their radius, number of hidden neurons, and the weights from the hidden layer to the output layer. This training method was taken Blanzieri et. al.  with only minor modi?cations.
Using the method described above the training data set was divided into two subsets. These sets were used to implement early stopping and prevent over ?tting. Our training involved the following steps. First centers were initialized by using a k-means clustering algorithm described by Alsabti et al. . The means found via the k-means algorithm were used as the initial centers in the hidden neurons. Second, the neighbors for each hidden neuron were found. The neurons neighbors were de?ned as the four closest neurons. The distance between neurons was described by the L 2 norm of the difference in their centers. The average distance of the neurons neighbors were used as the radius, , of the Gaussian. Next the weights applied to the inputs of the hidden neurons were trained by using the same back propagation that was used to train the biologically abs inspired neural network. During back propagation the values of A j , Eji , and Eji were updated in the manner described in the biologically inspired NN, but because there were no quadrants in the Gaussian network only one neuron had its activation and error parameters updated via equations (3), (4), and (5). Then the neuron closest to the input of the example and its neighbors were moved toward the input. The centers in the biologically inspired neural network were not given this degree of freedom. The change in the center of the nearest neuron, s, is ?xs = b (x ? xs ), while the center of its neighbors moved ?xi = n (x ? xi ). Where b is a small positive number and the value of b n with n > 0. The ?nal step in the training was to add a new neuron after every G examples if there was a neuron with Aj greater than the user speci?ed threshold. The neuron was added using the same method described for the BioNet. Once the neuron was added to the network the neurons neighbors were re-calculated. This process was repeated until the desired number of epochs or minimum error was reached and the network that performed the best on the tuning data was output. E XPERIMENTAL R ESULTS The ?rst question addressed was the quantity of data needed to train the neural networks and what method of data selection was optimum. This was determined by training each of the networks with differing amounts of data. In addition the networks were trained ?rst with data chosen randomly, then data selected systematically. The performance of the networks on the testing set can be seen in Figure 2. Figure 2 shows the in?uence of both data selection and training data size on the average sum squared error of the network on the testing set. The average sum squared error was used because it is proportional to the distance the networks estimate was from the true coordinates of the point. The plot revealed that as the amount of data decreased the amount of error increased as expected. The increase in the error was relatively small until about .125 (2000 samples) was reached. At this point the error increased signi?cantly. The comparison of the data selection methods had some interesting results. The data showed that the systematic selection of the data proved to be statistically insigni?cant with the biologically inspired neural network. It also indicated that the systematic method was detrimental to the performance of the Gaussian network. These observations were used to determine the optimal data set size (.125) and the data selection method (random) for training two networks so a comparison could be made. This was accomplished by ?rst creating 10 different training and testing sets with .125 of the total data set selected at random. Then each of the networks was trained and tested. It was found that the biological neural network vastly out performed the Gaussian network (See Table 1). D ISCUSSION In this paper we established that the both types of neural networks require at least 2000 examples to obtain the optimal performance. It was shown that the error in the neural networks may increase by as much as 60% when 1000 examples rather than the optimal 2000 were used to train the networks. This seems to indicate that a training sets of less
then 2000 examples did not contain enough information to produce a good approximation of the mapping function. The random data selection method was shown to be as effective if not better than the systematic data selection method. This is likely due to improper handling of interactions between variables. The solution to this may be to use a design of experiments(DOE) approach to select data. This approach is commonly used by the engineering community to vary the control variables and ?nd the best solution with the least amount of data. This will be explored in future work This paper also demonstrated that the more complicated biologically based neural network is capable of a far better approximation of the three dimensional mapping function. The Gaussian network described here did not have the expressive power to properly approximate the mapping function.
Gaussian 26.69 26.91 27.39 27.46 27.89 27.17 27.38 26.93 27.05 27.38 BioNet 15.92 14.94 14.98 15.42 14.76 15.25 14.69 16.05 15.39 16.02 Average Std Diff. 10.76 11.96 12.41 12.03 13.13 11.92 12.69 10.88 11.66 11.36 11.88 0.238
Figure 2: Results of tests to determine the amount of data needed to train the neural networks.
Table 1: The average sum squared error, in cm2 , over the test set with optimal training size and randomly data selection. Std is the standard deviation of the differences.
R EFERENCES [Alsabti et al., 1998] Alsabti, Ranka, and Singh (1998). An ef?cient parallel algorithm for high dimensional similarity join. In IPPS: 11th International Parallel Processing Symposium. IEEE Computer Society Press. [Blanzieri et al., 1995] Blanzieri, E., Katenkamp, P., and Giordana, A. (1995). Growing radial basis function networks. In Proc. of the 4th Workshop on Learning Robots, Karlsruhe, Germany. [Fritzke, 1994] Fritzke, B. (1994). Supervised learning with growing cell structures. In Cowan, J., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages 255C262. Morgan Kaufmann Pub., San Mateo, CA. [Guse, 1999] Guse, N. (1999). A neural network for visual kinematics. M.Sc. Thesis, University of Wisconsin, Madison. [Knight and Reid, 2000] Knight, J. and Reid, I. (2000). Active visual alignment of a mobile stereo camera platform. In IEEE International Conference on Robotics and Automation, pages 3203C 3208, San Francisco, CA. [Moody and Darken, 1989] Moody, J. and Darken, C. J. (1989). Fast learning in networks of locallyC tuned processing units. Neural Computation, 1:281C293. [Neubert et al., 2001] Neubert, J., Hammond, A., Guse, N., Do, Y., Hu, Y., and Ferrier, N. (2001). Automatic training of a neural net for active stereo 3d reconstruction. In IEEE Intl Conf. on Robotics and Automation, pages 2140C2146. [Poggio and Girosi, 1990] Poggio, T. and Girosi, F. (1990). Networks for approximation and learning. [Pouget, 1994] Pouget, A. (1994). Computational Models of Spatial Representations. PhD thesis, University of California, San Diego.