Neural Information Processing Systems
NIPS*2000 Online Abstracts and Papers
This is a list of the papers that were presented at NIPS*2000. They are
listed in (approximately) alphabetical order according to the last
name of the first author. The title is a link to a Postscript or PDF version
of the paper.
BibTeX entries for all papers can be found
here.
What Can a Single Neuron Compute? -- Blaise Ag{\"u}era y Arcas, Adrienne L. Fairhall, William Bialek
In this paper we formulate a description of the computation
performed by a neuron as a combination of dimensional reduction
and nonlinearity. We evaluate this description for the
Hodgkin-Huxley model information-theoretically.
Who Does What? A Novel Algorithm to Determine Function Localization -- Ranit Aharonov-Barki, Isaac Meilijson, Eytan Ruppin
We introduce a novel algorithm, termed FDA (Functional Deficit Algorithm),
that quantitatively measures the contributions of elements of a neural
system to the tasks it performs. The algorithm identifies the neurons
or areas which participate in a cognitive or behavioral task, given data
about performance decrease in a small set of lesions. It also allows the
accurate prediction of deficits due to multi-element lesions. The
effectiveness of the new algorithm is demonstrated in two models of
recurrent neural networks with complex interactions among the elements.
The algorithm is scalable and applicable to the analysis of large neural
networks. Given the recent advances in reversible inactivation
techniques, it has the potential to significantly contribute to the
understanding of the organization of biological nervous systems, and to
shed light on the long-lasting debate about local versus distributed
computation in the brain.
Programmable Reinforcement Learning Agents -- David Andre, Stuart J. Russell
We present an expressive agent design language for reinforcement
learning that allows the user to constrain the policies considered
by the learning process by {\em partially} specifying a program for
the agent. The language includes standard programmatic features
such as parameterized subroutines, temporary interrupts, aborts, and
memory variables, and allows for the partial specification of
complex agent programs. For learning that which isn't specified, we
present provably convergent learning algorithms. We demonstrate
that agent programs written in the language are concise as well as
modular, which is important to facilitate state abstraction and the
transferability of learned skills.
From Mixtures of Mixtures to Adaptive Transform Coding -- Cynthia Archer, Todd K. Leen
Transform codes (e.g. JPEG, wavelet coding) compress data by
transforming data into coordinates presumed to be more favorable for
compression, allocating coding bits among the new coordinates, and
quantizing the transformed coordinate scalar values. Often the coding
procedure includes an ad hoc choice of transform (e.g. PCA, DCT) and
suboptimal bit allocation and quantizer design. As detrimental as the
suboptimal choices for these steps is the fact that designs proceed
piecemeal, with interactions between steps typically ignored. This paper
does three things: i) We establish a principled framework for adaptive
transform coding that starts from a probabilistic latent variable model
in the form of a mixture of constrained Gaussian mixtures. A
hard-clustering limit of this model reveals a truly optimal transform
coding algorithm, which is a constrained version of the generalized
Lloyd algorithm for vector quantizer design. ii) We introduce a new
transform that is required by the mathematical development. Unlike PCA,
DCT, and wavelet transforms, this new transform is explicitly optimized
for compression. iii) We exercise the new algorithm on image data. The
tested implementation reduces computational complexity, while sustaining
a small (estimated $< 0.2$ dB) decrease in signal-to-noise ratio, by using
the PCA transform and a greedy bit allocation procedure instead of our
new coding transform and optimal bit allocation. Adaptive transform
coders designed with our algorithm improve signal-to-noise ratio by 2.5
dB compared to global transform coding and 1 to 3 dB compared to other
adaptive transform coders.
Dendritic Compartmentalization Could Underlie Competition and Attentional Biasing of Simultaneous Visual Stimuli -- Kevin A. Archie, Bartlett W. Mel
Neurons in area V4 have relatively large receptive fields (RFs), so
multiple visual features are simultaneously ``seen'' by these cells.
Recordings from single V4 neurons suggest that simultaneously
presented stimuli compete to set the output firing rate, and that
attention acts to isolate individual features by biasing the
competition in favor of the attended object. We propose that both
stimulus competition and attentional biasing arise from the spatial
segregation of afferent synapses onto different regions of the
excitable dendritic tree of V4 neurons. The pattern of feedforward,
stimulus-driven inputs follows from a Hebbian rule: excitatory
afferents with similar RFs tend to group together on the dendritic
tree, avoiding randomly located inhibitory inputs with similar RFs.
The same principle guides the formation of inputs that mediate
attentional modulation. Using both biophysically detailed
compartmental models and simplified models of computation in single
neurons, we demonstrate that such an architecture could account for
the response properties and attentional modulation of V4 neurons.
Our results suggest an important role for nonlinear dendritic
conductances in extrastriate cortical processing.
Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning -- Angelo Arleo, Fabrizio Smeraldi, St\'ephane Hug, Wulfram Gerstner
We model hippocampal place cells and head-direction cells by combining
allothetic
(visual) and idiothetic (proprioceptive) stimuli. Visual input, provided by a
video camera
on a miniature robot, is preprocessed by a set of Gabor filters on 31 nodes of a
log-polar
retinotopic graph. Unsupervised Hebbian learning is employed to incrementally
build a
population of localized overlapping place fields. Place cells serve as basis
functions for
reinforcement learning. Experimental results for goal-oriented navigation of a
mobile
robot are presented.
Speech Denoising and Dereverberation Using Probabilistic Models -- Hagai Attias, John C. Platt, Alex Acero, Li Deng
This paper presents a unified probabilistic framework for denoising and
dereverberation of speech signals. The framework transforms the denoising
and dereverberation problems into Bayes-optimal signal estimation. A key
point in our approach is the use of a strong speech model as a prior. The
speech model is a mixture model pre-trained on a large data set of clean
speech. The framework covers both single and multiple microphones. We
apply this approach to a noisy reverberant speech signal, and get results
substantially better than standard methods.
Combining ICA and Top-Down Attention for Robust Speech Recognition -- Un-Min Bae, Soo-Young Lee
We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, an MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence. It modifies the density of recovered signals to the density appropriate for classification. Under a stationary mixing environment, the averaged adaptation of the unmixing system over the total data can roughly approximate the non-ideal mixing characteristics, such as the nonlinearity of microphones. For noisy speech signal recorded in real environments, the algorithm improved the recognition performance and showed robustness against parametric changes.
Modelling Spatial Recall, Mental Imagery and Neglect -- Suzanna Becker, Neil Burgess
We present a computational model of the neural mechanisms in the
parietal and temporal lobes that support spatial navigation, imagery,
and episodic recall. Long term representations are stored in the
hippocampus, and are associated with local spatial and object-related
features in the parahippocampal region. Viewer-centered representations are
dynamically generated from long term memory in the parietal part of the model.
The model thereby simulates recall and imagery of locations and objects in
complex environments. After parietal damage, the model
exhibits hemispatial neglect in mental imagery that rotates with the
imagined perspective of the observer, as in the famous Milan Square
experiment reported by Bisiach and Luzatti (1978).
Our model makes novel predictions for the neural
representations in the parahippocampal and parietal regions and
for behavior in healthy volunteers and neuropsychological patients.
Shape Context: A New Descriptor for Shape Matching and Object Recognition -- Serge Belongie, Jitendra Malik, Jan Puzicha
We introduce a new shape descriptor, the {\em shape context}, for
correspondence recovery and shape-based object recognition. The
shape context at a point captures the distribution over
relative positions of other shape points and thus summarizes
global shape in a rich, local descriptor. Shape contexts
greatly simplify recovery of correspondences between points of
two given shapes. Moreover, the shape context leads to a robust
score for measuring shape similarity, once shapes are aligned.
The shape context descriptor is tolerant to all common shape
deformations. As a key advantage no special landmarks or
key-points are necessary. It is thus a generic method with
applications in object recognition, image registration and
point set matching. Using examples involving both handwritten
digits and 3D objects, we illustrate its power for object
recognition.
Efficient Learning of Linear Perceptrons -- Shai Ben-David, Hans Ulrich Simon
We consider the existence of efficient algorithms
for learning the class of half-spaces in ${\bf R}^n$ in the
agnostic learning model (i.e., making
no prior assumptions on the example-generating
distribution). The resulting combinatorial problem - finding
the best agreement half-space over an input sample - is
NP hard to approximate to within some constant factor.
We suggest a way to circumvent this theoretical bound
by introducing a new measure of success for such algorithms.
An algorithm is $\mu$-successful if the agreement ratio of
the half-space it outputs is as good as that of any
half-space once training points that are inside the
$\mu$-margins of its separating hyper-plane are disregarded.
We prove crisp computational complexity results with respect
to this success measure; On one hand, for every positive
$\mu$, there exist efficient (poly-time) $\mu$-successful
learning algorithms. On the other hand, we prove
that unless P$=$NP, there is no algorithm that runs in
time polynomial in the sample size and in $1/\mu$ that is
$\mu$-successful for all $\mu$.
A Support Vector Method for Clustering -- Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik
We present a novel method for clustering
using the support vector machine approach.
Data points are mapped
to a high dimensional feature space, where support vectors are used
to define a sphere enclosing them.
The boundary of the sphere forms in data space a set of closed
contours containing the data.
As the kernel parameter is varied these contours fit the data more tightly
and splitting of contours occurs.
The contours are interpreted as cluster boundaries and the points within each
disconnected contour are defined as a cluster.
Cluster boundaries can take on arbitrary geometrical shapes and clusters
are separated by valleys in the underlying probability distribution.
As in other SV algorithms, outliers can be dealt with by introducing a soft
margin constant leading to smoother cluster boundaries.
The hierarchical structure of the data is explored by varying the two
parameters. We investigate the dependence of our method
on these parameters and apply it to several data sets.
A Neural Probabilistic Language Model -- Yoshua Bengio, R\'ejean Ducharme, Pascal Vincent
Modeling high-dimensional discrete random variables without
significant a priori knowledge of conditional independence is
intrinsically difficult because of the curse of dimensionality:
combinations of values on which the model will be tested are likely to
be different from all the combinations of values seen in the training
data. Neural networks can capture, with few parameters, dependencies of
any order between the variables (but not all of them, of course). We
consider here the case in which each variable can take a large number
of values, as in statistical language modeling. The proposed model
learns the conditional distribution of the next word at the same time
as learning a distributed representation for words (in a way that is
similar to latent semantic analysis), allowing to generalize from
contexts (combinations of words) seen in the training set to new
combinations of words, because of the similarity of their internal
representations. Successful comparisons against smoothed trigrams are
reported.
A Variational Mean-Field Theory for Sigmoidal Belief Networks -- Chiranjib Bhattacharyya, S. Sathiya Keerthi
A new variational derivation of Plefka's mean-field theory is
presented. This theory is then applied to sigmoidal belief networks
with the aid of further approximations. Empirical evaluation on small
scale networks shows that the proposed approximations are quite
competitive with other mean field approximations proposed in the
literature.
Stability and Noise in Biochemical Switches -- William Bialek
Many processes in biology, from the regulation of gene expression in
bacteria to memory in the brain, involve switches constructed from
networks of biochemical reactions. Crucial molecules are present in small
numbers, raising questions about noise and stability. Analysis of noise
in simple reaction schemes indicates that switches stable for years and
switchable in milliseconds can be built from fewer than one hundred
molecules. Prospects for direct tests of this prediction, as well as
implications, are discussed.
Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images -- Rafal Bogacz, Malcolm W. Brown, Christophe Giraud-Carrier
Olshausen \& Field demonstrated that a learning algorithm that attempts to
generate a sparse code for natural scenes develops a complete family of
localised, oriented, bandpass receptive fields, similar to those of 'simple
cells' in V1. This paper describes an algorithm which finds a sparse code for
sequences of images that preserves information about the input. This algorithm
when trained on natural video sequences develops bases representing the
movement in particular direction with particular speed, similar to the
receptive fields of the movement-sensitive cells observed in cortical visual
areas. Furthermore, in contrast to previous approaches to learning direction
selectivity, the timing of neuronal activity encodes the phase of the movement,
so the precise timing of spikes is crucially important to the information
encoding.
New Approaches Towards Robust and Adaptive Speech Recognition {\rm (invited paper)} -- Herv\'e Bourlard, Samy Bengio, Katrin Weber
In this paper, we discuss some new research directions in
automatic speech recognition (ASR), and which somewhat deviate
from the usual approaches. More specifically, we will motivate
and briefly describe new approaches based on multi-stream and
multi/band ASR. These approaches extend the standard hidden Markov
model (HMM) based approach by assuming that the different
(frequency) channels representing the speech signal
are processed by
different (independent) ``experts'', each expert focusing on a
different characteristic of
the signal, and that the different stream likelihoods (or
posteriors) are combined at some (temporal) stage to yield a
global recognition output. As a further
extension to multi-stream ASR, we will finally
introduce a new approach, referred to as HMM2, where the
HMM emission probabilities are
estimated via state specific feature based HMM responsible for
merging the stream information and modeling their possible
correlation.
Algorithmic Stability and Generalization Performance -- Olivier Bousquet, Andr\'e Elisseeff
We present a novel way of obtaining PAC-style bounds on the generalization
error of learning algorithms, explicitly using their stability properties. A
'stable' learner is one for which the learned solution does not change much
with small changes in the training set.
The bounds we obtain do not depend on any measure of the complexity of the
hypothesis space (e.g. VC dimension) but rather depend on how the learning
algorithm searches this space, and can thus be applied even when the VC
dimension is infinite. We demonstrate that regularization networks possess
the required stability property and apply our method to obtain new bounds on
their generalization performance.
Exact Solutions to Time-Dependent MDPs -- Justin A. Boyan, Michael L. Littman
We describe an extension of the Markov decision process model in
which a continuous time dimension is included in the state space.
This allows for the representation and exact solution of a wide
range of problems in which transitions or rewards vary over time.
We examine problems based on route planning with public
transportation and telescope observation scheduling.
Direct Classification with Indirect Data -- Timothy X Brown
This paper studies the following problem: Given input/output examples
from a real-valued function, classify the input space according to when
the function output is in a given subset. Obvious approaches to the
problem are inconsistent or overly complex when the output examples
have skewed noise or when the examples are indirect measurements of the
function outputs. We contribute a general classifier that is both
consistent and avoids the unnecessary complexity of estimating the
function. The approach is demonstrated on a problem from
telecommunications.
Model Complexity, Goodness of Fit and Diminishing Returns -- Igor V. Cadez, Padhraic Smyth
We investigate a general characteristic of the trade-off in learning
problems between
goodness-of-fit and model complexity. Specifically we characterize
a general class of learning problems where the goodness-of-fit
function can be shown to be convex within first-order as a
function of model complexity. This general property of
``diminishing returns'' is illustrated on a number of real data
sets and learning problems, including finite mixture modeling
and multivariate linear regression.
A Linear Programming Approach to Novelty Detection -- Colin Campbell, Kristin P. Bennett
Novelty detection involves modeling the normal behaviour of a system
hence enabling detection of any divergence from normality. It
has potential applications in many areas such as detection of machine
damage or highlighting abnormal features in medical data.
One approach is to build a hypothesis estimating the support of
the normal data i.e. constructing a function which is positive in the
region where the data is located and negative elsewhere. Recently
kernel methods have been proposed for estimating the support of
a distribution and they have performed well in practice training
involves solution of a quadratic programming problem. In this paper
we propose a simpler kernel method for estimating the support
based on linear programming. The method is easy to implement
and can learn large datasets rapidly. We demonstrate the method
on medical and fault detection datasets.
Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes -- Jakob Carlstr{\"o}m
This paper presents predictive gain scheduling, a technique for
decomposition of reinforcement learning problems. Admission control of
self-similar call traffic in a communications network is used as an
example. The control problem is decomposed into on-line prediction of
near-future call arrival rates, and precomputation of policies for
Poisson call arrival processes. At decision time, predicted arrival
rates are used to select a policy. Simulations show that predictive
gain scheduling results in significantly faster learning without any
performance loss, compared to a reinforcement learning controller that
does not decompose the problem.
Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping -- Rich Caruana, Steve Lawrence, Lee Giles
The conventional wisdom is that backprop nets with excess hidden units
generalize poorly: they have low bias, but high variance. Empirically,
however, large nets often generalize better. We show that nets with
excess capacity generalize well if trained with backprop and early
stopping. Experiments suggest two reasons for this: 1) overfitting
can vary significantly in different regions of the model. Excess
capacity allows better fit to regions of high non-linearity, and
backprop often avoids overfitting the regions of low non-linearity.
2) regardless of size, nets learn task subcomponents in similar
sequence. Big nets pass through stages similar to those learned by
smaller nets. Early stopping can stop training the large net when it
generalizes comparably to a smaller net. We also show that conjugate
gradient can yield worse generalization because it overfits regions of
low non-linearity when learning to fit regions of high non-linearity.
Incremental and Decremental Support Vector Machine Learning -- Gert Cauwenberghs, Tomaso Poggio
An on-line recursive algorithm for training a soft-margin
classification support vector machine, one vector at a time, is
presented. Adiabatic increments retain the Kuhn-Tucker conditions on
all previously seen training data, in a number of steps each computed
analytically.
The incremental procedure is reversible, and decremental
``unlearning'' offers an efficient method to exactly evaluate
leave-one-out generalization performance. Interpretation of
decremental unlearning in feature space sheds new light on the
relationship between generalization and geometry of the training data.
Vicinal Risk Minimization -- Olivier Chapelle, Jason Weston, L\'eon Bottou, Vladimir Vapnik
The Vicinal Risk Minimization principle establishes a bridge between
generative models and methods derived from the Structural Risk Minimization
Principle such as Support Vector Machines or Statistical Regularization. We
explain how VRM provides a framework which integrates a number of existing
algorithms, such as Parzen windows, Support Vector Machines, Ridge
Regression, Constrained Logistic Classifiers and Tangent-Prop. We then show
how the approach implies new algorithms for solving problems usually
associated with generative models. New algorithms are described for dealing
with pattern recognition problems with very different pattern distributions
and dealing with unlabeled data. Preliminary empirical results are
presented.
Temporally Dependent Plasticity: An Information Theoretic Account -- Gal Chechik, Naftali Tishby
The fundamental paradigm of Hebbian learning has recently received a
novel interpretation with the discovery of synaptic plasticity that
depends on the relative timing of pre and post synaptic spikes.
[e.g. Markram et al. 1997; Zhang et al. 1998]. While this type of
plasticity has already been studied in various computational
frameworks, a generic computational motivation or its derivation
from first principles is still missing. In this paper we derive
temporally dependent learning from the basic principle of mutual
information maximization and study its relation to the
experimentally observed plasticity. The comparison shows that not
only is the biological learning rule similar in form to the
analytically derived one, but it also increases mutual information
up to a local optimum. The analysis yields new insights into the
temporal characteristics of the observed learning rule and
its dependency on neuronal biophysical parameters.
Gaussianization -- Scott Saobing Chen, Ramesh A. Gopinath
High dimensional data modeling is difficult because of the
so-called ``curse of dimensionality''. We propose a technique called
``Gaussianization'' for high dimensional density estimation, which
alleviates the curse of dimensionality by exploiting the independence
structures in the data. Gaussianization is motivated from recent
developments in the statistics literature: projection pursuit,
independent component analysis and Gaussian mixture models with
semi-tied covariances. We propose an iterative Gaussianization
procedure which converges weakly: at each iteration, the data is first
transformed to the least dependent coordinates and then each
coordinate is marginally Gaussianized by univariate techniques.
Gaussianization offers density estimation sharper than traditional
kernel methods and radial basis function methods. Gaussianization can
be viewed as efficient solution of nonlinear independent component
analysis and high dimensional projection pursuit.
The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity -- David Cohn, Thomas Hofmann
In this paper, we describe a joint probabilistic model for modeling
the contents and inter-connectivity of document collections such as sets
of web pages or research paper archives. The model is based on
a probabilistic factor decomposition and allows identifying principal
topics of the collection as well as authoritative documents within
those topics. Further, it allows mapping out the relationships of
those topics to build a predictive model of link content. Among the
many applications of this approach are information retrieval and
search, topic identification, query disambiguation, focused web
crawling, web authoring, and bibliometric analysis.
The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference -- James M. Coughlan, Allen L. Yuille
Preliminary work by the authors made use of the so-called ``Manhattan
world'' assumption about the scene statistics of city and indoor
scenes. This assumption stated that such scenes were built on a
cartesian grid which led to regularities in the image edge gradient
statistics. In this paper we explore the general applicability of
this assumption and show that, surprisingly, it holds in a large
variety of less structured environments including rural scenes. This
enables us, from a single image, to determine the orientation of the
viewer relative to the scene structure and also to detect target
objects which are not aligned with the grid. These inferences are
performed using a Bayesian model with probability distributions
(e.g. on the image gradient statistics) learned from real data.
Improved Output Coding for Classification Using Continuous Relaxation -- Koby Crammer, Yoram Singer
Output coding is a general method for solving multiclass problems by
reducing them to multiple binary classification problems. Previous
research on output coding has employed, almost solely, predefined
discrete codes. We describe an algorithm that improves the performance of
output codes by relaxing them to continuous codes. The relaxation
procedure
is cast as an optimization problem and is reminiscent of the quadratic
program for support vector machines. We describe experiments with the
proposed algorithm, comparing it to standard discrete output codes. The
experimental results indicate that continuous relaxations of output codes
often improve the generalization performance, especially for short codes.
Sparse Representation for Gaussian Process Models -- Lehel Csat\'o, Manfred Opper
We develop an approach for a sparse representation
for Gaussian Process (GP) models in order to overcome the limitations
of GPs caused by large data sets. The method is based on a
combination of a Bayesian online algorithm together with a sequential
construction of a relevant subsample of the data which fully specifies
the prediction of the model.
Experimental results on toy examples and large real-world
datasets indicate the efficiency of the approach.
{\bf Lehel Csato received the NIPS Best Student Paper Award sponsored by
Athene Software}
Competition and Arbors in Ocular Dominance -- Peter Dayan
Hebbian and competitive Hebbian algorithms are
almost ubiquitous in modeling pattern formation in cortical
development. We analyse in theoretical detail a particular model
(adapted from Piepenbrock \& Obermayer, 1999) for the development of
1d stripe-like patterns, which places competitive and interactive
cortical influences, and free and restricted initial arborisation
onto a common footing.
Explaining Away in Weight Space -- Peter Dayan, Sham Kakade
Explaining away has mostly been considered in terms of inference of
states in belief networks. We show how it can also arise in a Bayesian
context in inference about the weights governing relationships such as
those between stimuli and reinforcers in conditioning experiments such
as backward blocking. We show how explaining away in weight
space can be accounted for using an extension of a Kalman filter
model; provide a new approximate way of looking at the Kalman gain
matrix as a whitener for the correlation matrix of the observation
process; suggest a network implementation of this whitener using an
architecture due to Goodall; and show that the resulting model
exhibits backward blocking.
Feature Correspondence: A Markov Chain Monte Carlo Approach -- Frank Dellaert, Steven M. Seitz, Sebastian Thrun, Charles Thorpe
When trying to recover 3D structure from a set of images, the most
difficult problem is establishing the correspondence between the
measurements.
Most existing approaches assume that features can be tracked across
frames, whereas methods that exploit rigidity constraints to facilitate
matching do so only under restricted camera motion. In this paper
we propose a Bayesian approach that avoids the brittleness associated
with singling out one ``best'' correspondence, and instead consider
the distribution over all possible correspondences. We treat both
a fully Bayesian approach that yields a posterior distribution, and
a MAP approach that makes use of EM to maximize this posterior. We
show how Markov chain Monte Carlo methods can be used to implement
these techniques in practice, and present experimental results on
real data.
A New Model of Spatial Representation in Multimodal Brain Areas -- Sophie Deneve, Jean-Rene Duhamel, Alexandre Pouget
Most model of spatial representation in the cortex assume
cells with limited receptive fields that are defined in a
particular egocentric frame of reference. However, cells outside
of primary sensory cortex are either gain modulated by postural
input, or partially shifting. We show that solving classical
spatial tasks, like sensory prediction, multi-sensory
integration, sensory-motor transformation and motor control
require more complicated intermediate representation that are not
invariant in one frame of reference. We present an iterative
basis function map that perform these spatial tasks optimally
with gain modulated and partially shifting units, and tests it
against neurophysiological and neuropsychological data.
An Adaptive Metric Machine for Pattern Classification -- Carlotta Domeniconi, Jing Peng, Dimitrios Gunopulos
Nearest neighbor classification assumes locally constant class
conditional probabilities. This assumption becomes invalid in high
dimensions with finite samples due to the curse of dimensionality.
Severe bias can be introduced under these conditions when using the
nearest neighbor rule. We propose a locally adaptive nearest neighbor
classification method to try to minimize bias. We use a {\em
Chi-squared} distance analysis to compute a flexible metric for
producing neighborhoods that are elongated along less relevant feature
dimensions and constricted along most influential ones. As a result,
the class conditional probabilities tend to be smoother in the
modified neighborhoods, whereby better classification performance can
be achieved. The efficacy of our method is validated and compared
against other techniques using a variety of real world
data.
High-temperature Expansions for Learning Models of Nonnegative Data -- Oliver B. Downs
Recent work has exploited boundedness of data for learning
new types of
generative model. For nonnegative data it has been shown that the
maximum-entropy generative model is a Nonnegative Boltzmann
Distribution not
a Gaussian distribution, when the model is constrained to
match the first
and second order statistics of the data. Learning for practical sized
problems is made difficult by computing expectations under this
distribution. Here I present a second-order approximation
for the model,
obtained using a ``high-temperature'' expansion. The result
is analogous to
the TAP-Onsager equations for the Ising model. The theory is
tested on
learning a bimodal 2-dimensional model, and in learning a
high-dimensional
translationally-invariant distribution.
Incorporating Second-Order Functional Knowledge for Better Option Pricing -- Charles Dugas, Yoshua Bengio, Fran\c{c}ois B\'elisle, Claude Nadeau, Ren\'e Garcia
Incorporating prior knowledge of a particular task into the
architecture of a learning algorithm can greatly improve
generalization performance. We study here a case where we know that
the function to be learned is non-decreasing in two of its
arguments and convex in one of them. For this purpose we propose a
class of functions similar to multi-layer neural networks but (1) that
has those properties, (2) is a universal approximator of continuous
functions with these properties. We apply this new class of functions
to the task of modeling the price of call options. Experiments show
improvements on regressing the price of call options using the new
types of function classes that incorporate the a-priori constraints.
A Productive, Systematic Framework for the Representation of Visual Structure -- Shimon Edelman, Nathan Intrator
We describe a unified framework, based on the MDL principle,
for the understanding of structure representation in primate vision. A
model derived from this framework is shown to be both productive and
effectively systematic. Specifically, it has the ability to associate
objects that are related through a rearrangement of ``middle-scale''
parts, represented as image fragments. The model addresses the same
concerns as previous work on compositional representation through the
use of {\it what+where} receptive fields and attentional gain
modulation. It does not require prior exposure to the individual
parts, and avoids the need for abstract symbolic binding.
Discovering Hidden Variables: A Structure-Based Approach -- Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller
A serious problem in learning probabilistic models is the presence of
hidden variables. These variables are not observed, yet interact with
several of the observed variables. As such, they induce seemingly complex
dependencies among the latter. In recent years, much attention has been
devoted to the development of algorithms for learning parameters,
and in some cases structure, in the presence of hidden variables. In this
paper, we address the related problem of detecting hidden variables that
interact with the observed variables. This problem is of interest both
for improving our understanding of the domain and as a preliminary step that
guides the learning procedure towards promising models. A very natural
approach is to search for ``structural signatures'' of hidden variables ---
substructures in the learned network that tend to suggest the presence of
a hidden variable. We make this basic idea concrete, and show how to integrate
it with structure-search algorithms. We evaluate this method on several
synthetic and real-life datasets, and show that it performs surprisingly
well.
Multiple Timescales of Adaptation in a Neural Code -- Adrienne L. Fairhall, Geoffrey D. Lewen, William Bialek, Robert R. de Ruyter van Steveninck
Many neural systems extend their dynamic range by adaptation. We
examine the timescales of adaptation in the context of dynamically
modulated rapidly-varying stimuli, and demonstrate in the fly
visual system that adaptation to the statistical ensemble of the
stimulus dynamically maximizes information transmission about the
time-dependent stimulus. Further, while the rate response has
long transients, the adaptation takes place on timescales
consistent with optimal variance estimation.
Learning Joint Statistical Models for Audio-Visual Fusion and Segregation -- John W. Fisher III, Trevor Darrell, William T. Freeman, Paul Viola
People can understand complex auditory and visual information, often
using one to disambiguate the other. Automated analysis, even at a
low-level, faces severe challenges, including the lack of accurate
statistical models for the signals, and their high-dimensionality and
varied sampling rates. Previous approaches [6] assumed simple parametric
models for the joint distribution, which, while tractable, cannot
capture the complex signal relationships. We learn the joint
distribution of the visual and auditory signals using a non-parametric
approach. First, we project the data into a maximally informative,
low-dimensional subspace, suitable for density estimation. Then we model
the complicated stochastic relationships between the signals using a
nonparametric density estimator. These learned densities allow
processing across signal modalities. We demonstrate, on synthetic and
real signals, localization in video of the face that is speaking in
audio, and, conversely, audio enhancement of a particular speaker
selected from the video.
Accumulator Networks: Suitors of Local Probability Propagation -- Brendan J. Frey, Anitha Kannan
One way to approximate inference in richly-connected graphical models
is to apply the sum-product algorithm (a.k.a. probability
propagation algorithm), while ignoring the fact that the graph has cycles.
The sum-product algorithm can be directly applied in Gaussian networks
and in graphs for coding, but for many conditional probability
functions -- including the sigmoid function -- direct application
of the sum-product algorithm is not possible.
We introduce ``accumulator networks'' that have low local complexity
(but exponential global complexity)
so the sum-product algorithm can be directly applied.
In an accumulator network, the probability of a child given its
parents is computed by accumulating the inputs from the parents in
a Markov chain or more generally a tree.
After giving expressions for inference and learning in accumulator
networks, we give results on the ``bars problem'' and on
the problem of extracting images of translated, overlapping faces from
an visual scene.
Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks -- Brendan J. Frey, Relu Patrascu, Tommi S. Jaakkola, Jodi Moran
An important class of problems can be cast as inference in noisy-OR
Bayesian networks, where the binary state of each variable is a logical
OR of noisy versions of the states of the variable's parents. For example,
in medical diagnosis, the presence of a symptom can be expressed as a
noisy-OR of the diseases that may cause the symptom -- on some occasions,
a disease may fail to activate the symptom. Inference in richly-connected
noisy-OR networks is intractable, but approximate methods
(e.g., variational techniques) are showing increasing promise as
practical solutions. One problem with most approximations is that they
tend to concentrate on a relatively small number of modes in the true
posterior, ignoring other plausible configurations of the hidden variables.
We introduce a new sequential variational method for bipartite noisy-OR
networks, that favors including all modes of the true posterior
and models the posterior distribution as a tree.
We compare this method with other approximations using an ensemble
of networks with network statistics that are comparable to the QMR-DT
medical diagnostic network.
Factored Semi-Tied Covariance Matrices -- Mark J.F. Gales
A new form of covariance modelling for Gaussian mixture models and
hidden Markov models is presented. This is an extension to an
efficient form of covariance modelling used in speech recognition,
semi-tied covariance matrices. The transform associated with each
Gaussian component is now factored, reducing the number of free
parameters. Maximum likelihood estimation schemes for all the model
parameters are estimated including the component/transform assignment,
transform and component parameters. This new model form is evaluated
on a large vocabulary speech recognition task.
A New Approximate Maximal Margin Classification Algorithm -- Claudio Gentile
A new incremental learning algorithm is described which is shown to
approximate the maximal margin hyperplane w.r.t. norm $p \geq 2$ for
a set of linearly separable data.
Our algorithm, called ALMA$_p$ (Approximate Large Margin algorithm w.r.t. norm $p$),
takes $O(\frac{(p-1)\,X^2}{\alpha^2\,\gamma^2})$
corrections to separate the data with $p$-norm margin larger than $(1-\alpha)\,\gamma$,
where $\gamma$ is the $p$-norm margin of the data and $X$ is a bound on
the $p$-norm of the instances.
ALMA$_p$ avoids quadratic (or higher-order) programming methods. It is
very easy to implement and is as fast as on-line algorithms, such as
Rosenblatt's perceptron.
We report on some experiments comparing
a variant of ALMA$_p$ to two
incremental algorithms: Perceptron and Li and Long's ROMMA.
These experiments show that our algorithm performs quite better than both.
Propagation Algorithms for Variational Bayesian Learning -- Zoubin Ghahramani, Matthew J. Beal
Variational approximations are becoming a widespread tool for Bayesian
learning of graphical models. We provide some theoretical results for the
variational updates in a very general family of conjugate-exponential
graphical models. We show how the belief propagation and the junction tree
algorithms can be used in the inference step of variational Bayesian
learning. Applying these results to the Bayesian analysis of
linear-Gaussian state-space models we obtain a learning procedure that
exploits the Kalman smoothing propagation, while integrating over all
model parameters. We demonstrate how this can be used to infer the hidden
state dimensionality of the state-space model in a variety of synthetic
problems and one real high-dimensional data set.
Reinforcement Learning with Function Approximation Converges to a Region -- Geoffrey J. Gordon
Many algorithms for approximate reinforcement learning are
not known to
converge. In fact, there are counterexamples showing that
the
adjustable weights in some algorithms may oscillate within
a region
rather than converging to a point. This paper shows that,
for two
popular algorithms, such oscillation is the worst that can
happen: the
weights cannot diverge, but instead converge with
probability 1 to a
bounded region. The algorithms are SARSA(0) and V(0); the
latter is
the algorithm used in the well-known TD-Gammon program.
The Kernel Gibbs Sampler -- Thore Graepel, Ralf Herbrich
We present an algorithm that samples the hypothesis space of kernel
classifiers. Given a uniform prior over normalised weight vectors
and a likelihood based on a model of label noise leads to a piecewise
constant posterior that can be sampled by the kernel Gibbs sampler
(KGS). The KGS is a Markov Chain Monte Carlo method that chooses a
random direction in parameter space and samples from the resulting
piecewise constant density along the line chosen. The KGS can be used
as an analytical tool for the exploration of Bayesian transduction,
Bayes point machines, active learning, and evidence-based model selection
on small data sets that are contaminated with label noise. For a simple
toy example we demonstrate experimentally how a Bayes point machine
based on the KGS outperforms an SVM that is incapable of taking into
account label noise.
From Margin to Sparsity -- Thore Graepel, Ralf Herbrich, Robert C. Williamson
We present an improvement of Novikoff's perceptron convergence theorem.
Reinterpreting this mistake bound as a margin dependent sparsity guarantee
allows us to give a PAC-style generalisation error bound for the
classifier learned by the dual perceptron learning algorithm. The
bound value crucially depends on the margin a support vector machine
would achieve on the same data set using the same kernel. Ironically,
the bound yields better guarantees than are currently available for
the support vector solution itself.
`N-Body' Problems in Statistical Learning -- Alexander G. Gray, Andrew W. Moore
We present very fast algorithms for a large class of statistical
problems, which we call all-point-pairs problems, or `$N$-body'-like
problems. These are problems which abstractly require a comparison of
each of the $N$ points in a dataset with each other point and would
naively be solved using $N^2$ computations. Such algorithms cannot be
applied to practical problems where $N$ is large. In this paper we
focus on several examples of all-point-pairs problems within
nonparametric statistical learning: nearest-neighbor classification,
kernel density estimation, outlier detection, the two-point
correlation, multiple two-point correlation, and the n-point
correlation. We give an asymptotic analysis and we show empirically
that this algorithmic framework leads to several orders of magnitude
in speedup over the naive computation, even for small datasets. We
are aware of no exact algorithms for these problems which are faster
either empirically or theoretically. The methodology presented in
this paper is also applicable in principle to fast, large-scale
parametric statistics such as RBF neural networks, mixtures of
Gaussians, and Gaussian processes.
A Comparison of Image Processing Techniques for Visual Speech Recognition Applications -- Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
We examine eight different techniques for developing visual
representations in machine vision tasks. In
particular we compare different versions of principal component and
independent component analysis in combination with stepwise regression
methods for variable selection. We found that local methods, based on
the statistics of image patches, consistently outperformed global
methods based on the statistics of entire images. This result
is consistent with previous work on emotion and facial expression
recognition. In addition, the use of a stepwise regression
technique for selecting variables and regions of interest
substantially boosted performance.
The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving -- David B. Grimes, Michael C. Mozer
Although connectionist models have provided insights into the nature of
perception and motor control, connectionist accounts of higher cognition
seldom go beyond an implementation of traditional symbol-processing theories.
We describe a connectionist constraint satisfaction model of how people solve
anagram problems. The model exploits statistics of English orthography, but
also addresses the interplay of subsymbolic and symbolic computation by a
mechanism that extracts approximate symbolic representations (partial
orderings of letters) from subsymbolic structures and injects the extracted
representation back into the model to assist in the solution of the anagram.
We show the computational benefit of this extraction-injection process and
discuss its relationship to conscious mental processes and working memory.
We also account for experimental data concerning the difficulty of anagram
solution based on the orthographic structure of the anagram string and the
target word.
Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks -- Richard H.R. Hahnloser, H. Sebastian Seung
Today, Lyapunov theory is still one of the most widely used concepts for
describing emergent phenomena in recurrent neural networks. As a new
approach to ascribing computational principles to feedback circuits, we
have studied threshold-linear networks. By applying linear analysis to
subnetworks composed of coactive neurons, we determine the stability of
their potential steady states. We find that stability depends on two
types of eigenmodes. One type determines global stability and the other
type determines whether or not multistability is possible. We can prove
the equivalence of our stability criteria with criteria taken from
quadratic programming.
Our results go beyond Lyapunov-theory. We show that there are permitted
sets of neurons that can be coactive at a steady state and forbidden
sets that cannot. Permitted sets are clustered in the sense that subsets
of permitted sets are permitted and supersets of forbidden sets are
forbidden. By viewing permitted sets as memories stored in the synaptic
connections, we can provide a formulation of long-term memory that is
more general than the traditional perspective of attractor networks.
Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra -- Paul Hayton, Bernhard Sch{\"o}lkopf, Lionel Tarassenko, Paul Anuzis
A system has been developed to extract diagnostic information from jet
engine carcass vibration data. Support Vector Machines applied to
novelty detection provide a measure of how unusual the shape of a
vibration signature is, by learning a representation of normality. We
describe a novel method for Support Vector Machines of including
information from a second class for novelty detection and give results
from the application to Jet Engine vibration analysis.
Large Scale Bayes Point Machines -- Ralf Herbrich, Thore Graepel
The concept of averaging over classifiers is fundamental to the Bayesian
analysis of learning. Based on this viewpoint, it has recently been
demonstrated for linear classifiers that the centre of mass of version
space (the set of all classifiers consistent with the training set)
--- also known as the Bayes point --- exhibits excellent generalisation
abilities. However, the billiard algorithm as presented in [Herbrich et al.,
2000] is restricted to small sample size because it requires $\mathcal{O}(
m^{2})$ of memory and $\mathcal{O}( N\cdot m^2)$ computational
steps where $m$ is the number of training patterns and $N$ is the number
of random draws from the posterior distribution. In this paper we
present a method based on the simple perceptron learning algorithm
which allows to overcome this algorithmic drawback. The method is
algorithmically simple and is easily extended to the multi-class case.
We present experimental results on the MNIST data set of handwritten
digits which show that Bayes Point Machines are competitive with the
current world champion, the support vector machine. In addition, the
computational complexity of BPMs can be tuned by varying the number
of samples from the posterior. Finally, rejecting test points on the
basis of their (approximative) posterior probability leads to a rapid
decrease in generalisation error, e.g. 0.1\% generalisation error
for a given rejection rate of 10\%.
A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work -- Ralf Herbrich, Thore Graepel
We present a bound on the generalisation error of linear classifiers in terms
of a refined margin quantity on the training set. The result is obtained in a
PAC-Bayesian framework and is based on geometrical arguments in the space of
linear classifiers. The new bound constitutes an exponential improvement of
the so far tightest margin bound by Shawe-Taylor et al. and scales
logarithmically in the inverse margin. Even in the case of less training
examples than input dimensions sufficiently large margins lead to non-trivial
bound values and --- for maximum margins --- to a vanishing complexity term.
Furthermore, the classical margin is too coarse a measure for the essential
quantity that controls the generalisation error: the volume ratio between the
whole hypothesis space and the subset of consistent hypotheses. The practical
relevance of the result lies in the fact that the well-known support vector
machine is optimal w.r.t. the new bound only if the feature vectors are all of
the same length. As a consequence we recommend to use SVMs on normalised
feature vectors only --- a recommendation that is well supported by our
numerical experiments on two benchmark data sets.
Hierarchical Memory-Based Reinforcement Learning -- Natalia Hernandez-Gardiol, Sridhar Mahadevan
A key challenge for reinforcement learning is how to scale up to large
partially observable domains. In this paper, we show how a hierarchy
of behaviors can be used to create and select among variable length
short-term memories appropriate for a task. At higher levels in the
hierarchy, the agent abstracts over lower-level details and looks back
over a variable number of high-level decisions in time. We formalize
this idea in a framework for solving partially observable, sequential
decision tasks called Hierarchical Suffix Memory (HSM). HSM uses
a memory-based SMDP Q-learning method to rapidly propagate delayed
reward across long decision sequences. We show that the HSM framework
outperforms several related reinforcement learning techniques on a
realistic corridor navigation task.
Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models -- Sepp Hochreiter, Michael C. Mozer
Many unsupervised learning procedures can be viewed as trying to bring
two probability distributions into alignment, including generative
models such as Gaussian mixtures and Boltzmann machines, and recoding
models such as ICA and projection pursuit. We propose a novel
sample-based error measure for these procedures, which works even for
situations in which maximum likelihood and probability density
estimation-based formulations cannot be applied, e.g., models that are
nonlinear or have intractable posterior. Furthermore, our
sample-based error measure avoids the difficulties of approximating a
density function. We prove that with an unconstrained model, (1) our
approach converges on the correct solution as the number of samples
goes to infinity, and (2) the expected solution of our approach in the
generative framework is the maximum likelihood solution. Finally, we
evaluate our sample-based error measure via simulations of linear and
nonlinear models on mixture-of-Gaussians and ICA problems. The
experiments show the wide applicability and generality of our
approach.
Ensemble Learning and Linear Response Theory for ICA -- Pedro A.d.F.R. H{\o}jen-S{\o}rensen, Ole Winther, Lars Kai Hansen
We propose a general framework for performing independent component
analysis (ICA) which relies on ensemble learning and linear response
theory known from statistical physics. We apply it to both discrete
and continuous sources. For the continuous source the underdetermined
(overcomplete) case is studied. The naive mean-field approach fails in
this case whereas linear response theory--which gives an improved
estimate of covariances--is very efficient. The examples given are
for sources without temporal correlations. However, this derivation
can easily be extended to treat temporal correlations. Finally, the
framework offers a simple way of generating new ICA algorithms without
needing to define the prior distribution of the sources explicitly.
A Silicon Primitive for Competitive Learning -- David Hsu, Miguel Figueroa, Chris Diorio
Competitive learning is a technique for training classification and
clustering networks. We have designed and fabricated an 11-transistor
primitive, that we term an automaximizing bump circuit, that implements
competitive learning dynamics. The circuit per-forms a similarity
computation, affords nonvolatile storage, and implements simultaneous
local adaptation and computation. We show that our primitive is suitable
for implementing competitive learning in VLSI, and demonstrate its
effectiveness in a standard clustering task.
On Reversing Jensen's Inequality -- Tony Jebara, Alex Pentland
Jensen's inequality is a powerful mathematical tool and one of the
workhorses in statistical learning. Its applications therein include
the EM algorithm, Bayesian estimation and Bayesian inference. Jensen
computes simple lower bounds on otherwise intractable quantities such
as products of sums and latent log-likelihoods. This simplification
then permits operations like integration and maximization. Quite
often, though, upper bounds are needed as well, especially in
discriminative learning. We derive and prove an efficient analytic
inequality that provides such variational upper bounds. This inequality
holds for latent variable mixtures of exponential family distributions
and thus spans a wide range of contemporary statistical models. We
also discuss applications of the upper bounds including maximum
conditional likelihood, large margin discriminative models and
conditional Bayesian inference. Convergence, efficiency and prediction
results are shown.
Automated State Abstraction for Options using the U-Tree Algorithm -- Anders Jonsson, Andrew G. Barto
The learning of a complex task can be significantly facilitated by defining a
hierarchy of subtasks. An agent can learn to choose between various temporally
abstract actions, each solving an assigned subtask, to achieve the overall
task. In this paper, we study hierarchical learning in the framework of
options. We argue that to take full advantage of the hierarchical structure,
one should perform option-specific state abstraction, and that if this is to
scale to larger tasks, state abstraction should be automated. We apply the
U-Tree algorithm, which automatically builds option-specific representations
of the state feature space, to a simple hierarchical task. Results show that
automatic state abstraction for options has a promising future.
Dopamine Bonuses -- Sham Kakade, Peter Dayan
Substantial data support a temporal difference (TD) model of dopamine
(DA) neuron activity in which the cells provide a global error signal
for reinforcement learning. However, in certain circumstances, DA
activity seems anomalous under the TD model, responding to
non-rewarding stimuli. We address these anomalies by suggesting that
DA cells multiplex information about reward bonuses, including
Sutton's exploration bonuses and Ng {\it et al\/}'s non-distorting
shaping bonuses. We interpret this additional role for DA in terms of
the unconditional attentional and psychomotor effects of dopamine,
having the computational role of guiding exploration.
Hippocampally-Dependent Consolidation in a Hierarchical Model of Neocortex -- Szabolcs K{\'a}li, Peter Dayan
In memory consolidation, declarative memories which initially
require the hippocampus for their recall, ultimately become
independent of it. Consolidation has been the focus of numerous
experimental and qualitative modeling studies, but only little
quantitative exploration. We present a consolidation model in which
hierarchical connections in the cortex, that initially instantiate
purely semantic information acquired through probabilistic
unsupervised learning, come to instantiate episodic information as
well. The hippocampus is responsible for helping complete partial
input patterns before consolidation is complete, while also training
the cortex to perform appropriate completion by itself.
Second Order Approximations for Probability Models -- Hilbert J. Kappen, Wim Wiegerinck
In this paper, we derive a second order mean field theory for
directed graphical probability models.
By using an information theoretic argument it is shown how this can be
done
in the absence of a partition function.
This method is the direct generalisation of the
well-known TAP approximation for Boltzmann Machines.
In a numerical example, it is shown that the method greatly improves the
first order mean field approximation.
The computational complexity of the method is comparable to the
first order mean field method for a restricted class of graphical models.
Generalizable Singular Value Decomposition for Ill-posed Datasets -- Ulrik Kjems, Lars Kai Hansen, Stephen C. Strother
We demonstrate that statistical analysis of ill-posed data sets is
subject to a bias, which can be observed when projecting independent
test set examples onto a basis defined by the training examples.
Because the training examples in an ill-posed data set do not fully
span the signal space the observed training set variances in each
basis vector will be too high compared to the average variance of
the test set projections onto the same basis vectors. On basis of
this understanding we introduce the Generalizable Singular Value
Decomposition (GSVD) as a means to reduce this bias by re-estimation
of the singular values obtained in a conventional Singular Value
Decomposition, allowing for a generalization performance increase of
a subsequent statistical model. We demonstrate that the algorithm
succesfully corrects bias in a data set from a functional PET
activation study of the human brain.
Some New Bounds on the Generalization Error of Combined Classifiers -- Vladimir Koltchinskii, Dmitriy Panchenko, Fernando Lozano
We develop a method of bounding the generalization error of a complex
classifier that is a combination of simpler classifiers from a base class
in terms of its empirical margin distribution. The bounds we have got
generalize and improve previously known results for neural networks
(Bartlett (1998)) and for convex combinations of classifiers (Schapire,
Freund, Bartlett and Lee (1998)). The methods of Gaussian and empirical
processes (including concentration and deviation inequalities) allowed
us to obtain substantial improvement of the previous bounds in some
of the problems and to understand much better the probabilistic
nature of the margin type bounds.
Sparsity of Data Representation of Optimal Kernel Machine and Leave-one-out Estimator -- Adam Kowalczyk
Vapnik's result that the expectation of the generalisation error of
the optimal hyperplane is bounded by the expectation of the ratio of the
number of support vectors to the number of training examples is extended
to a broad class of kernel machines. The class includes Support Vector
Machines for soft margin classification and regression, and Regularization
Networks with a variety of kernels and cost functions. We show that key
inequalities in Vapnik's result and its proof become equalities once
``the classification error'' is replaced by ``the margin
error'', with the latter defined as an instance with positive cost. In
particular we show that expectations of the true margin error and
the empirical margin error are equal, and that the sparse solutions
for kernel machines are possible only if the cost function is ``partially''
in-sensitive.
Keeping Flexible Active Contours on Track using Metropolis Updates -- Trausti T. Kristjansson, Brendan J. Frey
Condensation, a form of likelihood-weighted particle filtering, has
been successfully used to infer the shapes of highly constrained ``active contours'' in video sequences. However, when the contours are
highly flexible (e.g. for tracking fingers of a hand), a computationally
burdensome number of particles is needed to successfully approximate the
contour distribution over contours at each frame in a video sequence. We
compare this method to condensation using a video sequence that requires
highly flexible contours, and show that the new algorithm performs
dramatically better than the condensation algorithm. We discuss the
incorporation of this method into the ``active contour'' framework where a
shape-subspace is used to constrain shape variation.
Smart Vision Chip Fabricated Using Three Dimensional Integration Technology -- Hiroyuki Kurino, M. Nakagawa, Kang Wook Lee, Tomonori Nakamura, Yuusuke Yamada, Ki Tae Park, Mitsumasa Koyanagi
Smart vision chips have several great advantages compared to conventional
image
processing systems. However it also has several hindrances such as low
resolution,
low fill factor and large chip size. In order to overcome such drawbacks, we
propose a
new smart vision chip fabricated using three dimensional (3D) integration
technology.
In three dimensional technology, several LSI chips are stacked vertically
and electrically
connected with each other through vertical interconnect. The high speed,
high density
interconnection and high density packing can be realized by a novel 3D
integration technology.
3D LSI chips has vertical layered structures and is similar to the cortex
or the retina.
Therefore it is a suitable structure for neuromorphic or neural network
circuit.
A prototype of a 3D test chip was fabricated and showed good electrical
characteristics.
In this paper we describe the fabrication process and fabrication results of
the 3D test chips.
We also present our design of 3D neuromorphic device.
Algorithms for Non-negative Matrix Factorization -- Daniel D. Lee, H. Sebastian Seung
Non-negative matrix factorization (NMF) has previously been shown to
be a useful decomposition for multivariate data. Two different
multiplicative algorithms for NMF are analyzed. They differ only
slightly in the multiplicative factor used in the update rules. One
algorithm can be shown to minimize the conventional least squares
error while the other minimizes the generalized Kullback-Leibler
divergence. The monotonic convergence of both algorithms can be
proven using an auxiliary function analogous to that used for proving
convergence of the Expectation-Maximization algorithm. The algorithms
can also be interpreted as diagonally rescaled gradient descent, where
the rescaling factor is optimally chosen to ensure convergence.
Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes -- Te-Won Lee, Thomas Wachtler, Terrence J. Sejnowski
The human visual system encodes the chromatic signals conveyed by the
three types of retinal cone photoreceptors in an opponent fashion.
This color opponency has been shown to constitute an efficient
encoding by spectral decorrelation of the receptor signals.
We analyze the spatial and chromatic structure of natural
scenes by decomposing the spectral images into a set of linear basis
functions such that they constitute a representation with minimal
redundancy. Independent component analysis finds the basis functions
that transforms the spatiochromatic data such that the outputs
(activations) are statistically as independent as possible, i.e. least
redundant. The resulting basis functions show strong opponency along
an achromatic direction (luminance edges), along a blue-yellow
direction, and along a red-blue direction. Furthermore, the resulting
activations have very sparse distributions, suggesting that the use of
color opponency in the human visual system achieves a highly efficient
representation of colors. Our findings suggest that color opponency is
a result of the properties of natural spectra and not solely a
consequence of the overlapping cone spectral sensitivities.
Foundations for a Circuit Complexity Theory of Sensory Processing -- Robert A. Legenstein, Wolfgang Maass
We introduce total wire length as a salient complexity measure for
judging the biological feasibility of various proposed circuit
designs for
sensory processing tasks, such as translation-invariant
pattern recognition. Biological data show that the total
wire length of neural circuits in the cortex is in fact not very large
(if one puts it in relation to the number of inputs that these
circuits receive).
Our analysis shows that most commonly proposed
circuit architectures are in fact biologically unrealistic because they
would require too much total wire length.
In this paper we exhibit a number of alternative circuit
design techniques for computational tasks from translation- and
scale-invariant sensory processing, that can be implemented within
biologically realistic bounds for their total wire length.
We also discuss the relationship between total wire length and
other complexity measures that had previously been considered in the
context of VLSI-design.
A Tighter Bound for Graphical Models -- Martijn A. R. Leisink, Hilbert J. Kappen
We present a method to bound the partition function of a Boltzmann machine
neural network with any odd order polynomial. This is a direct extension
of the mean field bound, which is first order. We show that the third
order bound is strictly better than mean field. Additionally we show the
rough outline how this bound is applicable to sigmoid belief networks.
Numerical experiments indicate that an error reduction of a factor two
is easily reached in the region where expansion based approximations
are useful.
{\bf Martijn Leisink received an Honorable Mention for NIPS Best Student
Paper Award sponsored by Athene Software}
Position Variance, Recurrence and Perceptual Learning -- Zhaoping Li, Peter Dayan
Visual stimulus arrays are inevitably presented at different positions
on the retina in perceptual learning tasks, even tasks that
nominally require fixation. We show that inference in the face of
positional variance has a structurally different quality from inference
about fixed position stimuli, involving a particular, quadratic,
non-linearity rather than a purely linear discrimination. We show the
advantage this non-linearity has for discrimination, and suggest it
as a role for recurrent connections in area V1.
Homeostasis in a Silicon Integrate and Fire Neuron -- Shih-Chii Liu, Bradley A. Minch
In this work, we explore homeostasis in a silicon integrate-and-fire
neuron. The neuron adapts its firing rate over long time
periods on the order of seconds or minutes so that it returns to
its spontaneous firing rate after a lasting perturbation.
Homeostasis is implemented via two schemes. One scheme looks at
the presynaptic activity and adapts the synaptic weight
depending on the presynaptic spiking rate.
The second scheme adapts the synaptic
``threshold'' depending on the neuron's activity.
The threshold is lowered if the neuron's activity decreases over
a long time and is increased for prolonged
increase in postsynaptic activity. Both these mechanisms for
adaptation use floating-gate technology.
The results shown here are measured from a chip fabricated in a
2-$\mu$m CMOS process.
Text Classification using String Kernels -- Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Chris Watkins
We introduce a novel kernel for comparing two text documents. The kernel
is an inner product in the feature space consisting of all subsequences of
length $k$. A subsequence is any ordered sequence of $k$ characters occurring
in the text though not necessarily contiguously. The subsequences are
weighted by an exponentially decaying factor of their full length in the
text, hence emphasising those occurrences which are close to contiguous. A
direct computation of this feature vector would involve a prohibitive
amount of computation even for modest values of k, since the dimension of
the feature space grows exponentially with $k$. The paper describes how
despite this fact the inner product can be efficiently evaluated by a
dynamic programming technique. A preliminary experimental comparison of
the performance of the kernel compared with a standard word feature space
kernel is made showing encouraging results.
Constrained Independent Component Analysis -- Wei Lu, Jagath C. Rajapakse
The paper presents a novel technique of constrained independent
component analysis (CICA) to eliminate indeterminacies existing in present
ICA.
CICA introduces constraints into ICA problem and solves the constrained
optimization by using Lagrange
multiplier methods. This paper shows that CICA can systematically
eliminate the ICA's indeterminacies on permutation and dilation by
ordering the resulted independent components in a specific manner and
normalizing the demixing matrix, respectively. The experiments demonstrate
ordering of components in complete as well as undercomplete ICA while
providing normalized demixing processes.
Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations -- D\"orthe Malzahn, Manfred Opper
Based on a statistical mechanics approach, we develop a method
for approximately computing average case learning curves
for Gaussian process regression models. The approximation
works well in the large sample size limit and for arbitrary
dimensionality of the input space. We explain how the approximation
can be systematically improved and argue that similar techniques
can be applied to general likelihood models.
Active Support Vector Machine Classification -- Olvi L. Mangasarian, David R. Musicant
An active set strategy is applied to the dual of
a simple reformulation of the standard quadratic program of
a linear support vector machine. This application generates a fast new
dual algorithm that consists of solving a finite number of linear equations,
with a typically large dimensionality equal to the number of points to
be classified. However, by making novel use of the
Sherman-Morrison-Woodbury formula, a much smaller matrix of the order
of the original input space is inverted at each step. Thus, a problem
with a 32-dimensional input space and 7 million points required
inverting positive definite symmetric matrices of size 33-by-33
with a total running time of 96 minutes on a 400 MHz Pentium II. The
algorithm requires no specialized quadratic or linear
programming code, but merely a linear equation solver which is
publicly available.
Weak Learners and Improved Rates of Convergence in Boosting -- Shie Mannor, Ron Meir
The problem of constructing weak classifiers for boosting algorithms is
studied. Roughly speaking, an effective weak learner is one that is
guaranteed to achieve, for any distribution on the data, classification
performance that is sufficiently better than random guessing. We show
that linear weak
learners always exist; however, they are not always useful for learning
purposes. Simple arguments are used to establish conditions for the
existence of effective weak learners for one-dimensional problems. We then
discuss extensions of the results to any dimension using methods from the
theory of combinatorial discrepancy. In particular, precise conditions are
established for the existence of effective linear weak learners in any
dimension. Additionally, we provide improved convergence rate bounds for
the generalization error in situations where the empirical error can be
made small, which is exactly the situation that occurs if weak learners
with guaranteed performance that are better than random guessing can be
established.
Recognizing Hand-written Digits Using Hierarchical Products of Experts -- Guy Mayraz, Geoffrey E. Hinton
The product of experts learning procedure (Hinton, 2000) can discover a set of
stochastic binary features that constitute a non-linear generative model of
handwritten images of digits. The quality of generative models learned in this
way can be assessed by learning a separate model for each class of digit and
then comparing the unnormalized probabilities of test images under the 10
different class-specific models. To improve discriminative performance, each of
the $10$ digit models can be given more layers of feature detectors. The layers
are trained sequentially and each layer learns a generative model of the
patterns of feature activities in the preceding layer. After training, each
layer of feature detectors produces a separate, unnormalized log probabilty
score. With three layers of feature detectors in each of the $10$ digit models,
a test image produces $30$ scores which can be used as inputs to a supervised,
logistic classification network that is trained on separate data. On the MNIST
database, our system is comparable with current state-of-the-art discriminative
methods, demonstrating that the product of experts learning procedure can
produce effective generative models of high-dimensional data.
Learning Segmentation by Random Walks -- Marina Meil\u{a}, Jianbo Shi
We present a new view of image segmentation by pairwise
similarities. We interpret the similarities as edge flows in a Markov
random walk and study the eigenvalues and eigenvectors of the walk's
transition matrix. This interpretation shows that spectral methods for
clustering and segmentation have a probabilistic foundation. In
particular, we prove that the Normalized Cut method arises naturally
from our framework. Finally, the framework provides a principled
method for learning the similarity function as a combination of
features.
The Unscented Particle Filter -- Rudolph van der Merwe, Arnaud Doucet, Nando de Freitas, Eric Wan
In this paper, we propose a new particle filter based on sequential
importance sampling.
The algorithm uses a bank of unscented filters to obtain the importance
proposal distribution. This proposal
has two very ``nice'' properties. Firstly, it makes efficient use of the
latest available information and, secondly,
it can have heavy tails. As a result, we find that the algorithm
outperforms standard particle filtering and other nonlinear filtering
methods by almost an order of magnitude. This experimental finding is in
agreement with the
theoretical convergence proof for the algorithm. The algorithm also
includes resampling and (possibly)
Markov chain Monte Carlo (MCMC) steps.
A Mathematical Programming Approach to the Kernel Fisher Algorithm -- Sebastian Mika, Gunnar R{\"a}tsch, Klaus-Robert M{\"u}ller
We investigate a new kernel--based classifier: the Kernel Fisher
Discriminant (KFD). A mathematical programming formulation based on
the observation that KFD maximizes the {\em average margin} permits
an interesting modification of the original KFD algorithm yielding
the sparse KFD. We find that both, KFD and the proposed sparse KFD,
can be understood in an unifying probabilistic context. Furthermore,
we show connections to Support Vector Machines and Relevance Vector
Machines. From this understanding, we are able to outline a very
intuitive kernel--regression technique based upon the KFD algorithm.
Simulations support the usefulness of our approach.
Automatic Choice of Dimensionality for PCA -- Thomas P. Minka
A central issue in principal component analysis (PCA) is choosing the
number of principal components to be retained. By interpreting PCA as
density estimation, this paper shows how to use Bayesian model selection to
determine the true dimensionality of the data. The resulting estimate is
simple to compute yet guaranteed to pick the correct dimensionality, given
enough data. In simulations, it is convincingly better than
cross-validation and other proposed algorithms, plus it runs much faster.
On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems -- Eiji Mizutani, James W. Demmel
This paper describes a method of dogleg trust-region steps, or
restricted Levenberg-Marquardt steps, based on a projection
process onto a Krylov subspace for neural networks nonlinear
least squares problems. In particular, the linear conjugate
gradient (CG) method works as the inner iterative algorithm for
solving the linearized Gauss-Newton normal equation, whereas the
outer nonlinear algorithm repeatedly takes so-called ``Krylov-dogleg''
steps, relying only on matrix-vector multiplication without
explicitly forming the Jacobian matrix or the Gauss-Newton model
Hessian. That is, our iterative dogleg algorithm can reduce both
operational counts and memory space by a factor of O(n) (the number
of parameters) in comparison with a direct linear-equation
solver. This memory-less property is useful for large-scale
problems.
Sex with Support Vector Machines -- Baback Moghaddam, Ming-Hsuan Yang
Nonlinear Support Vector Machines (SVMs) are investigated for visual
sex classification with low resolution ``thumbnail'' faces processed
from 1,755 images from the FERET face database. The performance of
SVMs is shown to be superior to traditional pattern classifiers
(Linear, Quadratic, Fisher Linear Discriminant, Nearest-Neighbor) as
well as more modern techniques such as Radial Basis Function (RBF)
classifiers and large ensemble-RBF networks. Furthermore, the SVM
performance (3.4\% error) is currently the best reported in the
open literature.
Robust Reinforcement Learning -- Jun Morimoto, Kenji Doya
This paper proposes a new reinforcement learning (RL) paradigm that
explicitly takes into account input disturbance as well as modeling
errors. The use of environmental models in RL is quite popular for both
off-line learning by simulations and for on-line action planning.
However, the difference between the model and the real environment can
lead to unpredictable, often unwanted results.
Based on the theory of H-infinity control, we consider a differential game in
which a `disturbing' agent tries to make the worst possible disturbance
while a `control' agent tries to make the best control input. The
problem is formulated as finding a min-max solution of a value function
that takes into account the norm of the output deviation and the norm of
the disturbance. We derive on-line learning algorithms for estimating
the value function and for calculating the worst disturbance and the
best control in reference to the value function.
We tested the paradigm, which we call ``Robust Reinforcement Learning
(RRL),'' in the task of inverted pendulum. In the linear domain, the
policy and the value function learned by the on-line algorithms
coincided with those derived analytically by the linear H-infinity control theory.
For a fully nonlinear swing-up task, the control by RRL achieved robust
performance against two-fold changes in the pendulum length while a
standard RL control could not deal with such environmental changes.
Partially Observable SDE Models for Image Sequence Recognition Tasks -- Javier R. Movellan, Paul Mineiro, Ruth J. Williams
This paper explores a framework for recognition of image sequences
using partially observable stochastic differential equation (SDE)
models. Monte-Carlo importance sampling techniques are used for
efficient estimation of sequence likelihoods and sequence likelihood
gradients. Once the network dynamics are learned, we apply the SDE
models to sequence recognition tasks in a manner similar to the way
Hidden Markov models (HMMs) are commonly applied. The potential
advantage of SDEs over HMMS is the use of continuous state dynamics.
We present encouraging results for a video sequence recognition task
in which SDE models provided excellent performance when compared to
hidden Markov models.
The Use of MDL to Select among Computational Models of Cognition -- In J. Myung, Mark A. Pitt, Shaobo Zhang, Vijay Balasubramanian
How should we decide among competing explanations of a cognitive process given
limited observations? The problem of model selection is at the heart of progress
in cognitive science. In this paper, Minimum Description Length (MDL) is
introduced as a method for selecting among computational models of cognition. We
also show that differential geometry provides an intuitive understanding of what
drives model selection in MDL. Finally, adequacy of MDL is demonstrated in two
areas of cognitive modeling.
Probabilistic Semantic Video Indexing -- Milind R. Naphade, Igor Kozintsev, Thomas Huang
We propose a novel probabilistic framework for semantic video indexing.
We define probabilistic multimedia objects (multijects) to
map low-level media features to high-level semantic labels.
A graphical network of such multijects (multinet) captures
scene context by discovering intra-frame as well as inter-frame
dependency relations between the concepts. The main contribution is
a novel application of a factor graph framework to model this network.
We model relations between semantic concepts in terms of
their co-occurrence as well as the temporal dependencies
between these concepts within video shots. Using the sum-product
algorithm for approximate or exact inference in these factor graph multinets,
we attempt to correct errors made during isolated concept detection by
forcing high-level constraints. This results in a significant improvement in
the overall detection performance.
Finding the Key to a Synapse -- Thomas Natschl{\"a}ger, Wolfgang Maass
Experimental data have shown that synapses are heterogeneous:
different synapses respond with different sequences of amplitudes of
postsynaptic responses to the same spike train. Neither the role of
synaptic dynamics itself nor the role of the heterogeneity of synaptic
dynamics for computations in neural circuits is well understood.
We present in this article two computational methods that make it
feasible to compute for a given synapse with known synaptic parameters
the spike train that is optimally fitted to the synapse in a certain
sense. One of these methods is based on dynamic programming (similar
as in reinforcement learning), the other one on sequential quadratic
programming. With the help of these methods one can compute for
example the temporal pattern of a spike train (with a given number of
spikes) that produces the largest sum of postsynaptic responses for a
specific synapse. Several other applications are also discussed.
To our surprise we find that most of these optimally fitted spike
trains match common firing patterns of specific types of neurons that
are discussed in the literature. Furthermore optimally fitted spike
trains are rather specific to a certain synapse (``the key to this
synapse'') in the sense that they exhibit a substantially smaller
postsynaptic response on any other of the mayor types of synapses
reported in the literature. This observation provides the first
glimpse at a possible functional role of the specific combinations of
synapse types and neuron types that was recently found in (Gupta,
Wang, Markram, Science, 2000).
Our computational analysis provides the platform for a better
understanding of the specific role of different parameters that
control synaptic dynamics, because with the help of the computational
techniques that we have introduced one can now see directly how the
temporal structure of the optimal spike train for a synapse depends on
the individual synaptic parameters. We believe that this inverse
analysis is essential for understanding the computational role of
neural circuits.
Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics -- Thomas Natschl{\"a}ger, Wolfgang Maass, Eduardo D. Sontag, Anthony Zador
Experimental data show that biological synapses behave quite
differently from the symbolic synapses in common artificial neural
network models. Biological synapses are dynamic, i.e., their
``weight'' changes on a short time scale by several hundred percent in
dependence of the past input to the synapse. In this article we
explore the consequences that this synaptic dynamics entails for the
computational power and adaptive capability of feedforward neural
networks.
Our analytical results show that the class of nonlinear filters that
can be approximated by neural networks with dynamic synapses, even
with just a single hidden layer of sigmoidal neurons, is remarkably
rich. It contains every time invariant filter with fading memory,
hence arguable every filter that is potentially useful for a
biological organism. This result is robust with regard to various
changes in the model for synaptic dynamics. Furthermore we show that
simple gradient descent suffices to approximate a given quadratic
filter by a rather small neural network with dynamic synapses.
The computer simulations we performed show that in fact their
performance is slightly better than that of previously considered
artificial neural networks that were designed for the purpose of
yielding efficient processing of temporal signals, without aiming at
biological realism. We have tested dynamic networks on tasks such as
the learning of a randomly chosen quadratic filter, as well as on the
system identification task used in (Back and Tsoi, 1993), to
illustrate the potential of our new architecture.
We also address the question which synaptic parameters are essential
for a network with dynamic synapses to be able to learn a particular
target filter. We found that neither just plasticity in the synaptic
dynamics nor just plasticity of the maximal amplitude alone yields
satisfactory results. However a simple gradient descent learning
algorithm that tunes both types of parameters simultaneously yields
good approximation capabilities.
Active Inference in Concept Learning -- Jonathan D. Nelson, Javier R. Movellan
People are active information gatherers, constantly experimenting and
seeking information relevant to their goals. A reasonable approach to
active information gathering is to ask questions and conduct
experiments that maximize the expected information gain, given current
beliefs (Lindley, 1956; Good, 1966; MacKay, 1992). In this paper we
compare the behavior of human subjects with that of an optimal
information-gathering agent (infomax) in a concept induction task
(Tenenbaum, 1999, 2000). Results show high consistency between
subjects in their choices of numbers to sample. However infomax
generally fails to predict subject's behavior when information value
is measured with respect to Tenenbaum's model. It is unclear at this
time whether the failure of infomax to predict human behavior is due
to problems with Tenenbaum's concept induction model, or due to the
fact that subjects use suboptimal heuristics (e.g., confirmatory
sampling).
Learning Continuous Distributions: Simulations With Field Theoretic Priors -- Ilya Nemenman, William Bialek
Learning of a smooth but nonparametric probability density can be
regularized using methods of Quantum Field Theory. We implement a
field theoretic prior numerically, test its efficacy, and show that
the free parameter of the theory (`smoothness scale') can be
determined self consistently by the data; this forms an infinite
dimensional generalization of the MDL principle. Finally, we study
the implications of one's choice of the prior and the
parameterization and conclude that even wrong choices can be
advantageous for small data sets.
Interactive Parts Model: An Application to Recognition of On-line Cursive Script -- Predrag Neskovic, Philip C. Davis, Leon N. Cooper
In this work, we introduce an Interactive Parts (IP) model as an alternative to
hidden Markov models (HMMs). We tested both models on a database of on-line
cursive script. We show that implementations of HMMs and the IP model, in which
all letters are assumed to have the same average width, give comparable results.
However, in contrast to HMMs, the IP model can handle duration modeling without
an increase in computational complexity.
Learning Sparse Image Codes using a Wavelet Pyramid Architecture -- Bruno A. Olshausen, Phil Sallee, Michael S. Lewicki
We show how a wavelet basis may be adapted to best represent natural
images in terms of sparse coefficients. The wavelet basis, which may
be either complete or overcomplete, is specified by a small number of
spatial functions which are repeated across space and combined in a
recursive fashion so as to be self-similar across scale. These
functions are adapted to minimize the estimated code length under a
model that assumes images are composed as a linear superposition of
sparse, independent components. When adapted to natural images, the
wavelet bases become selective to different spatial orientations, and
they achieve a superior degree of sparsity on natural images as
compared with traditional wavelet bases.
Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice -- Dirk Ormoneit, Peter Glynn
Many approaches to reinforcement learning combine neural networks or
other parametric function approximators with a form of
temporal-difference learning to estimate the value function of a
Markov Decision Process. A significant disadvantage of those
procedures is that the resulting learning algorithms are frequently
unstable. In this work, we present a new, kernel-based approach to
reinforcement learning which overcomes this difficulty and provably
converges to a unique solution. By contrast to existing algorithms,
our method can also be shown to be consistent in the sense that its
costs converge to the optimal costs asymptotically. Our focus is on
learning in an average-cost framework and on a practical application
to the optimal portfolio choice problem.
Learning and Tracking Cyclic Human Motion -- Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black, Trevor Hastie
We present methods for learning and tracking human motion in video.
We estimate a statistical model of typical activities from a large set
of 3D periodic human motion data by segmenting these data
automatically into a sequence of ``cycles''. Then the mean and the
principal components of the cycles are computed using a new algorithm
that accounts for missing information and enforces smooth transitions
between cycles. The learned temporal model provides a prior
probability distribution over human motions that can be used in a
Bayesian framework for tracking human subjects in complex monocular
video sequences and recovering their 3D motion.
Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals -- Lucas Parra, Clay Spence, Paul Sajda
We present evidence that several higher-order statistical
properties of natural images and signals can be explained by a
stochastic model which simply varies scale of an otherwise stationary
Gaussian process. We discuss two interesting consequences. The first
is that a variety of natural signals can be related through a common
model of
spherically invariant random processes, which have the attractive
property that the joint densities can be constructed from
the one dimensional marginal. The second is that in some cases the
non-stationarity assumption and only second order methods can be
explicitly exploited to find a linear basis that is equivalent to
independent components obtained with higher-order methods. This is
demonstrated on spectro-temporal components of speech.
Learning Switching Linear Models of Human Motion -- Vladimir Pavlovi\'c, James M. Rehg, John MacCormick
The human figure exhibits complex and rich dynamic behavior that is
both nonlinear and time-varying. Effective models of human dynamics
can be learned from motion capture data using switching linear dynamic
system (SLDS) models. We present results for human motion synthesis,
classification, and visual tracking using learned SLDS models. Since
exact inference in SLDS is intractable, we present three approximate
inference algorithms and compare their performance. In particular, a
new variational inference algorithm is obtained by casting the SLDS
model as a Dynamic Bayesian Network. Classification experiments show
the superiority of SLDS over conventional HMM's for our problem
domain.
Bayes Networks on Ice: Robotic Search for Antarctic Meteorites -- Liam Pedersen, Dimi Apostolopoulos, Red Whittaker
A Bayes network based classifier for distinguishing terrestrial rocks
from meteorites is implemented onboard the Nomad robot. Equipped with a
camera, spectrometer and eddy current sensor, this robot searched the
ice sheets of Antarctica and autonomously made the first robotic
identification of a meteorite, in January 2000 at the Elephant Moraine.
This paper discusses rock classification from a mobile robotic platform,
and describes the system onboard Nomad.
Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features -- Penio S. Penev
Low-dimensional representations are key to solving problems in
high-level vision, such as face compression and recognition.
Factorial coding strategies for reducing the redundancy present in
natural images on the basis of their second-order statistics have been
successful in accounting for both psychophysical and
neurophysiological properties of early vision. Class-specific
representations are presumably formed later, at the higher-level
stages of cortical processing. Here we show that when retinotopic
factorial codes are derived for ensembles of natural objects, such as
human faces, not only redundancy, but also dimensionality is reduced.
We also show that objects are built from parts in a non-Gaussian
fashion which allows these local-feature codes to have
dimensionalities that are substantially lower than the respective
Nyquist sampling rates.
Fast Training of Support Vector Classifiers -- Fernando P\'erez-Cruz, Pedro L. Alarc\'on-Diana, Angel Navia-V\'azquez, Antonio Art\'es-Rodr\'{\i}guez
In this communication we present a new algorithm for solving
Support Vector Classifiers (SVC) with large training data sets.
The new algorithm is based on an Iterative Re-Weighted Least
Squares procedure in order to optimize the SVC. Moreover, a novel
selection strategy for the working set is presented, different
from the existing ones. The selection strategy randomly chooses
the working set among the training samples that do not fulfill the
stopping criteria. The validity of both approaches, the
optimization procedure and sample selection strategy, are shown
by means of computer experiments using well-known data sets.
The Use of Classifiers in Sequential Inference -- Vasin Punyakanok, Dan Roth
We study the problem of combining the outcomes of several different
classifiers in a way that provides a coherent inference that satisfies
some constraints. In particular, we develop two general approaches for an
important subproblem -- identifying phrase structure. The first is a
Markovian approach that extends standard HMMs to allow the use of a rich
observation structure and of general classifiers to model
state-observation dependencies. The second is an extension of constraint
satisfaction formalisms. We develop efficient combination algorithms
under both models and study them experimentally in the context of shallow
parsing of natural language.
Occam's Razor -- Carl Edward Rasmussen, Zoubin Ghahramani
Confounding the prior over data sets with the complexity of the machinery used
to implement the model causes an Occam's Razor behaviour in non-parametric
models. If instead the two complexities are carefully separated the effects of
Occam's Razor is reduced and the limit of complex models may be preferred. A
simple example illustrates the idea and the practical and theoretical
implications are discussed.
One Microphone Source Separation -- Sam T. Roweis
Source separation, or computational auditory scene analysis, attempts
to extract individual acoustic objects from input
which contains a mixture of sounds from
different sources, altered by the acoustic environment.
Unmixing algorithms such as ICA and its extensions recover sources
by reweighting multiple observation sequences, and thus cannot
operate when only a single observation signal is available. I present
a technique called {\em refiltering} which recovers sources by a
nonstationary reweighting (``masking'') of frequency sub-bands from a
single recording, and argue for the application of statistical
algorithms to learning this masking function. I present results of a
simple factorial HMM system which learns on recordings of single
speakers and can then separate mixtures using only one observation
signal by computing the masking during inference and then refiltering.
Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task -- Brian Sallans, Geoffrey E. Hinton
The problem of reinforcement learning in large factored Markov
decision processes is explored. The Q-value of a state-action pair
is approximated by the free energy of a product of experts network.
Network parameters are learned on-line using a modified SARSA algorithm
which minimizes the inconsistency of the Q-values of consecutive
state-action pairs. Actions are chosen based on the current value estimates
by fixing the current state and sampling actions from the network using
Gibbs sampling. The algorithm is tested on a co-operative multi-agent task.
The product of experts model is found to perform comparably to table-based
Q-learning for small instances of the task, and continues to perform well
when the problem becomes too large for a table-based representation.
{\bf Brian Sallans received an Honorable Mention for NIPS Best Student Paper
Award sponsored by Athene Software}
Minimum Bayes Error Feature Selection for Continuous Speech Recognition -- George Saon, Mukund Padmanabhan
We consider the problem of designing a linear transformation which
projects the n-dimensional features of a classifier onto
p-dimensional features ($p$ less than $n$) such as to achieve minimum Bayes
error (or probability of misclassification). Two avenues will be
explored: the first is to maximize the average divergence between
the class densities and the second is to minimize the union
Bhattacharyya bound on the error in the projected space. While
both approaches yield similar performance in practice, they
outperform standard LDA features and show a 10\% relative
improvement in the word error rate over state-of-the-art cepstral
features on a large vocabulary telephony speech recognition task.
Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech -- Lawrence K. Saul, Jont B. Allen
An eigenvalue method is developed for analyzing periodic structure in
speech. Signals are analyzed by a matrix diagonalization reminiscent
of methods for principal component analysis and
independent component analysis. Our method---called
periodic component analysis---uses constructive
interference to enhance periodic components of the frequency spectrum
and destructive interference to cancel noise. The front end emulates
certain aspects of auditory processing, such as cochlear filtering,
nonlinear compression, and insensitivity to phase, with the aim of
matching the robustness of human listeners. The method avoids the
inefficiencies of autocorrelation at the pitch period: it does not
require long delay lines, and it correlates signals at a clock rate on
the order of the actual pitch, as opposed to the original sampling
rate. We derive its cost function and present some experimental
results.
Spike-Timing-Dependent Learning for Oscillatory Networks -- Silvia Scarpetta, Zhaoping Li, John Hertz
We apply to oscillatory networks a class of learning rules in which
synaptic weights change proportional to pre- and post-synaptic activity,
with a kernel $A(\tau)$ measuring the effect for a postsynaptic spike a
time $\tau$ after the presynaptic one. The resulting synaptic matrices
have an outer-product form in which the oscillating patterns are
represented as complex vectors.
In a simple model, the even part of $A(\tau)$ enhances the resonant
response to learned stimulus by reducing the effective damping, while the
odd part determines the frequency of oscillation. We relate our model to
the olfactory cortex and hippocampus and their presumed roles in forming
associative memories and input representations.
Universality and Individuality in a Neural Code -- Elad Schneidman, Naama Brenner, Naftali Tishby, Robert R. de Ruyter van Steveninck, William Bialek
The problem of neural coding is to understand how sequences
of action potentials (spikes) are related to sensory stimuli, motor
outputs, or (ultimately) thoughts and intentions. One clear question
is whether the same coding rules are used by different neurons, or
by corresponding neurons in different individuals. We present a
quantitative formulation of this problem using ideas from information
theory, and apply this approach to the analysis of experiments in the
fly visual system. We find significant individual differences in the
structure of the code, particularly in the way that temporal patterns
of spikes are used to convey information beyond that available from
variations in spike rate. On the other hand, all the flies in our ensemble
exhibit a high coding efficiency, so that every spike carries the same
amount of information in all the individuals. Thus the neural code has a
quantifiable mixture of individuality and universality.
Machine Learning for Video-Based Rendering -- Arno Sch{\"o}dl, Irfan Essa
We present techniques for rendering and animation of realistic
scenes by analyzing and training on short video sequences. This
work extends the new paradigm for computer animation, video
textures, which uses recorded video to generate novel animations
by replaying the video samples in a new order. Here we
concentrate on video sprites, which are a special type of
video texture. In video sprites, instead of storing whole images,
the object of interest is separated from the background and the
video samples are stored as a sequence of alpha-matted sprites
with associated velocity information. They can be rendered
anywhere on the screen to create a novel animation of the object.
We present methods to create such animations by finding a
sequence of sprite samples that is both visually smooth and
follows a desired path. To estimate visual smoothness, we train a
linear classifier to estimate visual similarity between video
samples. If the motion path is known in advance, we use beam
search to find a good sample sequence. We can specify the motion
interactively by precomputing the sequence cost function using
Q-learning.
The Kernel Trick for Distances -- Bernhard Sch{\"o}lkopf
A method is described which, like the kernel trick in support vector
machines (SVMs), lets us generalize distance-based algorithms to
operate in feature spaces, usually nonlinearly related to the input
space. This is done by identifying a class of kernels which can be
represented as norm-based distances in Hilbert spaces. It turns out
that common kernel algorithms, such as SVMs and kernel PCA, are
actually really distance based algorithms and can be run with that
class of kernels, too.
As well as providing a useful new insight into how these algorithms
work, the present work can form the basis for conceiving new
algorithms.
Natural Sound Statistics and Divisive Normalization in the Auditory System -- Odelia Schwartz, Eero P. Simoncelli
We explore the statistical properties of natural sound stimuli
pre-processed with a bank of linear filters. The responses of
such filters exhibit a striking form of statistical dependency,
in which the response variance of each filter grows with the
response amplitude of filters tuned for nearby frequencies.
These dependencies may be substantially reduced using an
operation known as divisive normalization, in which the response
of each filter is divided by a weighted sum of the rectified
responses of other filters. The weights may be chosen to maximize
the independence of the normalized responses for an ensemble of
natural sounds. We demonstrate that the resulting model accounts
for non-linearities in the response characteristics of the
auditory nerve, by comparing model simulations to
electrophysiological recordings. In previous work (NIPS*98)
we demonstrated that an analogous model derived from the
statistics of natural images accounts for non-linear properties
of neurons in primary visual cortex. Thus, divisive
normalization appears to be a generic mechanism for eliminating a
type of statistical dependency that is prevalent in natural
signals of different modalities.
Balancing Multiple Sources of Reward in Reinforcement Learning -- Christian R. Shelton
For many problems which would be natural for reinforcement learning, the
reward signal is not a single scalar value but has multiple scalar
components. Examples of such problems include agents with multiple goals
and agents with multiple users.
Creating a single reward value by combining the
multiple components can throw away vital information and can lead to
incorrect solutions. We describe the multiple reward source problem and
discuss the problems with applying traditional reinforcement learning. We
then present an new algorithm for finding a solution and results on
simulated environments.
An Information Maximization Approach to Overcomplete and Recurrent Representations -- Oren Shriki, Haim Sompolinsky, Daniel D. Lee
The principle of maximizing mutual information is applied to learning
overcomplete and recurrent representations. The underlying
probabilistic model consists of a network of input units driving a
larger number of output units with recurrent interactions. In the
limit of zero noise, the network is deterministic and the mutual
information can be related to the entropy of the output units.
Maximizing this entropy with respect to both the feedforward
connections as well as the recurrent interactions results in simple
learning rules for both sets of parameters. The conventional
independent components (ICA) learning algorithm can be recovered as a
special case where there is an equal number of output units and no
recurrent connections. The application of these new learning rules is
illustrated on a simple two-dimensional input example.
Development of Hybrid Systems: Interfacing a Silicon Neuron to a Leech Heart Interneuron -- Mario F. Simoni, Gennady S. Cymbalyuk, Michael Q. Sorensen, Ronald L. Calabrese, Stephen P. DeWeerth
We have developed a silicon neuron that is inspired by a mathematical
model of the leech heartbeat (HN) interneuron. The temporal and ionic
current behaviors of this silicon neuron are close to that of the living
cell. Because of this similarity we were able to interface this silicon
neuron
to a living HN cell using a dynamic clamp technique. We present
data showing dynamic behaviors of the hybrid half-center oscillator.
FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks -- Malcolm Slaney, Michele Covell
FaceSync is an optimal linear algorithm that finds the degree of
synchronization between the audio and image recordings of a human
speaker. Using canonical correlation, it finds the best direction to
combine all the audio and image data, projecting them onto a single
axis. FaceSync uses Pearson's correlation to measure the degree of
synchronization between the audio and image data. We derive the
optimal linear transform to combine the audio and visual information
and describe an implementation that avoids the numerical problems
caused by computing the correlation matrices.
The Early Word Catches the Weights -- Mark A. Smith, Garrison W. Cottrell, Karen L. Anderson
The strong correlation between the frequency of words and
their naming latency has been well documented. However, as
early as 1973, the Age of Acquisition (AoA) of a word was
alleged to be the actual variable of interest, but these
studies seem to have been ignored in most of the
literature. Recently, there has been a resurgence of
interest in AoA. While some studies have shown that
frequency has no effect when AoA is controlled for, more
recent studies have found an independent contribution of
frequency and AoA. Connectionist models have repeatedly
shown strong effects of frequency, but little attention has
been paid to whether they can also show AoA effects. Indeed,
several researchers have explicitly claimed that they cannot
show AoA effects. In this work, we explore these claims
using a simple feed forward neural network. We find a very
significant contribution of AoA to naming latency, as well
as the conditions under which frequency provides an
independent contribution.
Sparse Greedy Gaussian Process Regression -- Alex J. Smola, Peter Bartlett
We present a simple sparse greedy technique to approximate the maximum a
posteriori estimate of Gaussian Processes with much improved scaling
behaviour in the sample size $m$. In particular, computational requirements
are $O(n^2 m)$, storage is $O(n m)$, the cost for prediction is $O(n)$
and the cost to compute confidence bounds is $O(n m)$, where $n \ll m$.
We show how to compute a stopping criterion, give bounds on the approximation
error, and show applications to large scale problems.
Regularization with Dot-Product Kernels -- Alex J. Smola, Zolt\'an L. \'Ov\'ari, Robert C. Williamson
In this paper we give necessary and sufficient conditions under which
kernels of dot product type $k(x, y) = k(x \cdot y)$ satisfy Mercer's
condition and thus may be used in Support Vector Machines (SVM),
Regularization Networks (RN) or Gaussian Processes (GP). In particular, we
show that if the kernel is analytic (i.e.\ can be expanded in a Taylor
series), all expansion coefficients have to be nonnegative. We give an
explicit functional form for the feature map by calculating its
eigenfunctions and eigenvalues.
APRICODD: Approximate Policy Construction Using Decision Diagrams -- Robert St-Aubin, Jesse Hoey, Craig Boutilier
We propose a method of approximate dynamic programming for Markov
decision processes (MDPs) using algebraic decision diagrams (ADDs).
We produce near-optimal value functions and policies with much lower
time and space requirements than exact dynamic programming. Our
method reduces the sizes of the intermediate value functions generated
during value iteration by replacing the values at the terminals of the
ADD with ranges of values. Our method is demonstrated on a class of
large MDPS (with up to 2 billion states), and we compare the results
with the optimal value functions.
Four-legged Walking Gait Control Using a Neuromorphic Chip Interfaced to a Support Vector Learning Algorithm -- Susanne Still, Bernhard Sch{\"o}lkopf, Klaus Hepp, Rodney J. Douglas
To control the walking gaits of a four-legged robot we present a
novel neuromorphic VLSI chip that coordinates the relative phasing of the
robot's legs similar to how spinal Central Pattern Generators are believed
to control vertebrate locomotion. The chip controls the leg movements by
driving motors with time varying voltages which are the outputs of a
small network of coupled oscillators. The characteristics of the chip's
output voltages depend on a set of input parameters. The relationship
between input parameters and output voltages can be computed analytically
for an idealized system. In practice, however, this idealized relationship
is only approximately true. Fine tuning of the chip's input parameters is
done autonomously by the robotic system, using an unsupervised Support
Vector learning algorithm that was introduced recently. The machine learns
from (unlabeled) examples how to set the parameters to the chip in order
to obtain a desired motor behavior.
Kernel Expansions with Unlabeled Examples -- Martin Szummer, Tommi S. Jaakkola
Modern classification applications necessitate supplementing the few
available labeled examples with unlabeled examples to improve
classification performance. We present a new tractable
algorithm for exploiting unlabeled examples in discriminative
classification. This is achieved essentially by expanding the input
vectors into longer feature vectors via the unlabeled examples. The
resulting classification method can be interpreted as a discriminative
kernel density estimate and is readily trained via the EM algorithm,
which in this case is both discriminative and achieves the optimal
solution. We provide, in addition, a purely discriminative formulation
of the estimation problem by appealing to the maximum