\newtheorem{teo}{Theorem}[section]
\newtheorem{example}{Example}
\newtheorem{lemma}{Lemma}
\newtheorem{cor}{Corollary}
\newtheorem{defi}{Definition}
\newtheorem{dfn}{Definition}
\newcommand{\epf}{$_{_{\framebox(5,5){}}}$\vspace{.02in}\\}
\newcommand{\bpf}{{\it Proof: }}
\newcommand{\fullsq}[3]{\put(#1,#2){\rule{#3cm}{#3cm}}}
\newcommand{\voidsq}[3]{\put(#1,#2){\framebox(#3,#3){}}}
\newcommand{\psfigure}[2]{\begin{figure}[htb]\centering
\BoxedEPSF{#1.eps}
\caption{#2 \label{f:#1}}
\end{figure}}
\newcommand{\scaledpsfigure}[4]{\begin{figure}[#4]\centering
\BoxedEPSF{#1.eps scaled #3}
\caption{#2 \label{f:#1}}
\end{figure}}
\newcommand{\ascaledpsfigure}[4]{\begin{figure*}[#4]\centering
\BoxedEPSF{#1.eps scaled #3}
\caption{#2 \label{f:#1}}
\end{figure*}}
\newcommand{\oscaledpsfigure}[3]{\begin{figure}[hp]\centering
\BoxedEPSF{#1.eps scaled #3}
\caption{#2 \label{f:#1}}
\end{figure}}
\documentstyle{book}
\input ../../TESI-ITA/BoxedEPS.tex
\SetRokickiEPSFSpecial
\HideDisplacementBoxes
\input{../SCUOLA-VIETRI/mydefs.tex}
\begin{document}
\chapter{A Tutorial on Neurocomputing of Structures}
A Sperduti\\
Department of Computer Science\\
University of Pisa\\
{\it perso@di.unipi.it}\\
\vspace{4mm}
\begin{center} {\em Abstract} \end{center}
\begin{quotation}
In this chapter we present a tutorial on neural networks for the processing
of structured information. Basic concepts on structured domains are
first introduced. Then earlier work on representing data structures within
traditional neural networks is briefly reviewed. This review introduces
some basic concepts which are exploited to realize graph transductions.
Specifically, we discuss neural realizations of a specific class of
graph transductions for classification of labeled acyclic graphs. We show how
the standard definition of neuron and neural learning algorithms
can be extended to deal with structured domains. A specific section
is then devoted to the case of cyclic graphs. In the second part of the
tutorial, computational and complexity results concerning this class
of neural networks are reviewed. Almost all these results exploit
Tree Automata Theory as medium to assess the computational
capabilities and functional complexity
of this kind of neural networks.
\end{quotation}
\section{Introduction}
Structured representations are ubiquitous in different fields such as
knowledge representation, language modeling and pattern recognition.
Examples of application domains where structures are extensively used
are medical and technical diagnoses, molecular biology, chemistry, automated
reasoning, and so on.
While algorithms that manipulate symbolic information are capable of
dealing with highly structured data, adaptive neural networks are
mostly regarded as learning models for domains in which instances are
organized into {\em static} data structures, like records or
fixed-size arrays. Recurrent neural networks, that generalize
feedforward networks to sequences (a particular case of dynamically
structured data) are perhaps the best known exception.
The interest in developing connectionist architectures capable of
dealing with these rich representations (as opposed to ``flat'' or
vector-based representations) can be traced back to the end of the
80's \cite{touretzky90j,pollack90j,hinton90j,plate95j,smol90j}. However, only
during the last few years a general framework for the development of
neural networks for the representation and processing of structures
have been developed \cite{tr-16/95,framework}. In particular, these neural
networks are a generalization of recurrent networks for processing
sequences (i.e., linear chains from a graphical point of view) to the
case of (mainly) directed acyclic graphs.
There are several reasons for being interested in processing of
structures by neural networks. First of all, neural networks are
universal approximators. Moreover, they are good classifiers and they
are robust to the presence of noise. These capabilities are important
when considering real-world applications in structured domains. In
order to exemplify, let consider the classification of labeled graphs
which represent chemical compounds. The standard approach consists in
encoding each graph $X$ as a fixed-size vector which is then given as
input to a feed-forward neural network for classification. This
approach is motivated by the fact that neural networks only have a
fixed number of input units while graphs are variable in size. The
encoding process is usually defined a priori and does not depend on
the classification task, e.g., in molecular biology and chemistry, the
encoding process is performed through the definition of {\it
topological indexes} \cite{chimica2}, which are designed by a very
expensive trial and error approach.
The a priori definition of the encoding process faces the {\it
specificity-generality dilemma}, i.e., if the encoding is too
specific, new features must be devised (by suitable experts) for each
new computational problem; on the other hand, if the encoding is too
general, the representing features are weakly relevant with respect to
the specific target and so they turn out to be difficult to classify.
To overcome the dilemma, the encoding process should be adapted
automatically to the specific classification task at hand. This can
be done by implementing the encoding process through an additional
neural network which is trained, alongside the neural network
performing the classification, to learn the best way to encode the
graphs for the given classification task.
Of course, neither standard neurons nor recurrent neurons are powerful
enough to be used in such a network. This is because the former were
devised for processing unstructured patterns, while the latter can
only naturally process sequences. However, by extending the concept of
recurrent neuron it is possible to formalize several supervised models
for classification of structures which stem very naturally from well
known models, such as Back-Propagation Through Time networks,
Real-Time Recurrent networks, Simple Recurrent networks, Recurrent
Cascade Correlation networks, and Neural Trees.
When developing a new computational model, it is important to
understand its computational power and complexity. This evaluation is
typically done by resorting to known results in related models. For
example, it is well known that recurrent neural networks can simulate
any finite state automata \cite{ADO91,OmlGil96a} as well as any
multi-stack Turing machine in real time \cite{siegelmann95j}. When
constraining the network architecture, however, this computational
power may no longer hold. For example, Elman-style Recurrent Networks
can simulate any finite state automata \cite{goudreau94j,kremer95j},
while Recurrent Cascade-Correlation cannot \cite{limit-rcc,kremer96p}.
Here we discuss the computational capabilities of Elman-style
Recurrent Networks, Recurrent Cascade Correlation networks, and Neural
Trees with respect to the classification of structures. We will
relate them to Frontier-to-Root Tree Automata (FRA)
\cite{thatcher73c,sypr} and show that Elman-style Recurrent Networks
can simulate any FRA, while neither Cascade-Correlation networks, nor
Neural Trees can. Moreover, we briefly report a known result which
states that, if the set of possible labels (alphabet) is finite, then
any mapping from trees to the set of reals can be implemented by a
neural network for structures having a sufficient number of
parameters.
Computational complexity of the model is important as well. Here we
give both upper and lower bounds on the number of neurons (or
connections) needed to implement an arbitrary frontier-to-root Tree
Automata into the most powerful model. Some results on the complexity
of learning are sketched as well.
\section{Basic Concepts}
Here we consider structured domains which are sets
of annotated directed ordered graphs (DOGs).
For a DOG we mean a directed graph $\BY$ with vertex set $\vert(\BY)$ and edge
set $\edg(\BY)$, where for each vertex $v\in \vert(\BY)$ a total
order on the edges leaving from $v$ is defined. Moreover, vertices
are annotated by labels which
are tuples of variables.
The void DOG will
be denoted by the special symbol $\xi$.
For example, in the case of graphs representing logical terms with
variables (denoted by capital letters), the
order on outgoing edges is immediately induced by the order of the
arguments to a function; e.g., the logical term {\tt f(X,g(X,c))}
can be represented as
\[
\begin{array}{ccccccc}
& &\mbox{\tt f} & & & & \\
& \SW \hspace{-.5cm}& & \hspace{-.5cm}\SE & & &\\
\mbox{\tt X\hspace{-.5cm}} & & \stackrel{1}{\longleftarrow}& & \mbox{\hspace{-.5cm}\tt g} & & \\
& & & & & \hspace{-.5cm}\SE &\\
& & & & & & \mbox{\hspace{-.5cm}\tt c}
\end{array}
\]
Note that, since the {\tt X} variable is shared, a single node
is used to represent it.
In problems of structure classification, we shall require the DOG either to
be empty or to possess a supersource, i.e. a vertex $s \in \vert(\BY)$ such
that every vertex in $\vert(\BY)$ can be reached by a directed
path starting from $s$. Note that if a DOG does
not possess a supersource, it is still possible to define a convention for
adding an extra vertex $s$ (with a minimal number of outgoing edges), such
that $s$ is a supersource for the expanded DOG \cite{tr-16/95}.
When considering a DOAG, i.e., an acyclic DOG, the function $source(\BY)$
returns the (unique) supersource of $\BY$. We also assume that a convention is
defined in such a way that, given a cyclic DOG, the function $source(\BY)$
returns a predetermined supersource within the set of
supersources\footnote{A cyclic graph may have several supersources.}.
Given a DOAG $\BY$ and $v\in \vert(\BY)$, we denote by $\pa[v]$ the set of
parents of $v$, by $\ch[v]$ the set of children of $v$, by $\de[v]$ the
set of descendants of $v$, and by $\an[v]$ the set of ancestors of $v$.
We will use the notation $\ch_j[v]$ to refer to the jth children of $v$.
The {\em indegree} of $v$ is the cardinality of the set $\pa[v]$
the {\em outdegree} of $v$ is the
cardinality of the set $\ch[v]$.
In the following, we shall denote by
$\#^{(i,o)}$ the class of DOAGs with maximum indegree $i$ and
maximum outdegree $o$.
In our logical terms example, the maximum outdegree corresponds to the
maximum arity of the functions being considered, e.g.,
the maximum outdegree of {\tt f(X,g(X,c))} is 2.
A generic class of DOAGs with
bounded (but unspecified) indegree and outdegree, will simply be denoted
by $\#$.
Subscript notation will be used when referencing the labels attached
to vertices in a data structure.
Hence $\BY_{v}$ denotes the set of variables labeling vertex $v\in
\vert(\BY)$.
Given a data structure $\BY$, the DOG obtained by ignoring all
node labels will be referred to as the {\em skeleton} of $\BY$,
denoted $\skel(\BY)$.
Clearly, any two data structures can be distinguished because they have
different skeletons, or, if they have the same skeleton, because they
have different node labels.
The class of all data structures defined over the label universe domain
$\CalY$ and skeleton in $\#^{(i,o)}$ will be denoted as
$\CalY^{\#^{(i,o)}}$.
We define a {\it structured domain} ${\cal D}$
(over the set of labels $\Sigma$) as any
(possibly infinite)
set of graphs (over $\Sigma$). The {\it valence of a domain} ${\cal D}$
is defined as the
maximum among the out-degrees of the graphs belonging to ${\cal D}$.
Since we will deal
with learning, we need to define the target function we want to learn.
In approximation tasks, a {\it target function} $d()$ over $\CalD^{\#}$
is defined as any
function $d:\CalD^{\#}\rightarrow \RealSet^k$, where $k$ is the output
dimension, while
in (binary) classification tasks we have $d:\CalD^{\#}\rightarrow
\{0,1\}^k$
(or
$d:\CalD^{\#}\rightarrow \{-1,1\}^k$.)
A training set $T$ on a domain $\CalD^{\#}$ is defined as a set of
couples $(\BX,d(\BX))$,
where $\BX\in \CalU^{\#}\subseteq \CalD^{\#}$ and $d()$ is a target function
defined on $\CalD^{\#}$.
Finally, for the sake of completeness, here we recall the basic definitions
of standard and recurrent neurons.
The output $o^{(s)}$ of a standard neuron is given by
\begin{equation}
o^{(s)} = f(\sum_{i}w_iI_i),
\end{equation}
where {\em f()} is some non-linear squashing function applied to the
weighted sum of inputs {\em I}\footnote{The threshold of the neuron
is included in the weight vector by expanding the input vector with
a component always to 1.}.
A recurrent neuron with a single self-recurrent connection, on the other
hand,
computes its output $o^{(r)}(t)$ as follows
\begin{equation}
o^{(r)}(t) = f(\sum_{i}w_iI_i(t) + w_so^{(r)}(t-1)),
\end{equation}
where {\em f()} is applied to the weighted sum of inputs {\em I} plus the
self-weight, $w_s$, times the previous output. The above formula can be
extended both considering several interconnected recurrent neurons and
delayed versions of the outputs \cite{haykin}.
\section{Representing Structures in Neural Networks}
Structures may be represented in a neural network by using either a {\it localist
representation} or a {\it distributed representation}. In the localist representation
each neuron represents a component of the structure, while
each connection represents a relation between components. This way of representing
structures is trivial and not efficient, since different structures needs to
be represented by different networks. For this reason, several researchers
have developed neural systems using different types of distributed representation,
such as in DCPS \cite{tour88j}, BoltzCONS \cite{touretzky90j}, Tensor Products
\cite{smol90j}, RAAM \cite{pollack90j} and LRAAM \cite{sperduti94p,sperduti94j},
Convolution-based models (CHARM \cite{met93j},
TODAM2 \cite{murdock93j}),
Holographic Reduced Representations \cite{plate95j}.
In the following we discuss the RAAM family of models, since they can be
considered the precursors of the models presented in this chapter.
\subsection{The RAAM Family}
The task of a network in the RAAM family \cite{sperduti96c2} consists in devising a
reduced descriptor for each structure (and substructure) in a given
set of structures, i.e., the training set. The concept of
reduced descriptor was first introduced by Hinton \cite{hinton90j}.
The basic idea is to represent complex conceptual structures in distributed
representations. These representation, however, should satisfy four desiderata
\begin{enumerate}
\item {\bf Representational adequacy:} it should be possible to
reconstruct the full representation of the structure from the reduced descriptor;
\item {\bf Reduction:} the reduced descriptor should require the allocation of fewer units
then the full explicit representation of the structure;
\item {\bf Systematicy:} the reduced descriptor should be related in a
systematic way to the full representation;
\item {\bf Informativeness:} the reduced descriptor should tell
us something about the concept that it refers to, without it
being necessary to reconstruct the full representation.
\end{enumerate}
All these desiderata are satisfied by the networks in the RAAM family.
Depending on the nature of the structures at hand and on how they are
represented, different networks of the RAAM family can be used.
A standard RAAM network is able to generate reduced descriptors for
trees with information stored in the leaves; an SRAAM network
is, instead, suited for lists and an LRAAM network (which is a
generalization of the SRAAM network) can deal with labeled directed
graphs.
The general form of a network of the RAAM family is defined
by a $N_I-N_H-N_O$ feed-forward encoder network, where both
the input and output layers are partitioned in one label field
and $n$ reduced descriptor (pointer) fields, i.e., $N_I = N_O = N_l+nN_H$, where
$N_l \geq 0$ is the size of the label field; moreover, the
reduced descriptor fields and the hidden layer have the same size.
In this formulation, the network for a RAAM network is obtained
by setting $N_l = 0$ and the network for an SRAAM by setting
$n=1$.
The set of equations describing the network are as follows:
\begin{equation}
\label{pah}
\mbox{\boldmath $F$}_E(\vec{i})=\mbox{\boldmath $F$}
(E\vec{i}+\vec{\theta}_H)\equiv\vec{h},
\end{equation}
\begin{equation}
\label{pao}
\mbox{\boldmath $F$}_D(\vec{h})=\mbox{\boldmath $F$}
(D\vec{h}+\vec{\theta}_O)\equiv\vec{o},
\end{equation}
where $\vec{h},\vec{\theta}_H\in \RealSet^{N_H}$,
$\vec{o},\vec{i},\vec{\theta}_O\in \RealSet^{N_l+nN_H}$, $\mbox{\boldmath
$F$}(\vec{i})_j=f(\vec{i}_j)$,
and $f()$ is a sigmoid-shaped function. The vectors $\vec{\theta}_H$ and
$\vec{\theta}_O$ are the bias vectors for the hidden and output layers,
respectively, $\vec{i}$ is the
input pattern to the network, $\vec{h}$ is the hidden activation, and
$\vec{o}$ is the output of the network.
The matrix $E\in \RealSet^{N_H \times N_I}$ is
the Encoding matrix, i.e., the
weight matrix between the input layer ($N_I=N_l+nN_H$ units)
and the hidden layer ($N_H$ units);
and $D\in \RealSet^{N_O \times N_H}$ is the Decoding matrix,
i.e., the
weight matrix between the hidden layer and the output layer ($N_O=N_I$).
In the following, the index $t$ will vary on the output units,
$s$ on the hidden units, and $q$ on the input units.
A pattern\footnote{For the sake of notation, in the following, we omit the transpose
operator: the notation $\vec{p}=[\vec{y}_1,\ldots,\vec{y}_k]$ must be read as
$\vec{p}^{~t}=[\vec{y}^{~t}_1,\ldots,\vec{y}^{~t}_k]$.} in the training set of a general
RAAM network assumes the
form $\vec{p}=[\vec{c}_1,\ldots,\vec{c}_m]$, where
$\vec{c}_j\in \RealSet^{N_H}$ can
be either a reduced descriptor or a label. When considering an SRAAM
network the typical form of a pattern becomes
$\vec{p}=[\vec{l},\vec{rd}]$, where $\vec{l}\in \RealSet^{N_l}$
is a label and $\vec{rd}\in \RealSet^{N_H}$ is a reduced descriptor.
An LRAAM network generalizes
this form, allowing multiple reduced descriptors: $\vec{p}=
[\vec{l},\vec{rd}_1,\ldots,\vec{rd}_k]$.
The learning algorithm for RAAM and SRAAM networks was originally developed by
J. B. Pollack \cite{pollack90j} and it essentially remains the same for the LRAAM.
The algorithm
combines a standard gradient descent with a dynamical adaptation of the
training set. Specifically, the goal of the gradient descent consists in
minimizing the following cost function:
\begin{equation}
C_0 =\frac{1}{2}\sum_k^P\|\vec{p}^{~(k)}-\mbox{\boldmath $F$}_D(\mbox{\boldmath $F$}_E(\vec{p}^{~(k)}))\
|^2,
\end{equation}
where $P$ is the number of patterns (substructures) in the training set.
If the learning is able to reach zero error, then the network implements a perfect
compressor-reconstructor couple for the training set. Furthermore, if each reduced
descriptor $\vec{rd}^{(k)}$
in the training set is equal to $\vec{h}^{~(k)}=\mbox{\boldmath $F$}_E(\vec{p}^{~(k)})$, then the network
is also a member of the RAAM family, in the sense that each reduced descriptor
retains all the information contained in the pattern $\vec{p}^{~(k)}$. In fact, this
information can be reconstructed by decoding $\vec{rd}^{(k)}$, i.e., $\vec{p}^{~(k)}=
\mbox{\boldmath $F$}_D(\vec{rd}^{(k)})$. Pollack's algorithm fulfills this constraint by modifying
the training set during learning: at each epoch of learning every reduced descriptor
$\vec{rd}^{(k)}$ is replaced by the corresponding hidden layer activity $\vec{h}^{~(k)}$. Note
that, this technique guarantees the consistency of the model only at global
minima of $C_0$. Practically, however, even a point in the weights space which
is in the proximity of a global minimum works well.
For LRAAM networks, this author introduced an additional learning rule for dealing
with void descriptors \cite{sperduti94j}. This rule is necessary because in LRAAM networks there is
no commitment to a particular representation for the void descriptor, i.e.,
the void descriptor can get multiple representations. These representations
are distinguished from non-void descriptors thanks to a void condition bit
allocated in the label for each reduced descriptor field of the network.
The learning rule for void descriptors consists in copying the output representations for
the void descriptor into the training set at each epoch of learning.
\scaledpsfigure{fam-raam}{The models belonging to the RAAM family: the RAAM is
able to encode trees with labels on the leaves, the SRAAM encodes sequences,
while the LRAAM is able to encode labeled directed graphs.}{550}{tbhp}
The main theoretical concern about Pollack's algorithm (and its LRAAM generalization)
is whether it is possible
to give it a better mathematical characterization, since an improved understanding of the
underpinning mathematics should eventually either help in improving the efficiency
of the algorithm or in disclosing the limitations of the RAAM family.
\section{From Graph Representation to Graph Transductions}
\label{gra-trasd}
In the previous section, we have discussed a specific way of representing
labeled graphs within a RAAM neural network. The basic idea is to implement
a mechanism for encoding the graph into a vectorial representation,
preserving all the information needed for a subsequent decoding of the
vectorial representation in order to reconstruct the original graph.
In this section, we discuss how the encoding function can be fruitfully
combined with a transformation to obtain a specific type of
graph transduction\footnote{Some experimental results obtained on
combining an LRAAM network with a perceptron are reported in \cite{sperduti95p2}.}.
A functional graph transduction ${\cal T}:{\cal I}\rightarrow{\cal O}$ is defined
as a mapping from an input structured domain ${\cal I}$ (over
$\Sigma_{\cal I}$)
to an output structured domain ${\cal O}$ (over $\Sigma_{\cal O}$).
A special case of transduction that we will consider is the one where ${\cal O}$
is the trivial set of structures compounded by a single labeled node. In
fact, it is not difficult to recognize that when this is the case, with
$\Sigma_{\cal O}=\{0,1\}$ the transduction can be interpreted as a
classification function of the structures in ${\cal I}$.
In general, however, graph transductions can assume several forms. For
the purpose of this tutorial we mainly focus on the class of
transductions, defined on domains of DOAGS with supersource, which can be
represented in the following form
\begin{equation}\label{tra}
{\cal T}=g\circ \hat{\tau}
\end{equation}
where $g$ is the {\it output} function and $\hat{\tau}$ is the {\it encoding}
(or {\it state transition}) function. Specifically, $\hat{\tau}$ is defined
recursively as
\begin{equation}\label{enc-fun}
\hat{\tau}(\BY)=
\left\{
\begin{array}{ll}
nil & \mbox{~~if } \BY=\xi\\
\tau(s,\BY_s,\hat{\tau}(\BY^{(1)}),\ldots,\hat{\tau}(\BY^{(o)})) & \mbox{~~otherwise}
\end{array}
\right.
\end{equation}
where $\tau:vert(\CalI^{\#})\times\CalI\times\CalX^o\rightarrow \CalX$
is the {\it c-model} function,
$s=source(\BY)$, $nil\in\CalX$ is used to represent the void graph, and
$\BY^{(1)},\ldots,\BY^{(o)}$ are the subgraphs
pointed by $s$, i.e., $source(\BY^{(i)})=ch_i[s]$.
The function $\tau$ is called c-model function since it
defines a computational model for the encoding function.
Note that, because of eq. (\ref{enc-fun}), $\transd$
is {\it causal} since $\tau$ only depends on the current node and
nodes descending by it. Moreover, when $\tau$ does not depend on any
specific vertex, i.e.,
$\tau(\BY_s,\hat{\tau}(\BY^{(1)}),\ldots,\hat{\tau}(\BY^{(o)}))$,
then $\transd$ is also {\it stationary}. We will mainly focus
on stationary transductions.
Of course, for general DOGs, the above definition of
$\hat{\tau}(\cdot)$ is not
adequate. In Section \ref{cgraphs} we will also discuss how to deal
with cyclic DOGs.
\begin{example}
Given a stationary encoding function $\hat{\tau}$, the encoding of the
labeled structure
\[
\begin{array}{ccccc}
& & \mbox{b} & & \\
& \stackrel{1}{\nearrow} & & & \\
\mbox{a} & & & & \\
& \stackrel{2}{\searrow} & & & \\
& & \mbox{c} & \stackrel{1}{\rightarrow} & \mbox{a}
\end{array}
\]
is defined by the following set of equations
\[
\hat{\tau}\left(
\begin{array}{ccccc}
& & \mbox{b} & & \\
& \stackrel{1}{\nearrow} & & & \\
\mbox{a} & & & & \\
& \stackrel{2}{\searrow} & & & \\
& & \mbox{c} & \stackrel{1}{\rightarrow} & \mbox{a}
\end{array} \right) = \tau(\mbox{a},\hat{\tau}(\mbox{\bf b}),
\hat{\tau}(\mbox{c}\stackrel{1}{\rightarrow}\mbox{a})),
\]
\[
\hat{\tau}(\mbox{\bf b})=\tau(\mbox{b},nil,nil),
\]
\[
\hat{\tau}(\mbox{c}\stackrel{1}{\rightarrow}\mbox{a})=
\tau(\mbox{c},\hat{\tau}(\mbox{\bf a}),nil),
\]
\[
\hat{\tau}(\mbox{\bf a})=\tau(\mbox{a},nil,nil),
\]
where {\bf b} and {\bf a} denote the graphs with a single node labeled b,
and a, respectively. By proper substitutions, the
encoding can finally be written as
\[
\hat{\tau}\left(
\begin{array}{ccccc}
& & \mbox{b} & & \\
& \stackrel{1}{\nearrow} & & & \\
\mbox{a} & & & & \\
& \stackrel{2}{\searrow} & & & \\
& & \mbox{c} & \stackrel{1}{\rightarrow} & \mbox{a}
\end{array} \right) = \]
\[ = \tau(\mbox{a},\tau(\mbox{b},nil,nil),
\tau(\mbox{c},\tau(\mbox{a},nil,nil),nil))
\]
\end{example}
It should be noted that causal stationary transductions cannot
compute any function of the input DOAG. For example, consider the following
two DOAGs
\[
\begin{array}{ccccccccccc}
& &\mbox{f} & & & & & \mbox{f} & &\\
& \SW & & \SE & & & \SW & & \SE & &\\
\mbox{h}\hspace{-.5cm} & & & & \hspace{-.5cm}\mbox{q} &\ \ \ \ \ \ \ & \hspace{-.5cm}\mbox{h} & & & & \hspace{-1.3cm}\mbox{q}\\
\DW\hspace{-.5cm} & & & &\hspace{-.5cm} \DW & &\SEE & & & & \hspace{-1.7cm}\SWW \\
\mbox{a}\hspace{-.5cm} & & & &\hspace{-.5cm} \mbox{a} & & & &\hspace{-1.2cm}\mbox{a} & &
\end{array}
\]
Any given transduction described by equations (\ref{tra}-\ref{enc-fun})
will necessarily map these two graphs into the
same output, regardless of the form of functions $g(\cdot)$ and $\tau(\cdot)$.
This is because the graph's processing is bottom-up, so it is not
possible to recognize that in the right-side graph $pa[a]$ is constituted
by two nodes, i.e., the ones labeled h and q.
This computational limitation, however, allows to improve the
efficiency of computation.
In fact, when considering DOAGs, the training set can be optimized by allowing
only one occurrence of the same subgraph: if there are graphs
$\BX^{(1)},\ldots,\BX^{(q)}$ which share a common subgraph $\hat{\BX}$, then
$\hat{\BX}$ can be represented only once. This is obtained by merging all the
graphs in a single minimal DOAG. After that, a topological sort on the vertices
of the minimal DOAG is performed to
determine the updating order on the vertices for the recursive network.
The advantage of having this topological order is that all
the reduced representations (and also their derivatives with respect to the
weights) can be computed by a single ordered scan of the training set.
Both the merging\footnote{The merging operation can be performed
by using either hash tables or AVL trees.}
and sorting operations can be done efficiently
(almost linearly) with respect
to the size of all DOAGs and the size of the minimal DOAG, respectively.
Moreover, the use of the minimal DOAG leads to a considerable reduction in space
complexity. In some cases, such reduction can even be exponential.
\section{Neural Graph Transductions}
The implementation of graph transductions through standard and recurrent
neural networks can only be realized by resorting to complex and very
unnatural encoding protocols
which map structures onto fixed-size unstructured patterns or sequences.
A more natural approach, instead, can be obtained by proper instantiation
of the input and output domains for $g$, $\hat{\tau}$, and $\tau$.
Specifically, let $\CalI^{\#}$ be the input structured domain
with real valued vectors as labels ($\CalI=\RealSet^n$), and $\CalO=\RealSet^k$.
The encoding function $\hat{\tau}$ is completely defined by choosing a
representation for the void graph and by defining the c-model function
$\tau$. By choosing $\CalX=\RealSet^m$, any real valued vector would be
fine for representing the void graph, however, for computational
reasons which will be clear in the following, we choose the null vector for
representing the void graph, i.e., $nil=\mbox{\boldmath{$0$}}\in\RealSet^m$.
Consequently the c-model function $\tau$ will be defined as
\begin{equation}\label{general1}
\tau:
\RealSet^n\times\underbrace{\RealSet^m\times\cdots\times\RealSet^m}_{\mbox{o times}}
\rightarrow \RealSet^m
\end{equation}
where the first domain of the cartesian product denotes the label space, while
the remaining domains represent the encoded subgraphs spaces up to the
maximum outdegree of
the input domain $\CalI^{\#}$. Finally, we can define the output function $g$ as
\begin{equation}\label{general2}
g:\RealSet^m\rightarrow \RealSet^k.
\end{equation}
Note that eqs. (\ref{general1}) and (\ref{general2}) only describe the general form
for $\tau$ and $g$. Different realizations can be given which satisfy the
above equations.
For example, both $\tau$ and $g$ can be implemented by feedforward neural networks.
Before to reach such level of complexity, however, it is worth to explore
simpler realizations. Specifically, in the following section we study
recursive neurons.
\section{Recursive Neurons}
Let first consider the case in which $m=1$.
The simplest non-linear neural realization for $\tau(\cdot)$ is given by
\begin{equation}
\label{single-crn}
\tau(\mbox{\boldmath{$l$}},x_1,\ldots,x_o) =
f(\sum_{i=1}^{n}w_il_i +
\sum_{j=1}^{o}\hat{w}_jx_j+\theta),
\end{equation}
where $f$ is a sigmoidal function, $w_i$ are the weights associated to the
label space, $\hat{w}_j$ are the weights associated to the
subgraphs spaces, $\theta$ is the bias, \mbox{\boldmath{$l$}} is the current input label,
and $x_1,\ldots,x_o$ are the encoded representations of subgraphs.
\begin{example}[Logical term {\tt f(a,g(b),c)}.]
\label{esempio}
Let define a coding function
\[\phi:\{\mbox{\tt a},\mbox{\tt b},\mbox{\tt
c},\mbox{\tt f},\mbox{\tt g}\}\rightarrow \{0,1\}^5\]
which encodes
the symbolic labels into binary vectors, i.e.,
$\phi(\mbox{\tt a})=[1,0,0,0,0]$, $\phi(\mbox{\tt b})=[0,1,0,0,0]$,
$\phi(\mbox{\tt c})=[0,0,1,0,0]$, $\phi(\mbox{\tt f})=[0,0,0,1,0]$,
$\phi(\mbox{\tt g})=[0,0,0,0,1]$, and $nil=0$. Thus,
$\tau(\cdot)$ will be defined by five parameters for the label, i.e.,
$w_1,\ldots,w_5$, three parameters for the subgraphs (maximum outdegree),
i.e., $\hat{w}_1$, $\hat{w}_2$, and $\hat{w}_3$, and the bias $\theta$.
Then,
\[
\hat{\tau}\left(
\begin{array}{ccccc}
& &\mbox{\tt f} & & \\
& ^1\hspace{-.2cm}\swarrow & \DWW &
\searrow\hspace{-.2cm}^3 & \\
\mbox{\tt a\hspace{-0.5cm}} & & \mbox{\tt g} & &\hspace{-0.5cm} \mbox{\tt c}\\
& & \DW & & \\
& & \mbox{\tt b} & &
\end{array}
\right) =
\]
\begin{eqnarray*}
& = &
f(\sum_{i=1}^5 w_i\phi_i(\mbox{\tt f})+
\hat{w}_1 \overbrace{f(\sum_{i=1}^5 w_i\phi_i(\mbox{\tt
a})+\theta)}^{\hat{\tau}(\mbox{\bf a})}+\\
& & \hat{w}_2 \overbrace{f(\sum_{i=1}^5 w_i\phi_i(\mbox{\tt g})+
\hat{w}_1 \underbrace{f(\sum_{i=1}^5
w_i\phi_i(\mbox{\tt b})+\theta)}_{\hat{\tau}(\mbox{\bf b})}+\theta)}^{\hat{\tau}(\mbox{\tt
g}\stackrel{1}{\rightarrow}\mbox{\tt b})} \\
& & +\hat{w}_3 \underbrace{f(\sum_{i=1}^5
w_i\phi_i(\mbox{\tt c})+\theta)}_{\hat{\tau}(\mbox{\bf c})}+\theta).
\end{eqnarray*}
A graphical representation of the above equation is given
in Figure \ref{f:esempio-mod}.
\scaledpsfigure{esempio-mod}{Graphical representation for the
encoding function of Example 2.}{500}{htbp}
\end{example}
When $m>1$, $\tau(\cdot)\in\RealSet^m$ can be written as
\begin{equation}
\label{crn}
\tau(\mbox{\boldmath{$l$}},\mbox{\boldmath{$x$}}^{(1)},\ldots,\mbox{\boldmath{$x$}}^{(o)}) =
\mbox{\boldmath $F$}(\mbox{\boldmath $W$}
\mbox{\boldmath $l$} +
\sum_{j=1}^{o}\widehat{\mbox{\boldmath
$W$}}_j\mbox{\boldmath{$x$}}^{(j)}+\mbox{\boldmath{$\theta$}}),
\end{equation}
where $\mbox{\boldmath $F$}_i(\mbox{\boldmath $v$})=f(v_i)$,
$\mbox{\boldmath $l$}\in \RealSet^n$,
$\mbox{\boldmath $\theta$}\in \RealSet^m$,
$\mbox{\boldmath $W$}\in \RealSet^{m\times n}$, $\mbox{\boldmath{$x$}}^{(j)}\in
\RealSet^{m}$,
$\widehat{\mbox{\boldmath $W$}}_j \in \RealSet^{m\times m}$.
Concerning the output function $g(\cdot)$, it can be defined as a standard
neuron taking in input the encoded representation of the graph, i.e.,
\begin{equation}\label{out-func}
g(\mbox{\boldmath{$x$}})=
\mbox{\boldmath $F$}(\mbox{\boldmath $M$}\mbox{\boldmath $x$}
+\mbox{\boldmath{$\beta$}}),
\end{equation}
where $\mbox{\boldmath $M$}\in\RealSet^{k\times m}$ and
$\mbox{\boldmath{$\beta$}}\in\RealSet^{k}$ are the weight matrix and bias terms
defining $g(\cdot)$, respectively.
As a special case, $g(\cdot)$ can be defined as a function which
selects a subset of the state vector in $\RealSet^m$. For example,
\begin{equation}\label{select}
g(\mbox{\boldmath{$x$}})=\mbox{\boldmath $M$}\mbox{\boldmath $x$},
\end{equation}
where $\mbox{\boldmath $M$}\in\RealSet^{k\times m}$ is a 0-1 matrix
with a single 1 for each row and at most a single 1 for each
column. This corresponds to have the selected
components of the state vector to contribute to both the output and the
coding of the state.
The relationship between recurrent and recursive neurons
is clarified by the following example.
\begin{example}
A recurrent neuron can be considered as a special case of a
recursive neuron applied to lists, i.e., labeled
graphs with $o$=1; in that case, the position of a vertex
within the list corresponds to the time of processing; e.g.,
given the list
\[\BL\equiv \mbox{\boldmath $l$}^{(k)}\rightarrow
\mbox{\boldmath $l$}^{(k-1)}\rightarrow \ldots \rightarrow
\mbox{\boldmath $l$}^{(1)}\]
the set of
equations derived by the recursive application of
eq. (\ref{single-crn}) is
\[x_1=\tau(\mbox{\boldmath{$l$}}^{(1)},nil)
= f(\sum_{j=1}^{n}w_jl^{(1)}_j) \]
\[ x_i= \tau(\mbox{\boldmath{$l$}}^{(i)},x_{i-1}) = f(\sum_{j=1}^{n}w_jl^{(i)}_j +
\hat{w}_1x_{i-1}))\ \ \ \ i=2,\ldots ,k\]
where $\hat{w}_1$ is the weight on the recursive connection;
by making the time $t$ of processing explicit,
the above equations can be rewritten as the output $o(\cdot)$ of
a recurrent neuron as follows
\[ o(t=1) = f(\sum_{j=1}^{n}w_jl^{(1)}_j) \]
\[ o(t) = f(\sum_{j=1}^{n}w_jl^{(t)}_j + \hat{w}_1o(t-1))\ \ \ \
t=2,\ldots ,k.\]
\end{example}
\section{Learning Algorithms}
\label{models}
In this section, we discuss how several standard supervised algorithms
for neural networks can be extended to structures.
\subsection{Back-propagation through Structure}
\label{bpts}
The task addressed by back-propagation through time networks \cite{pdp} is to
produce particular output sequences in response to specific input
sequences. These networks are, generally, fully recurrent, in which any
unit
may be connected to any other. Consequently, they cannot be trained
by using plain back-propagation. A trick, however, can be used to
turn an arbitrary recurrent network into an equivalent feed-forward network
when the input sequences have a maximum length $T$. In this case, all the
units
of the network can be duplicated $T$ times ({\it unfolding of time}),
so that the state of a unit in the recurrent network at time $t$ is held
by the $t$th copy of the same unit in the feed-forward network.
By preserving the same weight values through the layers of the
feed-forward network, it is not difficult to see that the two networks
will behave identically for $T$ time steps.
The feed-forward network can be trained by back-propagation, taking care to
preserve the identity constraint between weights of different layers. This
can be guaranteed by adding together the individual gradient contributions
of
corresponding copies of the same weight and then changing all copies
by the total amount.
Back-propagation through Time can easily be extended
to structures. The basic idea is to use recursive neurons
to encode the structures, i.e., $\hat{\tau}(\cdot)$; the
representations obtained are then
classified or used to approximate an unknown function by a standard
feed-forward network, i.e., $g(\cdot)$.
Following our framework, given an input graph $\BX$, the network output
$\transd(\BX)$
can be expressed as the composition of the
encoding function $\hat{\tau}(\cdot)$ and
the classification (or approximation) function $g(\cdot)$:
\begin{equation}
\transd(\BX)=g(\hat{\tau}(\BX)).
\end{equation}
Learning for the set of weights $\mbox{\boldmath $W$}_{g}$
can be implemented by plain back-propagation on the
feed-forward network realizing $g(\cdot)$
\begin{equation}
\Delta \mbox{\boldmath $W$}_{g}=
-\eta \frac{\partial Error(g(\mbox{\boldmath $y$}))}
{\partial \mbox{\boldmath $W$}_{g}},
\end{equation}
where $\mbox{\boldmath $y$}=\hat{\tau}(\BX)$, i.e., the input to
the feed-forward
network, while learning for the set of weights,
$\mbox{\boldmath $W$}_{\hat{\tau}}$,
realizing $\hat{\tau}(\cdot)$ can be implemented by
\begin{equation}
\Delta \mbox{\boldmath $W$}_{\hat{\tau}}=
-\eta \frac{\partial Error(g(\mbox{\boldmath $y$}))}
{\partial \mbox{\boldmath $y$}}\frac{\partial \mbox{\boldmath $y$}}
{\partial \mbox{\boldmath $W$}_{\hat{\tau}}},
\end{equation}
where the first term represents the error coming from the feed-forward
network and the second one represents the error due to the encoding
function $\hat{\tau}(\cdot)$. Note that $\mbox{\boldmath $W$}_{\hat{\tau}}$
is actually referring to the weights associated with the c-model
function $\tau(\cdot)$. Thus, in order to clarify this point we will
denote this weight matrix as $\mbox{\boldmath $W$}_{\tau}$
Goller and K{\"u}chler \cite{goller96p} devised a learning
algorithm, namely {\it Backpropagation Through Structure},
on the basis of the above equations.
The basic idea it to observe that
$\frac{\partial \mbox{\boldmath $y$}}{\partial \mbox{\boldmath
$W$}_{\tau}}$
can be computed by backpropagating the error from the feed-forward
network implementing $g(\cdot)$ through the network implementing
the encoding function $\hat{\tau}(\BX)$. As in
back-propagation
through time, the gradient contributions of corresponding copies
of the same weight are collected for each structure. The total amount is
then
used to change all the copies of the same weight. If learning is
performed
by {\it structure}, then the weights are updated after the presentation
of each individual structure, otherwise, the gradient contributions are
collected
through the whole training set and the weights changed after all
the structures in the training set have been presented to the network.
Recalling that the encoding network of a graph $\BX$ is defined
as the composition of the network implementing $\hat{\tau}(\BX)$
with the network implementing $g(\cdot)$, the learning algorithm
for a (optimized) training set of $L$ structures
can be summarized in the following way:
\begin{figure}[htp]
\noindent{\bf Backpropagation Through Structure}
\begin{description}
\item {\bf Repeat}
\item {\bf for} l=1 {\bf to } L
\begin{description}
\item Compute $\transd(\BX)$;
\item Compute the gradient using the Backpropagation Algorithm
over the encoding network of $\BX$;
\item Collect (add) the gradient for the different instances of
weights in $\mbox{\boldmath $W$}_{\tau}$;
\end{description}
\item Update $\mbox{\boldmath $W$}_{\tau}$ and $\mbox{\boldmath $W$}_{g}$;
\item {\bf Until} Convergence
\end{description}
\end{figure}
\subsection{Extension of Real-Time Recurrent Learning}
The extension of Real-Time Recurrent Learning \cite{williams89j} to
recursive neurons does not present particular problems. Here we
consider a neural model where $g(\cdot)$ is implemented as in equation
(\ref{select}), while the encoding is performed by $m$ recursive
neurons, i.e., (by including the bias into the label weight matrix)
\begin{equation}
\label{crn2}
\hat{\tau}(\BX)=\tau(\mbox{\boldmath{$l$}},\mbox{\boldmath{$x$}}^{(1)},\ldots,\mbox{\boldmath{$x$}}^{(o)}) =
\mbox{\boldmath $F$}(\mbox{\boldmath $W$}
\mbox{\boldmath $l$} +
\sum_{j=1}^{o}\widehat{\mbox{\boldmath
$W$}}_j\mbox{\boldmath{$x$}}^{(j)}),
\end{equation}
where $\mbox{\boldmath{$x$}}^{(j)}=\hat{\tau}(\BX^{(j)})$ with
$source(\BX^{(j)})=ch_j[source(\BX)]$, and
\begin{equation}
\mbox{\boldmath $W$}_{\tau}=[\mbox{\boldmath
$W$},\widehat{\mbox{\boldmath
$W$}}_1, \ldots,\widehat{\mbox{\boldmath $W$}}_o].
\end{equation}
Here we show how to compute the derivatives of $\hat{\tau}(\BX)$
leaving to the reader the derivation of the final learning rule according to
the chosen error function.
The derivatives of $\hat{\tau}(\BX)$ with respect
to $\mbox{\boldmath $W$}$
and $\widehat{\mbox{\boldmath $W$}}_i$ ($i\in [1,\ldots,o]$)
can be computed as follows:
\begin{equation}\label{rtl-eq1}
\frac{\partial \hat{\tau}_p(\BX)}{\partial W_{tk}} =
\tau_p'(\mbox{\boldmath $l$}_{k}\delta_{pt}+
\sum_{j=1}^{o}(\widehat{\mbox{\boldmath $W$}}_j)_p
\frac{\partial \mbox{\boldmath{$x$}}^{(j)}}{\partial W_{tk}}),
\end{equation}
where $\tau_p'$ is the first derivative of the $p$th component of
$\tau(\cdot)$ computed on the supersource of $\BX$,
$\delta_{mt}$ is the Kronecker delta, $p=1,\ldots,m$, $t=1,\ldots,m$,
$k=0,\ldots,n$ and
$(\widehat{\mbox{\boldmath $W$}}_j)_p$
is the $p$th row of $\widehat{\mbox{\boldmath $W$}}_j$;
\begin{equation}\label{rtl-eq2}
\frac{\partial \hat{\tau}_p(\BX)}{\partial (\widehat{\mbox{\boldmath
$W$}}_i)_{tq}} = \tau_p'(\mbox{\boldmath{$x$}}^{(j)}_p\delta_{pt}+
\sum_{j=1}^{o}
(\widehat{\mbox{\boldmath $W$}}_j)_t
\frac{\partial \mbox{\boldmath{$x$}}^{(j)}}{\partial
(\widehat{\mbox{\boldmath $W$}}_i)_{tq}}),
\end{equation}
where $t=1,\ldots,m$, and $q=1,\ldots,m$.
These equations are recursive on the structure $\BX$, and can be computed
by noting that if $\BV$ is a graph composed by a single vertex, then
\begin{equation}
\frac{\partial \hat{\tau}_p(\BV)}{\partial W_{tk}} =
\tau_p'(\mbox{\boldmath $l$}_{source(\BV)})_k\delta_{pt},
\ \mbox{and}\ \
\frac{\partial \hat{\tau}_t(\BV)}{\partial
(\widehat{\mbox{\boldmath $W$}}_i)_{tq}} = 0.
\end{equation}
This allows the computation of the derivatives in real time
alongside the computation of the neural representations for
the graphs.
\subsection{Recursive Cascade Correlation}
In this section we discuss how a neural graph transduction $\transd$ can be
learned using an extension of the Cascade Correlation algorithm.
The standard Cascade-Correlation algorithm \cite{fahlman90p} creates a
neural network using
an incremental approach for the classification (or regression) of unstructured patterns.
The starting network ${\cal N}_0$ is a network without hidden
nodes trained by a Least Mean Square algorithm.
If network ${\cal N}_0$ is not able
to solve the problem, a hidden unit $u_1$ is
added such that the {\it correlation} between the output of the
unit and the residual error of network ${\cal N}_0$ is
maximized\footnote{Since
the maximization of the correlation is obtained using a gradient
ascent technique on a surface with several maxima, a pool of
hidden units is trained and the best one selected.}. The weights
of $u_1$ are frozen and the remaining weights are retrained. If the
obtained network ${\cal N}_1$ cannot solve the problem,
new hidden units are added which are connected (with frozen
weights) with all the inputs and previously installed hidden units. The
resulting
network is a {\it cascade} of nodes. Fahlman extended the algorithm to
the classification of sequences, obtaining good results \cite{fahlman91p}.
In the following, we show that the Cascade-Correlation can be further extended to
structures by using our computational scheme \cite{sperduti96c1}. In fact,
the shape of the c-model function
can be expressed component-wise by the following set of equations:
\begin{eqnarray}
\tau_1 & = &
h_1(\mbox{\boldmath{$l$}},\hat{\tau}_1(\BY^{(1)}),\ldots,\hat{\tau}_1(\BY^{(o)}))\\
\tau_2 & = &
h_2(\mbox{\boldmath{$l$}},\hat{\tau}_1(\BY^{(1)}),\ldots,\hat{\tau}_1(\BY^{(o)}),\\
& &\hat{\tau}_2(\BY^{(1)}),\ldots,\hat{\tau}_2(\BY^{(o)}),\hat{\tau}_1(\BY))\nonumber\\
& \vdots &\nonumber \\
\tau_m & = &
h_m(\mbox{\boldmath{$l$}},\hat{\tau}_1(\BY^{(1)}),\ldots,\hat{\tau}_1(\BY^{(o)}),\\
& &\hat{\tau}_2(\BY^{(1)}),\ldots,\hat{\tau}_2(\BY^{(o)}),\ldots\nonumber\\
& & ~~\ldots,\hat{\tau}_m(\BY^{(1)}),\ldots,\hat{\tau}_m(\BY^{(o)}),\\
& &\hat{\tau}_1(\BY),\ldots,\hat{\tau}_{m-1}(\BY))\nonumber
\label{ccs-full}
\end{eqnarray}
where the $h_i$ are suited nonlinear functions of the arguments.
Specifically, the output of the $k$th hidden unit, in our framework, can be computed as
\begin{equation}
\label{hidden-scc}
\tau_k(\mbox{\boldmath{$l$}},\mbox{\boldmath{$x$}}^{(1)},\ldots,\mbox{\boldmath{$x$}}^{(o)}) = f(\sum_{i=0}^{n}w^{(k)}_il_i +
\sum_{v=1}^{k}\sum_{j=1}^{o}
\hat{w}^{(k)}_{(v,j)}\mbox{\boldmath{$x$}}^{(j)}_v+
\end{equation}
\[
+\sum_{q=1}^{k-1}\bar{w}^{(k)}_q\tau_q(\mbox{\boldmath{$l$}},\mbox{\boldmath{$x$}}^{(1)},\ldots,\mbox{\boldmath{$x$}}^{(o)})),
\]
where $w^{(k)}_{(v,j)}$ is the weight of
the $k$th hidden unit associated with the output of
the $v$th hidden unit
computed on the $j$th subgraph code
$\mbox{\boldmath{$x$}}^{(j)}$, and $\bar{w}^{(k)}_q$
is the weight of
the connection from the $q$th (frozen) hidden unit,
$q0$ there exists a function $f^\epsilon \in {\cal F}$ such
that $\forall s\in \CalS,\ P(|f^\epsilon(s) - \bar{f}(s)|>\epsilon)<\epsilon$
\end{teo}
This result is quite interesting, since it states that if the set of possible
labels is finite, then any mapping from trees to the set of reals can be
implemented by a recursive neural network having a sufficient number of
parameters. No universal mapping result for arbitrary real valued vector as
labels, however, has been established up to now.
\section{Complexity Issues}
A usual problem in neural networks is the evaluation of how
many resources a network needs to implement the desired function.
In this section, we explore this problem from a node complexity
perspective when considering the implementation of
{\em Frontier-to-Root Tree Automata with Output} (FRAO)
into recursive neural networks.
Here we report upper bounds
to the number of units needed to implement a given FRAO in four,
three, and two layers recursive networks. While the bound for four layers networks
constitutes a generalization of the bound discussed in \cite{horne96j1} for the
implementation of FSM in recurrent networks, the bound for three layers
networks produces, as a special case, a new bound to the implementation
of FSA in three layers networks. Moreover, the bound for three layers
networks is constructive and, thus, it may turn useful for practical
applications where a priori knowledge available in
form of formal languages or inferred by symbolic machine learning
approaches can be injected into the neural network (for the case of recurrent
neural networks see Omlin and Giles~\cite{OmlGil96a}, Frasconi et
al.~\cite{FGS95}).
Its constructive proof is based on the encoding of
each possible input configuration to the FRA by a different integer in
a predefined interval. This integer is then processed by
the repeated application
of the {\em telescopic} technique \cite{discrete-ncb}, which is called
in this way because it consists in the progressive activation of a set
of units according to the magnitude of the processed integer.
On the basis of this progressive encoding, it is possible to
implement both the transition and output function of the desired FRAO.
Finally, a lower bound based on a counting argument is reported, which
demonstrate the optimality of the upper bound obtained for the
four layers recursive networks.
\subsection{Representing FRAO as Boolean Functions}
In \cite{horne96j1} the $m$ states of a FSA are encoded through
$\log m$ bits. Hence, the state and output transition functions
can be implemented by a boolean function
\begin{equation}
\label{bool}
\{0,1 \}^{\log m+\log l} \rightarrow \{0,1 \}^{\log m+\log l}
\end{equation}
where $\log m$ bits in input are used to encode the current status,
while the remaining $\log l$ bit are used to encode the current input label.
Similarly, $\log m$ bits in output are used to encode the new status
and one bit to encode the output bit.
In a similar way an FRAO can be implemented by a boolean function
\begin{equation}
\label{bool2}
\{0,1 \}^{\log l + N \log m} \rightarrow \{0,1 \}^{\log l + \log m}
\end{equation}
where $\log l$ is the number of bits encoding the input-output symbols,
and $N$ is the maximum rank for the symbols.
\subsection{Upper Bounds}
Even in this case, the fact that a sigmoid function can approximate
a step function to an
arbitrary degree of precision by augmenting the modulus of the associated
weight vector, implies that any complexity bound based on threshold gates
is valid also for sigmoidal networks.
It is known that Finite State Automata with $m$
states and binary alphabet can be inserted in four layers recurrent
neural networks
with $O(\sqrt{m})$ units \cite{horne96j1}. This result was obtained
by exploiting the following lemma by Lupanov \cite{Lupanov72}:
\begin{lemma}
Arbitrary boolean logic functions of the form $f:\{0,1\}^x
\rightarrow \{0,1\}^y$ can be implemented in a four layer
network of perceptrons with a node complexity of
$O(\sqrt{\frac{y2^x}{x-\log y}}).$
\end{lemma}
Using the same lemma the following theorem can be proved:
\begin{teo}[Four Layers Insertion]
Any FRAO with
$m$ states, $l$ input-output labels, and maximum rank $N$,
can be implemented by a four layers recursive neural network
with a node complexity of
$O(\sqrt{\frac{(\log l+\log m)lm^{N}}{\log l+N\log m}})$.
If $l