^{*}

Moisl [

Moisl [

Humans intuitively feel that they possess a head-internal meaningfulness, that is, an awareness of the self and its relationship to the perceived world which is independent of interpretation of one's behaviour by observers. This intuition is captured by the philosophical concept of intentionality [

In 1980 and subsequent publications [

The Room is, of course, a computer. Searle is the CPU, the list of English instructions is a program, and the input-output sequences are symbol strings; by concluding that the room understands Chinese, its observers have confirmed the Turing Test [

Searle's position remains controversial among philosophers of mind and cognitive scientists more generally after four decades [

The black box problem in system identification [

Applied to black boxes in general, the doctrine of emergence in the philosophy of science [

For linguistic meaning the black box is the human head and the input-output behaviour is conversation. The CTM view of what's in the head is that it is a Turing Machine whose program is cognition. When the box is opened, however, one looks in vain for the data structures and algorithms of CTM, and finds instead billions of interconnected neurons. Some have argued that study of the brain by cognitive neuroscience will supplant the theoretical ontology of CTM, but this is not the majority view [

Proposal of an implementation model implies clarity about what is being implemented. 'Meaning' is understood, and its theoretical characterization is approached, in a variety of ways, for an overview of which see [

According to Searle, 'intentionality in human beings (and animals) is a product of causal features of the brain', and 'any attempt literally to create intentionality artificially (strong AI) could not succeed just by designing programs but would have to duplicate the causal powers of the human brain'. Since the validity of Searle's position was and in the present discussion continues to be assumed, the choice of artificial neural networks (ANN) as the modelling framework was obvious: though radically simplified with respect to the biological brain, they do retain its fundamental architectural characteristics as a collection of massively interconnected processing units that learns to represent environmental inputs via synaptic strength modification. The research question that motivates the present discussion thereby becomes: How can the brain or a physical mechanism analogous to it implement intrinsic intentionality?

The solution proposed for implementation of lexical intentionality was the structure of interconnected ANNs shown in

The structure of the lexical intentionality model.

Spoken word and visual inputs from an environment are simultaneously presented to their eponymous subnets, where sequences of representations are generated in their respective hidden layers; the numbers of units in the various subnets are small for tractability of graphical presentation and would need to be much larger in a practical implementation. These representations are associated in the association subnet, whose hidden layer was argued to be the implementation of lexical intentionality, that is, of the meaning of the word.

Fundamental to the model is the autoassociative multilayer perceptron (aMLP), an example of which is shown in

Example of an autoassociative multilayer perceptron.

The aMLP appears as the audio input subnet in _{j} generates I_{j} in the output units. The hidden layer, which contains fewer units, is a compact representation of I_{j}, and when the representations of all the components of I were cluster analyzed, their structure was found to be similar to that of I; 'similar' as understood here is defined in Section 3. For example, if I is the collection of the 26 letter forms of the Roman alphabet represented as 12 × 12 bitmaps, as shown in

A letter bitmap.

Cluster trees for letter bitmaps and their aMLP representations.

The idea that the structure of head-internal representations generated by a cognitive agent's interaction with an agent-external environment is similar to the spatial and temporal structure of that environment has a long history in cognitive science. It was proposed in Antiquity by philosopers like Aristotle, Augustine, and Boethius [

The relevance of similarity-based models to present concerns is that they can be understood as implementation-level models of intrinsic intentionality in biological brains. Their implication is that the formal similarity structure of the neural activations which causally drive brain dynamics reflect the similarity structure of mind-external objects and their interactions, and are thereby 'about' the mind-external world without involvement of a system-external interpreter.

The model in

causally generates its own system-internal representations of external environmental input.

The physical form of these representations is determined by that which they represent.

For a given environmental domain, the structure of the representations is similar to that of the domain and thereby model it.

The representations are causal in the input-output behaviour of the system.

Assuming the validity of the foregoing comments about the relevance of structural similarity to modelling of intentionality, the structure in

Finally, it is freely admitted that _{val} scholastic philosophy called universals [

This part of the discussion describes a way of understanding the foregoing preservation of input similarity structure in system-internal representations as mathematical homomorphism; the references for standard mathematical topics used in what follows are Gowers et al [

In current mathematics a space is understood as a pair S = (Obj, Op), where Obj is a set of mathematical objects of some particular type and Op is a set of operations defined on Obj such as scalar multiplication defined on vectors. The input, hidden, and output layers of ANNs are mathematically represented as real-valued vectors, so what follows will focus on vector spaces.

A vector space is a set V of vectors and the associated operations are vector addition and multiplication of a vector by a scalar. In what follows, some way of characterizing distances among vectors in V will be required, and these two operations on their own are insufficient for that. What is required is a metric d, that is, a function that returns a measure of the distance between two vectors v and w, both in V, inclusion of which transforms a vector space into a metric space. The most familiar metric space is the n-dimensional Euclidian in which the metric is Euclidian distance, the shortest distance between two points. Input, hidden, and output aMLP layers are here interpreted as Euclidian spaces.

A homomorphism is a structure-preserving map between two spaces ([

Every linear transformation of vectors in a Euclidean space is homomorphic. This is exemplified in principal component analysis (PCA; [_{1} into a Euclidean output space E_{2} by linear combination of the vectors in E_{1}, and the distance structure of E_{1} is preserved in E_{2}. For example,

Cluster analyses of L and its PCA transformation, both 144-dimensional.

Hierarchical cluster analysis constructs its trees on the basis of relative Euclidean distances between and among vectors; the trees in

Cluster analyses of 144-dimensional L and the PCA-transformed 50-dimensional version.

Again, the trees are identical. But, as noted, there is a limit. PCA is often used to reduce high-dimensionality matrices to dimensionality 2 or 3 for graphical display; reduction to dimensionality 3 is shown in

Cluster analyses of 144-dimensional L and the PCA-transformed 3-dimensional version.

There is a family resemblance between the trees, but they also differ substantially. What has happened to distance preservation is that dimensionality has been reduced to a value lower than the intrinsic dimensionality of the letter data matrix, where intrinsic dimensionality is the minimum number of variables required to represent a given data matrix without significant loss of information ([

PCA provides a criterion for identifying intrinsic dimensionality. One of the ways of calculating PCA is by matrix eigendecomposition which, briefly, works as follows. Given a mean-centred m × n matrix M, PCA creates a covariance matrix C = MTM/(m-1) and then abstracts two matrices from C: E_{vect}, whose columns are the eigenvectors of C and constitute the orthogonal basis of a new vector space into which M will be projected, and E_{val}, which is a diagonal matrix containing the eigenvalues of C in descending order of magnitude and which represent the lengths of the new basis vectors, that is, the amount of variance in M that each of the basis vectors represents. It often happens that data is redundant in the sense that the variance of its variables overlaps. The eigenvalues in E_{val} make it possible to identify such redundancy: the largest eigenvalue and the corresponding eigenvector represent the largest direction of variance in M, the second-largest eigenvalue and the corresponding eigenvector represent the second largest direction of variance in M, and so on to n. Plotting the eigenvalues provides an indication of intrinsic dimensionality, that is, of how many mutually orthogonal variables are required to represent the variability in M without significant loss of information. The eigenvalue plot for the covariance matrix abstracted from L is shown in

Distribution of E_{val}.

The intrinsic dimensionality is about 25; going below that compromises distance preservation.

The reason for going into all this is that, since Baldi & Hornik made the connection explicit in 1989, the aMLP architecture has been recognized as an implementation of PCA (for example, [

An alternative to the eigenvalue decomposition method for calculating PCA is singular value decomposition (SVD; [

SVD is a theorem in linear algebra which says that any real-valued matrix D with m rows and n columns can be represented as the product of three matrices:

where

U, S, and V are the matrices whose product gives D.

The column vectors of U are the eigenvectors of the square matrix which results from the multiplication of D by a transposition of itself, that is, DD^{T}, and these constitute an orthonormal basis for the column vectors of D.

The column vectors of V are the eigenvectors of the square matrix which results from the multiplication of D^{T} by D, that is, D^{T}D, and these constitute an orthonormal basis for the row vectors of D.

S is a diagonal matrix, that is, a matrix having non-negative real values on its main diagonal. These values are the singular values of D in descending order of magnitude, and are the square roots of the eigenvalues of DD^{T} or D^{T}D.

When D is a covariance or correlation matrix SVD and PCA are identical. SVA is more general than PCA because it can be applied to matrices of arbitrary dimensions with unrestricted numerical values whereas PCA is restricted to square matrices containing covariances and correlations, but in practice it is a straightforward matter to calculate a covariance or correlation matrix for whatever matrix one wants to analyze, so the choice between SVD and PCA is a matter of preference.

Both eigenvalue decomposition and SVD simply restate the given matrix D in a new vector space having an orthogonal basis and the same dimensionality as D. But one of the main uses of PCA is dimensionality reduction. With the eigendecomposition approach this is achieved by selecting the first k largest eigenvalues from E_{val} and deleting all the columns k+1...n from E_{vect} prior to multiplication by M, that is, ME_{vect}, thereby projecting the original m-dimensional matrix into the k-dimensional space. The corresponding SVD operation is to select the the first k columns from S, yielding S_{k}, and then to multiply US_{k}, which results in an m × k matrix consisting of the largest k principal components.

How does all this relate to aMLP architecture? Given D, an aMLP with hidden layer dimensionality k approximates the US_{k} product and V matrices of SVD by using a gradient descent method, back propagation, to optimize the standard mean squared error objective function, which minimizes the difference between the target output and the actual output of the aMLP with respect to D. Once trained, each of the m row vectors of D generates a k-dimensional hidden layer activation vector, and all m hidden layer vectors constitute an m × k matrix H which is an approximation to the US_{k} matrix of SVD; for details see Aggarwal ([

There is, however, a caveat. Homomorphism is a concept from linear algebra, and it applies to linear functions of which PCA is one. When the unit activation functions of an aMLP are uniformly linear the homomorphism property of PCA transfers directly [

Bourlard & Kamp [

The obvious way to test this in the present application is to train an aMLP with a nonlinear hidden layer and a linear output layer using the letter bitmap matrix L, and then to compare the cluster tree for the matrix of hidden layer vectors H generated by the trained net with that for L. The approach needs to be more nuanced than that, however. The number of hidden units k in an aMLP, and in artificial neural networks generally, is known to have a strong effect on convergence to whatever function is being implemented, and, traditionally, choice of that number has been heuristic - a heuristic choice of k with a negative result with respect to homomorphism doesn't necessarily mean that nonlinear aMLPs fail to preserve homomorphism, because a different choice of k might give a positive result. One approach is simply to try a series of random k. A more systematic approach, taken here, is as follows:

Train the net using a range of hidden layer sizes, say n = 1...200, and then generate the hidden layer matrix H as above.

Calculate a matrix D_{hidden} containing the Euclidean distances between all pairs of rows in H.

Calculate a matrix D_{input} containing the Euclidean distances between all pairs of rows in L.

Row-wise concatenate the values below the main diagonals of D_{hidden} and D_{input} to yield two vectors dv_{hidden} and dv_{input} respectively.

Pearson-correlate dv_{hidden} and dv_{input} and save the correlation value in a vector v_{corr}.

Plot v_{corr}.

The underlying intuition is that the correlation vector captures the degree of distance structure similarity between L and H.

Correlations of input and hidden layer distance vectors for hidden layer sizes 1...200.

Correlation is low for small hidden layer sizes but grows rapidly as the size increases and eventually flatlines at a correlation that fluctuates in the range 0.90...0.95; Spearman correlation gave the same distribution shape with the same numerical range. There is a strong correlation for distance vectors of hidden layer size c.25 onwards, and so the conclusion is that the distance relations in the input data are preserved in the hidden layer for those values of k, that is, the aMLP generates a good approximation to homomorphism with respect to preservation of input distance structure in the present application.

Cluster trees (average linkage) for L and H.

Further empirical results indicate that the distance structure preservation of nonlinear single hidden layer aMLPs generalizes. Experiments using the foregoing methodology were conducted using randomly generated binary input matrices with various combinations of the numbers of rows and columns in the range 12-48, keeping these quantities small for tractability. The shape of the correlations was always very similar to that in

That said, it is the case that generalizations based on inductive inference from evidence cannot constitute proof [

Finally, a single layer aMLP with sigmoid hidden and linear output layer has been shown to be a universal function approximator, as noted, given a sufficient number of hidden units, so this architecture should be all that's required to generate homomorphic representations of environmental input for the model which is the focus of this paper. In practice, however, 'a sufficient number of hidden units' in any given application may well be found to be very large, and this raises the problem of overfitting, where the neural network learns the input data accurately but generalizes poorly to unseen data from the same input distribution [

The aim of the foregoing discussion was to show how the homomorphism between the environment and its system-internal representation in a model of intrinsic linguistic intentionality, as implemented by a three-layer autoassociative multilayer perceptron with nonlinear hidden and linear output layers, can be understood mathematically. Existing work which sees such aMLPs as implementations of principal component analysis was cited, the implication of which is that the homomorphism characteristic of linear functions in general applies also to aMLPs. To extend the range of identity functions that can be implemented by an aMLP, however, nonlinear activation functions can be, and in the model in question are, used in the hidden layer; because homomorphism is currently understood as a characteristic of specifically linear functions, its preservation in a nonlinear MLP is not guaranteed. Experimental results were used to show that it is preserved in example applictions, but it was noted that generalizations based on inductive inference from evidence cannot be proof, and that a secure theoretical basis would be useful. The discussion also noted that use of a multilayer 'deep learning' aMLP would address the potential problem of overfitting when a single hidden layer aMLP is used, but that homomorphism in such network would need to be demonstrated.

The author did all the research work of this study.

The author has declared that no competing interests exist.