Learning and Time

 

 

In many cases the evolution of the nervous system was paralleled by  a contemporary increase of the information amount about the external environment. If one compares the number of bits collected by skin photo-sensible cellules of primitive invertebrates with the information flux  carried by mammalian evolved eyes, such a change is clearly seen. Making use of large information fluxes from the environment represents an evident advantage. However, a gigantic increase of the learning tasks computing efforts is also implied. This is an important aspect to understand the constraints under which the evolution of nervous system and competence has taken place. In the following we focus on such a problem.

We consider in the following just binary vectors as input data, as it permits to simplify the discussion without any relevant loss of generality. In particular we consider the problem of classification learning by appropriate algorithms, having in mind the mammalian brain. We do not refer here to specific hardware, as neural networks. The points discussed here are independent of the specific devices by which learning is implemented. The problem we focus on is here the number of examples needed by a given algorithm to learn a specific binary classification. The number of examples needed, as well as their lower theoretical limits, give a good estimate of the time necessary to a living system to learn a correct binary classification. These estimates permit to discuss the utility of learning and its consequences in terms of evolutionary advantages.

We make use of the concept of “Probably Approximately Correct” model (PAC).

 

The external environment affects the system by a monitoring module M, which translates it into binary strings of length N. The number of possible seen states of affairs is therefore 2N.

Some of these states correspond to “positive states of affairs”, represented by 1, while the rest correspond to negative states of affairs (SOA) and are associated with 0.

Assuming that each SOA is classified and therefore corresponds to 0 or 1, there exist  possible classifications PC . Most of these possible classifications would be “very strange”; imagine we order the possible SOA’s and use a coin to decide if each of them belongs to 1 or 0 class. Common sense suggests that the probability that it could be a easy learnable classification will be very low. Given a standard method to order the 2N states of affairs, each classification will correspond to a 2N binary string of  possible ones (one for each possible classification). We can make use of Kolmogorov Complexity to define the complexity of a classification. As we have assumed that PC can be ordered, we can define the K(PC(i)) as the shortest UTM program capable of output the binary string corresponding to PC(i).

As the probability of a low complexity classification is very low, an arbitrary classification will have an extremely low probability of having a complexity less that 2N. This has  heavy consequences on the learning problem we are considering here, as it implies that you need to learn the value of the classification at each point of the 2N domain before “knowing” the classification. This is true for the vast majority of the possible classifications.

We can define on the 2N possible input states a probability function m, which represents the probability for the living system to experience, by its monitoring module M, a given state. Low m states will probably affect less the survival probability of the system itself. A second, but fundamental, point is the existence of a possible relation between the number of states experienced and the classification error.  Of course there can be different “experiences” of m examples, depending on what m states have been visited.

We can identify by S(m) the ensemble of possible stories long m time steps. Two stories will be equivalent if the set of visited states is the same. The number of possible m states is large and can be estimated easily. Moreover we can call the Concept Space (C) as the set of all possible classifications. As the living system has however to learn by some device, the capabilities of which can be somehow limited, we can define a Hypothesis Space (H) which roughly corresponds to the set of classifications that the device can implement.

 

Now we identify a “reasonable request” for the learning process. More precisely we ask the system to control its error probability by the length of its story length. This looks quite reasonable also in terms of everyday experience: normally (but not always) you need a period of training before reducing your error probability.

This “reasonable request” can be translated into a formal statement, which turns out to be as well the definition of PAC algorithms.

Lets call A(C,H) a learning algorithm.  Then for all binary classifications of C, A(C,H) is PAC if, given an error e and a parameter d, both positive and smaller than the unity, there exists a critical length mc(e,d) such that for:

 

                                                                                                   (1)

 

independently of m.

 

The meaning of the above definition is quite simple: if we want to limit the probability of an error of order e by  d, we need to know at least  mc(e,d) examples. This holds for whatever classification  t of C. C is the domain of validity of A. Of course everything will be easier if H and C are equivalent.

 

Now we will introduce a minimal learning request for A, namely that it is at least capable of remembering correctly the learned examples. We will call such an algorithm a consistent one. We mean that A, for each element of the example story seen, is capable to reproduce the correct classification. This is a form of minimal memory's property.

It can be shown that any consistent A(H,H) algorithm (C=H) is PAC and that:

 

                                                                                                                                                         (2)

 

This is a very nice formula, which permits to easily evaluate the minimum number of examples needed by a consistent (H,H) algorithm to limit the probability to make an error of order e.  For a binary classifier however, it is evident that there are some problems in the direct application of (2). The error has to be better defined.

Let us consider an approximating  function f, and a target function t. Then:

 

                                                                                                                          (3)

 

So the error is actually the probability that, on a given domain D, the approximating function f output fails to correctly reproduce t classification.

Then it is interesting to underline how a nice log function on the rhs of (2) is a good reason for optimism about the “solution” of the learning problem we are considering here, even if, as seen above, #(H) tends to be gigantic.

For the classification of binary strings of length N, we obtain, by (2):

 

mc= ae-1(2N-log2d)

 

where a depends on the log base. Then the minimal number of examples to be acquired grows exponentially with N. This effect can be, within very restrict limits,  balanced by changing values of e and d.

 

 

All the above results are true if the domain H is finite. Infinite H problem can be approached in terms of VC dimension.

VC (Vapnik-Chervonenkis) dimension is however of some importance when applied to Neural Networks also for finite H [1,2].

 

As we are essentially interested in classifying algorithms ( with output 0 or 1),  we consider this case here. We intend to investigate how an algorithm A performs for growing sets of examples. Let’s use L to represent an examples set. #L  is its cardinality.

The maximum possible number of classifications is 2#L. If A is capable to learn all such possible classifications, we say that A shatters  such ensemble.  The VC-dimension is defined as the largest set of examples shattered by A.

This is a quite general definition, that can be applied to a variety of cases. Having however in mind the evolution of nervous system, we are interested in VC-dimension of neural networks, as they seem, notwithstanding the large number of different signals observed in the brains, to play a major role in the learning processes. See here for a brief list of observed signals.

We consider here feedforward networks with binary outputs with the standard sigmoid function as activation function for hidden units. We further represent by #W the number of the neural network’s parameters (weights and thresholds) and by K the number of computational units.

It has been shown [3] that such a network has a finite VC-dimension, at most

 

VC@#W K (#W K –1) + O(W4)                                                                                                                                                               (4)

 

Many others specific networks have been studied [4]. Here we are not interested in the complex details of VC-dimension. We just need to understand those general features of the theory that can help us in understanding the problem we are trying to depict. The reader interested in the advances in VC-dimension theory of Neural Networks can find updated information at NeuroCOLT site.

 

We consider a very simplified case for our discussion.

 

1)     The concept space is finite

2)     The hypothesis space coincides with the concept space

3)     All network parameters are discrete and finite

4)     The VC-dimension d of the networks considered is positive and its value is approximately equivalent to the number of free parameters

 

We then make use of a very nice theorem by Blumer et al [5] and by Herenfeucht et al. [6]:

 

If a neural network N has finite VC-dimension d greater or equal to 1, then any consistent algorithm A for N is a PAC learning algorithm. Then there exists a constant K such that a sufficient sample length mc(e,d) for any such algorithm is given by:

 

mc(e,d)=Ke-1(d ln(e-1)+ln(d-1))                                                                                                                      (5)

 

Moreover, for each PAC learning algorithm for N, there exists a constant c such that

 

mc(e,d)=ce-1(d)+ln(d-1))                                                                                                                                (6)

 

for all e smaller or equal to 1/8 and d smaller or equal to 1/100.

 

Referring to Vc-dimension, we make use of  make use of an approximate version of a couple of theorems [7]. After this theorem (and some other refinements) the following relation holds:

 

                                                                                                                                                (7)

 

where W represents the number of parameters (thresholds and weights) of the network and N is the number of threshold units. K is a constant that can be shown to exist. (7) does not hold for all classes of networks, but it can be retained as a good general approximation [8].

 

References

 

[1] Vapnik V. N. and Chervonenkis A. Ya., (1971), On the uniform convergence of relative frequencies of events to their probabilities, theory of probability  and its applications, 16(2): pp: 264-280

 

[2] Valiant L. G., (1984), A theory of the learnable, Comm. ACM, 27, 11, pp: 1134-1142

 

[3] Karpinsky M. and Mcintyre A., (1995), Polynomial bounds for VC dimension of sigmoidal neural networks, Proceedings of the 27th annual ACM symposium on the theory of computing, pp:200-208

 

[4] Anthony M., (1994), Probabilistic analysis of learning in artificial neural networks: the PAC model and its variants, NeuroCOLT Tech. Rep. NC-TR-94-3

 

[5] Blumer A., Ehrenfeucht A., Haussler D. and Warmuth M. K., (1989), Learnability  and the Vapnik-Chervonenkis dimension, J. of the A.C.M., 36, 4, pp:929-965

 

[6] Ehrenfeucht A., Haussler D., Kearns M. and Valiant L., (1989), A general lower bound on the number of examples needed for learning, Information and Computation, 82, 3, pp:247-261

 

[7] Baum E. B. and Haussler D, (1989), What size net gives valid generalisation, Neural Computation, 1, pp:151-160

 

[8] Anthony M., Biggs N., (1995), PACA learning and Artificial Neural Networks, NeuroCOLT Technical Report, NC-TR-95-023

 

 

 

 

Next Page

 

 

Home Page