Information acquisition in view of survival: part 2

 

We introduce a utility function F(L), defined on the binary vector space BL of L vectors. Such a utility function represents the advantage the system gets by distinguishing it and by implementing an appropriate reaction. Such a function can be submitted to the following scaling constraint:

 

                                                                                                                                                                                  (18)

 

which permits to introduce a measure similar to Entropy:

 

                                                                                                                                                 (19)

 

By inspection of (19), one observes that if the utility function is homogeneous on BL space HF(L) reaches its maximum. At the same time, the system has so advantage by distinguishing between different L’s .

The total utility on BL is defined as:

 

                                                                                                                                                      (20)

 

For both F(L) and P(L), the average corresponds to the coarse grain distribution. So we can define:

 

 

Then, by inserting the above equations into (20), we obtain:

 

                                                                                (21)

 

Where the second term on the rhs. represents a basic contribution given by the hazard. c represents the correlation coefficient. The  average utility depends on the correlation on BL the between the deviations of F(L) and P(L) from the coarse  grain hypothesis.

If one or both  F(L) and P(L) exhibit a flat constant values on BL, the average utility is reduced to the second term. The above relation holds for whatever interpretation of the utility function. While the first term is affected by the scaling of the utility function (18) of the utility function, the second term does not depend on such scaling. Equation (21) shows how both large s’s and correlation increase the average utility. Homogeneous and flat in BL probability and utility functions lead to poor advantages.

 

Below we study the possible interpretations of the utility function F(L).

We refer to a specific survival situation, where there exists a risky situation, associated with a given L, and the system’s knowledge module produces a competent reaction that has a cost.

We assume therefore that the function F(L) can be rewritten as:

 

                                                                                                                                                                                 (22)

 

where w is a parameter we assume for the moment equal to the unity. So, large values of F(L) refer to high risk situations where the reaction is essentially a low cost one. The cost function is strictly connected with two essential parameters, namely the capability of the system to predict a given risky state of affairs with a significant advance, and the number of available reactions that its knowledge module can choose between.

 

Therefore, the average capability of predicting a given state of affairs is important; it is well captured by the entropy H(L), which is an exact measure of the average surprise. As far as the system restricts itself to L-blocks knowledge, its problem is that of reducing the average surprise to that intrinsic to the environment observed by its see(S).

As aH(L) is the measure of information/symbol  once the system has a complete knowledge of L-blocks statistics [2], the quantity of information it gets from n symbols is simply naH(L). By solving the following equation:

 

                                                                                                                                                              (23)

 

one can determine the discrete time interval needed to sum up an amount of information equivalent to the coarse grain hypothesis of total ignorance. The surprise, after that limit is equivalent to getting the result of a coin tossing.

As we are considering the case of a binary vocabulary, we can conclude that

 

                                                                                                                                                                                   (24)

 

 

For n<nc(L), the predictive advantage is given by:

 

                                                                                                                                                             (25)

 

One can of course take the limit for L tending to the infinity of (24) to evaluate the intrinsic time horizon, due to the nature of the underlying environmental dynamics. As aH(L) decrease in module for increasing values of L, the forecasting advantage for the system of moving towards increasing L knowledge is evident.

As noticed above, however, the system needs, to make an evolutionary advantageous use of its predicting competence, of a wider repertoire of possible reactions. Moreover, to competently use this repertoire, it needs the information necessary to distinguish between situations.

Until here we have implicitly assumed that the see(S) is capable of delivering a very rough (binary) information about the instantaneous state of the world. The utility function depends much more on the temporal sequence of data obtained by a very rough see(S) function.

The change of see(S) corresponds essentially to a change in the vocabulary or, if you prefer, with a more detailed symbolic partition trough by which the system can observe the environment.  In order to point out the problems associated with such a vocabulary change, let us consider the following case. Assume that a certain external state of affairs implies a survival risk for the living system and that the already existing knowledge module is capable to identify it and produce a competent primitive reaction. Such a reaction represents a useful competence. Why should we call it primitive? This judgement refers necessarily to an external, omniscient observer, who could judge the reaction from the exterior. Such an observer could for example “observe” that the reaction is too expensive from the energetic point of view and would give rise in the future to negative consequence. In any case, the observer would conclude that, if the living beings had been able to take into proper account other information, it would have adopted a more appropriate choice. This is why the observer would have called the reaction “competent, but primitive”.  This leads to the following interesting question: “why should the primitive being give up with its implicit judgement that was considered correct also by an omniscient external observer?” Of course we can translate the previous quite rhetoric question into a more pragmatic statement: there are no evolutionary reasons to modify or cancel the knowledge module that has produced a correct classification of a dangerous state of affairs, while there exist excellent evolutionary reasons to improve its reaction to such a state of affairs.  The problem posed here, becomes even more serious if one considers the time that would normally be needed to acquire a good statistical knowledge about block formed by a vocabulary with larger cardinality. This means essentially that, to optimise survival together with its possible improving, the system must, in many senses, build the new knowledge on the base of old one.

The next problem is captured by the following question: “are all the symbolic partition refinements corresponding to a new, more sophisticated, see(S) function, equally useful to improve the survival competence of the system?”

Then a third problem emerges: even if genetics can imply catastrophic changes in the physical structure of the device used by the system to manage its knowledge, the information acquisition processes cannot take place in times simply ruled by genetics. This imply that, while we can suppose that evolution is capable to “abruptly” make available one or more new physical modules to implement the new task, 

 

In order to compare the changes associated to an homogeneous increase in the details captured by the see(S) function, one can simply assume to pass from a binary coding, to a 4-alphabet coding represented by two dimensional binary vector. There are various possibilities to represent such operation. We adopt the table below for the schematic representation of the change.

 

         0

 

         1

 
 


0

1

  1,0

 

0,0

 

0,0

 

 1,1

 
Old Vocabulary

0 0

1 0

New

0 1

1 1

Vocabulary

Table 1

 

For the old vocabulary, the information advantage is represented by:

 

                                                                                                                                                                                          (26)

As such an advantage is represented by the difference of a priori ignorance and the residual 1, once L=1 statistics is acquired. Similarly, after the acquisition of a A=4 vocabulary, the same advantage is represented by:

 

 

If we ask for a relative advantage in ignorance reduction by passing from A=2 to A=4, we obtain:

                                                                                                                                                                        (27)

 

which can be rewritten as:

 

                                                                                                                                                                           (28)

 

This means that the average surprise for the second symbol is less than for the first.

The contributions to the rhs of the above equation can be different in value. They can therefore be ordered following their contribution. Anyhow the first term represents a quite abstract measure, as a small value of H(j/i) turns out to be quite inessential if P(i) is very low. One is therefore lead to consider the weighted sum , that gives a measure of the average (in time) information flux associated to the second symbol.

 

 

as a measure of efficiency for the adoption of A=4 vocabulary.

 

 

                                                                  (29)

 

Or:

 

                                                                  (30)

 

Where:

 

 

By inspection of equation (30), one can observe how, in the presence of a doubling of the vocabulary (or an equivalent refinement of the monitoring partition of the environment) not all the previous elements of the previous partition contribute equally to the increase of the distance from the previous distribution. More precisely the contribution of an old element of the partition depends on its role in the old probability distribution, with a weight that is given by the term of the rhs of (30). This term, that could be further simplified with any advantage in the understanding of its meaning, clearly captures the asymmetry between the two local sub-partitions.

Equation (30) can be understood also in terms of  Effective Complexity [5]. This will be the subject of a future paper.

 

 

 

 

 

[1] Cover T. M., Thomas J. A., (1991), Elements of Information theory, J. Wiley & Sons, New York

 

[2] Crutchfield  J. P., Feldman  D. P. (2001), Regularities Unseen, Randomness Observed :Levels of Entropy Convergence, Santa Fe Institute Working Paper 01-02-012

 

[3] Crutchfield  J. P., Feldman  D. P. (2001), Synchronizing to the Environment: Information Theoretic Constraints on Agent Learning, Santa Fe Institute Working Paper 01-03-020

 

[4] Ioos G., Joseph D. D., (1980), Elementary Stability and Bifurcation Theory, Springer-Verlag, New York

 

[5] Gell-Mann M. and Lloyd S., (1996), Information Measures, Effective Complexity, and Total Information, Complexity, J. Wiley & Sons, New York

 

[6] V’yugin V. V.,(1999 ), Algorithmic Complexity and Stochastic Properties of Finite Binary Sequences, The Computer J., 42, 4, pp. 294-317

 

 

Next Page

 

Interests

 

Home Page