We
introduce a utility function F(L),
defined on the binary vector space BL of L
vectors. Such a utility function represents the advantage the system gets
by distinguishing it and by implementing an appropriate reaction. Such a function
can be submitted to the following scaling constraint:
(18)
which
permits to introduce a measure similar to Entropy:
(19)
By
inspection of (19), one observes that if the utility function is homogeneous
on BL space HF(L)
reaches its maximum. At the same time, the system has so advantage by distinguishing
between different Ls .
The
total utility on BL is defined as:
(20)
For
both F(L)
and P(L), the average corresponds to the coarse grain distribution.
So we can define:
Then,
by inserting the above equations into (20), we obtain:
(21)
Where
the second term on the rhs. represents a basic contribution given by the hazard.
c represents the correlation coefficient. The
average utility depends on the correlation on BL
the between the deviations of F(L)
and P(L) from the coarse
grain hypothesis.
If
one or both F(L)
and P(L) exhibit a flat constant values on BL,
the average utility is reduced to the second term. The above relation holds
for whatever interpretation of the utility function. While the first term
is affected by the scaling of the utility function (18) of the utility function,
the second term does not depend on such scaling. Equation (21) shows how both
large ss
and correlation increase the average utility. Homogeneous and flat in BL
probability and utility functions lead to poor advantages.
Below
we study the possible interpretations of the utility function F(L).
We
refer to a specific survival situation, where there exists a risky situation,
associated with a given L, and the systems knowledge module
produces a competent reaction that has a cost.
We
assume therefore that the function F(L)
can be rewritten as:
(22)
where
w
is a parameter we assume for the moment equal to the unity. So, large
values of F(L) refer to high risk situations where the reaction is essentially a low
cost one. The cost function is strictly connected with two essential parameters,
namely the capability of the system to predict a given risky state of affairs
with a significant advance, and the number of available reactions that its
knowledge module can choose between.
Therefore,
the average capability of predicting a given state of affairs is important;
it is well captured by the entropy H(L), which is an exact measure
of the average surprise. As far as the system restricts itself to L-blocks
knowledge, its problem is that of reducing the average surprise to that intrinsic
to the environment observed by its see(S).
As
aH(L)
is the measure of information/symbol
once the system has a complete knowledge of L-blocks statistics [2],
the quantity of information it gets from n symbols is simply naH(L).
By solving the following equation:
(23)
one
can determine the discrete time interval needed to sum up an amount of information
equivalent to the coarse grain hypothesis of total ignorance. The surprise,
after that limit is equivalent to getting the result of a coin tossing.
As
we are considering the case of a binary vocabulary, we can conclude that
(24)
For
n<nc(L), the predictive advantage is given by:
(25)
One can of course take the limit for L tending to the infinity of (24) to evaluate the intrinsic time horizon, due to the nature of the underlying environmental dynamics. As aH(L) decrease in module for increasing values of L, the forecasting advantage for the system of moving towards increasing L knowledge is evident.
As noticed above, however, the system needs, to make an evolutionary advantageous use of its predicting competence, of a wider repertoire of possible reactions. Moreover, to competently use this repertoire, it needs the information necessary to distinguish between situations.
Until here we have implicitly assumed that the see(S) is capable of delivering a very rough (binary) information about the instantaneous state of the world. The utility function depends much more on the temporal sequence of data obtained by a very rough see(S) function.
The
change of see(S) corresponds essentially to a change in the
vocabulary or, if you prefer, with a more detailed symbolic partition trough
by which the system can observe the environment.
In order to point out the problems associated with such a vocabulary
change, let us consider the following case. Assume
that a certain external state of affairs implies a survival risk for the living
system and that the already existing knowledge module is capable to identify
it and produce a competent primitive reaction. Such a reaction represents
a useful competence. Why should we call it primitive? This judgement refers
necessarily to an external, omniscient observer, who could judge the reaction
from the exterior. Such an observer could for example observe
that the reaction is too expensive from the energetic point of view and would
give rise in the future to negative consequence. In any case, the observer
would conclude that, if the living beings had been able to take into proper
account other information, it would have adopted a more appropriate choice.
This is why the observer would have called the reaction competent, but
primitive. This leads to the following interesting question: why
should the primitive being give up with its implicit judgement that was considered
correct also by an omniscient external observer? Of course we can
translate the previous quite rhetoric question into a more pragmatic statement:
there are no evolutionary reasons to modify or cancel the knowledge module
that has produced a correct classification of a dangerous state of affairs,
while there exist excellent evolutionary reasons to improve its reaction to
such a state of affairs. The
problem posed here, becomes even more serious if one considers the time that
would normally be needed to acquire a good statistical knowledge about block
formed by a vocabulary with larger cardinality. This means essentially that,
to optimise survival together with its possible improving, the system must,
in many senses, build the new knowledge on the base of old one.
The
next problem is captured by the following question: are all the symbolic
partition refinements corresponding to a new, more sophisticated, see(S)
function, equally useful to improve the survival competence of the system?
Then
a third problem emerges: even if genetics can imply catastrophic changes in
the physical structure of the device used by the system to manage its knowledge,
the information acquisition processes cannot take place in times simply ruled
by genetics. This imply that, while we can suppose that evolution is capable
to abruptly make available one or more new physical modules to
implement the new task,
In
order to compare the changes associated to an homogeneous increase in the
details captured by the see(S) function, one can simply assume to pass from
a binary coding, to a 4-alphabet coding represented by two dimensional binary
vector. There are various possibilities to represent such operation. We adopt
the table below for the schematic representation of the change.
0
1
|
0 |
1 |
1,0
0,0
0,0
1,1
|
||||||||
|
0
0 |
1
0 |
New |
||||||||
|
0
1 |
1
1 |
Vocabulary |
Table
1
For
the old vocabulary, the information advantage is represented by:
(26)
As
such an advantage is represented by the difference of a priori ignorance and
the residual 1, once L=1 statistics is acquired. Similarly, after the acquisition
of a A=4 vocabulary, the same advantage is represented by:
If
we ask for a relative advantage in ignorance reduction by passing from A=2
to A=4, we obtain:
(27)
which
can be rewritten as:
(28)
This
means that the average surprise for the second symbol is less than for the
first.
The
contributions to the rhs of the above equation can be different in value.
They can therefore be ordered following their contribution. Anyhow the first
term represents a quite abstract measure, as a small value of H(j/i)
turns out to be quite inessential if P(i) is very low. One is therefore lead
to consider the weighted sum , that gives a measure of the average (in time)
information flux associated to the second symbol.
as
a measure of efficiency for the adoption of A=4 vocabulary.
(29)
Or:
(30)
Where:
By
inspection of equation (30), one can observe how, in the presence of a doubling
of the vocabulary (or an equivalent refinement of the monitoring partition
of the environment) not all the previous elements of the previous partition
contribute equally to the increase of the distance from the previous distribution.
More precisely the contribution of an old element of the partition depends
on its role in the old probability distribution, with a weight that is given
by the term of the rhs of (30). This term, that could be further simplified
with any advantage in the understanding of its meaning, clearly captures the
asymmetry between the two local sub-partitions.
Equation
(30) can be understood also in terms of
Effective Complexity [5]. This will be the subject of a future paper.
[1]
Cover T. M., Thomas J. A., (1991), Elements of Information theory, J. Wiley
& Sons, New York
[2]
Crutchfield J. P., Feldman
D. P. (2001), Regularities Unseen, Randomness Observed :Levels of Entropy
Convergence, Santa
Fe Institute Working Paper 01-02-012
[3] Crutchfield J. P., Feldman D.
P. (2001), Synchronizing to the Environment: Information Theoretic Constraints
on Agent Learning, Santa Fe Institute Working Paper 01-03-020
[4] Ioos G., Joseph D. D., (1980), Elementary Stability and
Bifurcation Theory, Springer-Verlag, New York
[5]
Gell-Mann M. and Lloyd S., (1996), Information Measures, Effective Complexity,
and Total Information, Complexity, J. Wiley & Sons, New York
[6]
Vyugin V. V.,(1999 ), Algorithmic Complexity and Stochastic Properties
of Finite Binary Sequences, The Computer J., 42, 4, pp. 294-317
Next
Page