A
system A has a detection module see(S) and create
a flux of data (here in binary form for simplicity) towards A.
The
flux continues, in principle, for the whole life of he system. The world internal
state is not accessible to the system. All it gets is the output of the see(S)
function (Figure 1).
A
binary string can result a very poor representation of messages reaching A.
Other alternatives could be represented by a flux of binary strings of finite
length:
100100101011010110101101000101000110110
011011010010110110111010101101011101000
100101101001111010111010100101000101011
100101011
..
...
where the vertical arrows represent time.
see(S) function

We keep, for the moment our simplified model. We assume that the system keeps trace of the statistics of blocks of symbols reaching it. Actually in figure one just the statistics of 3 symbols blocks is represented. Anyhow, given a natural number L, a set of possible blocks of length L is easily found. For a binary vocabulary as that considered here, we obtain:
where # means cardinality of
Let us consider a long data string of length N , that we assume being the complete data string obtained by a system in its life.
For each value L, the system has therefore received:
(1)
examples of blocks of length L, with possible repetitions.
The system can therefore evaluate an approximate Probability Distribution
(2)
where sL is a member of WL , on L-blocks.
For
, one can imagine to reach an evaluation of H(L), the block entropy.
(3)
In he coarse grain hypothesis,
, we can evaluate the maximum value for H(L):
(4)
So the maximum entropy for L-blocks is a simple linear function of L. By moving from L to L+1, we obtain a maximum entropy change of 1 bit.
For infinite values of L, the maximum block entropy tends to infinity, but its evident as well that:
(5)
The
coarse grain hypothesis represents a very particular (and unlucky ) case.
It has however been shown that such a limit exist, finite, for all stationary
stochastic processes [1]. So we can assume that, as far as the environment
and the see(S) are such that these conditions are satisfied,
we have:
(6)
By inspecting the above discussion, we can also infer that aH=<1.
So, at least for stationary stochastic processes, a linear asymptotic variety exists, with coefficient aH in the (H,L) plane, with an intersection E with L=0, where H(L)=H(0)=0.
aH
can be rewritten as::
(7)
Its
dimension is (bits/symbol) and it represents
the irreducible randomness in observation sequences
produced by the environment or,
if you prefer, the randomness that persists even after statistics over longer
and longer blocks of observations are accounted for by the agent. aH
can also be interpreted in terms of Kolmogorov
Complexity [1].
The units of E are bits (a measure of information then) and its normally called excess entropy. Excess entropy tells us how much additional information needs to be collected from the environment in order to reveal the actual per-symbol uncertainty aH. .We can refer to local (in L) evaluations of aH(L)- aH as the redundancy (per symbol) in length-L sequences.

FIG.2. (From Reference 2)
So
aH
represents the average information per symbol the system will get because
of the nature of the environment and see(S), rather than because
of the limits of the information collected. aH(L)
represents the average surprise per symbol due to the limits of information
collected.
It is therefore evident that
T (L)= E+aHL-H(L)
(8)
represents
a measure of the information still lacking once H(L) has been
correctly estimated.
In
principle, at least, one can observe that the system has no further need of
information as far as L blocks are concerned, if T(L)=0
for a given L. Actually it would not need more information regarding
L blocks [2].
This,
of course, is a very intuitive statement. H(L) curve, as shown in figure 2,
is the result of Shannon Entropy concept evaluation and has a very loose connection
with finite experience of the system. This is true for the discussion above.
One must stay tuned on that.
Let us concentrate on how the information can be accumulated by the system, as represented in figure 1. Given a certain value of L, the system, after having acquired a string of length N, has (N-L+1) samples of L-Blocks, with possible repetitions. It is therefore in a position to infer an approximated probability density about them.
Assume now that the system gets M more bits from its see(S). It will be able to work out a possibly different distribution :
We evidently need some estimate of the advantage associated with the new M bits acquired. This must depend on both distributions evaluated after N and N+M bits.
Such an information gain is well represented by:
(9)
Assume M>>L
If Q is divided by M, we obtain an estimate of the information acquired per bit. If we divide it L-1+M-L+1= M, we obtain the same result for the information per example acquired.
One assumes, because of the large numbers theorem, that we have an improvement in knowledge by acquiring new information. If we consider however an evolving surviving system, submitted to all the limits of its nature, the problem of asymptotic values of the probability distributions assumes a specific aspect. In particular some elements belonging to WL can have some form of relevance with respect to survival, while others can be irrelevant. So one can imagine, at least intuitively, that the above expression for information acquisition does not capture this aspect. It naturally suggests in facts that one should continue to acquire information until Q variations will become negligible, or, at least reach a constant (eventually small) value. This would actually imply that the system has reached the possible best knowledge about L-blocks statistics. Nevertheless this point of view does not capture the whole problem, as the form of the asymptotic distribution of L-blocks is important as well.
If the actual asymptotic distribution corresponds to the coarse grain hypothesis, the advantage acquired by the system is purely epistemological, as L-blocks will represent essentially a noise for its decision schemes.
So L-blocks statistics will be important if:
1) some L-block corresponds to survival relevant states of affairs
2) the L-blocks distribution does not exhibits a tend of convergence towards a coarse grain distribution
This point suggests the introduction of a measure of the distance of the acquired statistical distribution from that corresponding to coarse grain hypothesis:
where
R is commonly called redundancy and is clearly a measure of the information acquired by comparing the results with the coarse grain hypothesis.
3) the time needed for information acquisition must be tuned on the survival problem. This is a quite complex aspect.
We now can come back to mathematics and observe that H(L) is a function of a discrete variable; therefore, if we want to capture its local behaviour, we cannot make use of standard derivatives and integrals. In particular its derivative is another discrete variable function defined as:
(10)
It can be shown [2], that the following relation holds:
(11)
that is connected with the statistical dependence of the distribution on L-blocks on that of L-1 blocks:
It has to be noticed however that, in this case, the two distributions have different supports. We have not to confuse it with the definition given above, and referring to distributions with the same support. See here for a simple example.
We have so given an exact definition of the function aH(L) introduced above.
As aH(L) is an entropy measure that depends on the fact that the system knowledge is exact up to L-blocks distribution, it can be easily interpreted as a measure of how as an apparent entropy rate due to the knowledge limits of the system, or, if one prefers, as an epistemological estimate of randomness. The ontological one is represented somehow by the asymptotic limit of aH(L).
It is evident that aH(L), does not capture the quantity we are interested in. We are interested in measuring the advantage obtained by moving our knowledge from L to L+1. The obvious question is actually, what is the advantage of making the effort of acquiring full knowledge of longer blocks?
The difference aH(L)- aH(L-1) its evidently the right quantity. We have
(12)
It can be shown [2] that:
(13)
where the firsts terms inside the parenthesis mean name of the symbol.
As aH(L) is connected with the knowledge about the environment, and the above quantity represent the knowledge advantage acquired by the system by getting a full knowledge of the next order blocks, one can make use of it in order to evaluate the advantage the system would be able to get by acquiring an infinite knowledge about the system, namely:
(14)
G evidently represents a property of the environment, but there can be an important influence of the see(S).
If a system has acquired an almost exact knowledge about the statistics of blocks up to L order, he will have a lack of knowledge that is well represented by:
(15)
The nature of the convergence of this quantity to 0 is an important (but hardly accessible) information to evaluate the quality of the knowledge acquired by the system.
While (13) is a local (in L) evaluation of the acquired level of knowledge, (15) represents an absolute evaluation.
We should now concentrate on the survival logic of the system, that is, by definition, incapable of carrying out the present (and other formal) analysis. The assumption that the system exists is not a neutral choice in the perspective of evolution; its existence indirectly implies that it has a minimum level of competence of survival. This can be translated, at least in principle, that its see(S) , its knowledge-reaction modules are capable of grasping some basic characteristic of the environment. As see(S) is not discussed here, one can translate the existence hypothesis into some formal facts; the most basic ones are the followings:
1) the system has a reasonable control of the L-blocks statistics
2) the L-blocks are sufficient pictures of the environment for the systems survival
3) the reaction repertoire is tuned in a way to implement a minimal survival strategy
The evolutionary perspective can create several misunderstandings one should do his/her best to avoid. The first point is related on possible confusion between individual and ancestors accumulated experience of the environment. While hazard has an essential role in the comprehension of the Adam system, the comprehension of the descendants evolution is more articulated as it has most probably passed trough a number (possibly very large) of hazard-amplification bifurcations [4]. We will try to simplify the approach by assuming that the system behaves as if it has acquired a NG string of data about the environment that was sufficient to produce a survival-efficient statistics of LG-blocks:
Making use of (1), one can define the following quantity:
which represents the equivalent number of examples needed by the systems ancestors to work out their statistical knowledge about L-blocks.
The a priori available knowledge is therefore:
This is some form of incorporated knowledge.
There are several possible strategies, based on different hypothesis:
Hp1 : the L-blocks statistics can be further refined and such a refining can give a competitive advantage to the system
Hp 2 the L-blocks statistics in well incorporated in the systems knowledge system and significant competitive advantage can be obtained moving to some higher L-order blocks statistics.
To simplify the problem, one can concentrate on the L+1 blocks statistics, but one must stay tuned on such simplification.
[1] Cover T. M., Thomas J. A., (1991), Elements of Information theory, J. Wiley & Sons, New York
[2] Crutchfield J. P., Feldman D. P. (2001), Regularities Unseen, Randomness Observed :Levels of Entropy Convergence, Santa Fe Institute Working Paper 01-02-012
[3] Crutchfield J. P., Feldman D. P. (2001), Synchronizing to the Environment: Information Theoretic Constraints on Agent Learning, Santa Fe Institute Working Paper 01-03-020
[4] Ioos G., Joseph D. D., (1980), Elementary Stability and Bifurcation Theory, Springer-Verlag, New York