F C T S S T R N G R T H N F C T N

The fact that we can reconstruct the meaning of the message from these symbols alone shows
that what was left out did not convey any information that was essential to the
communication, i.e., the vowels and spaces were redundant for this message. (Don't push this
example too far ... one could get through a first grade reader without vowels or spaces, but I
doubt whether one could handle such an abridged version of *Finnegans Wake* [James Joyce]).
If redundancy is something that exists and can be compared [ first grade reader > *Finnegans
Wake* ], then we should be able to precisely define it and then measure it.

As with any other mathematical treatment of a real world concept, we will create a mathematical model of the situation and make our definitions and take our measurements with respect to that model. How well this corresponds to the real world is then a question of how well does the model fit, and if it is a good model we can tinker with it until we get whatever fitness we need.

Rather than dealing with redundancy directly, let's consider the other side of the coin:
information. Rather than attempting a definition, consider what our intuition tells us about
information. Consider a horse race. If our friendly neighborhood bookie gives us a tip - say,
Finnegans Wake is a sure thing in the 5^{th} at Pimlico - then we would say that we have some
information about that race compared to not having this tip. Without the tip we have no
information concerning the race, our uncertainty about the outcome is maximal and the most
rational thing we can say about the outcome of the race is that each horse has the same
chance of winning. With the information (if we trust the source), our uncertainty is
diminished and the outcome no longer has an equiprobable distribution. If we had received
the tip after the race was over then it would have had no informational content because we
would now be certain about the outcome. What we see here is a reciprocal relationship
between information about the outcome and our uncertainty of the outcome. The more
information we have the less uncertain we are. Uncertainty is a concept that we can handle
mathematically with the theory of probabilities, so in the model we are creating we will
formally identify information with the reciprocal of uncertainty. This identification pares away
much of the semantical content of the concept of information but leaves us with a quantifiable
aspect of that concept. Open to claims that we are tossing out the baby with the bathwater,
the vindication of this identification will come with the usability of the model we create.

Uncertainty in a physical system is a well-known concept. The measurement of this
uncertainty or randomness is called entropy by the physical scientists. Entropy is the subject
of one of the most fundamental of physical laws, the 2^{nd} Law of Thermodynamics. Claude
Shannon, with brilliant insight, saw this connection with information theory and called the
measure of information entropy also. Before defining this measure, we need to make precise
the idea of what messages we are going to try to measure for information content.

We think of the source of our messages as a process that emits consecutive symbols from a
finite alphabet. Each symbol has a particular probability of being emitted at any precise time.
These probabilities depend upon what has already been emitted. For instance, if our source is
producing English and the last two letters emitted were a "t" and an "h," then the probability
of the next letter being a "p" is very low while that for an "e" is much higher, but if the last
two letters were "o" and "o" then the probability of a "p" is higher than that of an "e." Such a
process is called a Markov process and may be classified by how much of the previous
history is needed to determine the probabilities of the next symbol to be emitted. Thus, a 4^{th}
order Markov process requires knowing the last 4 symbols before the probability of the next
symbol can be calculated. As a special case, a 0^{th} order Markov process assigns the
probabilities without reference to what has gone before. A property that we shall require of
our Markov process source is that it be ergodic. Ergodicity has a difficult technical definition,
but its meaning can be made clear. A process is said to be ergodic if almost all of its output
strings eventually have the same statistical properties. That is, after the process has run for a
while, any output string will have the same frequency counts and distribution patterns as any
other (with exceptions being so rare as to be disregarded). This assumption makes the
computational aspects of the Markov process tractable and there is some evidence from
cryptology that natural languages come close to being ergodic in nature. To build a source for
a natural language such as English we proceed as follows: We consider a series of ergodic
Markov sources of increasing order. As a 0^{th} order source we take as the probabilities for the
symbols the relative frequency of the letters in the language. For a 1^{st} order source we use
the relative frequency of letter pairs (digrams) together with the probabilities of the 0^{th} order
source to calculate the conditional probabilities (i.e., the probability that the next letter is a
"k" if the first letter is a "c" for instance) used in the 1^{st} order process. Then using the
relative frequency of trigrams we can construct a 2^{nd} order Markov process. Theoretically we
can use the statistics of the language to create higher and higher order Markov processes.
Now, passing to the limit as the order goes to infinity gives us an ergodic Markov process for
our natural language. It has been estimated that the limit is practically achieved around the
32^{nd} order process (i.e., letters more than 32 positions away have no discernible effect on the
choice of the next letter) for an English source.

To make this discussion a little more concrete consider the following "approximations" to
the English language generated by Markov processes. In these examples we use a 27-letter
alphabet, the 26 English letters and a space. A 0^{th} order process with the outcomes
equiprobable (i.e., the probability of any letter appearing is 1/27) would give output like this:

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD ZEWRTZYNSADXESYJRQY WGECIJJ OBVKRBQPOZBYMBUAWVLBTQCNIKFMP MKVUUGB M DM QASCJDGFOZYNX ZSDZLXIKUDA 0

OCRO HLI RGWR NMIELWIS EU LL NBNESEBYATH EEI ALHENHTTPA OOBTTVA NAH BRL OR L RW NILI E NNSBATEI AI NGAE ITF NNR ASAEV OIE BAINTHA HYROO POER SETRYGAIETRWCO EHDUARU EU C FT NSREM DIY EESE F O SRIS R UNNASHORNotice how the "words" are about the right length and the proportion of vowels to consonants is more realistic. A first order process with the probabilities calculated from the relative frequency of digrams would give:

ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONSIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBEAnd here is a 2

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE RETAGIN IS REGOACTIONA OF CREWhile it is possible to continue in this vein to get higher order processes, the computational problem of determining the relative frequencies in English suffers from combinatorial explosion and becomes impractical. We can however get a glimpse of the higher order processes by using words instead of letters as the symbols for the process. Based on the relative frequencies of words in the English language we can get from a word 0

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE HAD MESSAGES BE THESEAnd from a word 1

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHOEVER TOLD THE PROBLEM FOR AN UNEXPECTEDThe basic frequencies used in the above examples are found in the literature. Letter, digram and trigram frequencies have been tabulated by cryptologists and can be found for example in

THE BEST FILM ON TELEVISION TONIGHT IS THERE NO-ONE HERE WHO HAD A LITTLE BIT OF FLUFFIt is thus not a ridiculous approximation to regard a natural language, such as English, as a limit of some succession of Markov sources.

We now turn to the question of measuring information or uncertainty, i.e., the

- 1. The measure should depend only on the probabilities of the output events.
- Thus, if we are dealing with a situation in which there are k possible events having
probabilities of occurring equal to p
_{1}, p_{2}, ... , p_{k}, then we are trying to define a function H (p_{1}, p_{2}, ... , p_{k}). - 2. The function H should be continuous in each of its variables.
- Small changes in the probabilities should not cause our uncertainty to change very much.
- 3. In the special case of equiprobable events (each probability = 1/k) then H should be a monotonically increasing function of k.
- Our uncertainty about the outcome of equiprobable events should increase if there are more events.
- 4. The entropy of a compound event should be the weighted sum of the entropies of its constituent simple events.
- The justification for this requirement is not unreasonable, but its chief effect is to make the function easily computable.

for some positive constant , and by adjusting this constant we may choose any base for the logarithms. Note that while these requirements seem reasonable, there are other sets of equally reasonable requirements that could give more flexibility in the form of this function and other functions have been used in the literature.

We can now define the entropy of a 0^{th} order Markov process where the probability of the
appearance of the symbol i is p_{i} by:

The base 2 logarithms are fairly standard practice these days but the choice is arbitrary. The
units of this measure are called *bits* (not to be confused with the term bit as it is used by
computer scientists - although, as we shall see below, in an important special case the two
concepts coincide). If natural logarithms had been used we would call the unit a *nat*. For base
10 (common) logarithms the unit is a* Hartley* (after R.V. Hartley who in 1928 suggested the
use of logarithms for the measure of information).

Consider some properties of this function. If one of the probabilities in the sum is 0 then we have introduced a 0-infinity form. This is dealt with either by taking the limit of the term (which is 0) or restricting the sum to only those events that have positive probability. The function takes its maximum value (for fixed k) iff all the probabilities are equal (try a little calculus) in which case the value of the entropy is log k. The function is always nonnegative and equals zero only in the case that one probability is 1 and the remaining are 0 (the sum of the probabilities must be 1). This just reflects the fact that there is no uncertainty in a sure thing. In the special case that there are just two symbols, (say 0 and 1) each with a probability of .5, the entropy of the process is 1 bit. Thus, a bit corresponds to the amount of information in a situation with two equally likely outcomes. It is here that the information theoretic bit and the computer scientist bit coincide (when the need arises we can call the comp. sci. term a binit), but if the probabilities are changed then a binit will contain less than a bit of information.

We can use property 4 to extend the definition of entropy to higher order Markov processes.
For an mth order process, the probabilities can be computed if we know the previous m
outputs. Thus we can calculate the entropy using the above formula for each string of m
symbols and then sum these entropies weighted by the probability that that particular string of
m symbols appears. This will give us the entropy of the mth order process. A numerical
example should make this clear. Suppose that we have a two symbol alphabet (0 and 1) and
a 1^{st} order Markov process where the probability of a 0 following a 0 is 1/2 but following a
1 is 1/3. We can calculate from this that the probability of a 0 is 2/5 (and so, for a 1 would
be 3/5). Given a 0, the entropy for the next symbol would be

H_{0} = -( .5 log(.5) + .5 log(.5)) = - ( .5(-1) + .5(-1)) = - (-1) = 1

and given a 1 we have:

H_{1} = - ( (1/3)log(1/3) + (2/3)log(2/3)) = - ((1/3)(-1.58) + (2/3)(-.58))

= - ( -.526 + -.386) = .912

The entropy for this 1^{st} order process is thus

H = .4 H_{0} + .6 H_{1}

H = (.4)(1) + (.6)(.912) = .9472 bits/letter.

For a fixed alphabet, the entropies of higher order processes form a decreasing sequence, which being bounded from below (by 0) has a limit. This limit would be the entropy of a natural language being modeled by the limit of Markov processes. Although clearly defined, there is no effective way to use this definition to compute the entropy of say English. Various attempts to approximate this entropy have placed its value at about 1 bit per letter.

It should be noted that entropy is not a measure that can be applied to individual messages, it
is a statement about the information rate of a source and so refers to all messages coming
from that source. Also, remember the reciprocal relationship between information and
uncertainty. The lower the entropy, the higher the informational content.

Finally, we return to the concept of redundancy. For a given alphabet (with k symbols), the maximum entropy is obtained from a 0

**Redundancy** = 1 - (H/log k).

With this measure we see that a 0^{th} order process on two letters with equal probabilities (i.e.,
bit strings) has redundancy 0 (H = 1, log 2 = 1) as we mentioned earlier. English would have
a redundancy of about .75 (taking H = 1 and log 27 ~ 4), or 75%. A word of caution about
this figure, while it is true that the language can be compressed to about 1/4 of its size
without loss of meaning, this compression has to be done carefully because of the way
redundancy has been built into the language. A simple random removal of 3/4 of a message
will not generally leave enough to be comprehensible.