LessWrong post: the Telephone Theorem
What, exactly, can you know about the world? That’s a big and difficult question, but one practical way of approaching it is to start by considering issues of communication: when are facts able to travel through the world without becoming irreversibly changed or corrupted before they reach someone who wants to learn them?
One Rationalist takes this approach, asking how information is (or is not) preserved when it is passed through a sequence of many people/stages/channels/etc. They reach the following conclusion, which they say can be proven mathematically:
when information is passed through many layers, one after another, any information not nearly-perfectly conserved through nearly-all the “messages” is lost.
They call this the “telephone theorem”. They also state it “more formally”:
The only way that [message] Mn+1 can contain exactly the same information as Mn about M0 is if:
There’s some functions fn,fn+1 for which fn(Mn)=fn+1(Mn+1) with probability 1; that’s the deterministic constraint.
The deterministic constraint carries all the information about M0 - i.e. P[M0|Mn]=P[M0|fn(Mn)]. (Or, equivalently, mutual information of M0 and Mn equals mutual information of M0 and fn(Mn).)
The above is actually sort of true, but not obviously so - it’s a convoluted and incomplete way of describing basic concepts in information theory. A clearer and more general way of phrasing it is the following: if M0, Mn are random variables and Mn+1=g(Mn) for some function g, then I(M0;Mn)=I(M0;Mn+1) if and only if g is invertible (I(x;y) here is mutual information between x and y). The first bullet point of the “formal telephone theorem” follows directly from this by setting fn=Identity and fn+1=g^(-1.) The second bullet point doesn’t really mean anything.
My rephrasing is a standard and widely known fact. It doesn’t meaningfully answer the questions that the “telephone theorem” author is trying to address. In colloquial terms it’s equivalent to saying “you don’t lose any information when you edit a document in such a way that you know how to undo the changes that you made”, which is obvious. Worst of all, the proof that the author of “telephone theorem” offers is very unnecessarily complicated, and I can’t even tell if it’s correct; it’s poorly motivated, it uses some nonstandard notation, and it relies on another “lemma” for which the author cites himself elsewhere in LessWrong.
This isn’t the first time that someone has tried to figure out the conditions under which messages can be transmitted from one place to another with minimal error.
Here’s a different theorem, called the “noisy channel coding theorem”, which was first proven in the 1940s:
for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (digital information) nearly error-free up to a computable maximum rate through the channel.
It bears a kind of eerie resemblance to the plain english version of the “telephone theorem” above. That is because the noisy channel coding theorem completely subsumes the “telephone theorem”, but it is also far more general and it actually offers clear, easily quantifiable prescriptions about the conditions under which messages can be transmitted without errors.
I’m sure that all of this has been pretty obscure for people who don’t have a particular kind of STEM background, and so I would like to strenuously emphasize the following: the noisy channel coding theorem is extremely famous. It is the basis for all modern communications technology. It is a fundamental fact about what can be communicated by one intelligent entity to another. People who do information theory professionally would agree that the noisy channel coding theorem has less cultural significance than Newton’s laws or special relativity, but they’d have to stop and think for a moment to figure that out, and they might follow up by telling you that its lack of cultural status is an injustice. A thousand years from now people will still speak of the noisy channel coding theorem in tones of awe and reverence.
The concept of mutual information is closely related to the noisy channel coding theorem. The “telephone theorem” LessWrong post talks repeatedly about mutual information, and even cites the wikipedia article about it. Mutual information was first defined by Claude Shannon, who is the extremely famous original author of the noisy channel coding theorem. He defined the concept of mutual information specifically for the purpose of proving the noisy channel coding theorem. In the same paper he also invented the word “bit”, as in the fundamental unit of measure for information that everyone is familiar with, as in e.g. “kilobits” or “megabits per second”.
It is truly astonishing to me that the “telephone theorem” author managed to write that entire post the way that they did without mentioning the noisy channel coding theorem even once. The mind boggles.
Someone who is trying to unravel the mystery of human intelligence is probably going to feel a bit bored if they have to read about things like encoding schemes for noise-robust wireless networks. That also happens to be the kind of context in which the noisy channel coding theorem is most likely to be discussed. It’s a completely general theorem that applies to any kind of information transmission under any circumstances, but you wouldn’t necessarily figure that out just by skimming wikipedia articles. And if you’ve managed to convince someone to pay you to be an “independent researcher” then there’s a very real risk that you’ll never find out that skimming wikipedia articles has led you astray.
Skimming wikipedia articles is bad enough, but god help you if you try to learn about this stuff from LessWrong posts. I came across the “telephone theorem” when I saw this LessWrog post, in which the author speculates about the possibility that the “telephone theorem” implies the existence of arbitrarily powerful general learning algorithms (it does not).
A comment on that same post suggests that the “telephone theorem” is additionally an explanation for the manifold hypothesis (it is not) and, delightfully, it further asserts that solid objects have color because of the spontaneous breaking of the Lorentz symmetry (they do not).
Rationalists seem to love the idea of information theory, but generally seem to be quite bad at actually engaging with it as it exists in the world today.
[deleted]
Ah yes, autodidacts, reinventing the wheel from first principles because teachers can’t teach someone as brilliant and special as me, maaaaaaan.
I’m skimming here, but did someone just claim to reinvent Information Theory? One of those things taught to all electrical engineers the world over?
And they called it - you know what? Nevermind.
God it’s so fun when they make actual substantive mathematical claims, but for some reason it happens so rarely
The post references the data processing inequality, so they’re not trying to reinvent it.
On the other hand I don’t really see how the post has any meat to it. If you unravel what they mean by “conserved quantities” this is just a tautology: the information that makes it through a channel is conserved by the channel. If we pass through many copies of the same or similar channels, then the conserved information over the long run must be roughly deterministically propagated. The language of “conserved quantities” seems to be confusing several people into thinking that this has some strong analogy with conserved quantities in physics, but there’s not really a reason for any link. One is conservation over time within the system itself, the other is conservation over some spatial propagation away from the system. Physically non-conserved quantities can be easily propagated over distance (we can all watch movies) and conserved quantities over time can easily be blocked from view.
The reason he’s talking in terms of this other function f_n is because he wants to structure the argument differently: he wants to compute the set of conserved quantities at the source (the image of f) which will propagate in this way, given some knowledge of the rest of the network, analogous to how you can derive energy as a conserved quantity without simulating the system. The primary interest is using the nature of the channel to determine the set of possible f’s that could work. But the whole idea is toothless unless you also have a non-tautologous way of figuring that out.
Let’s take a look at how much math is done correctly.
First of, what are Mₙ? They are messages sure, but messages isn’t a mathematically defined quantity. Are they random variables? Are they events in some probability space?
Secondly, what are fₙ? They are deterministic functions which is already weird because in probability, you talk about random variables, functions that aren’t deterministic. (That’s the whole point of doing probability). But, what kind of functions are fₙ? All he has done up−till here is just rewrite P[M₀ |Mₙ] = P[M₀ | Mₙ₊₁] in more words.
Now maybe the Appendix is a bit illuminating.
Nope, we don’t get a precise statement of what Mₙ or fₙ are. Let’s look at the jargon pretending to be a proof.
Both are false. The correct factorisations are P[M₂,M₁,X] = P[X|M₂,M₁]P[M₂|M₁]P[M₁] and P[M₂,M₁,X] = P[X|M₂,M₁]P[M₁|M₂]P[M₂].
Yeah, no shit you end up getting very wrong statements once you used the wrong factorisation. Had you wrote the correct P[X|M₁,M₂] then you would see that you aren’t getting anything new.
NOOO! Why would different messages contain the same information about X?? What if my M₂ is just “Shut the fuck up”? Why would this contain the same information about something as another M₁? Unless they both contain no information. And WHAT IS X?? Define your variables ffs.
Suffice to say, there is nothing new here, and only completely wrong.
This is what happens when you dismiss experts and then try to reinvent everything from scratch because you are smarter than everyone else. At some point they are literally going to reinvent the wheel.
this seems closer (brief skim and i’m tired) to what is called the “data-processing inequality”: https://en.wikipedia.org/wiki/Data_processing_inequality
in fact it seems like a reskin/retread/reinvention of it in shittier notation.
Shut the fuck up nerd
OK, but you have to admit that noisy channel coding theorem is a stupid name.