r/SneerClub archives
newest
bestest
longest
The "Telephone theorem": a convoluted, and probably incorrect, Rationalist attempt at reinventing extremely famous mathematics (https://www.reddit.com/r/SneerClub/comments/114g1gu/the_telephone_theorem_a_convoluted_and_probably/)
110

LessWrong post: the Telephone Theorem

The Telephone Theorem

What, exactly, can you know about the world? That’s a big and difficult question, but one practical way of approaching it is to start by considering issues of communication: when are facts able to travel through the world without becoming irreversibly changed or corrupted before they reach someone who wants to learn them?

One Rationalist takes this approach, asking how information is (or is not) preserved when it is passed through a sequence of many people/stages/channels/etc. They reach the following conclusion, which they say can be proven mathematically:

when information is passed through many layers, one after another, any information not nearly-perfectly conserved through nearly-all the “messages” is lost.

They call this the “telephone theorem”. They also state it “more formally”:

The only way that [message] Mn+1 can contain exactly the same information as Mn about M0 is if:

  1. There’s some functions fn,fn+1 for which fn(Mn)=fn+1(Mn+1) with probability 1; that’s the deterministic constraint.

  2. The deterministic constraint carries all the information about M0 - i.e. P[M0|Mn]=P[M0|fn(Mn)]. (Or, equivalently, mutual information of M0 and Mn equals mutual information of M0 and fn(Mn).)

The above is actually sort of true, but not obviously so - it’s a convoluted and incomplete way of describing basic concepts in information theory. A clearer and more general way of phrasing it is the following: if M0, Mn are random variables and Mn+1=g(Mn) for some function g, then I(M0;Mn)=I(M0;Mn+1) if and only if g is invertible (I(x;y) here is mutual information between x and y). The first bullet point of the “formal telephone theorem” follows directly from this by setting fn=Identity and fn+1=g^(-1.) The second bullet point doesn’t really mean anything.

My rephrasing is a standard and widely known fact. It doesn’t meaningfully answer the questions that the “telephone theorem” author is trying to address. In colloquial terms it’s equivalent to saying “you don’t lose any information when you edit a document in such a way that you know how to undo the changes that you made”, which is obvious. Worst of all, the proof that the author of “telephone theorem” offers is very unnecessarily complicated, and I can’t even tell if it’s correct; it’s poorly motivated, it uses some nonstandard notation, and it relies on another “lemma” for which the author cites himself elsewhere in LessWrong.

The Extremely Famous Math

This isn’t the first time that someone has tried to figure out the conditions under which messages can be transmitted from one place to another with minimal error.

Here’s a different theorem, called the “noisy channel coding theorem”, which was first proven in the 1940s:

for any given degree of noise contamination of a communication channel, it is possible to communicate discrete data (digital information) nearly error-free up to a computable maximum rate through the channel.

It bears a kind of eerie resemblance to the plain english version of the “telephone theorem” above. That is because the noisy channel coding theorem completely subsumes the “telephone theorem”, but it is also far more general and it actually offers clear, easily quantifiable prescriptions about the conditions under which messages can be transmitted without errors.

I’m sure that all of this has been pretty obscure for people who don’t have a particular kind of STEM background, and so I would like to strenuously emphasize the following: the noisy channel coding theorem is extremely famous. It is the basis for all modern communications technology. It is a fundamental fact about what can be communicated by one intelligent entity to another. People who do information theory professionally would agree that the noisy channel coding theorem has less cultural significance than Newton’s laws or special relativity, but they’d have to stop and think for a moment to figure that out, and they might follow up by telling you that its lack of cultural status is an injustice. A thousand years from now people will still speak of the noisy channel coding theorem in tones of awe and reverence.

The concept of mutual information is closely related to the noisy channel coding theorem. The “telephone theorem” LessWrong post talks repeatedly about mutual information, and even cites the wikipedia article about it. Mutual information was first defined by Claude Shannon, who is the extremely famous original author of the noisy channel coding theorem. He defined the concept of mutual information specifically for the purpose of proving the noisy channel coding theorem. In the same paper he also invented the word “bit”, as in the fundamental unit of measure for information that everyone is familiar with, as in e.g. “kilobits” or “megabits per second”.

It is truly astonishing to me that the “telephone theorem” author managed to write that entire post the way that they did without mentioning the noisy channel coding theorem even once. The mind boggles.

So, wait…what went wrong here?

Someone who is trying to unravel the mystery of human intelligence is probably going to feel a bit bored if they have to read about things like encoding schemes for noise-robust wireless networks. That also happens to be the kind of context in which the noisy channel coding theorem is most likely to be discussed. It’s a completely general theorem that applies to any kind of information transmission under any circumstances, but you wouldn’t necessarily figure that out just by skimming wikipedia articles. And if you’ve managed to convince someone to pay you to be an “independent researcher” then there’s a very real risk that you’ll never find out that skimming wikipedia articles has led you astray.

Skimming wikipedia articles is bad enough, but god help you if you try to learn about this stuff from LessWrong posts. I came across the “telephone theorem” when I saw this LessWrog post, in which the author speculates about the possibility that the “telephone theorem” implies the existence of arbitrarily powerful general learning algorithms (it does not).

A comment on that same post suggests that the “telephone theorem” is additionally an explanation for the manifold hypothesis (it is not) and, delightfully, it further asserts that solid objects have color because of the spontaneous breaking of the Lorentz symmetry (they do not).

Rationalists seem to love the idea of information theory, but generally seem to be quite bad at actually engaging with it as it exists in the world today.

That would require actual work and rigorous reasoning, though. Not to mention coming to terms with the fact that they're not the super geniuses they think they are. Being braggadocious and condescending about their half baked ideas on Twitter is so much more fun and convenient.
I think it's a step further in dishonesty because I'm completely willing to acknowledge I'm a dumbass about information theory but I don't act as if this isn't otherwise

[deleted]

Oh boy it's a bit messier than I thought. I assumed the same thing that you did - that he's talking about something having to do with how the brain interprets information from the eyes or whatever - but this might be a much more prototypical overconfident undergrad thing. He knows just enough to get it weirdly wrong, but he doesn't know what he doesn't know and so he's gotten way ahead of himself. In a crystalline solid the color we see largely comes from the size of the separations in the energy levels of the electrons in that solid, which in turn is determined, in part, by the structure of the crystal, which in turn can be described by a symmetry group. This symmetry group is only a subset of all possible symmetries though, so you can say that symmetry breaking has occurred. But it's kind of asinine to say that color comes from symmetry breaking in this way, and it's also generally wrong. Even in the context of crystalline solids it's a lot more correct to say that the symmetries of the crystal make it easier use the properties of the individual atoms to predict color. It's the symmetries that *do* exist that matter, not the symmetries that have been broken. And of course none of this applies to noncrystalline solids. It's not even true that color is always determined by electron energy levels; the color of anodized titanium is famously determined by the thickness of its oxidization layer and not by any of the properties of titanium atoms themselves or their crystal structure.

Ah yes, autodidacts, reinventing the wheel from first principles because teachers can’t teach someone as brilliant and special as me, maaaaaaan.

if only they could solve one (1) problem...
I would probably qualify as an autodidact---I've mostly taught myself mathematics (I talk about it frequently as one can see in my comment history) and I have some results that a recently-minted PhD have told me definitely seem interesting enough to public---but I would absolutely never refer to myself with that word because of people like this. They like the shiny word, but don't want to actually go through the arduous process of learning things and dealing with the important but mundane and tedious details that are ever-present at the high level of any subject.
I'm also taking a stab and assuming part of your self learning was reading what people in the field have published and reading commentary on those publications, rather than attempting to recreate the entire field of mathematics whole cloth from first principles, out of an earnest belief that nobody else could possibly tell you anything you haven't thought of first.
Oh I'm truly awful at actually reading things through, so it's been a lot of gleaning things from the internet (but from *good* sources online, along with other people studying math I can ask questions). But more to the point you're making: yes, I'm actually interested in learning what people have already done, because people way smarter than me have already had brilliant insights over many generations to get where we are today. Starting from scratch would only help set me back centuries.

I’m skimming here, but did someone just claim to reinvent Information Theory? One of those things taught to all electrical engineers the world over?

And they called it - you know what? Nevermind.

God it’s so fun when they make actual substantive mathematical claims, but for some reason it happens so rarely

The post references the data processing inequality, so they’re not trying to reinvent it.

On the other hand I don’t really see how the post has any meat to it. If you unravel what they mean by “conserved quantities” this is just a tautology: the information that makes it through a channel is conserved by the channel. If we pass through many copies of the same or similar channels, then the conserved information over the long run must be roughly deterministically propagated. The language of “conserved quantities” seems to be confusing several people into thinking that this has some strong analogy with conserved quantities in physics, but there’s not really a reason for any link. One is conservation over time within the system itself, the other is conservation over some spatial propagation away from the system. Physically non-conserved quantities can be easily propagated over distance (we can all watch movies) and conserved quantities over time can easily be blocked from view.

The reason he’s talking in terms of this other function f_n is because he wants to structure the argument differently: he wants to compute the set of conserved quantities at the source (the image of f) which will propagate in this way, given some knowledge of the rest of the network, analogous to how you can derive energy as a conserved quantity without simulating the system. The primary interest is using the nature of the channel to determine the set of possible f’s that could work. But the whole idea is toothless unless you also have a non-tautologous way of figuring that out.

I say that they're attempting to reinvent things because the data processing inequality doesn't actually answer the questions that I think the author is trying to address. The data processing inequality tells us that information is either conserved or lost in transmission, but the author seems to be trying to figure out what you can actually learn about the original sent message. These are related but different issues, and the noisy channel coding theorem is the one that addresses what the author is trying to get at: it tells us how much the receiver can learn about the messages from the sender even in the case that information is lost with each transmitted message. I don't think it's even accurate to characterize what they're doing as "using the nature of the channel to determine the set of possible f's that could work"; that's probably too generous. If they knew that the specific details of the channel are what determine how information is transmitted through it then presumably they wouldn't have written any of this to begin with! I think the whole problem is that they don't actually have the mental picture that goes along with the noisy channel coding theorem, which is why they end up chasing their own tail and getting nowhere.

Let’s take a look at how much math is done correctly.

The only way that [message] Mn+1 can contain exactly the same information as Mn about M0 is if:(1) There’s some functions fn,fn+1 for which fn(Mn)=fn+1(Mn+1) with probability 1; that’s the deterministic constraint.(2) The deterministic constraint carries all the information about M0 - i.e. P[M0|Mn]=P[M0|fn(Mn)]. (Or, equivalently, mutual information of M0 and Mn equals mutual information of M0 and fn(Mn).)

First of, what are Mₙ? They are messages sure, but messages isn’t a mathematically defined quantity. Are they random variables? Are they events in some probability space?

Secondly, what are fₙ? They are deterministic functions which is already weird because in probability, you talk about random variables, functions that aren’t deterministic. (That’s the whole point of doing probability). But, what kind of functions are fₙ? All he has done up−till here is just rewrite P[M₀ |Mₙ] = P[M₀ | Mₙ₊₁] in more words.

Now maybe the Appendix is a bit illuminating.

Nope, we don’t get a precise statement of what Mₙ or fₙ are. Let’s look at the jargon pretending to be a proof.

P[M₂, M₁,X] = P[X|M₁]P[M₁|M₂]P[M₂] and P[M₂,M₁,X] = P[X|M₂]P[M₂|M₁]P[M₁]

Both are false. The correct factorisations are P[M₂,M₁,X] = P[X|M₂,M₁]P[M₂|M₁]P[M₁] and P[M₂,M₁,X] = P[X|M₂,M₁]P[M₁|M₂]P[M₂].

First, we can equate the two factorizations and cancel P[M₁,M₂] from both sides to find P[X|M₁] = P[X|M₂]

Yeah, no shit you end up getting very wrong statements once you used the wrong factorisation. Had you wrote the correct P[X|M₁,M₂] then you would see that you aren’t getting anything new.

This makes sense: M₁ and M₂ contain the same information about X, so the distribution of X given M₁ is the same as the distribution of X given M₂.

NOOO! Why would different messages contain the same information about X?? What if my M₂ is just “Shut the fuck up”? Why would this contain the same information about something as another M₁? Unless they both contain no information. And WHAT IS X?? Define your variables ffs.

Suffice to say, there is nothing new here, and only completely wrong.

As the article gets to later, the intended scenario is to analyze information propagation through a series of nested Markov blankets. So the set up here is that the M_n are a collection of random variables in a chain, M0->M1->M2->M3... where the arrows are Markov kernels. "Deterministic function" is used to refer to a Markov kernel that is a lift of a function on the underlying sample space, not a function on the random variable with a deterministic output.
I see, in this setting, the concept makes sense, but I wasn't able to find someone defining Markov kernels as merely deterministic functions from a quick search, but maybe that speaks more about my inexperience with the topic than the accepted terminology. However, as the article goes on, he defines, >fₙ(Mₙ)(x) = P\[X = x|Mₙ\] This doesn't seem to correspond to the notion you are referring. On a first glance, this cannot be defined everywhere since X isn't defined everywhere and he has defined it in terms of X. Secondly, a Markov kernel is supposed to take a set as an input as well, whereas he takes a random variable (although, I guess, one might rectify this by replacing Mₙ by sets that are Mₙ−measurable, but I doubt he has that consideration in mind)
Some of the things you point out here as wrong are, in fact, perfectly fine. This is what I think is interesting about this post: it's actually kind of hard to sort out which parts are correct or incorrect because the author kind of understands some things, but they aren't explicit in describing what they're doing and they don't always use standard notation and terminology. It's quite common in probability to take deterministic functions of random variables, so it's fine if f\_n is deterministic. But the most general theory of communication - the noisy channel coding theorem - also accommodates the case when f\_n is stochastic, and in order to answer the questions that the author is trying to address then you really should be handling that case. *But it's not clear that the author understands that there is a distinction between deterministic or stochastic f\_n*, and it's also not clear which case they intend to use! Similarly, the factorizations they use are actually fine. It's valid to say e.g. P\[M₂,M₁,X\] = P\[X|M₂\]P\[M₂|M₁\]P\[M₁\], because this is what you get if you assume that X and M₁ are independent, which is certainly possible. But *the reason that they're using these factorizations* is what is suspect. In the best case scenario this is an overly complicated way of proving something that is actually very simple. In the worst case scenario it's a totally incorrect proof, but not because the factorizations themselves are invalid.
>It's quite common in probability to take deterministic functions of random variables, so it's fine if f\_n is deterministic Yes, it is fine to talk about deterministic functions but as I state afterwards, the real issue is the absence of any further qualification on fₙ. Where and what qualifications do we think these functions have is the real issue. Btw, I would disagree with the statement about commonality. It is very uncommon to use deterministic functions in probability. Take your favourite graduate probability book and ctrl+f the word deterministic. Doing it for *Probability and Measure* by Billingsley gives you 0 results as well as for *Probability: Theory and Examples* by Durrett. Doing it for *Stochastic Processes* by Bass gives 15 results in a 370 page book. Sure, in measure theory, when one talks about measurable functions, they are implicitly sort of treating deterministic functions, but no one will write that because it is a false statement. Our functions are (usually) taken from an Lp space which are composed of equivalence classes of functions that are, in a sense, only defined except on a null-set. >Similarly, the factorizations they use are actually fine. It's valid to say e.g. P\[M₂,M₁,X\] = P\[X|M₂\]P\[M₂|M₁\]P\[M₁\], because this is what you get if you assume that X and M₁ are independent, which is certainly possible. Well, I completely disagree. Just because an equality holds under a possible assumption doesn't mean it is *valid* to state the equality. For suppose, I write X = X + X and so 0 = 1. With this information, I then produce an imprecise rant bashing the foundations of mathematics. Now is it *possible* that X = X + X? For sure, if X = 0. But am I being formal by writing X = X + X without stating my presumptions? No. The entire point of formalisation is to be absolutely certain about what we have to assume and what we can infer. Failure of either nullifies the benefits of formalisation. Being cautious of this assumption would have shown me that since X = 0 so I can't divide by X to prove 0 = 1. Similarly, were he cautious of his assumption he wouldn't have made an error (and most importantly, realised that there is nothing new to be gained just from expansion of conditional probability). Not being critical when people use bogus formalisations to peddle bullshit ideas is how we got Terrence Howard to sell his ideas to Uganda.
A significant fraction of modern probability and statistics is about analyzing all of the nuances of the deterministic function f(x1,x2,...,xn) = (x1+x2+...+xn)/n. Under various assumptions on the (randomized) input. Nobody would refer to it in these terms, just like nobody refers to various extensions of this analysis in terms of deterministic functions (for example, concentration of Lipschitz functions on the sphere is a common extension. I did not say "deterministic" here, although said functions are deterministic).
Okay, in this case, I believe we have a difference of terminology. For me, a deterministic function of random variable is a random variable which is defined everywhere. However, you and OP, I presume, are talking about *compositions of (deterministic) functions* with random variables. Under this context, I agree, it makes sense to talk about the concept, and I agree it is very common to find such constructions in a probability book. Regardless, this is very off-topic because for one, the terminology of deterministic functions [isn't really used in probability](https://www.google.com/search?q=%22deterministic+function+of+random+variable%22&client=safari&rls=en&biw=1638&bih=1473&sxsrf=AJOqlzVc21ENIOUmsZQcx1-qcHQsCn7b0A%3A1676749974306&ei=lizxY4ieErCs5NoPtYK_mAs&ved=0ahUKEwiIweG07J_9AhUwFlkFHTXBD7MQ4dUDCA8&uact=5&oq=%22deterministic+function+of+random+variable%22&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIKCAAQRxDWBBCwAzIKCAAQRxDWBBCwAzIKCAAQRxDWBBCwAzIKCAAQRxDWBBCwAzIKCAAQRxDWBBCwA0oECEEYAFAAWABg4SNoAnAAeACAAQCIAQCSAQCYAQDIAQXAAQE&sclient=gws-wiz-serp) (hence the confusion) and secondly, the author didn't even define the deterministic fn's as the other concept we are talking about. His fₙ's are defined as fₙ(Mₙ)(x) = P\[X = x|Mₙ\] which isn't a composition of deterministic function with a random variable. So in both senses, my definition and yours/OP definition, the author's usage is incorrect.
It is used when discussing the data processing inequality though, where there are versions that explicitly model the randomness (phrased in terms of properties of Markov kernels) and simpler versions that don't. For particular kernels you can get stronger versions (generally called *strong data processing inequalities*) depending on particular details of the Markov kernel. See for example Neyman Pearson, where one can prove the optimal hypothesis test (say for simple binary hypothesis testing) is given by a *deterministic* test, rather than some randomized test (at least under continuity assumptions on the log likelihood ratio. Without these assumptions you actually have to randomize over two tests for optimality).
I see. Thanks for the added context. I mostly work in pure probability and analysis, so it is definitely a new-find for me. That said, I'll construe the absence of a comment on my last point as an agreement on that point.
Every textbook uses deterministic functions of random variables, they just don't bother to qualify them by calling them "deterministic". The fact that the outputs of deterministic functions of random variables are, themselves, also random variables is one of the first things that students learn. For example, pretty much every textbook has some variation of the following problem: >Let X1 and X2 be the outcomes of rolling two fair dice. If Z=X1+X2, what is the PMF of the random variable Z? f(x1,x2) = x1+x2 is, of course, a deterministic function of x1 and x2. Also, I promise that the factorizations are okay. If you're in any doubt you can have a look at the wikipedia page for the data processing inequality, which uses these factorizations explicitly: [https://en.wikipedia.org/wiki/Data\_processing\_inequality](https://en.wikipedia.org/wiki/Data_processing_inequality) I assume the author of the LessWrong post used these factorizations because they were inspired by that wikipedia page, which they cite.
>Every textbook uses deterministic functions of random variables, they just don't bother to qualify them by calling them "deterministic". This isn't true. We define *measurable functions* of random variables. Can those functions be deterministic? Sure. The constant functions are an example of widely used deterministic functions. But, mostly, they are not deterministic. Like think about it? If we mostly study deterministic mappings of random variables, why even use random variables in the first place? If the output is going to be deterministic, it will nullify any indeterminacy brought on by the concept of random variable. >f(x1,x2) = x1+x2 is, of course, a deterministic function of x1 and x2. No!! f(x₁,x₂) is the *probability density/mass function* associated with the *distribution of the random variable*. Yes, this is a deterministic function of x₁,x₂ but x₁,x₂ aren't random variable. So this isn't an instance of a *deterministic function of a random variable*. Which is my point given that's the context he used his fₙ in. >Also, I promise that the factorizations are okay. They are okay *given an assumption on the random variables.* Just like how X = X + X is okay *given X = 0.* Does this mean is it valid for me to say X = X+X without any qualification? Nope. The Wikipedia page does mention the assumption, but it's not an assumption he stated. And understanding the meaning of the assumption in this case would show him that deducing P\[X|M₁\] = P\[X|M₂\] is very trivial given X,M₁, M₂ are independent.
You gotta slow down friend, you're making the same mistakes as the people we're talking about.
If I have made a false statement, I would gladly rectify it if you can point it out. P.S: If I were you, I would revise that username.

This is what happens when you dismiss experts and then try to reinvent everything from scratch because you are smarter than everyone else. At some point they are literally going to reinvent the wheel.

this seems closer (brief skim and i’m tired) to what is called the “data-processing inequality”: https://en.wikipedia.org/wiki/Data_processing_inequality

in fact it seems like a reskin/retread/reinvention of it in shittier notation.

Shut the fuck up nerd

As much as we like making fun of incorrectly used technical terms and autodidacts reinventing the wheel, most of us actually do enjoy math and science when they are explained correctly.
Yeah, I thought they just copy and pasted a LW post, but it looks like they're actually analyzing it.
https://www.youtube.com/watch?v=2Y43lKngRQE

OK, but you have to admit that noisy channel coding theorem is a stupid name.