（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41037981

科学进步的局限性源于需要广泛专业知识的新领域日益复杂，使得有才华的大学生难以掌握现有知识，更不用说推进了。由于同行评审的研究论文和博士论文的标准下降，知识的进步受到阻碍，这是由促进过度论文生产而不是智力深度的激励制度推动的。为了解决这个问题，需要新的模型和抽象来简化复杂的问题，类似于编程语言中提供的简单性，在编程语言中可以开发复杂的计算机系统，而无需完全理解低级细节。然而，与技术相比，自然科学的这种转变似乎缓慢，可能是由于保护其地位、资金和传统的守门人的抵制。信息论为理解知识的客观和主观方面提供了一个有价值的工具。交叉熵根据实际和相信的频率量化观察时经历的平均惊讶，提供了一种评估各种信念系统的相对准确性的方法。提高初始数据集或假设的质量可以得出更准确的结论。影响数据分析准确性的关键因素是观测数据集的大小和细节；不完整或不够详细的数据集可能会掩盖关键模式或关系。此外，对熵的解释多种多样，并且可能因上下文而异，这凸显了在传达科学思想时仔细定义术语和概念的重要性。未破碎的鸡蛋与破碎的鸡蛋的例子说明了系统的熵如何反映其组织或无序。未破碎的鸡蛋具有最小的熵，因为它以高度组织的状态存在，而破碎的鸡蛋由于蛋壳碎片有多种可能的排列而具有最大的熵。同样，在热力学中，熵指的是系统内的无序性或随机性，是描述传热、化学反应和相变等过程的基本概念。最后，理解熵需要精确的术语，并且必须考虑守恒定律、可逆性和其他相关因素，以准确描述自然界中发生的现象。

A well known anecdote reported by Shannon:

"My greatest concern was what to call it. I thought of calling it 'information,' but the word was overly used, so I decided to call it 'uncertainty.' When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, 'You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.'"

See the answers to this MathOverflow SE question (https://mathoverflow.net/questions/403036/john-von-neumanns-...) for references on the discussion whether Shannon's entropy is the same as the one from thermodynamics.

> Emil Kirkegaard is a self-described white nationalist

That's simply a lie.

> who thinks the age of consent is too high

Too high in which country? Such laws vary strongly, even by US state, and he is from Denmark. Anyway, this has nothing to do with the topic at hand.

In Spain used to be as low as 13 a few decades ago; but that law was obviously written before the rural exodus of inner Spain into the cities (from the 60's to almost the 80's), as children since early puberty got to work/help in the farm/fields or at home and by age 14 they had far more duties and accountabilities than today. And yes, that yielded more maturity.

Thus, the law had to be fixed for more urban/civilized times up to 16. Altough depending on the age/mentality closeness (such as 15-19 as it happened with a recent case), the young adult had its charges totally dropped.

He was really brilliant, made contributions all over the place in the math/physics/tech field, and had a sort of wild and quirky personality that people love telling stories about.

A funny quote about him from a Edward “a guy with multiple equations named after him” Teller:

> Edward Teller observed "von Neumann would carry on a conversation with my 3-year-old son, and the two of them would talk as equals, and I sometimes wondered if he used the same principle when he talked to the rest of us."

Are there many von-Neumann-like multidisciplinaries nowadays? It feels like unless one is razor sharp fully into one field one is not to be treated seriously by those who made careers in it (and who have the last word on it).

IMO they do exist, but the popular attitude that it's not possible anymore is the issue, not a lack of genius. If everyone has a built in assumption that it can't happen anymore, then we will naturally prune away social pathways that enable it.

I think there are none. The world has gotten too complicated for that. It was early days in quantum physics, information theory, and computer science. I don’t think it is early days in anything that consequential anymore.

Centuries ago, the limitation of most knowledge was the difficulty in discovery; once known, it was accessible to most scholars. Take Calculus, which is taught in every high school in America. The problem is, we're getting to a point where new fields are built on such extreme requirements, that even the known knowledge is extremely hard for talented university students to learn, let alone what is required to discover and advance that field. Until we are able to augment human intelligence, the days of the polymath advancing multiple fields are mostly over. I would also argue that the standards for peer reviewed whitepapers and obtaining PhDs has significantly dropped (due to the incentive structure to spam as many papers as possible), which is only hurting the advancement of knowledge.

Sounds like the increased difficulty could be addressed with new models and right abstraction layers. E.g., there’s incredible complexity in modern computing, but you don’t need to know assembly in order build a Web app, to reason about architecture, to operate functional paradigms, etc. However, this doesn’t seem to happen in natural sciences. I wonder if adopting better models runs into the gatekeepers protecting their status, tenures, and status quo.

More than that, as professionals' career paths in fields develop, the organisations they work for specialize, becoming less amenable to the generalist. ('Why should we hire this mathematician who is also an expert in legal research? Their attention is probably divided, and meanwhile we have a 100% mathematician in the candidate pool fresh from an expensive dedicated PhD program with a growing family to feed.')

I'm obviously using the archetype of Leibniz here as an example but pick your favorite polymath.

Is it fair to say that the number of publicly accomplished multidisciplinaries alive at a particular moment is not rising as it may be expected, proportionally to the total number of suitably educated people?

Euler.

JVM was one of the smartest ever, but Euler was there centuries before and shows up in so many places.

If I had a Time Machine I'd love to get those two together for a stiff drink and a banter.

I felt like I finally understood Shannon entropy when I realized that it's a subjective quantity -- a property of the observer, not the observed.

The entropy of a variable X is the amount of information required to drive the observer's uncertainty about the value of X to zero. As a correlate, your uncertainty and mine about the value of the same variable X could be different. This is trivially true, as we could each have received different information that about X. H(X) should be H_{observer}(X), or even better, H_{observer, time}(X).

As clear as Shannon's work is in other respects, he glosses over this.

What's often lost in the discussions about whether entropy is subjective or objective is that, if you dig a little deeper, information theory gives you powerful tools for relating the objective and the subjective.

Consider cross entropy of two distributions H[p, q] = -Σ p_i log q_i. For example maybe p is the real frequency distribution over outcomes from rolling some dice, and q is your belief distribution. You can see the p_i as representing the objective probabilities (sampled by actually rolling the dice) and the q_i as your subjective probabilities. The cross entropy is measuring something like how surprised you are on average when you observe an outcome.

The interesting thing is that H[p, p] <= H[p, q], which means that if your belief distribution is wrong, your cross entropy will be higher than it would be if you had the right beliefs, q=p. This is guaranteed by the concavity of the logarithm. This gives you a way to compare beliefs: whichever q gets the lowest H[p,q] is closer to the truth.

You can even break cross entropy into two parts, corresponding to two kinds of uncertainty: H[p, q] = H[p] + D[q||p]. The first term is the entropy of p and it is the aleatoric uncertainty, the inherent randomness in the phenomenon you are trying to model. The second term is KL divergence and it tells you how much additional uncertainty you have as the result of having wrong beliefs, which you could call epistemic uncertainty.

Thanks, that's an interesting perspective. It also highlights one of the weak points in the concept, I think, which is that this is only a tool for updating beliefs to the extent that the underlying probability space ("ontology" in this analogy) can actually "model" the phenomenon correctly!

It doesn't seem to shed much light on when or how you could update the underlying probability space itself (or when to change your ontology in the belief setting).

You can sort of do this over a suitably large (or infinite) family of models all mixed, but from an epistemological POV that’s pretty unsatisfying.

From a practical POV it’s pretty useful and common (if you allow it to describe non- and semi-parametric models too).

I think what you're getting at is the construction of the sample space - the space of outcomes over which we define the probability measure (e.g. {H,T} for a coin, or {1,2,3,4,5,6} for a die).

Let's consider two possibilities:

1. Our sample space is "incomplete"

2. Our sample space is too "coarse"

Let's discuss 1 first. Imagine I have a special die that has a hidden binary state which I can control, which forces the die to come up either even or odd. If your sample space is only which side faces up, and I randomize the hidden state appropriately, it appears like a normal die. If your sample space is enlarged to include the hidden state, the entropy of each roll is reduced by one bit. You will not be able to distinguish between a truly random coin and a coin with a hidden state if your sample space is incomplete. Is this the point you were making?

On 2: Now let's imagine I can only observe whether the die comes up even or odd. This is a coarse-graining of the sample space (we get strictly less information - or, we only get some "macro" information). Of course, a coarse-grained sample space is necessarily an incomplete one! We can imagine comparing the outcomes from a normal die, to one which with equal probability rolls an even or odd number, except it cycles through the microstates deterministically e.g. equal chance of {odd, even}, but given that outcome, always goes to next in sequence {(1->3->5), (2->4->6)}.

Incomplete or coarse sample spaces can indeed prevent us from inferring the underlying dynamics. Many processes can have the same apparent entropy on our sample space from radically different underlying processes.

Right, this is exactly what I'm getting at - learning a distribution over a fixed sample space can be done with Bayesian methods, or entropy-based methods like the OP suggested, but I'm wondering if there are methods that can automatically adjust the sample space as well.

For well-defined mathematical problems like dice rolling and fixed classical mechanics scenarios and such, you don't need this I guess, but for any real-world problem I imagine half the problem is figuring out a good sample space to begin with. This kind of thing must have been studied already, I just don't know what to look for!

There are some analogies to algorithms like NEAT, which automatically evolves a neural network architecture while training. But that's obviously a very different context.

We could discuss completeness of the sample space, and we can also discuss completeness of the hypothesis space.

In Solomonoff Induction, which purports to be a theory of universal inductive inference, the "complete hypothesis space" consists of all computable programs (note that all current physical theories are computable, so this hypothesis space is very general). Then induction is performed by keeping all programs consistent with the observations, weighted by 2 terms: the programs prior likelihood, and the probability that program assigns to the observations (the programs can be deterministic and assign probability 1).

The "prior likelihood" in Solomonoff Induction is the program's complexity (well, 2^(-Complexity), where the complexity is the length of the shortest representation of that program.

Altogether, the procedure looks like: maintain a belief which is a mixture of all programs consistent with the observations, weighted by their complexity and the likelihood they assign to the data. Of course, this procedure is still limited by the sample/observation space!

That's our best formal theory of induction in a nutshell.

Someone else pointed me to Solomonoff induction too, which looks really cool as an "idealised" theory of induction and it definitely solves my question in abstract. But there are obvious difficulties with that in practice, like the fact that it's probably uncomputable, right?

I mean I think even the "Complexity" coefficient should be uncomputable in general, since you could probably use a program which computes it to upper bound "Complexity", and if there was such an upper bound you could use it to solve the halting problem etc. Haven't worked out the details though!

Would be interesting if there are practical algorithms for this. Either direct approximations to SI or maybe something else entirely that approaches SI in the limit, like a recursive neural-net training scheme? I'll do some digging, thanks!

Correct anything thats wrong here. Cross entropy is the comparison of two distributions right? Is the objectivity sussed out in relation to the overlap cross section. And is the subjectivity sussed out not on average but deviations on average? Just trying to understand it in my framework which might be wholly off the mark.

Cross entropy lets you compare two probability distributions. One way you can apply it is to let the distribution p represent "reality" (from which you can draw many samples, but whose numerical value you might not know) and to let q represent "beliefs" (whose numerical value is given by a model). Then by finding q to minimize cross-entropy H[p, q] you can move q closer to reality.

You can apply it other ways. There are lots of interpretations and uses for these concepts. Here's a cool blog post if you want to find out more: https://blog.alexalemi.com/kl-is-all-you-need.html

I'm not sure what you mean by objectivity and subjectivity in this case.

With the example of beliefs, you can think of cross entropy as the negative expected value of the log probability you assigned to an outcome, weighted by the true probability of each outcome. If you assign larger log probabilities to more likely outcomes, the cross entropy will be lower.

This doesn't really make entropy itself observer dependent. (Shannon) entropy is a property of a distribution. It's just that when you're measuring different observers' beliefs, you're looking at different distributions (which can have different entropies the same way they can have different means, variances, etc).

Entropy is a property of a distribution, but since math does sometimes get applied, we also attach distributions to things (eg. the entropy of a random number generator, the entropy of a gas...). Then when we talk about the entropy of those things, those entropies are indeed subjective, because different subjects will attach different probability distributions to that system depending on their information about that system.

Some probability distributions are objective. The probability that my random number generator gives me a certain number is given by a certain formula. Describing it with another distribution would be wrong.

Another example, if you have an electron in a superposition of half spin-up and half spin-down, then the probability to measure up is objectively 50%.

Another example, GPT-2 is a probability distribution on sequences of integers. You can download this probability distribution. It doesn't represent anyone's beliefs. The distribution has a certain entropy. That entropy is an objective property of the distribution.

Of those, the quantum superposition is the only one that has a chance at being considered objective, and it's still only "objective" in the sense that (as far as we know) your description provided as much information as anyone can possibly have about it, so nobody can have a more-informed opinion and all subjects agree.

The others are both partial-information problems which are very sensitive to knowing certain hidden-state information. Your random number generator gives you a number that you didn't expect, and for which a formula describes your best guess based on available incomplete information, but the computer program that generated knew which one to choose and it would not have picked any other. Anyone who knew the hidden state of the RNG would also have assigned a different probability to that number being chosen.

You might have some probability distribution in your head for what will come out of GPT-2 on your machine at a certain time, based on your knowledge of the random seed. But that is not the GPT-2 probability distribution, which is objectively defined by model weights that you can download, and which does not correspond to anyone’s beliefs.

I'm of the view that strictly speaking, even a fair die doesn't have a probability distribution until you throw it. It just so happens that, unless you know almost every detail about the throw, the best you can usually do is uniform.

So I would say the same of GPT-2. It's not a random variable unless you query it. But unless you know unreasonably many details, the best you can do to predict the query is the distribution that you would call "objective."

I think this gets into unanswerable metaphysical questions about when we can say mathematical objects, propositions, etc. really exist.

But I think if we take the view that it's not a random variable until we query it, that makes it awkward to talk about how GPT-2 (and similar models) is trained. No one ever draws samples from the model during training, but the whole justification for the cross-entropy-minimizing training procedure is based on thinking about the model as a random variable.

A more plausible way to argue for objectiveness is to say that some probability distributions are objectively more rational than others given the same information. E.g. when seeing a symmetrical die it would be irrational to give 5 a higher probability than the others. Or it seems irrational to believe that the sun will explode tomorrow.

The probability distribution is subjective for both parts -- because it, once again, depends on the observer observing the events in order to build a probability distribution.

E.g. your random number generator generates 1, 5, 7, 8, 3 when you run it. It generates 4, 8, 8, 2, 5 when I run it. I.e. we have received different information about the random number generator to build our subjective probability distributions. The level of entropy of our probability distributions is high because we have so little information to be certain about the representativeness of our distribution sample.

If we continue running our random number generator for a while, we will gather more information, thus reducing entropy, and our probability distributions will both start converging towards an objective "truth." If we ran our random number generators for a theoretically infinite amount of time, we will have reduced entropy to 0 and have a perfect and objective probability distribution.

But this is impossible.

Sorry, this is a major misinterpretation, or at least a completely different one. I don't know how to put it in a more productive way; I think your comment is very confused. You don't need to run a random number generator "for a while" in order to build up a probability distribution.

This might be a frequentist vs bayesian thing, and I am bayesian. So maybe other people would have a different view.

I don't think you need to have any information to have a probability distribution; your distribution already represents your degree of ignorance about an outcome. So without even sampling it once, you already should have a uniform probability distribution for a random number generator or a coin flip. If you do personally have additional information to help you predict the outcome -- you're skilled at coin-flipping, or you wrote the RNG and know an exploit -- then you can compress that distribution to a lower-entropy one.

But you don't need to sample the distribution to do this. You can have that information before the first coin toss. Sampling can be one way to get information but it won't necessarily even help. If samples are independent, then each sample really teaches you barely anything about the next. RNGs eventually do repeat so if you sample it enough you might be able to find the pattern and reduce the entropy to zero, but in that case you're not learning the statistical distribution, you're deducing the exact internal state of the RNG and predicting the exact next outcome, because the samples are not actually independent. If you do enough coin flips you might eventually find that there's a slight bias to the coin, but that really takes an extreme number of tosses and only reduces the entropy a tiny tiny bit; not at all if the coin-tossing procedure had no bias to begin with.

However the objective truth is just that the next toss will land heads. That's the only truth that experiment can objectively determine. Any other doubt that it might-have-counterfactually-landed-tails is subjective, due to a subjective lack of sufficient information to predict the outcome. We can formalize a correct procedure to convert prior information into a corresponding probability distribution, we can get a unanimous consensus by giving everybody the same information, but the probability distribution is still subjective because it is a function of that prior information.

Would you say that all claims about the world are subjective, because they have to be based on someone’s observations?

For example my cat weighs 13 pounds. That seems objective, in the sense that if two people disagree, only one can be right. But the claim is based on my observations. I think your logic leads us to deny that anything is objective.

I do believe in objective reality, but probabilities are subjective. Your cat weighs 13 pounds, and now that you've told me, I know it too. If you asked me to draw a probability distribution for the weight of your cat, I'd draw a tight gaussian distribution around that, representing the accuracy of your scale. My cat weighs a different amount, but I won't tell you how much, so if we both draw a probability distribution, they'll be different. And the key thing is that neither of us has an objectively correct probability distribution, not even me. My cat's weight has an objectively correct value which even I don't know, because my scale isn't good enough.

Right, but the very interesting thing is it turns out that what's random to me might not be random to you! And the reason that "microscopic" is included is because that's a shorthand for "information you probably don't have about a system, because your eyes aren't that good, or even if they are, your brain ignored the fine details anyway."

Entropy in physics is usually the Shannon entropy of the probability distribution over system microstates given known temperature and pressure. If the system is in equilibrium then this is objective.

That's not a problem, as the GP's post is trying to state a mathematical relation not a historical attribution. Often newer concepts shed light on older ones. As Baez's article says, Gibbs entropy is Shannon's entropy of an associated distribution(multiplied by the constant k).

It is a problem because all three come with a bagage. Almost none of the things discussed in this thread are invalid when discussing actual physical entropy even though the equations are superficially similar. And then there are lots of people being confidently wrong because they assume that it’s just one concept. It really is not.

Don't see how the connection is superficial. Even the classical macroscopic definition of entropy as ΔS=∫TdQ can be derived from the information theory perspective as Baez shows in article(using entropy maximizing distributions and Lagrange multipliers). If you have a more specific critique, it would be good to discuss.

In classical physics there is no real objective randomness. Particles have a defined position and momentum and those evolve deterministically. If you somehow learned these then the shannon entropy is zero. If entropy is zero then all kinds of things break down.

So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness, even though temperature does not really seem to be a quantum thing.

> If entropy is zero then all kinds of things break down.

Entropy is a macroscopic variable and if you allow microscopic information, strange things can happen! One can move from a high entropy macrostate to a low entropy macrostate if you choose the initial microstate carefully. But this is not a reliable process which you can reproduce experimentally, ie. it is not a thermodynamic process.

A thermodynamics process P is something which takes a macrostate A to a macrostate B, independent of which microstate a0, a1, a2.. in A you started off with it. If the process depends on microstate, then it wouldn't be something we would recognize as we are looking from the macro perspective.

> Particles have a defined position and momentum

Which we don’t know precisely. Entropy is about not knowing.

> If you somehow learned these then the shannon entropy is zero.

Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space. (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

> So now you are forced to consider e.g. temperature an impossibility without quantum-derived randomness

Or you may study statistical mechanics :-)

> Which we don’t know precisely. Entropy is about not knowing.

No, it is not about not knowing. This is an instance of the intuition from Shannon’s entropy does not translate to statistical Physics.

It is about the number of possible microstates, which is completely different. In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

> Minus infinity. Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space.

No, 0. In this case, there is a single state with p=1 and and S = - k Σ p ln(p) = 0.

This is the same if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

You get the same using Boltzmann’s approach, in which case Ω = 1 and S = k ln(Ω) is also 0.

> (You need an appropriate extension of Shannon’s entropy to continuous distributions.)

Gibbs’ entropy.

> Or you may study statistical mechanics

Indeed.

>>> Particles have a defined position and momentum [...] If you somehow learned these then the shannon entropy is zero.

>> Entropy in classical statistical mechanics is proportional to the logarithm of the volume in phase space [and diverges to minus infinity if you define precisely the position and momentum of the particles and the volume in phase sphere goes to zero]

> [It's zero also] if you consider the phase space because then it is reduced to a single point (you need a bit of distribution theory to prove it rigorously but it is somewhat intuitive).

> The probability p of an microstate is always between 0 and 1, therefore p ln(p) is always negative and S is always positive.

The points in the phase space are not "microstates" with probability between 0 and 1. It's a continuous distribution and if it collapses to a point (i.e. you somehow learned the exact positions and momentums) the density at that point is unbounded. The entropy is also unbounded and goes to minus infinity as the volume in phase space collapses to zero.

You can avoid the divergence by dividing the continuous phase space into discrete "microstates" but having a well-defined "microstate" corresponding to some finite volume in phase space is not the same as what was written above about "particles having a defined position and momentum" that is "somehow learned". The microstates do not have precisely defined positions and momentums. The phase space is not reduced to a single point in that case.

If the phase space is reduced to a single point I'd like to see your proof that S(ρ) = −k ∫ ρ(x) log ρ(x) dx = 0

> possible microstates

Conditional on the known macrostate. Because we don’t know the precise microstate - only which microstates are possible.

If your reasoning is that « experimental entropy can be measured so it’s not about that » then it’s not about macrostates and microstates either!

> In Physics, entropy is a property of a bit of matter, it is not related to the observer or their knowledge. We can measure the enthalpy change of a material sample and work out its entropy without knowing a thing about its structure.

Enthalpy is also dependent on your choice of state variables, which is in turn dictated by which observables you want to make predictions about: whether two microstates are distinguishable, and thus whether the part of the same macrostate, depends on the tools you have for distinguishing them.

Sounds like a non-sequitur to me; what are you implying about the Maxwell's demon thought experiment vs the comparison between Shannon and stat-mech entropy?

Yeah but distributions are just the accounting tools to keep track of your entropy. If you are missing one bit of information about a system, your understanding of the system is some distribution with one bit of entropy. Like the original comment said, the entropy is the number of bits needed to fill in the unknowns and bring the uncertainty down to zero. Your coin flips may be unknown in advance to you, and thus you model it as a 50/50 distribution, but in a deterministic universe the bits were present all along.

To shorten this for you with my own (identical) understanding: "entropy is just the name for the bits you don't have".

Entropy + Information = Total bits in a complete description.

Trivial example: if you know the seed of a pseudo-random number generator, a sequence generated by it has very low entropy.

But if you don't know the seed, the entropy is very high.

> he glosses over this

All of information theory is relative to the channel. This bit is well communicated.

What he glosses over is the definition of "channel", since it's obvious for electromagnetic communications.

Can you explain this in more detail?

Entropy is calculated as a function of a probability distribution over possible messages or symbols. The sender might have a distribution P over possible symbols, and the receiver might have another distribution Q over possible symbols. Then the "true" distribution over possible symbols might be another distribution yet, call it R. The mismatch between these is what leads to various inefficiencies in coding, decoding, etc [1]. But both P and Q are beliefs about R -- that is, they are properties of observers.

[1] https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Co...

the subjectivity doesn't stem from the definition of the channel but from the model of the information source. what's the prior probability that you intended to say 'weave', for example? that depends on which model of your mind we are using. frequentists argue that there is an objectively correct model of your mind we should always use, and bayesians argue that it depends on our prior knowledge about your mind

(i mean, your information about what the channel does is also potentially incomplete, so the same divergence in definitions could arise there too, but the subjectivity doesn't just stem from the definition of the channel; and shannon entropy is a property that can be imputed to a source independent of any channel)

It's an objective quantity, but you have to be very precise in stating what the quantity describes.

Unbroken egg? Low entropy. There's only one way the egg can exist in an unbroken state, and that's it. You could represent the state of the egg with a single bit.

Broken egg? High entropy. There are an arbitrarily-large number of ways that the pieces of a broken egg could land.

A list of the locations and orientations of each piece of the broken egg, sorted by latitude, longitude, and compass bearing? Low entropy again; for any given instance of a broken egg, there's only one way that list can be written.

Zip up the list you made? High entropy again; the data in the .zip file is effectively random, and cannot be compressed significantly further. Until you unzip it again...

Likewise, if you had to transmit the (uncompressed) list over a bandwidth-limited channel. The person receiving the data can make no assumptions about its contents, so it might as well be random even though it has structure. Its entropy is effectively high again.

And this is what I prefer too, although with the clarification that its the number of ways that a system can be arranged without changing its macroscopic properties.

Its, unfortunately, not very compatible with Shannon's usage in any but the shallowest sense, which is why it stays firmly in the land of physics.

> not very compatible with Shannon's usage in any but the shallowest sense

The connection is not so shallow, there are entire books based on it.

“The concept of information, intimately connected with that of probability, gives indeed insight on questions of statistical mechanics such as the meaning of irreversibility. This concept was introduced in statistical physics by Brillouin (1956) and Jaynes (1957) soon after its discovery by Shannon in 1948 (Shannon and Weaver, 1949). An immense literature has since then been published, ranging from research articles to textbooks. The variety of topics that belong to this field of science makes it impossible to give here a bibliography, and special searches are necessary for deepening the understanding of one or another aspect. For tutorial introductions, somewhat more detailed than the present one, see R. Balian (1991-92; 2004).”

https://arxiv.org/pdf/cond-mat/0501322

I don't dispute that the math is compatible. The problem is the interpretation thereof. When I say "shallowest", I mean the implications of each are very different.

Insofar as I'm aware, there is no information-theoretic equivalent to the 2nd or 3rd laws of thermodynamics, so the intuition a student works up from physics about how and why entropy matters just doesn't transfer. Likewise, even if an information science student is well versed in the concept of configuration entropy, that's 15 minutes of one lecture in statistical thermodynamics. There's still the rest of the course to consider.

Assuming each of the N microstates for a given macrostate are equally possible with probability p=1/N, the Shannon Entropy is -Σp.log(p) = -N.p.log(p)=-1.log(1/N)=log(N), which is the physics interpretation.

In the continuous version, you would get log(V) where V is the volume in phase space occupied by the microstates for a given macrostate.

Liouville's theorem that the volume is conserved in phase space implies that any macroscopic process can only move all the microstates from a macrostate A into a macrostate B only if the volume of B is bigger than the volume of A. This implies that the entropy of B should be bigger than the entropy of A which is the Second Law.

The second law of thermodynamics is time-asymmetric, but the fundamental physical laws are time-symmetric, so from them you can only predict that the entropy of B should be bigger than the entropy of A irrespective of whether B is in the future or the past of A. You need the additional assumption (Past Hypothesis) that the universe started in a low entropy state in order to get the second law of thermodynamics.

> If our goal is to predict the future, it suffices to choose a distribution that is uniform in the Liouville measure given to us by classical mechanics (or its quantum analogue). If we want to reconstruct the past, in contrast, we need to conditionalize over trajectories that also started in a low-entropy past state — that the “Past Hypothesis” that is required to get stat mech off the ground in a world governed by time-symmetric fundamental laws.

https://www.preposterousuniverse.com/blog/2013/07/09/cosmolo...

The second law of thermodynamics is about systems that are well described by a small set of macroscopic variables. The evolution of an initial macrostate prepared by an experimenter who can control only the macrovariables is reproducible. When a thermodynamical system is prepared in such a reproducible way the preparation is happening in the past, by definition.

The second law is about how part of the information that we had about a system - constrained to be in a macrostate - is “lost” when we “forget” the previous state and describe it using just the current macrostate. We know more precisely the past than the future - the previous state is in the past by definition.

The "can be arranged" is the tricky part. E.g. you might know from context that some states are impossible (where the probability distribution is zero), even though they combinatorially exist. That changes the entropy to you.

That is why information and entropy are different things. Entropy is what you know you do not know. That knowledge of the magnitude of the unknown is what is being quantified.

Also, the point where I think the article is wrong (or not concise enough) as it would include the unknown unknowns, which are not entropy IMO:

> I claim it’s the amount of information we don’t know about a situation

For information theory, I've always thought of entropy as follows:

"If you had a really smart compression algorithm, how many bits would it take to accurately represent this file?"

i.e., Highly repetitive inputs compress well because they don't have much entropy per bit. Modern compression algorithms are good enough on most data to be used as a reasonable approximation for the true entropy.

I've always favored this down-to-earth characterization of the entropy of a discrete probability distribution. (I'm a big fan of John Baez's writing, but I was surprised glancing through the PDF to find that he doesn't seem to mention this viewpoint.)

Think of the distribution as a histogram over some bins. Then, the entropy is a measurement of, if I throw many many balls at random into those bins, the probability that the distribution of balls over bins ends up looking like that histogram. What you usually expect to see is a uniform distribution of balls over bins, so the entropy measures the probability of other rare events (in the language of probability theory, "large deviations" from that typical behavior).

More specifically, if P = (P1, ..., Pk) is some distribution, then the probability that throwing N balls (for N very large) gives a histogram looking like P is about 2^(-N * [log(k) - H(P)]), where H(P) is the entropy. When P is the uniform distribution, then H(P) = log(k), the exponent is zero, and the estimate is 1, which says that by far the most likely histogram is the uniform one. That is the largest possible entropy, so any other histogram has probability 2^(-c*N) of appearing for some c > 0, i.e., is very unlikely and exponentially moreso the more balls we throw, but the entropy measures just how much. "Less uniform" distributions are less likely, so the entropy also measures a certain notion of uniformity. In large deviations theory this specific claim is called "Sanov's theorem" and the role the entropy plays is that of a "rate function."

The counting interpretation of entropy that some people are talking about is related, at least at a high level, because the probability in Sanov's theorem is the number of outcomes that "look like P" divided by the total number, so the numerator there is indeed counting the number of configurations (in this case of balls and bins) having a particular property (in this case looking like P).

There are lots of equivalent definitions and they have different virtues, generalizations, etc, but I find this one especially helpful for dispelling the air of mystery around entropy.

Hey did you want to say relative entropy ~ rate function ~ KL divergence. Might be more familiar to ML enthusiasts here, get them to be curious about Sanov or large deviations.

That's right, here log(k) - H(p) is really the relative entropy (or KL divergence) between p and the uniform distribution, and all the same stuff is true for a different "reference distribution" of the probabilities of balls landing in each bin.

For discrete distributions the "absolute entropy" (just sum of -p log(p) as it shows up in Shannon entropy or statistical mechanics) is in this way really a special case of relative entropy. For continuous distributions, say over real numbers, the analogous quantity (integral of -p log(p)) isn't a relative entropy since there's no "uniform distribution over all real numbers". This still plays an important role in various situations and calculations...but, at least to my mind, it's a formally similar but conceptually separate object.

Ah JCB, how I love your writing, you are always so very generous.

Your This Week's Finds were a hugely enjoyable part of my undergraduate education and beyond.

Thank you again.

"I have largely avoided the second law of thermodynamics, which says that entropy always increases. While fascinating, this is so problematic that a good explanation would require another book!"

For those interested I am currently reading "Entropy Demystified" by Arieh Ben-Naim which tackles this side of things from much the same direction.

Information entropy is literally the strict lower bound on how efficiently information can be communicated (expected number of transmitted bits) if the probability distribution which generates this information is known, that's it. Even in contexts such as calculating the information entropy of a bit string, or the English language, you're just taking this data and constructing some empirical probability distribution from it using the relative frequencies of zeros and ones or letters or n-grams or whatever, and then calculating the entropy of that distribution.

I can't say I'm overly fond of Baez's definition, but far be it from me to question someone of his stature.

Am I only one that can't download the pdf, or is the file server down? I can see the blog page but when I try downloading the ebook it just doesn't work..

If the file server is down.. anyone could upload the ebook for download?

I like the formulation of 'the amount of information we don't know about a system that we could in theory learn'. I'm surprised there's no mention of the Copenhagen interpretation's interaction with this definition, under a lot of QM theories 'unavailable information' is different from available information.

I sometimes ponder where new entropy/randomness is coming from, like if we take the earliest state of universe as an infinitely dense point particle which expanded. So there must be some randomness or say variety which led it to expand in a non uniform way which led to the dominance of matter over anti-matter, or creation of galaxies, clusters etc. If we take an isolated system in which certain static particles are present, will there be the case that a small subset of the particles will get motion and this introduce entropy? Can entropy be induced automatically, atleast on a quantum level? If anyone can help me explain that it will be very helpful and thus can help explain origin of universe in a better way.

Thanks for the reference will take some time before I see the whole video. Can you tell me what those quantum fluctuations are in short? Are they part of some physical law?

Symmetry breaking is the general phenomenon that underlies most of that.

The classic example is this:

Imagine you have a perfectly symmetrical sombrero[1], and there's a ball balanced on top of the middle of the hat. There's no preferred direction it should fall in, but it's _unstable_. Any perturbation will make it roll down hill and come to rest in a stable configuration on the brim of the hat. The symmetry of the original configuration is now broken, but it's stable.

1: https://m.media-amazon.com/images/I/61M0LFKjI9L.__AC_SX300_S...

After years of thought I dare to say the 2nd TL is a tautology. Entropy is increasing means every system tends to higher probability means the most probable is the most probable.

I think that’s right, though it’s non-obvious that more probable systems are disordered. At least as non-obvious as Pascal’s triangle is.

Which is to say, worth saying from a first principles POV, but not all that startling.

The book might disappoint some..

>I have largely avoided the second law of thermodynamics ... Thus, the aspects of entropy most beloved by physics popularizers will not be found here.

But personally, this bit is the most exciting to me.

>I have tried to say as little as possible about quantum mechanics, to keep the physics prerequisites low. However, Planck’s constant shows up in the formulas for the entropy of the three classical systems mentioned above. The reason for this is fascinating: Planck’s constant provides a unit of volume in position-momentum space, which is necessary to define the entropy of these systems. Thus, we need a tiny bit of quantum mechanics to get a good approximate formula for the entropy of hydrogen, even if we are trying our best to treat this gas classically.

There's fundamental nature of entropy, but as usual it's not very enlightening for poor monkey brain, so to explain you need to enumerate all its high level behavior, but its high level behavior is accidental and can't be summarized in a concise form.

My definition: Entropy is a measure of the accumulation of non-reversible energy transfers.

Side note: All reversible energy transfers involve an increase in potential energy. All non-reversible energy transfers involve a decrease in potential energy.

That definition doesn't work well because you can have changes in entropy even if no energy is transferred, e.g. by exchanging some other conserved quantity.

The side note is wrong in letter and spirit; turning potential energy into heat is one way for something to be irreversible, but neither of those statements is true.

For example, consider an iron ball being thrown sideways. It hits a pile of sand and stops. The iron ball is not affected structurally, but its kinetic energy is transferred (almost entirely) to heat energy. If the ball is thrown slightly upwards, potential energy increases but the process is still irreversible.

Also, the changes of potential energy in corresponding parts of two Carnot cycles are directionally the same, even if one is ideal (reversible) and one is not (irreversible).

The way I understand it is with an analogy to probability. To me, events are to microscopic states like random variable is to entropy.

If I would write a book with that title then I would get to the point a bit faster, probably as follows.

Entropy is just a number you can associate with a probability distribution. If the distribution is discrete, so you have a set p_i, i = 1..n, which are each positive and sum to 1, then the definition is:

S = - sum_i p_i log( p_i )

Mathematically we say that entropy is a real-valued function on the space of probability distributions. (Elementary exercises: show that S >= 0 and it is maximized on the uniform distribution.)

That is it. I think there is little need for all the mystery.

So the only thing you need to know about entropy is that it's a real-valued number you can associate with a probability distribution? And that's it? I disagree. There are several numbers that can be associated with probability distribution, and entropy is an especially useful one, but to understand why entropy is useful, or why you'd use that function instead of a different one, you'd need to know a few more things than just what you've written here.

Exactly, saying that's all there is to know about entropy is like saying all you need to know about chess are the rules and all you need to know about programming is the syntax/semantics.

Knowing the plain definition or the rules is nothing but a superficial understanding of the subject. Knowing how to use the rules to actually do something meaningful, having a strategy, that's where meaningful knowledge lies.

In particular, the expectation (or variance) of a real-valued random variable can also be seen as "a real-valued number you can associate with a probability distribution".

Thus, GP's statement is basically: "entropy is like expectation, but different".

The problem is that this doesn't get at many of the intuitive properties of entropy.

A different explanation (based on macro- and micro-states) makes it intuitively obvious why entropy is non-decreasing with time or, with a little more depth, what entropy has to do with temperature.

That doesn't strike me as a problem. Definitions are often highly abstract and counterintuitive, with much study required to understand at an intuitive level what motivates them. Rigour and intuition are often competing concerns, and I think definitions should favour the former. The definition of compactness in topology, or indeed just the definition of a topological space, are examples of this - at face value, they're bizarre. You have to muck around a fair bit to understand why they cut so brilliantly to the heart of the thing.

The above evidently only suffices as a definition, not as an entire course. My point was just that I don't think any other introduction beats this one, especially for a book with the given title.

In particular it has always been my starting point whenever I introduce (the entropy of) macro- and micro-states in my statistical physics course.

That definition is on page 18, I agree it could've been reached a bit faster but a lot of the preceding material is motivation, puzzles, and examples.

This definition isn't the end goal, the physics things are.

Correct! And it took me just one paragraph, not the 18 pages of meandering (and I think confusing) text that it takes the author of the pdf to introduce the same idea.

You didn’t introduce any idea. You said it’s “just a number” and wrote down a formula without any explanation or justification.

I concede that it was much shorter though. Well done!

Haha you reminded me of that idea in software engineering that "it's easy to make an algorithm faster if you accept that at times it might output the wrong result; in fact you can make infinitely fast"

Thanks for defining it rigorously. I think people are getting offended on John Baez's behalf because his book obviously covers a lot more - like why does this particular number seem to be so useful in so many different contexts? How could you have motivated it a priori? Etcetera, although I suspect you know all this already.

But I think you're right that a clear focus on the maths is useful for dispelling misconceptions about entropy.

Misconceptions about entropy are misconceptions about physics. You can’t dispell them focusing on the maths and ignoring the physics entirely - especially if you just write an equation without any conceptual discussion, not even mathematical.

I didn't say to only focus on the mathematics. Obviously wherever you apply the concept (and it's applied to much more than physics) there will be other sources of confusion. But just knowing that entropy is a property of a distribution, not a state, already helps clarify your thinking.

For instance, you know that the question "what is the entropy of a broken egg?" is actually meaningless, because you haven't specified a distribution (or a set of micro/macro states in the stat mech formulation).

Ok, I don’t think we disagree. But knowing that entropy is a property of a distribution given by that equation is far from “being it” as a definition of the concept of entropy in physics.

Anyway, it seems that - like many others - I just misunderstood the “little need for all the mystery” remark.

> is far from “being it” as a definition of the concept of entropy in physics.

I simply do not understand why you say this. Entropy in physics is defined using exactly the same equation. The only thing I need to add is the choice of probability distribution (i.e. the choice of ensemble).

I really do not see a better "definition of the concept of entropy in physics".

(For quantum systems one can nitpick a bit about density matrices, but in my view that is merely a technicality on how to extend probability distributions to Hilbert spaces.)

I’d say that the concept of entropy “in physics” is about (even better: starts with) the choice of a probability distribution. Without that you have just a number associated with each probability distribution - distributions without any physical meaning so those numbers won’t have any physical meaning either.

But that’s fine, I accept that you may think that it’s just a little detail.

(Quantum mechanics has no mystery either.

ih/2pi dA/dt = AH - HA

That’s it. The only thing one needs to add is a choice of operators.)

Sarcasm aside, I really do not think you are making much sense.

Obviously one first introduces the relevant probability distributions (at least the micro-canonical ensemble). But once you have those, your comment still does not offer a better way to introduce entropy other than what I wrote. What did you have in mind?

In other words, how did you think I should change this part of my course?

Many students will want to know where the minus sign comes from. I like to write the formula instead as S = sum_i p_i log( 1 / p_i ), where (1 / p_i) is the "surprise" (i.e., expected number of trials before first success) associated with a given outcome (or symbol), and we average it over all outcomes (i.e., weight it by the probability of the outcome). We take the log of the "surprise" because entropy is an extensive quantity, so we want it to be additive.

Everyone who sees that formula can immediately see that it leads to principle of maximum entropy.

Just like everyone seeing Maxwell's equations can immediately see that you can derive the the speed of light classically.

Oh dear. The joy of explaining the little you know.

As of this moment there are six other top-level comments which each try to define entropy, and frankly they are all wrong, circular, or incomplete. Clearly the very definition of entropy is confusing, and the definition is what my comment provides.

I never said that all the other properties of entropy are now immediately visible. Instead I think it is the only universal starting point of any reasonable discussion or course on the subject.

And lastly I am frankly getting discouraged by all the dismissive responses. So this will be my last comment for the day, and I will leave you in the careful hands of, say, the six other people who are obviously so extremely knowledgeable about this topic. /s

One could also say that it’s just a consequence of the passage of time (as in getting away from a boundary condition). The decay of radioactive atoms is also a measure of the arrow of time - of course we can say that’s the same thing.

CP violation may (or may not) be more relevant regarding the arrow of time.

This seems like a great resource for referencing the various definitions. I've tried my hand at developing an intuitive understanding: https://spacechimplives.substack.com/p/observers-and-entropy. TLDR - it's an artifact of the model we're using. In the thermodynamic definition, the energy accounted for in the terms of our model is information. The energy that's not is entropic energy. Hence why it's not "useable" energy, and the process isn't reversible.

Hmmm that list of things that contribute to entropy I've noticed omits particles which under "normal circumstances" on earth exist in bound states, for example it doesn't mentions W bosons or gluons. But in some parts of the universe they're not bound but in different state of matter, e.g. quark gluon plasma. I wonder how or if this was taken I to account.

Entropy is the distribution of potential over negative potential.

This could be said "the distribution of what ever may be over the surface area of where it may be."

This is erroneously taught in conventional information theory as "the number of configurations in a system" or the available information that has yet to be retrieved. Entropy includes the unforseen, and out of scope.

Entropy is merely the predisposition to flow from high to low pressure (potential). That is it. Information is a form of potential.

Philosophically what are entropy's guarantees?

- That there will always be a super-scope, which may interfere in ways unanticipated;

- everything decays the only mystery is when and how.

This answer is as confident as it's wrong and full of gibberish.

Entropy is not a "distribution”, it's a functional that maps a probability distribution to a scalar value, i.e. a single number.

It's the mean log-probability of a distribution.

It's an elementary statistical concept, independent of physical concepts like “pressure”, “potential”, and so on.

It sounds like log-probability is the manifold surface area.

Distribution of potential over negative potential. Negative potential is the "surface area", and available potential distributes itself "geometrically". All this is iterative obviously, some periodicity set by universal speed limit.

It really doesn't sound like you disagree with me.

Baez seems to use the definition you call erroneous: "It’s easy to wax poetic about entropy, but what is it? I claim it’s the amount of information we don’t know about a situation, which in principle we could learn."

But it is possible to account for the unforseen (or out-of-vocabulary) by, for example, a Good-Turing estimate. This satisfies your demand for a fully defined state space while also being consistent with GP's definition.

You are referring to the conceptual device you believe bongs to you and your equations. Entropy creates attraction and repulsion, even causing working bias. We rely upon it for our system functions.

Undefined is uncertainty is entropic.

All definitions of entropy stem from one central, universal definition: Entropy is the amount of energy unable to be used for useful work. Or better put grammatically: entropy describes the effect that not all energy consumed can be used for work.

There's a good case to be made that the information-theoretic definition of entropy is the most fundamental one, and the version that shows up in physics is just that concept as applied to physics.

My favorite course I took as part of my physics degree was statistical mechanics. It leaned way closer to information theory than I would have expected going in, but in retrospect should have been obvious.

Unrelated: my favorite bit from any physics book is probably still the introduction of the first chapter of "States of Matter" by David Goodstein: "Ludwig Boltzmann, who spent much of his life studying statistical mechanics, died in 1906, by his own hand. Paul Ehrenfest, carrying on the work, died similarly in 1933. Now it is our turn to study statistical mechanics."

Well it's part of math, which physics is already based on.

Whereas metaphysics is, imo, "stuff that's made up and doesn't matter". Probably not the most standard take.

Not really. Information theory applies to anything probability applies to, including many situations that aren't "physics" per se. For instance it has a lot to do with algorithms and data as well. I think of it as being at the level of geometry and calculus.

Yeah, people seemingly misunderstand that the entropy applied to thermodynamics is simply an aggregate statistic that summarizes the complex state of the thermodynamic system as a single real number.

The fact that entropy always rises etc, has nothing to do with the statistical concept of entropy itself. It simply is an easier way to express the physics concept that individual atoms spread out their kinetic energy across a large volume.

（评论） (comments)

（评论）
(comments)