Information Entropy and Learning

·3 min read
learninginformation-theory

I recently came across the concept of information entropy, which is a measurement of surprise in learning some information. Learning that the sun rose today has very low information entropy, whereas learning that the stock market crashed has high information entropy.

This concept was initially introduced in Claude Shannon's 1948 paper A Mathematical Theory of Communication. In it, he describes how language can be approximated by looking at the probability distribution of each letter in a language's alphabet. In the case of English, we'd be looking at the probability distribution of the 26 letters, plus the space character (omitting punctuation and other characters for simplicity).

Randomly and independently selecting from the set of characters based on their probabilities produces strings of texts that look like gibberish -- however, if we select characters depending on the preceding N characters, we quickly begin to produce strings that look similar to English. Similarly, if we instead choose words based on their likelihood and the preceding M words, we begin to approximate coherent sentences.

This has a few implications. One is that you can exploit this structure in language to implement lossless compression, preserving information with fewer characters. Additionally, we can say that a string of text has higher surprisal if there is a high occurrence of unlikely substrings. This can be a useless measurement if the words or sentences produced are meaningless -- pure noise has very high information entropy because it is structureless.

LLMs work in a similar fashion. During generation, a token is probabilistically selected as a function of the previous tokens. Given the list of input tokens and tokens produced thus far, the transformer will generate a probability distribution over the set of tokens and sample a token from that distribution.

This got me thinking about learning and reading technical or dense topics. I'd developed a maladaptive habit of treating each sentence with equal priority, rather than skimming through the unimportant parts. This is exhausting and meant that I quickly burnt out attempting to read anything difficult. And it ignores that information retrieval is now easy, while discovery is still difficult.

I'm now skimming through much of the material and slowing down to focus my attention when I discover something surprising (which fortunately, is often also interesting!). Through the information entropy lens, the subjective experience of surprise can be seen as an indication of a misalignment between my internal world model and the "true" world model.

Optimizing for surprise means more efficiently expanding, pruning, and correcting your mental models. In prioritizing information, the broad strokes can be painted, and the details can be filled in later if you need them. It's methodically building a map of the terrain and identifying the unknown unknowns so that you can have more tools at your disposal.