Notes on Low-N Quantum Information Theory

(a report on work in progress)


It is well known that a quantum source has minimum
compressed size equal to the Von Neumann entropy, which
is closely related to the Shannon Entropy.  The classical
and quantum noiseless coding results for an order-0
probabilistic source are identical in the limit of
large N (message size); (this also assumes the entropy is
measured after the character of the source is known).
However, it is found that quantum and classical entropies
necessarily differ for small N and when the entropy is
measured without knowledge of the source.

I conclude that the optimal quantum modeling algorithm
is uncomputable (whereas the optimal classical model may
be reached asymptotically by simply making the model
more and more comprehensive).

note : some of the considerations here may be too
imprecise; in particular, I should use the Holevo entropy
instead of the Von Neumann entropy when I start relaxing
some assumptions.  Many questions attacked near the end
are open questions in quantum info-theory.  Nonetheless,
the conclusion rests on a difference of entropies, which
removes the uncertainty from each, leaving a definite
result which I believe is unchanged in a more rigorous


I considered low N issues in classical information theory recently; see for articles on the delicate matters therein. In conversation with Michael Schindler, I came to the viewpoint : > with n-> infinity you won't get any quantum >effects, and there are a lot of chooseable parameters for small N. I've gradually developed a concrete philosophy on these matters. I think that the small N effects are due to a fundamental conceptual difficulty in information theory. There are two basic interpretations of entropy and strings : I. S is a string chosen with a probability P(S) from a source of strings. The entropy H(S) = P(S) (here H(S) is rescaled to be the entropy per bit) II. S is a string handed to "me" from god knows where. The entropy H(S) is a quantity computed based only on the properties of S which I observe. (for example, there exists no universal compressor which packs S to less than H(S) bits per bit) Now for N -> large, it is known that the views I and II are equal. For small N, we can interpret I and II from your assertion : I is based on the actual prob. of a bit being 0, while II is based on the actual number of bits which are 0. It is interesting to note that most information theoretic proofs are done on the entropy defined in I, not that in II. (of course, even when using view I they also usually take N large). Now in Quantum information we have a key difference : we cannot use view II ! Why? Because if we look at our string S to compute H(S) by examining only S, we have destroyed the quantum entanglement in S (that is, looking at it required measuring it). Thus, quantum info theory must be based on view I - we must talk about random sources, not the more practical case of an unknown source and a given string! By carefully considering quantum compression we can see that this issue is more than philosophical. Let me briefly review the "quantum noiseless coding theorem", to establish my notation. First, the classical theorem: given a message (S) of length N symbols and entropy H per bit from some *known* source, there are 2^N possible strings, but only 2^(NH) occur "often". Hence, we may find a mapping, "C" (the compressor), from the 2^NH most likely strings down to a set of NH-bit codewords. Thus our compressor is: if S is in C , send 0 + C(S) if S is not, send 1 + S The flag bit is too small to matter (even when we take N small, it will be like 1,000,000 so 1 bit is irrelevant), and the first case occurs with probability (1-e) where e is epsilon, so we have sent the symbols in NH bits. *: the mapping C may be computed only if we know the character of the source (eg. a binary order-0 source (order-0 means no correlations), and known prob.s) in which case it is simply a list of the most probable sequences in order from highest to lowest. BTW I may use the notation H(source|source) , which should read "entropy of average strings from a source, given the specifications of that source". Now the quantum theorem: I'll be careful about the assumptions in this one. we wish to encode a quantum string (S = a sequence of qubits) taken from a "quantum source" = a density matrix (r, for rho) rho may be diagonalized by a suitable unitary transform, U, so that UrU^ = p_n |n> = delta(n,m) ) #1. assumption: the qubits extracted from S are uncorrelated; that is, r is a tensor product of N identical one qubit density matrices; q0 and q1 are the diagonal elements in the one qubit density matrices (or {q} , read "the set of q's") Now that we have diagonalized rho, it acts like an ensemble, hence a classical source; we draw a particular string (n) of pure qubits from rho with probability p_n. Hence, we may again find a mapping C ; this is a mapping from N-bit typical qustrings to NS-bit qustrings (tensor some place holder of size N(1-S)). Again, this mapping is successful (1-e) of the time. (when it is not successful, we may (must?) use a classical channel to send flag bits and handle the remainder, but this occurs so rarely that we may neglect it henceforth). BTW S is the quantum entropy : S(r) = - Tr{ r log(r) } = - Sum[n] p_n log p_n = H(p_n) where H is the Shannon entropy for the alphabet. So, we have found that: H( source | its probs ) = S( source | its rho ) Where '|' reads "given". This is an old result, but we emphasize a detail : the conditional information of knowing the character of the source (probabilities for the classical theorem, rho for the quantum theorem) is different! To make this difference most manifest, we see that for the decoder to compute the mapping 'C' , he must know the probabilities, {p} (we will henceforth use these {p} as a vector of probabilities in the space of the alphabet) . This is true for quantum and classical, and causes an extra cost of about 2logN bits (we must transmit the probabilities up to an accuracy of 1/N (they are real numbers in a true quantum source)) ; this is an irrelevant factor which is no larger than the cost of transmitting N itself. Now, if the quantum decoder is to reproduce the original state, he must also undo the unitary transform that diagonalized rho ; if you prefer, you may imagine that he computed the mapping C using the diagonal basis, and now needs to know how to transform that basis to match the input data. Also, I will frequently cite the fact that "we are not allowed to measure the qustring until it has been used" ; this is because it causes Copenhagen-Collapse , and so destroys the qustring. This problem cannot be avoided, because of the no-copy theorem - no copying and no measurement are inherent to quantum data.

An Adaptive Quantum Coder

The equality of H and S given the character of the source is broken when we construct a coder which does not rely on prior knowledge of the source. This is done by creating an adaptive coder (these are well known in the classical domain, and are known to be asymptotically optimal; we will examine the case of N finite, so the asymptote matters not). Let me emphasize that we want to think of rho (the quantum source) as black box : we can take qubits from it, but we can do no other operations on it to find out information about it (we do not have the user's manual describing its probability distribution and basis). #2. Assume the N-qubit qustring can be broken into packets (N/n) of length n , with n << N . We can certainly draw n bits from the "source" rho, then wait and draw n more, etc. Our assumption is that we can draw n bits, transmit them, let the decoder use them for whatever the application was, then *measure* them, and then draw the next set of n. This is certainly true when the qu-source is the simple tensor product (and order-0 source), but it may not be true if the source can produce long-range correlated qustrings in which case this packetization would ruin the strings. So, the adaptive coder algorithm is : Step 1. put {p} = {1/A,1/A,..} (A is the alphabet size) Step 2. compute C = the list of most likely n-qubit sequences (using {p} this is easy) , and such that C contains 2^(nH(p)) entries. Step 3. the encoder draws n bits from rho into a string S Step 4. encoder sends C(S) to the decoder (over a quantum channel) Step 5. decoder computes S=C^-1 , and then uses S (note the assumption #2 above) Step 6. the decoder *measures* S in some basis, defined by a unitary transform U. We then count the number of occurances of each symbol = {n1,n2..} . Then compute {p (new)} = 1/2 ( {p (old)} + {n}/n ) Step 7. the decoder sends {p (new)} back to the encoder over a classical channel Step 8. if we've sent N bits : end else, goto step 2. Let me emphasize some points : Remark 1. in step 6, we had to make use of U , the operator that diagonalizes rho; I will discuss this more momentarily; for now you may assume it was known in advance. Remark 2. This encoder is asymptotically optimal, for N and (N/n) both infinite. It is essentially the same as a classical adaptive probability coder in this sense. During the initial periods the probability estimation will be far from precise; however, I will not discuss this in detail because I think it is identical to the classical case, and what we seek here is delta(quantum - classical). Remark 3. Step 7 is inherently quantum mechanical. In a classical coder, the encoder can also perform step 6, and thereby compute the {p} himself. However, on qustrings, the encoder cannot compute the {p} because to do so requires measurement, and if the encoder measures S, then he cannot send it to the decoder (the qustring is ruined). Now some remarks on the problem of U. This transform can also be adaptively estimated, and is also a uniquely quantum mechanical problem. First, consider what happens if we use the wrong U : imagine a qustring of 1/0 bits corresponding to up/down Z spins. If I measure these along X , then I will get a completely random (classical) result ; if I had used the correct Z , then I would have found the qustring, with H(result) = S(qustring), which is the minimum. Hence, we can find U by varying it adaptively to minimize H. I do this by expanding the above coder: add to step 1 : let U(theta) = rotate spin by theta put theta = 0 , let epsilon be some small number add to step 6 : let H_cur = H( {n}/n ) , the Shannon entropy for the actually-observed most recent n qubits. if H_cur is > H_last then epsilon *= -1 rotate theta by epsilon let H_last = H_cur So, we have added a simple walk in step 6, so that U(theta) is shifted to minimize H ; this will be subject to wiggles, but will eventually settle down and find the diagonal basis for rho. Note that we cannot do the classical approach of changing U over an over until we find the right answer - we can only change it once per cycle, since testing U requires measuring the qustring. Let me emphasize that estimating (U,{p}) is identical to estimating (rho) - I have separated the problem into two pieces for two reasons : it is easy to construct C from the set {p} - simply list classical strings ; this separation also shows the difference between the classical and quantum case. The intuitive interpretation is that a quantum source is harder to specify than a classical, so if we know that S(qusource|source type) = H(source|source type) , then S > H without knowledge of the sources. Finally, some remarks to generalize assumption #1 . If the source is some Markov stochastic process (for example, but it could be anything non-order-0) , and we assume that #2 still holds, then we can generalize our coder (conceptually, at least). We replace {p} by M, the internal model state (for, eg, a Markov model (PPM)). When we compute C, we use M instead of just {p}. Similarly, rather than just counting {n1,n2..} to update {p} we pass the model M over the measured qustring and let it update itself in some elaborate way; ** this update "delta M" must then be transmitted back to the encoder! I emphasize that we are still leaving assumption #2 intact, but we have successfully diffused assumptions #1 and #3. Assumption #4 : even if rho is not a simple tensor product (a source of N qubits where each qubit is identically and independently distributed), each qubit source still is diagonal in the same basis. That is, rho is the unitary transform of a diagonal (ensemble) density matrix, which may be markov in its {p} distribution. Thus we have loosened assumption #1 but not completely relaxed it. Quantum theorists will be eager to point out that S does not depend on assumption #4 , and this is true, however the step 6 estimation of U in the adaptive coder (which allows modelling, which allows construction of C) cannot be done without specific knowledge of the U that diagonalizes rho. If we wished to relax assumption #4 we would have to proceed by estimating the distribution of U over symbols, just as the model M of probabilities tried to estimate the distribution in the diagonal source. Thus, we would have a probability model, M({p}), and a transform basis model M(U). These would conflict with eachother, and both would have to be sent back to the encoder. The minimization technique for finding U would become quite poor when considering the possibility of different U's for each symbol. (for example, if U was random for each character drawn from rho, then we could never compress to more than 1 bit per bit. This is because S(source|rho) may be finite, but when U is random, |rho| is collosal, so the best we can do is to not compress at all). More realistically, U might be U(n*pi/1000) , where n is the symbol-number, and U(theta) is a spin-rotation; this has low intrinsic entropy (knowing U, |U| is small) so that as N->infinity, we would theoretically figure out U and acheive asymptotic optimality, but for N small, U would be very hard to estimate. Phrased another way, consider a source rho of entropy S = H in its diagonal basis (S is basis-free, but S = H only in the diagonal) so that H(rho|U)=S. Let each qubit drawn from rho be diagonal either in the z spin or x spin direction, randomly. Then, we wish to classically measure spins drawn from rho. We may choose to measure each qubit along x or z = 2^N possible choices. On each qubit we get wrong, the entropy is 1, while on each qubit we get right, the entropy is S ; he have a 1/2 probability of getting each qubit right, so that the perceived entropy is H > (S+1)/2 (the greater sign is because we cannot know which measurements we got wrong, so the bits are unseparable, so that we must add properly: if S = -plog(p) - qlog(q) (p+q=1), then -2 H = (p+1)log((p+1)/2) + (q+1)log((q+1)/2) -2 H = (p+1)(log(p+1) - 1) + (q+1)(log(q+1) - 1) H = 1/2[3 - (p+1)log(p+1) - (q+1)log(q+1)] ) Again, this is because the 2^N choices of ways to measure correspond to 2^N possible values of U, so that |U| = N bits Note the inequality : prob(guess U)*S(rho|U) + prob(not guess U)*1 < S(rho) < S(rho|U) + |U| The right hand part follows because we can code S(rho) by send U in |U| classical bits, and then coding S(rho|U); note that if rho is a black box, |U| must also include the estimation cost. The left hand side follows because at each qubit, if we guess U correctly (prob(guess U) is for a single qubit, eg. 1/2 above) then we get S(rho|U), but if we fail we get a random bit (the 1), and the true S can only be greater than this addition. (ie, if S = 0, and prob = 1/2 , this formula gives 1/2, but the right answer is 0.81 ) (prob(guess U) is more correctly the overlap of the true diagonal basis and the guessed diagonal basis = ^2 )


The total channel capacity needed for the adaptive coder is then : TCC(classical) = H(source | M) + (cost of M adaptation) where (source | M) is the source given the model, computed using the 'C' above, or with a more conventional compressor. (if you're unfamiliar with general models, simply replace M with {p} , so that |M| is the 2log(N) I discussed before). And: TCC(quantum) = S(source | M) + (cost of M adaptation) + (cost of U adaptation) + |M| + |U| = TCC(classical) + (cost of U adaptation) + |M| + |U| (Note that the |M| is sent in the decoder->encoder feedback (step 7) over a classical channel); there is also the additional cost of the U adaptation, which cannot be quantitatively counted (and our scheme here is probably not the optimal one). (note also that the U adaptation cost is a finite number which depends on the source and n, but not on N; this cost is usually neglected because of that fact, though it may be quite large). (it is also roughly equal to the cost of M adaptation; furthermore the M and U adaptations may interfere with eachother, since they both use the 'local' entropy as a hint that they are improving; this would drive both of their costs up further; I merge this effect into the cost of U). (if we relax assuption #4 as discussed above, the cost of U adaptation becomes even larger). This channel capacity is remarkably different for the case of a highly complicated source : the classical modeling paradigm is that the encoder and decoder can create baroque models to figure out what the source is. the quantum compressor must explicitly send the model, which puts severe constraints on it (we can probably make |M| a finite constant, but it is often of order 10*N in practical work). This is not a simply a practical issue, it means that the quantum coder cannot arbitrarily make his model fancier to get more compression (as a classical coder can) - he must solve the "optimal dictionary problem" ; this is a problem analogous to that in LZ77 coding of choosing the optimal dictionary such that { H(source | dict) + |dict| } is minimized; the first term goes down with bigger dict. (M) and the second goes up, so there is a well defined minimum. This problem is well-known to be unsolvable, except by brute force. Note that this even means crisis for asymptotic compression of quantum strings : we achieve asymptotic optimality (arbitrarily close the true entropy with correlations removed) in classical data by letting |M| -> infinity as N -> infinite , so that the model learns everything there is to know. This means that |M| = c N , for some c. If we use this in the quantum coder, we have increased the entropy by c , and hence not achieved asymptotic optimality. It is, however, possible that we can use |M| = c N for arbitrarily small c and still get optimality, though that is not obvious. This optimality is also certainly not true when N is small, where |M| = N is the upper bound on |M|, and is in fact attained with all known ass-optimal models (on stochastic sources = LZ and PPM). There is another way to discover these differences; in the little of the optimal modelling algorithm, the classical total channel capacity becomes the Kolmogorov algorithmic entropy. In the quantum case we can certainly imagine a quantum computer generating our string, so that we could define a quantum Kolmogorov coder : send the smallest algorithm which generates the desired string. This is well defined (and I conjecture that it would be the same as the classical kolmogorov in the diagonal basis, but since neither one is computable and we don't know the relation between quantum and classical algorithm sizes yet, this conjecture is an open question), but this method does not work, and cannot work. Why? Because the encoder doesn't know which string to create! (that is, he cannot measure the qu-string until its been used by the desired recipient). We might imagine that there are clever ways to get around this : for example, enumerate all quantum machines; apply the inverse of one to our qustring (since qumachines are unitary), this goes back to the source string; if its not the smallest, then run the machine forward to get our original string back and repeat; but I've measured the string and ruined it without telling you when I said the word "smallest". Similarly, I could just gather all short quantum machines and run them, and compare their output to my string to see if any match; of course it is obvious that this requires measurement of the string.

Charles Bloom / cb at my domain

Send Me Email

Back to the News Index

The free web counter says you are visitor number