TY - GEN
T1 - Always Good Turing
T2 - 44th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2003
AU - Orlitsky, A.
AU - Santhanam, N. P.
AU - Zhang, J.
N1 - Publisher Copyright:
© 2003 IEEE.
Copyright:
Copyright 2015 Elsevier B.V., All rights reserved.
PY - 2003
Y1 - 2003
N2 - While deciphering the German Enigma code during World War II, I.J. Good and A.M. Turing considered the problem of estimating a probability distribution from a sample of data. They derived a surprising and unintuitive formula that has since been used in a variety of applications and studied by a number of researchers. Borrowing an information-theoretic and machine-learning framework, we define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily-long sequence by any distribution, and the corresponding probability assigned by the estimator. We show that some common estimators have infinite attenuation and that the attenuation of the Good-Turing estimator is low, yet larger than one. We then derive an estimator whose attenuation is one, namely, as the length of any sequence increases, the per-symbol probability assigned by the estimator is at least the highest possible. Interestingly, some of the proofs use celebrated results by Hardy and Ramanujan on the number of partitions of an integer. To better understand the behavior of the estimator, we study the probability it assigns to several simple sequences. We show that some sequences this probability agrees with our intuition, while for others it is rather unexpected.
AB - While deciphering the German Enigma code during World War II, I.J. Good and A.M. Turing considered the problem of estimating a probability distribution from a sample of data. They derived a surprising and unintuitive formula that has since been used in a variety of applications and studied by a number of researchers. Borrowing an information-theoretic and machine-learning framework, we define the attenuation of a probability estimator as the largest possible ratio between the per-symbol probability assigned to an arbitrarily-long sequence by any distribution, and the corresponding probability assigned by the estimator. We show that some common estimators have infinite attenuation and that the attenuation of the Good-Turing estimator is low, yet larger than one. We then derive an estimator whose attenuation is one, namely, as the length of any sequence increases, the per-symbol probability assigned by the estimator is at least the highest possible. Interestingly, some of the proofs use celebrated results by Hardy and Ramanujan on the number of partitions of an integer. To better understand the behavior of the estimator, we study the probability it assigns to several simple sequences. We show that some sequences this probability agrees with our intuition, while for others it is rather unexpected.
KW - Computer science
UR - http://www.scopus.com/inward/record.url?scp=33746333177&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33746333177&partnerID=8YFLogxK
U2 - 10.1109/SFCS.2003.1238192
DO - 10.1109/SFCS.2003.1238192
M3 - Conference contribution
AN - SCOPUS:33746333177
T3 - Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS
SP - 179
EP - 188
BT - Proceedings - 44th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2003
PB - IEEE Computer Society
Y2 - 11 October 2003 through 14 October 2003
ER -