TY - JOUR
T1 - Universal compression of memoryless sources over unknown alphabets
AU - Orlitsky, Alon
AU - Santhanam, Narayana P.
AU - Zhang, Junan
N1 - Funding Information:
Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, Yokohama, Japan, June/July 2003.
PY - 2004/7
Y1 - 2004/7
N2 - It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
AB - It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
KW - Large and unknown alphabets
KW - Patterns
KW - Set and integer partitions
KW - Universal compression
UR - http://www.scopus.com/inward/record.url?scp=3042606358&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=3042606358&partnerID=8YFLogxK
U2 - 10.1109/TIT.2004.830761
DO - 10.1109/TIT.2004.830761
M3 - Article
AN - SCOPUS:3042606358
SN - 0018-9448
VL - 50
SP - 1469
EP - 1481
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 7
ER -