Universal compression of memoryless sources over unknown alphabets

Alon Orlitsky; Narayana P. Santhanam; Junan Zhang

doi:10.1109/TIT.2004.830761

Universal compression of memoryless sources over unknown alphabets

Alon Orlitsky, Narayana P. Santhanam, Junan Zhang

Research output: Contribution to journal › Article › peer-review

108 Scopus citations

Abstract

It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.

Original language	English (US)
Pages (from-to)	1469-1481
Number of pages	13
Journal	IEEE Transactions on Information Theory
Volume	50
Issue number	7
DOIs	https://doi.org/10.1109/TIT.2004.830761
State	Published - Jul 2004
Externally published	Yes

Keywords

Large and unknown alphabets
Patterns
Set and integer partitions
Universal compression

ASJC Scopus subject areas

Information Systems
Computer Science Applications
Library and Information Sciences

Access to Document

10.1109/TIT.2004.830761

Cite this

@article{ba2d87252fd141429eb98acca334c54b,

title = "Universal compression of memoryless sources over unknown alphabets",

abstract = "It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.",

keywords = "Large and unknown alphabets, Patterns, Set and integer partitions, Universal compression",

author = "Alon Orlitsky and Santhanam, {Narayana P.} and Junan Zhang",

note = "Funding Information: Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, Yokohama, Japan, June/July 2003.",

year = "2004",

month = jul,

doi = "10.1109/TIT.2004.830761",

language = "English (US)",

volume = "50",

pages = "1469--1481",

journal = "IEEE Transactions on Information Theory",

issn = "0018-9448",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "7",

}

TY - JOUR

T1 - Universal compression of memoryless sources over unknown alphabets

AU - Orlitsky, Alon

AU - Santhanam, Narayana P.

AU - Zhang, Junan

N1 - Funding Information: Manuscript received March 29, 2003; revised March 31, 2004. This work was supported by the National Science Foundation under Grant CCR-0313367. The material in this paper was presented in part at the IEEE International Symposium on Information Theory, Yokohama, Japan, June/July 2003.

PY - 2004/7

Y1 - 2004/7

N2 - It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.

AB - It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern - the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.

KW - Large and unknown alphabets

KW - Patterns

KW - Set and integer partitions

KW - Universal compression

UR - http://www.scopus.com/inward/record.url?scp=3042606358&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=3042606358&partnerID=8YFLogxK

U2 - 10.1109/TIT.2004.830761

DO - 10.1109/TIT.2004.830761

M3 - Article

AN - SCOPUS:3042606358

SN - 0018-9448

VL - 50

SP - 1469

EP - 1481

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

IS - 7

ER -

Universal compression of memoryless sources over unknown alphabets

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this