AI Heap
Published on

Tokenisation is NP-Complete

arXiv:2412.15210 - [arXiv,PDF]
Authors
  • Name
    Philip Whittington
  • Name
    Gregor Bachmann
  • Name
    Tiago Pimentel
  • Affiliation
    Department of Computer Science, ETH Zürich
In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).