World Map Visualization

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Language Models for Low-Resource Languages Workshop (LoResLM) @ COLING 2025

* Equal Contribution
Please scroll to the bottom for the Algorithm.

Token-Saving Ratio (TSR) Visualization

Vocab. size:

Legend

Algorithm

We use a dynamic programming formulation similar to the Viterbi algorithm (Forney, 1973 ) and produce the optimal segmentation \(S^*\). Given a document \(d\), define \(\mathrm{dp}[i]\) as the minimal number of tokens needed to segment the prefix \(d_0 d_1 \dots d_i\) (positions 0 to \(i\)). We set \(\mathrm{dp}[-1] = 0\) as the base case, representing the empty prefix requiring zero tokens. The parent array \(\mathrm{par}\) serves as a backtracking mechanism, where \(\mathrm{par}[i]\) points to the end of the previous token in the optimal segmentation.
1: Input:
\(B = [B_0, B_1,\dots,B_{n-1}] \in \Sigma^*\) (byte sequence)
\(V \subset \Sigma^*\) (vocabulary)
\(T(V^R)\) with root \(r\) (trie on reversed vocabulary \(V^R\))
\(\delta(v) : T \to T \cup \{\varnothing\}\) (outputs child of \(v\) in trie \(T\))
\(I : V \to \{\mathrm{True}, \mathrm{False}\}\) (indicator function detecting if node is terminal)
2: Output: \(S^* \in V^*\) (optimal segmentation)
3: Initialize:
\(\displaystyle dp[i] \leftarrow i+1,\ \forall\,i \in [0,n-1]; \quad dp[n] \leftarrow 0\)
\(\displaystyle par[i] \leftarrow -1,\ \forall\,i \in [0,n-1]\)
4: for \(i \in [0,n-1]\) do
\(v \leftarrow r\)
for \(j \leftarrow i \downarrow 0\) do
\(v \leftarrow \delta\bigl(v,\,B[j]\bigr)\) (child of \(v\) for symbol \(B[j]\))
if \(v = \varnothing\) then break
if \(I(v)\ \wedge\ (\,dp[j-1] + 1 < dp[i]\,)\) then
  \(dp[i] \leftarrow dp[j-1] + 1\)
  \(par[i] \leftarrow j-1\)
end if
end for
end for
5:\(S \leftarrow \varnothing\) (initialize empty sequence)
6:\(k \leftarrow n\)
7: while \(k > 0\) do
\(S \leftarrow S \cup \bigl\{\ B[\,(\,par[k]\,+\,1)\ :\ (k+1)\ ]\bigr\}\)
\((B[i:j]\) denotes substring\()
\(k \leftarrow par[k]\)
end while
8:return \(S^R\) (reversed sequence)