When Every Token Counts: Optimal Segmentation for Low-Resource Language Models
Language Models for Low-Resource Languages Workshop (LoResLM) @ COLING 2025
* Equal Contribution
Please scroll to the bottom for the Algorithm.
Token-Saving Ratio (TSR) Visualization
Vocab. size:
Legend
Algorithm
We use a dynamic programming formulation similar to the Viterbi algorithm (Forney, 1973 ) and produce the optimal segmentation \(S^*\). Given a document \(d\), define \(\mathrm{dp}[i]\) as the minimal number of tokens needed to segment the prefix \(d_0 d_1 \dots d_i\) (positions 0 to \(i\)). We set \(\mathrm{dp}[-1] = 0\) as the base case, representing the empty prefix requiring zero tokens. The parent array \(\mathrm{par}\) serves as a backtracking mechanism, where \(\mathrm{par}[i]\) points to the end of the previous token in the optimal segmentation.
1: | Input: \(B = [B_0, B_1,\dots,B_{n-1}] \in \Sigma^*\) (byte sequence) \(V \subset \Sigma^*\) (vocabulary) \(T(V^R)\) with root \(r\) (trie on reversed vocabulary \(V^R\)) \(\delta(v) : T \to T \cup \{\varnothing\}\) (outputs child of \(v\) in trie \(T\)) \(I : V \to \{\mathrm{True}, \mathrm{False}\}\) (indicator function detecting if node is terminal) |
2: | Output: \(S^* \in V^*\) (optimal segmentation) |
3: | Initialize: \(\displaystyle dp[i] \leftarrow i+1,\ \forall\,i \in [0,n-1]; \quad dp[n] \leftarrow 0\) \(\displaystyle par[i] \leftarrow -1,\ \forall\,i \in [0,n-1]\) |
4: | for \(i \in [0,n-1]\) do \(v \leftarrow r\) end forfor \(j \leftarrow i \downarrow 0\) do \(v \leftarrow \delta\bigl(v,\,B[j]\bigr)\) (child of \(v\) for symbol \(B[j]\)) end forif \(v = \varnothing\) then break if \(I(v)\ \wedge\ (\,dp[j-1] + 1 < dp[i]\,)\) then \(dp[i] \leftarrow dp[j-1] + 1\) \(par[i] \leftarrow j-1\) end if |
5: | \(S \leftarrow \varnothing\) (initialize empty sequence) |
6: | \(k \leftarrow n\) |
7: | while \(k > 0\) do \(S \leftarrow S \cup \bigl\{\ B[\,(\,par[k]\,+\,1)\ :\ (k+1)\ ]\bigr\}\) end while\((B[i:j]\) denotes substring\() \(k \leftarrow par[k]\) |
8: | return \(S^R\) (reversed sequence) |