World Map Visualization

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Language Models for Low-Resource Languages Workshop (LoResLM) @ COLING 2025

Bharath Raj S^*, Garvit Suri^*, Vikrant Dewangan^*, Raghav Sonavane

Sprinklr AI

^* Equal Contribution

arXiv Code

Please scroll to the bottom for the Algorithm.

Token-Saving Ratio (TSR) Visualization

Vocab. size: 50K 100K 200K

Legend

Algorithm

We use a dynamic programming formulation similar to the Viterbi algorithm (Forney, 1973 ) and produce the optimal segmentation \(S^*\). Given a document \(d\), define \(\mathrm{dp}[i]\) as the minimal number of tokens needed to segment the prefix \(d_0 d_1 \dots d_i\) (positions 0 to \(i\)). We set \(\mathrm{dp}[-1] = 0\) as the base case, representing the empty prefix requiring zero tokens. The parent array \(\mathrm{par}\) serves as a backtracking mechanism, where \(\mathrm{par}[i]\) points to the end of the previous token in the optimal segmentation.

1:	Input: \(B = [B_0, B_1,\dots,B_{n-1}] \in \Sigma^\) (byte sequence) \(V \subset \Sigma^\) (vocabulary) \(T(V^R)\) with root \(r\) (trie on reversed vocabulary \(V^R\)) \(\delta(v) : T \to T \cup \{\varnothing\}\) (outputs child of \(v\) in trie \(T\)) \(I : V \to \{\mathrm{True}, \mathrm{False}\}\) (indicator function detecting if node is terminal)
2:	Output: \(S^* \in V^*\) (optimal segmentation)
3:	Initialize: \(\displaystyle dp[i] \leftarrow i+1,\ \forall\,i \in [0,n-1]; \quad dp[n] \leftarrow 0\) \(\displaystyle par[i] \leftarrow -1,\ \forall\,i \in [0,n-1]\)
4:	for \(i \in [0,n-1]\) do \(v \leftarrow r\) for \(j \leftarrow i \downarrow 0\) do \(v \leftarrow \delta\bigl(v,\,B[j]\bigr)\) (child of \(v\) for symbol \(B[j]\)) if \(v = \varnothing\) then break if \(I(v)\ \wedge\ (\,dp[j-1] + 1 < dp[i]\,)\) then \(dp[i] \leftarrow dp[j-1] + 1\) \(par[i] \leftarrow j-1\) end if end for end for
5:	\(S \leftarrow \varnothing\) (initialize empty sequence)
6:	\(k \leftarrow n\)
7:	while \(k > 0\) do \(S \leftarrow S \cup \bigl\{\ B[\,(\,par[k]\,+\,1)\ :\ (k+1)\ ]\bigr\}\) \((B[i:j]\) denotes substring\() \(k \leftarrow par[k]\) end while
8:	return \(S^R\) (reversed sequence)