All projects
hobby

Filipino Tokenizer

A Morphology-Aware BPE Tokenizer for Philippine Languages

An open-source BPE tokenizer built with Philippine language morphology in mind.

Overview

Most tokenizers are trained on English-heavy corpora, which means they butcher Filipino text — splitting words in ways that lose meaning. This tokenizer is built differently. It uses Byte-Pair Encoding with awareness of Filipino morphological rules, so affixes like 'mag-', 'nag-', and '-an' are handled properly. It's designed for anyone doing NLP work on Filipino or other Philippine languages who doesn't want to start from scratch.