I have a predefined vocab which build from the common-used 3500 Chinese characters. Now I want to tokenize the Dataset with this vocab to fix each character. Any mature class
or function
exists I can inherit from to buid the data reading pipline?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…