BERT Tagalog Part of Speech Tagger (BERTTPOST)

Cite this repository

Saya-ang, K., Hamor, M. G., Gozum, D. J., & Mabansag, R. K. Bidirectional Encoder Representation from Transformer Tagalog Part of Speech Tagger [Computer software]. https://github.com/syke9p3/bert-tagalog-pos-tagger

This repository contains the training and testing Python files for fine-tuning gklmip/bert-tagalog-base-uncased model for Tagalog part of speech tagging

Developed by: Saya-ang, Kenth G. (@syke9p3) | Gozum, Denise Julianne S. (@Xenoxianne) | Hamor, Mary Grizelle D. (@mnemoria) | Mabansag, Ria Karen B. (@riavx)
Model type: BERT Tagalog Base Uncased
Programming Language: Python
Languages (NLP): Tagalog, Filipino
Dataset: Sagum et. al.'s annotated Tagalog Corpora based on MGNN Tagset convention. This model was trained in 800 sentences and evaluated with 200 sentences.
Finetuned from model: Jiang et. al.'s pre-trained bert-tagalog-base-uncased model

HuggingFace

Try the model: HuggingFace Spaces

Model source code: HuggingFace

Python Libraries

PyTorch
Regular Expressions
Transformers
SKLearn Metrics
Datasets
tqdm

Dataset and Preprocessing

A corpus was used containing tagged sentences in Tagalog language. The dataset comprises sentences with each word annotated with its corresponding POS tag in the format of <TAG word>. To prepare the corpus for training, the following preprocessing steps were performed:

Removal of Line Identifier: the line identifier, such as SNT.108970.2066, was removed from each tagged sentence.
Symbol Conversion: for the BERT model, certain special symbols like hyphens, quotes, commas, etc., were converted into special tokens (PMP, PMS, PMC) to preserve their meaning during tokenization.
Alignment of Tokenization: the BERT tokenized words and their corresponding POS tags were aligned to ensure that the tokenization and tagging are consistent.

Training

The BERT Tagalog POS Tagger were trained using PyTorch library with the following hyperparameters set:

Hyperparamter	Value
Batch Size	8
Training Epoch	5
Learning-rate	2e-5
Optimizer	Adam

Inference

For the test sentences, almost the same preprocessing and tokenization steps as in training were performed, but without the need to extract POS tags from the sentence. The trained model was loaded to generate the tags for the input sentence along with Gradio to provide an interface for displaying the POS tag results.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
corpus		corpus
.gitignore		.gitignore
BERTTPOST Screenshot.jpg		BERTTPOST Screenshot.jpg
README.md		README.md
citation.cff		citation.cff
tester.py		tester.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT Tagalog Part of Speech Tagger (BERTTPOST)

Cite this repository

HuggingFace

Python Libraries

Dataset and Preprocessing

Training

Inference

About

Releases

Packages

Languages

syke9p3/bert-tagalog-pos-tagger

Folders and files

Latest commit

History

Repository files navigation

BERT Tagalog Part of Speech Tagger (BERTTPOST)

Cite this repository

HuggingFace

Python Libraries

Dataset and Preprocessing

Training

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages