A simple BERT based baseline for DataThon @ IndoML'24, .
Data & Details: After registering here, you can get data from here; download the raw data and store it in a directory (ideally, called data/
).
Preprocess: Run
python src/preprocess.py --data_dir <your_data_directory>
Download BERT model and tokenizer: You also need the BERT model and tokenizer in appropriate directories, run,
python src/downloadBERT.py
Train & Test: The rest of the code works on all configurations from single CPU, multi-GPU to multi-machine.
python3 src/trainer.py --output <some_output_column>
The code will automatically pick up multiple GPUs, or you can also launch by prefixing it with CUDA_VISIBLE_DEVICES=x,y,z
.
Feel free to modify any components of this code as you see fit.