Paper Type

full

Description

One of the most pervasive challenges in adopting machine or deep learning is the scarcity of training data. This problem is amplified in IS research, where application domains usually require specialized knowledge. This study compares three systems to create a large dataset for training when only a small amount of human-labeled data is available: a high-precision LSTM classifier, a high-recall LSTM classifier, and manually created rule-based system. Based on fewer than 20,000 human-labeled training examples, we used automated labeling to add an additional 100,000 examples to the training data. We found that combining a small human-labeled dataset with a system-labeled dataset improves classification performance. In our evaluation, adding training data labeled by the high-recall LSTM to the human-labeled dataset achieved F1 of 0.578, and adding training data labeled by the rule-based system achieved F1 of 0.598, over 4% improvement compared to a baseline system that only uses human-labeled data.

Share

COinS
 

Mechanisms for Automatic Training Data Labeling for Machine Learning

One of the most pervasive challenges in adopting machine or deep learning is the scarcity of training data. This problem is amplified in IS research, where application domains usually require specialized knowledge. This study compares three systems to create a large dataset for training when only a small amount of human-labeled data is available: a high-precision LSTM classifier, a high-recall LSTM classifier, and manually created rule-based system. Based on fewer than 20,000 human-labeled training examples, we used automated labeling to add an additional 100,000 examples to the training data. We found that combining a small human-labeled dataset with a system-labeled dataset improves classification performance. In our evaluation, adding training data labeled by the high-recall LSTM to the human-labeled dataset achieved F1 of 0.578, and adding training data labeled by the rule-based system achieved F1 of 0.598, over 4% improvement compared to a baseline system that only uses human-labeled data.