Using Snorkel For Multi-Label Annotation.

How to use snorkel’s multi-class implementation to create multi-labels

Jul 04, 2023

Snorkel is a nice little package that allows us to use labeling functions (LFs) as simple as heuristics or keywords or as complex as algorithms and human annotators in order to create a labeled dataset for the purpose of training a classifier.

“Snorkel is a system for programmatically building and managing training datasets without manual labeling. In Snorkel, users can develop large training datasets in hours or days rather than hand-labeling them over weeks or months.” — Snorkel.org

Motivation

The motivation to use snorkel in order to create multi-labels is simple. Snorkel allows us to use simple heuristics or keywords in order to create a supervised dataset. Using it as a labeling algorithm allows a certain level of clarity when we need to understand why a sample was assigned a certain class. You can also think of a scenario in which you created a heuristical multi-labeling algorithm that assigns a sparse amount of multi-labels; when your product or feature requires large quantities of multi-labels in order to give a higher value to your clients.

Snorkel

It's worth noting that throughout the process, we are actually creating two classifiers, the first one using snorkel’s MajorityLabelVoter or LabelModel, the former gives us a majority vote as a baseline and the latter gives us the secret-sauce model that Snorkel is known for and the latter is to train a machine or deep-learning (MLDL) algorithm because we don't want to rely on the labeling functions that we created. We want to train a classifier that is not limited to our keywords, regexes, etc. We want a classifier that generalizes beyond what we gave it. Ideally, it will find correlations between, for example, tokens that we didn't account for in the labeling process to our final labels.

Using Snorkel For Multi-Class

First, we need to understand how to use Snorkel. Consider a sentiment classification task and the following sentences: 1. “the cake tasted really bad”, 2. “the cream is really good” & 3. “this food is fair”. These sentences are NEGATIVE, POSITIVE & NEUTRAL, respectively. Therefore, we will create several LFs that assign a label accordingly.

from snorkel.labeling import labeling_function

@labeling_function()
def lf_keyword_good(x):
    return POSITIVE if "good" in x.text.lower() else ABSTAIN

@labeling_function()
def lf_keyword_bad(x):
    return NEGATIVE if "bad" in x.text.lower() else ABSTAIN

@labeling_function()
def lf_keyword_fair(x):
    return NEUTRAL if "fair" in x.text.lower() else ABSTAIN

The rest of the process is easy, once you have many LFs you apply them on your pandas.DataFrame and train one of the models, i.e., MajorityLabelVoter or LabelModel.

from snorkel.labeling import LabelModel, PandasLFApplier

# Define the set of labeling functions (LFs)
lfs = [lf_keyword_bad, lf_keyword_good, lf_keyword_fair]

# Apply the LFs to the unlabeled training data
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

# Train the label model and compute the training labels
label_model = LabelModel(cardinality=3, verbose=True)
label_model.fit(L_train, n_epochs=500, log_freq=50)
df_train["label"] = label_model.predict(L=L_train, tie_break_policy="abstain")

df_train[“label”] will contain ABSTAIN labels as well, therefore in order to further train our secondary classifier, we’ll have to filter them out.

df_train = df_train[df_train.label != ABSTAIN]

Again, the purpose of training a secondary classifier (Random forest in this example) on the filtered dataset is “to generalize beyond the coverage of the labeling functions and the LabelModel. — snorkel.org"

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(df_train.drop(['label'],axis=1), df_train["label"])

Using Snorkel For Multi-Label

Snorkel can only be used out of the box as a multi-class labeler. To use it for Multi-label, you can do one of the following three methods:

Use MajorityLabelVoter’s predict_proba() and assign all the classes that have a ‘probability’ ≥ 0. i.e., the first and the last [0.5, 0, 0.5]. We can think of it as a sample that resides in two clusters or two keywords from two classes that allows a sample to have multi labels. For example, “The hamburger was good and bad”.
Use LabelModel’s predict_proba and assign all the classes in which probability is above the ‘knee’. You can use Kneed to figure it out. essentially our probabilities are softmax and only a handful will receive high values. Please note that from my empirical tests, there is a high correlation between MajorityLabelVoter’s probability values to LabelModel’s, i.e., the former is a “hard” softmax and the latter is what you expect from softmax. i.e., [0.5, 0, 0.5] vs [0.45, 0.06, 0.49], respectively.
Train ‘one vs all’ models, for each class using MajorityLabelVoter or LabelModel and assign a multi-label according to the prediction. Note that you need to think of a strategy when considering ABSTAIN labels.

A Kneed Illustration for the highest probabilities.

Multi-Label In Practice

Because ‘1’ & ‘2’ gave similar outputs and the calculation to get multi-labels from ‘1’ was simpler, I chose to assign multi-labels based on 1.

Please note that when a model assigns an ABSTAIN label, it can’t decide who is the winner. We observe this behavior in predict_proba()’s output. For example, consider the following probability vector: [1.0, 0, 0], the winner here is the first class. Now, consider the following vector: [0.5,0, 0.5], we see that isn't a clear winner, therefore, Snorkel will assign an ABSTAIN label.

When we assign a label for each class that has a non-zero probability, we are in fact eliminating all ABSTAIN labels, labeling all the dataset and assigning a lot of multi-labels per sample.

The following codes use a MajorityLabelVoter classifier and assign labels according to all the classes that have a higher than zero score. It's as simple as that :).

from snorkel.labeling import MajorityLabelVoter
from sklearn.preprocessing import MultiLabelBinarizer

Y = [['POSITIVE', 'NEGATIVE', 'NEUTRAL']]

# fit a MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(Y)

# create a majority vote model and predict
majority_model = MajorityLabelVoter(cardinality=3)
predictions = majority_model.predict_proba(L=L_test)

df_multilabel = pd.DataFrame()
df_multilabel['predict_proba'] = predictions.tolist()

# get all the non zero indices which are the multi labels
df_multilabel['multi_labels'] = df_multilabel['predict_proba'].apply(lambda x: np.nonzero(x)[0])
    
#transform to mlb for classification report
df_multilabel['mlb_pred'] = df_multilabel['multi_labels'].apply(lambda x: mlb.transform([x])[0])

print(df_multilabel.head())

#convert to str in order to see how many multi labels did we gain
multi_label_string = df_multilabel.multi_labels.apply(lambda x: ", ".join(le.inverse_transform(x)))
print(multi_label_string.value_counts()[:50])

# print some metrics using classification report 
y_pred = df_multilabel.mlb_pred.apply(lambda x: list(x)).to_numpy().tolist()
y_true = mlb.transform(Y.values).tolist()

print(classification_report(y_true, y_pred, target_names = mlb.classes_))

That’s it, you have created multi labels for each sample. Please keep in mind that due to this methodology of using Snorkel, which is very greedy in terms of labeling strategy, you may get a very RECALL-ORIENTED model. Your experience will vary.

Dr. Ori Cohen has a Ph.D. in Computer Science with a focus on machine learning. He is a Senior Director of Data and the author of the ML & DL Compendium and StateOfMLOps.com.

Dr. Ori Cohen’s Newsletter

Discussion about this post