HT LG Binary - Hybrid Model V2: a model trained by combining the results of previous models and the results of running GeoTopicParser on the ads. In this case, top 5 geolocations are selected.
HT Lead Generation Binary model: Using a MEMEX provided JSON data file (contact mattmann@usc.edu for more information on the source file), the final training dataset was created using a JSON Parser, which was also separately implemented within Apache Tika in conjunction with the Text-To-Tag-Ratio algorithm, and a Data Preparation Parser. The first parser was used to extract the value of the raw_content field in the JSON file and its corresponding annotation, or the label (NOT_RELEVANT or VERY_RELEVANT). Since the value of the raw_content was the HTML source code of the entire ad, the parser's implementation within Tika, along with the Text-To-Tag-Ratio algorithm, was essential to get rid of any site navigation and unnecessary tags, which made the model more reliable. The latter parser made the training dataset even more reliable by ensuring an equal presence of examples per each label and shuffling the data. In the end we ended up with an almost perfect training dataset of total 22,246 lines (11,123 NOT_RELEVANT and 11,123 VERY_RELEVANT examples), where each line started out with a label (NOT_RELEVANT or VERY_RELEVANT) followed by the respective ad. As a result, the model is capable of analysing the given data for its relevancy to human trafficking.
To find out more about the models we trained please visit our Models and Datasets pages.
Dataset evaluated
HT Lead Generation Ads - pre-labeled (22,246): the MEMEX services ads examined for their relevance to human trafficking and sentiment, and therefore possible issues.
HT Lead Generation Ads - NOT labeled (38,563): the MEMEX services reviews examined for their sentiment and therefore possible issues.
Step II.A
HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data
In this further analysis we will refer to the pre-labled data from the given training file discussed above as the Ground Truth of HT Lead Generation since we are almost 100% certain that the label matches the ad. We will refer to the results acquired by running the model trained on that Ground Truth, as described above, as the results of the HT LG Model. Both the Ground Truth data and the results of running the HT LG model can be categorised into RELEVANT and NOT RELEVANT.
Step II.A analyses the correlation between the Ground Truth of the HT Lead Generation Ads and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this and other analyses, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as provided with the Ground Truth, and the results of running the four models. In short, given the 22,246 pre-labaled ads with their respective IDs, the Output Analyzer recorded the HT-relevancy label of each ad, as well as the result it produced when analysed with the given model, and calculated the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).
Correlation between the binary sentiment and the ground truth of the HT data (Netflix model)
RELEVANT
Postive: 3,688
Negative: 7,435
NOT RELEVANT
Postive: 5,587
Negative: 5,536
Correlation between the binary sentiment and the ground truth of the HT data (HT Review model)
RELEVANT
Postive: 8,838
Negative: 2,285
NOT RELEVANT
Postive: 8,550
Negative: 2,573
Correlation between the categorical sentiment and the ground truth of the HT data (Stanford model)
RELEVANT
Angry: 0
Sad: 83
Neutral: 10,958
Like: 77
Love: 5
NOT RELEVANT
Angry: 30
Sad: 35
Neutral: 10,820
Like: 238
Love: 0
Correlation between the categorical sentiment and the ground truth of the HT data (HT Review model)
RELEVANT
Angry: 112
Sad: 3,433
Neutral: 1,536
Like: 5,904
Love: 138
NOT RELEVANT
Angry: 296
Sad: 3,795
Neutral: 1,494
Like: 5,538
Love: 0
Step II.B
HT LEAD GENERATION ANALYSIS - the HT LG Model
Step II.B analyses the correlation between the results of running the HT LG Model on the dataset #2 (the not labeled test ads) and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this, just like previous, analysis, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as predicted by the HT LG Model, and the results of running the four models. In short, given the 38,563 non-labeled ads with their respective IDs, the Output Analyzer analysed the HT relevancy label, as given by the newly produced HT LG Model, for each ad and the label given by one of the sentiment models, which thereafter allowed to calculate the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).
Correlation between the binary sentiment and the truth of the HT data (Netflix model)
RELEVANT
Postive: 8,124
Negative: 10,160
NOT RELEVANT
Postive: 9,127
Negative: 11,152
Correlation between the binary sentiment and the truth of the HT data (HT Review model)
RELEVANT
Postive: 14,833
Negative: 3,451
NOT RELEVANT
Postive: 15,164
Negative: 5,115
Correlation between the categorical sentiment and the truth of the HT data (Stanford model)
RELEVANT
Angry: 0
Sad: 68
Neutral: 17,889
Like: 318
Love: 9
NOT RELEVANT
Angry: 3
Sad: 114
Neutral: 19,868
Like: 294
Love: 0
Correlation between the categorical sentiment and the truth of the HT data (HT Review model)
RELEVANT
Angry: 377
Sad: 5,030
Neutral: 1,796
Like: 11,088
Love: 93
NOT RELEVANT
Angry: 449
Sad: 6,994
Neutral: 1,183
Like: 11,553
Love: 0
Step III
HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model
Step III analyses the relative distribution of categorical sentiment in the HT Lead Generation ads (dataset #2). To build the following two graphs, these steps were undertaken:
Run the chosen model (either #3 or #4) on the test dataset (HT LG dataset #2).
Record the cluster_id of each ad.
For each cluster, calculate the total number of angry, sad, neutral, like and love ads.
To calculate the contribution of each cluster to the total value of each label on the graph, for each label calculate the # of ads classified for a label/total # of ads, which was done using the Cluster Analysis parser.
For example, the cluster ce3fb900eb7ca45e4a1dab30624b822fa40418d1, as analysed by the Stanford Model, has a total of 62 ads with 0 love, like, sad and angry ads, and 62 neutral ads, which means that it contributes 62/62=1.0 to the neutral bar and 0/62=0 to the rest.
The relative distribution of categorical sentiment among HT LG ads using the Stanford Model (#3)
Angry: 0.01121
Sad: 0.17895
Neutral: 125.07799
Like: 4.70649
Love: 0.02534
The relative distribution of categorical sentiment among HT LG ads using the HT Categorical Model (#4)
Angry: 2.35796
Sad: 26.67812
Neutral: 18.13439
Like: 82.00351
Love: 0.82603
Step IV.A
HT LEAD GENERATION - OVERALL ANALYSIS
Step IV.A involves looking at Steps II.A and II.B from far. In short, it portrays the results of the analyses in two tables - for Step II.A and II.B respectively.
Using the HT LG Ground Truth Data
Indicator
Netflix
Stanford
HT binary
HT categorical
POS
NEG
ANGRY
SAD
NEUTRAL
LIKE
LOVE
POS
NEG
ANGRY
SAD
NEUTRAL
LIKE
LOVE
HT Relevant
3,688
7,435
0
83
10,958
77
5
8,838
2,285
112
3,433
1,536
5,904
138
Not HT Relevant
5,587
5,536
30
35
10,820
238
0
8,550
2,573
296
3,795
1,494
5,538
0
Using the HT LG Model
Indicator
Netflix
Stanford
HT binary
HT categorical
POS
NEG
ANGRY
SAD
NEUTRAL
LIKE
LOVE
POS
NEG
ANGRY
SAD
NEUTRAL
LIKE
LOVE
HT Relevant
8,124
10,160
0
68
17,889
318
9
14,833
3,451
377
5,030
1,696
11,088
93
Not HT Relevant
9,127
11,152
3
114
19,868
294
0
15,164
5,115
449
6,994
1,283
11,553
0
Step IV.B
HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL
Step IV.B involves analysing the performance of the HT LG Model. To do this, the dataset #1 (the pre-labeled training dataset) was divided into 80% for training and 20% for testing. The NOT_RELEVANT and VERY_RELEVANT labels remained in the new training dataset, and were removed and saved separately in the test dataset. The new model was trained on the 80% of the data and run on the 20% testing dataset. The results of this run were thereafter compared to the actual saved labels, and the matches were calculated (as true positive, false positive, true negative, false negative). All of this was performed using the Test Model Parser.
To build the table below, and later on, the 20% of the dataset #1 (4,449 ads) were evaluated in total:
Before running the newly created 80% model on the 20% test set, the labels of all 4,449 ads were removed and saved separately.
The model was run on the 4,449 ads and the predicted labels were saved.
The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was also RELEVANT, was calculated to give the total number of TRUE POSITIVE (true relevant) ads, which in this case is 2,213.
The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was NOT RELEVANT, was calculated to give the total number of FALSE NEGATIVE (false not relevant) ads, which in this case is 29.
The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was also NOT RELEVANT, was calculated to give the total number of TRUE NEGATIVE (true not relevant) ads, which in this case is 2,165.
The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was RELEVANT, was calculated to give the total number of FALSE POSITIVE (false relevant) ads, which in this case is 42.
The ROC Curve was later on built using the results of the 80:20 run, which was performed using the ROC creator. To get the X axis and Y axis coordinates for the purpose of plotting the curve, for each test ad in the 20% evaluated, the False Positive Rate (FPR) and the True Positive Rate (TPR) were calculated respectively. To calculate the TRP, the current count of true positive (or, in our case, true relevant) was divided by the total count of true positive (as seen in the table). The same was done for FPR, but using the the false postive (or false relevant) count instead.
The same procedure was followed in Step V (Hybrid Model analysis).
Dividing the train set into 80:20 (80% = training; 20% = testing)
Predicted
RELEVANT
NOT RELEVANT
Actual
RELEVANT
2,213
29
NOT RELEVANT
42
2,165
Plotting an ROC curve on the 80:20 data
Step V
HT LEAD GENERATION - THE HYBRID MODEL
Step V involves building a Hybrid Model using the Hybrid Model builder. The purpose of the model is to determine the HT Relevancy of an ad with more precision. In short, to build the training dataset the following steps were undertaken:
Use the pre-labeled dataset (#1) and save its labels only
Run the Stanford model on the ads from dataset #1 and give each ad a label of yes (indicating that the ad was labeled with "love") or no (indicating that the ad was labeled as either "like", "neutral", "sad" or "angry").
Run the HT Categorical model on the ads and give each ad a label the same way.
Run the GeoTopicParser on each ad and save the main geolocation found.
Use that main geolocation as the next label.
Parse each ad text and search for presence of negation ("no"). If it is present in the ad, save "yes" as the next label, save "no" otherwise
As a result each line of the training file (of 22,246 lines in total) looked like this: NOT_RELEVANT no, no, New York City, no
Running the Hybrid Model on the HT LG Data (dataset #2)
RELEVANT: 20,396
NOT RELEVANT: 18,167
Compared to HT LG Binary model (#5) results on the same data:
RELEVANT: 18,284
NOT RELEVANT: 20,279
Dividing the hybrid train set into 80:20 (80% = training; 20% = testing)
Predicted
RELEVANT
NOT RELEVANT
Actual
RELEVANT
1,979
228
NOT RELEVANT
671
1,571
Plotting an ROC curve on the 80:20 data
HT LEAD GENERATION - HYBRID MODEL VERSION II
In this case a new version of the Hybrid Model was built that includes the top 5 geolocations as outputted by the GeoTopicParser rather than just one, as seen in the first version of the Hybrid Model (#6). As a result, an example of an element in the training dataset looks like this: RELEVANT no; no; Oakland County, Cumberland East Bay, San Fernando Valley Division; no
Running the Hybrid Model Version II on the HT LG Data (dataset #2)
RELEVANT: 13,183
NOT RELEVANT: 25,380
Plotting an ROC curve on the 80:20 data for Hybrid Model V2
Step VI
COMPARING THE TWO HT LG MODELS (#5 AND #6)
Step VI involves looking at the two Binary HT Lead Generation models (#5 and #6) once again and giving a more precise description of depiction of their performance to conclude which one of the two acts as a better and more reliable model for HT relevancy.
During the training of the model, Apache OpenNLP provides the user with the accuracy rate for each of the 100 iterations. Looking at the source code, such accuracy is calculated by running the model on the training dataset (i.e. the dataset the model has already seen) and calculating number of correct predictions/total number of events. The curves #1 and #3 depict such accuracy for the HT LG Model (#5) and the Hybrid Model (#6) respectively. However, as mentioned earlier, the model had already seen the data it was being tested on, which makes such an analysis not very reliable. To calculate the same exact accuracy for the test set (20%), some of the OpenNLP source code was used and modifed. The results of such analysis can be seen in curves #2 and #4 for the HT LG Model (#5) and the Hybrid Model (#6) respectively.
HT LG Model: Accuracy on the training set
Iteration 1: 0.625028096
Iteration 100: 0.979489773
HT LG Model: Accuracy on the test set
Iteration 1: 0
Iteration 100: 0.333258427
Hybrid Model: Accuracy on the training set
Iteration 1: 0.374971904
Iteration 100: 0.841818386
Hybrid Model: Accuracy on the test set
Iteration 1: 0.201797753
Iteration 100: 0.520674157
Conclusions:
Since the accuracy when running the model on the test set is much more reliable, it is safe to say that the Hybrid Model outperforms the HT LG Model
Not much overfitting can be seen, but 10-20 iterations for the Hybrid Model seems to be enough for training
~50 iterations for the HT LG Model seems to be enough for training
Comparing two Hybrid Models (#6 and #7)
Hybrid Model: Accuracy on the training set
Iteration 1: 0.374971904
Iteration 100: 0.841818386
Hybrid Model: Accuracy on the test set
Iteration 1: 0.201797753
Iteration 100: 0.520674157
Hybrid Model Version II: Accuracy on the training set
Iteration 1: 0.374971904
Iteration 100: 0.861373342
Hybrid Model Version II: Accuracy on the test set
Iteration 1: 0.19658879
Iteration 100: 0.516853933
Conclusions:
The overall accuracy for both models seems to be very similar
The accuracy on the training dataset is slightly higher for the Version II of the Hybrid Model, but since such accuracy is not the most reliable way to analyze the given models, this does not guarantee an overall higher accuracy of Version II
The test set accuracy seems to reach almost the same point for both models, but it does appear to be slightly higher
The overall output when running on the unlabeled test dataset (#2) seems to differ drastically when comparing the results of two models
However, since we have seen that the test set accuracy is slightly higher for the first version of the Hybrid Model, we could assume that the results of running it on dataset #2 are slightly more reliable
Step VII.A (NEW)
DEBIASED HT LG DATA ANALYSIS - the HT LG Ground Truth
Step VII.A involves performing the same steps as in Step II.A
Correlation between the binary sentiment and the ground truth of the debiased HT data (Netflix model)
RELEVANT
Postive: 2,131
Negative: 5,282
NOT RELEVANT
Postive: 37,102
Negative: 54,819
Correlation between the binary sentiment and the ground truth of the debiased HT data (HT Review model)
RELEVANT
Postive: 5,586
Negative: 1,827
NOT RELEVANT
Postive: 63,771
Negative: 28,150
Correlation between the categorical sentiment and the ground truth of the debiased HT data (Stanford model)
RELEVANT
Angry: 0
Sad: 72
Neutral: 7,297
Like: 44
Love: 0
NOT RELEVANT
Angry: 15
Sad: 1,514
Neutral: 89,105
Like: 1,280
Love: 0
Correlation between the categorical sentiment and the ground truth of the debiased HT data (HT Review model)
RELEVANT
Angry: 54
Sad: 2,752
Neutral: 699
Like: 3,853
Love: 55
NOT RELEVANT
Angry: 1,980
Sad: 36,917
Neutral: 5,827
Like: 45,989
Love: 0
Step VII.B (NEW)
DEBIASED HT LG DATA ANALYSIS - the HT LG Model II
Step VII.B involves performing the same steps as in Step II.B. The initial debiased labeled dataset consisted of 99,334 elements, only 7,431 of which were labeled as RELEVANT. When a new HT LG model was trained on that dataset, almost all of the analysed ads were classified as NOT_RELEVANT due to the completely unbalanced nature of the training dataset. Because of that, the training dataset was balanced out, resulting in a total of 14,826 elements (7,431 RELEVANT and 7,431 NOT_RELEVANT). A new model was trained on that dataset and run on the unlabeled dataset #2. The results were as follows:
Correlation between the binary sentiment and the truth of the debiased HT data (Netflix model)
RELEVANT
Postive: 4,285
Negative: 6,509
NOT RELEVANT
Postive: 12,966
Negative: 14,803
Correlation between the binary sentiment and the truth of the debiased HT data (HT Review model)
RELEVANT
Postive: 6,597
Negative: 4,197
NOT RELEVANT
Postive: 23,400
Negative: 4,369
Correlation between the categorical sentiment and the truth of the debiased HT data (Stanford model)
RELEVANT
Angry: 2
Sad: 109
Neutral: 10,550
Like: 132
Love: 1
NOT RELEVANT
Angry: 1
Sad: 73
Neutral: 27,204
Like: 480
Love: 0
Correlation between the categorical sentiment and the truth of the debiased HT data (HT Review model)
RELEVANT
Angry: 246
Sad: 4,663
Neutral: 775
Like: 5,048
Love: 62
NOT RELEVANT
Angry: 580
Sad: 7,261
Neutral: 2,104
Like: 17,571
Love: 0
Step VIII
EXPLAINING THE WORKFLOW
Step VII involves analysing and explaining the flow of the work done for the previous steps.
Step II.A: HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data
Step II.B: HT LEAD GENERATION ANALYSIS - the HT LG Model
Step III: HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model
Step IV.B: HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL
Step V: HT LEAD GENERATION - THE HYBRID MODEL
Step VI: COMPARING THE TWO HT LG MODELS (#5 AND #6)