Human Trafficking - Lead Generation Analysis

Step I

Training models used

Binary - Netflix Review Model: a model trained by examining film and TV show reviews on Netflix.
Binary - HT Provider Review Model: a model trained by examining services reviews.
Categorical - Stanford Model: a model trained by examining data provided by Stanford University.
Categorical - HT Provider Review Model: a model trained by examining HT services reviews.
HT LG Binary - HT Lead Generation Binary: a model trained by examining HT Lead Generation ads classified as HT Relevant or Not HT Relevant.
HT LG Binary - Hybrid Model: a model trained by combining the results of previous models and the results of running GeoTopicParser on the ads.
HT LG Binary - Hybrid Model V2: a model trained by combining the results of previous models and the results of running GeoTopicParser on the ads. In this case, top 5 geolocations are selected.
HT LG Binary - HT LG V2: a model trained on the balanced debiased dataset.

HT Lead Generation Binary model: Using a MEMEX provided JSON data file (contact mattmann@usc.edu for more information on the source file), the final training dataset was created using a JSON Parser, which was also separately implemented within Apache Tika in conjunction with the Text-To-Tag-Ratio algorithm, and a Data Preparation Parser. The first parser was used to extract the value of the raw_content field in the JSON file and its corresponding annotation, or the label (NOT_RELEVANT or VERY_RELEVANT). Since the value of the raw_content was the HTML source code of the entire ad, the parser's implementation within Tika, along with the Text-To-Tag-Ratio algorithm, was essential to get rid of any site navigation and unnecessary tags, which made the model more reliable. The latter parser made the training dataset even more reliable by ensuring an equal presence of examples per each label and shuffling the data. In the end we ended up with an almost perfect training dataset of total 22,246 lines (11,123 NOT_RELEVANT and 11,123 VERY_RELEVANT examples), where each line started out with a label (NOT_RELEVANT or VERY_RELEVANT) followed by the respective ad. As a result, the model is capable of analysing the given data for its relevancy to human trafficking.

To find out more about the models we trained please visit our Models and Datasets pages.

Dataset evaluated

HT Lead Generation Ads - pre-labeled (22,246): the MEMEX services ads examined for their relevance to human trafficking and sentiment, and therefore possible issues.
HT Lead Generation Ads - NOT labeled (38,563): the MEMEX services reviews examined for their sentiment and therefore possible issues.

Step II.A

HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data

In this further analysis we will refer to the pre-labled data from the given training file discussed above as the Ground Truth of HT Lead Generation since we are almost 100% certain that the label matches the ad. We will refer to the results acquired by running the model trained on that Ground Truth, as described above, as the results of the HT LG Model. Both the Ground Truth data and the results of running the HT LG model can be categorised into RELEVANT and NOT RELEVANT.

Step II.A analyses the correlation between the Ground Truth of the HT Lead Generation Ads and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this and other analyses, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as provided with the Ground Truth, and the results of running the four models. In short, given the 22,246 pre-labaled ads with their respective IDs, the Output Analyzer recorded the HT-relevancy label of each ad, as well as the result it produced when analysed with the given model, and calculated the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).

Correlation between the binary sentiment and the ground truth of the HT data (Netflix model)

RELEVANT

Postive: 3,688
Negative: 7,435

NOT RELEVANT

Postive: 5,587
Negative: 5,536

Correlation between the binary sentiment and the ground truth of the HT data (HT Review model)

RELEVANT

Postive: 8,838
Negative: 2,285

NOT RELEVANT

Postive: 8,550
Negative: 2,573

Correlation between the categorical sentiment and the ground truth of the HT data (Stanford model)

RELEVANT

Angry: 0
Sad: 83
Neutral: 10,958
Like: 77
Love: 5

NOT RELEVANT

Angry: 30
Sad: 35
Neutral: 10,820
Like: 238
Love: 0

Correlation between the categorical sentiment and the ground truth of the HT data (HT Review model)

RELEVANT

Angry: 112
Sad: 3,433
Neutral: 1,536
Like: 5,904
Love: 138

NOT RELEVANT

Angry: 296
Sad: 3,795
Neutral: 1,494
Like: 5,538
Love: 0

Step II.B

HT LEAD GENERATION ANALYSIS - the HT LG Model

Step II.B analyses the correlation between the results of running the HT LG Model on the dataset #2 (the not labeled test ads) and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this, just like previous, analysis, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as predicted by the HT LG Model, and the results of running the four models. In short, given the 38,563 non-labeled ads with their respective IDs, the Output Analyzer analysed the HT relevancy label, as given by the newly produced HT LG Model, for each ad and the label given by one of the sentiment models, which thereafter allowed to calculate the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).

Correlation between the binary sentiment and the truth of the HT data (Netflix model)

RELEVANT

Postive: 8,124
Negative: 10,160

NOT RELEVANT

Postive: 9,127
Negative: 11,152

Correlation between the binary sentiment and the truth of the HT data (HT Review model)

RELEVANT

Postive: 14,833
Negative: 3,451

NOT RELEVANT

Postive: 15,164
Negative: 5,115

Correlation between the categorical sentiment and the truth of the HT data (Stanford model)

RELEVANT

Angry: 0
Sad: 68
Neutral: 17,889
Like: 318
Love: 9

NOT RELEVANT

Angry: 3
Sad: 114
Neutral: 19,868
Like: 294
Love: 0

Correlation between the categorical sentiment and the truth of the HT data (HT Review model)

RELEVANT

Angry: 377
Sad: 5,030
Neutral: 1,796
Like: 11,088
Love: 93

NOT RELEVANT

Angry: 449
Sad: 6,994
Neutral: 1,183
Like: 11,553
Love: 0

Step III

HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model

Step III analyses the relative distribution of categorical sentiment in the HT Lead Generation ads (dataset #2). To build the following two graphs, these steps were undertaken:

Run the chosen model (either #3 or #4) on the test dataset (HT LG dataset #2).
Record the cluster_id of each ad.
For each cluster, calculate the total number of angry, sad, neutral, like and love ads.
To calculate the contribution of each cluster to the total value of each label on the graph, for each label calculate the # of ads classified for a label/total # of ads, which was done using the Cluster Analysis parser.
For example, the cluster ce3fb900eb7ca45e4a1dab30624b822fa40418d1, as analysed by the Stanford Model, has a total of 62 ads with 0 love, like, sad and angry ads, and 62 neutral ads, which means that it contributes 62/62=1.0 to the neutral bar and 0/62=0 to the rest.

The relative distribution of categorical sentiment among HT LG ads using the Stanford Model (#3)

Angry: 0.01121
Sad: 0.17895
Neutral: 125.07799
Like: 4.70649
Love: 0.02534

The relative distribution of categorical sentiment among HT LG ads using the HT Categorical Model (#4)

Angry: 2.35796
Sad: 26.67812
Neutral: 18.13439
Like: 82.00351
Love: 0.82603

Step IV.A

HT LEAD GENERATION - OVERALL ANALYSIS

Step IV.A involves looking at Steps II.A and II.B from far. In short, it portrays the results of the analyses in two tables - for Step II.A and II.B respectively.

Using the HT LG Ground Truth Data

Indicator	Netflix		Stanford					HT binary		HT categorical
	POS	NEG	ANGRY	SAD	NEUTRAL	LIKE	LOVE	POS	NEG	ANGRY	SAD	NEUTRAL	LIKE	LOVE
HT Relevant	3,688	7,435	0	83	10,958	77	5	8,838	2,285	112	3,433	1,536	5,904	138
Not HT Relevant	5,587	5,536	30	35	10,820	238	0	8,550	2,573	296	3,795	1,494	5,538	0

Using the HT LG Model

Indicator	Netflix		Stanford					HT binary		HT categorical
	POS	NEG	ANGRY	SAD	NEUTRAL	LIKE	LOVE	POS	NEG	ANGRY	SAD	NEUTRAL	LIKE	LOVE
HT Relevant	8,124	10,160	0	68	17,889	318	9	14,833	3,451	377	5,030	1,696	11,088	93
Not HT Relevant	9,127	11,152	3	114	19,868	294	0	15,164	5,115	449	6,994	1,283	11,553	0

Step IV.B

HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL

Step IV.B involves analysing the performance of the HT LG Model. To do this, the dataset #1 (the pre-labeled training dataset) was divided into 80% for training and 20% for testing. The NOT_RELEVANT and VERY_RELEVANT labels remained in the new training dataset, and were removed and saved separately in the test dataset. The new model was trained on the 80% of the data and run on the 20% testing dataset. The results of this run were thereafter compared to the actual saved labels, and the matches were calculated (as true positive, false positive, true negative, false negative). All of this was performed using the Test Model Parser.

To build the table below, and later on, the 20% of the dataset #1 (4,449 ads) were evaluated in total:

Before running the newly created 80% model on the 20% test set, the labels of all 4,449 ads were removed and saved separately.
The model was run on the 4,449 ads and the predicted labels were saved.
The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was also RELEVANT, was calculated to give the total number of TRUE POSITIVE (true relevant) ads, which in this case is 2,213.
The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was NOT RELEVANT, was calculated to give the total number of FALSE NEGATIVE (false not relevant) ads, which in this case is 29.
The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was also NOT RELEVANT, was calculated to give the total number of TRUE NEGATIVE (true not relevant) ads, which in this case is 2,165.
The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was RELEVANT, was calculated to give the total number of FALSE POSITIVE (false relevant) ads, which in this case is 42.

The ROC Curve was later on built using the results of the 80:20 run, which was performed using the ROC creator. To get the X axis and Y axis coordinates for the purpose of plotting the curve, for each test ad in the 20% evaluated, the False Positive Rate (FPR) and the True Positive Rate (TPR) were calculated respectively. To calculate the TRP, the current count of true positive (or, in our case, true relevant) was divided by the total count of true positive (as seen in the table). The same was done for FPR, but using the the false postive (or false relevant) count instead.

The same procedure was followed in Step V (Hybrid Model analysis).

Dividing the train set into 80:20 (80% = training; 20% = testing)

		Predicted
		RELEVANT	NOT RELEVANT
Actual	RELEVANT	2,213	29
Actual	NOT RELEVANT	42	2,165

Plotting an ROC curve on the 80:20 data

Step V

HT LEAD GENERATION - THE HYBRID MODEL

Step V involves building a Hybrid Model using the Hybrid Model builder. The purpose of the model is to determine the HT Relevancy of an ad with more precision. In short, to build the training dataset the following steps were undertaken:

Use the pre-labeled dataset (#1) and save its labels only
Run the Stanford model on the ads from dataset #1 and give each ad a label of yes (indicating that the ad was labeled with "love") or no (indicating that the ad was labeled as either "like", "neutral", "sad" or "angry").
Run the HT Categorical model on the ads and give each ad a label the same way.
Run the GeoTopicParser on each ad and save the main geolocation found.
Use that main geolocation as the next label.
Parse each ad text and search for presence of negation ("no"). If it is present in the ad, save "yes" as the next label, save "no" otherwise
As a result each line of the training file (of 22,246 lines in total) looked like this: NOT_RELEVANT no, no, New York City, no

Running the Hybrid Model on the HT LG Data (dataset #2)

RELEVANT: 20,396
NOT RELEVANT: 18,167

Compared to HT LG Binary model (#5) results on the same data:

RELEVANT: 18,284
NOT RELEVANT: 20,279

Dividing the hybrid train set into 80:20 (80% = training; 20% = testing)

		Predicted
		RELEVANT	NOT RELEVANT
Actual	RELEVANT	1,979	228
Actual	NOT RELEVANT	671	1,571

Plotting an ROC curve on the 80:20 data

HT LEAD GENERATION - HYBRID MODEL VERSION II

In this case a new version of the Hybrid Model was built that includes the top 5 geolocations as outputted by the GeoTopicParser rather than just one, as seen in the first version of the Hybrid Model (#6). As a result, an example of an element in the training dataset looks like this: RELEVANT no; no; Oakland County, Cumberland East Bay, San Fernando Valley Division; no

Running the Hybrid Model Version II on the HT LG Data (dataset #2)

RELEVANT: 13,183
NOT RELEVANT: 25,380

Plotting an ROC curve on the 80:20 data for Hybrid Model V2

Step VI

COMPARING THE TWO HT LG MODELS (#5 AND #6)

Step VI involves looking at the two Binary HT Lead Generation models (#5 and #6) once again and giving a more precise description of depiction of their performance to conclude which one of the two acts as a better and more reliable model for HT relevancy.

During the training of the model, Apache OpenNLP provides the user with the accuracy rate for each of the 100 iterations. Looking at the source code, such accuracy is calculated by running the model on the training dataset (i.e. the dataset the model has already seen) and calculating number of correct predictions/total number of events. The curves #1 and #3 depict such accuracy for the HT LG Model (#5) and the Hybrid Model (#6) respectively. However, as mentioned earlier, the model had already seen the data it was being tested on, which makes such an analysis not very reliable. To calculate the same exact accuracy for the test set (20%), some of the OpenNLP source code was used and modifed. The results of such analysis can be seen in curves #2 and #4 for the HT LG Model (#5) and the Hybrid Model (#6) respectively.

HT LG Model: Accuracy on the training set

Iteration 1: 0.625028096
Iteration 100: 0.979489773

HT LG Model: Accuracy on the test set

Iteration 1: 0
Iteration 100: 0.333258427

Hybrid Model: Accuracy on the training set

Iteration 1: 0.374971904
Iteration 100: 0.841818386

Hybrid Model: Accuracy on the test set

Iteration 1: 0.201797753
Iteration 100: 0.520674157

Conclusions:

Since the accuracy when running the model on the test set is much more reliable, it is safe to say that the Hybrid Model outperforms the HT LG Model
Not much overfitting can be seen, but 10-20 iterations for the Hybrid Model seems to be enough for training
~50 iterations for the HT LG Model seems to be enough for training

Comparing two Hybrid Models (#6 and #7)

Hybrid Model: Accuracy on the training set

Iteration 1: 0.374971904
Iteration 100: 0.841818386

Hybrid Model: Accuracy on the test set

Iteration 1: 0.201797753
Iteration 100: 0.520674157

Hybrid Model Version II: Accuracy on the training set

Iteration 1: 0.374971904
Iteration 100: 0.861373342

Hybrid Model Version II: Accuracy on the test set

Iteration 1: 0.19658879
Iteration 100: 0.516853933

Conclusions:

The overall accuracy for both models seems to be very similar
The accuracy on the training dataset is slightly higher for the Version II of the Hybrid Model, but since such accuracy is not the most reliable way to analyze the given models, this does not guarantee an overall higher accuracy of Version II
The test set accuracy seems to reach almost the same point for both models, but it does appear to be slightly higher
The overall output when running on the unlabeled test dataset (#2) seems to differ drastically when comparing the results of two models
However, since we have seen that the test set accuracy is slightly higher for the first version of the Hybrid Model, we could assume that the results of running it on dataset #2 are slightly more reliable

Step VII.A (NEW)

DEBIASED HT LG DATA ANALYSIS - the HT LG Ground Truth

Step VII.A involves performing the same steps as in Step II.A

Correlation between the binary sentiment and the ground truth of the debiased HT data (Netflix model)

RELEVANT

Postive: 2,131
Negative: 5,282

NOT RELEVANT

Postive: 37,102
Negative: 54,819

Correlation between the binary sentiment and the ground truth of the debiased HT data (HT Review model)

RELEVANT

Postive: 5,586
Negative: 1,827

NOT RELEVANT

Postive: 63,771
Negative: 28,150

Correlation between the categorical sentiment and the ground truth of the debiased HT data (Stanford model)

RELEVANT

Angry: 0
Sad: 72
Neutral: 7,297
Like: 44
Love: 0

NOT RELEVANT

Angry: 15
Sad: 1,514
Neutral: 89,105
Like: 1,280
Love: 0

Correlation between the categorical sentiment and the ground truth of the debiased HT data (HT Review model)

RELEVANT

Angry: 54
Sad: 2,752
Neutral: 699
Like: 3,853
Love: 55

NOT RELEVANT

Angry: 1,980
Sad: 36,917
Neutral: 5,827
Like: 45,989
Love: 0

Step VII.B (NEW)

DEBIASED HT LG DATA ANALYSIS - the HT LG Model II

Step VII.B involves performing the same steps as in Step II.B. The initial debiased labeled dataset consisted of 99,334 elements, only 7,431 of which were labeled as RELEVANT. When a new HT LG model was trained on that dataset, almost all of the analysed ads were classified as NOT_RELEVANT due to the completely unbalanced nature of the training dataset. Because of that, the training dataset was balanced out, resulting in a total of 14,826 elements (7,431 RELEVANT and 7,431 NOT_RELEVANT). A new model was trained on that dataset and run on the unlabeled dataset #2. The results were as follows:

Correlation between the binary sentiment and the truth of the debiased HT data (Netflix model)

RELEVANT

Postive: 4,285
Negative: 6,509

NOT RELEVANT

Postive: 12,966
Negative: 14,803

Correlation between the binary sentiment and the truth of the debiased HT data (HT Review model)

RELEVANT

Postive: 6,597
Negative: 4,197

NOT RELEVANT

Postive: 23,400
Negative: 4,369

Correlation between the categorical sentiment and the truth of the debiased HT data (Stanford model)

RELEVANT

Angry: 2
Sad: 109
Neutral: 10,550
Like: 132
Love: 1

NOT RELEVANT

Angry: 1
Sad: 73
Neutral: 27,204
Like: 480
Love: 0

Correlation between the categorical sentiment and the truth of the debiased HT data (HT Review model)

RELEVANT

Angry: 246
Sad: 4,663
Neutral: 775
Like: 5,048
Love: 62

NOT RELEVANT

Angry: 580
Sad: 7,261
Neutral: 2,104
Like: 17,571
Love: 0

Step VIII

EXPLAINING THE WORKFLOW

Step VII involves analysing and explaining the flow of the work done for the previous steps.

Step II.A: HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data

Step II.B: HT LEAD GENERATION ANALYSIS - the HT LG Model

Step III: HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model

Step IV.B: HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL

Step V: HT LEAD GENERATION - THE HYBRID MODEL

Step VI: COMPARING THE TWO HT LG MODELS (#5 AND #6)