Human Trafficking - Lead Generation Analysis



Step I

Training models used



  1. Binary - Netflix Review Model: a model trained by examining film and TV show reviews on Netflix.
  2. Binary - HT Provider Review Model: a model trained by examining services reviews.
  3. Categorical - Stanford Model: a model trained by examining data provided by Stanford University.
  4. Categorical - HT Provider Review Model: a model trained by examining HT services reviews.
  5. HT LG Binary - HT Lead Generation Binary: a model trained by examining HT Lead Generation ads classified as HT Relevant or Not HT Relevant.
  6. HT LG Binary - Hybrid Model: a model trained by combining the results of previous models and the results of running GeoTopicParser on the ads.
  7. HT LG Binary - Hybrid Model V2: a model trained by combining the results of previous models and the results of running GeoTopicParser on the ads. In this case, top 5 geolocations are selected.
  8. HT LG Binary - HT LG V2: a model trained on the balanced debiased dataset.


HT Lead Generation Binary model: Using a MEMEX provided JSON data file (contact mattmann@usc.edu for more information on the source file), the final training dataset was created using a JSON Parser, which was also separately implemented within Apache Tika in conjunction with the Text-To-Tag-Ratio algorithm, and a Data Preparation Parser. The first parser was used to extract the value of the raw_content field in the JSON file and its corresponding annotation, or the label (NOT_RELEVANT or VERY_RELEVANT). Since the value of the raw_content was the HTML source code of the entire ad, the parser's implementation within Tika, along with the Text-To-Tag-Ratio algorithm, was essential to get rid of any site navigation and unnecessary tags, which made the model more reliable. The latter parser made the training dataset even more reliable by ensuring an equal presence of examples per each label and shuffling the data. In the end we ended up with an almost perfect training dataset of total 22,246 lines (11,123 NOT_RELEVANT and 11,123 VERY_RELEVANT examples), where each line started out with a label (NOT_RELEVANT or VERY_RELEVANT) followed by the respective ad. As a result, the model is capable of analysing the given data for its relevancy to human trafficking.



To find out more about the models we trained please visit our Models and Datasets pages.





Dataset evaluated



  1. HT Lead Generation Ads - pre-labeled (22,246): the MEMEX services ads examined for their relevance to human trafficking and sentiment, and therefore possible issues.
  2. HT Lead Generation Ads - NOT labeled (38,563): the MEMEX services reviews examined for their sentiment and therefore possible issues.




Step II.A

HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data



In this further analysis we will refer to the pre-labled data from the given training file discussed above as the Ground Truth of HT Lead Generation since we are almost 100% certain that the label matches the ad. We will refer to the results acquired by running the model trained on that Ground Truth, as described above, as the results of the HT LG Model. Both the Ground Truth data and the results of running the HT LG model can be categorised into RELEVANT and NOT RELEVANT.



Step II.A analyses the correlation between the Ground Truth of the HT Lead Generation Ads and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this and other analyses, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as provided with the Ground Truth, and the results of running the four models. In short, given the 22,246 pre-labaled ads with their respective IDs, the Output Analyzer recorded the HT-relevancy label of each ad, as well as the result it produced when analysed with the given model, and calculated the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).



  1. Correlation between the binary sentiment and the ground truth of the HT data (Netflix model)
  2. RELEVANT
    NOT RELEVANT




  3. Correlation between the binary sentiment and the ground truth of the HT data (HT Review model)
  4. RELEVANT
    NOT RELEVANT




  5. Correlation between the categorical sentiment and the ground truth of the HT data (Stanford model)
  6. RELEVANT
    NOT RELEVANT




  7. Correlation between the categorical sentiment and the ground truth of the HT data (HT Review model)
  8. RELEVANT
    NOT RELEVANT








Step II.B

HT LEAD GENERATION ANALYSIS - the HT LG Model



Step II.B analyses the correlation between the results of running the HT LG Model on the dataset #2 (the not labeled test ads) and their respective results when run through the first four models: binary Netflix and HT, categorical Stanford and HT. To perform this, just like previous, analysis, an Output Analyzer was created, which uses a brute-force approach to create correlations between the HT relevancy, as predicted by the HT LG Model, and the results of running the four models. In short, given the 38,563 non-labeled ads with their respective IDs, the Output Analyzer analysed the HT relevancy label, as given by the newly produced HT LG Model, for each ad and the label given by one of the sentiment models, which thereafter allowed to calculate the number of RELEVANT and NOT RELEVANT ads within each sentiment label (e.g. the number of RELEVANT positive ads, NOT RELEVANT postive, RELEVANT negative and NOT RELEVANT negative ads as analysed by the Netflix model).



  1. Correlation between the binary sentiment and the truth of the HT data (Netflix model)
  2. RELEVANT
    NOT RELEVANT




  3. Correlation between the binary sentiment and the truth of the HT data (HT Review model)
  4. RELEVANT
    NOT RELEVANT




  5. Correlation between the categorical sentiment and the truth of the HT data (Stanford model)
  6. RELEVANT
    NOT RELEVANT




  7. Correlation between the categorical sentiment and the truth of the HT data (HT Review model)
  8. RELEVANT
    NOT RELEVANT








Step III

HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model



Step III analyses the relative distribution of categorical sentiment in the HT Lead Generation ads (dataset #2). To build the following two graphs, these steps were undertaken:

  1. Run the chosen model (either #3 or #4) on the test dataset (HT LG dataset #2).
  2. Record the cluster_id of each ad.
  3. For each cluster, calculate the total number of angry, sad, neutral, like and love ads.
  4. To calculate the contribution of each cluster to the total value of each label on the graph, for each label calculate the # of ads classified for a label/total # of ads, which was done using the Cluster Analysis parser.
  5. For example, the cluster ce3fb900eb7ca45e4a1dab30624b822fa40418d1, as analysed by the Stanford Model, has a total of 62 ads with 0 love, like, sad and angry ads, and 62 neutral ads, which means that it contributes 62/62=1.0 to the neutral bar and 0/62=0 to the rest.


  1. The relative distribution of categorical sentiment among HT LG ads using the Stanford Model (#3)




  2. The relative distribution of categorical sentiment among HT LG ads using the HT Categorical Model (#4)








  3. Step IV.A

    HT LEAD GENERATION - OVERALL ANALYSIS



    Step IV.A involves looking at Steps II.A and II.B from far. In short, it portrays the results of the analyses in two tables - for Step II.A and II.B respectively.



    Using the HT LG Ground Truth Data



    Indicator Netflix Stanford HT binary HT categorical
    POS NEG ANGRY SAD NEUTRAL LIKE LOVE POS NEG ANGRY SAD NEUTRAL LIKE LOVE
    HT Relevant 3,688 7,435 0 83 10,958 77 5 8,838 2,285 112 3,433 1,536 5,904 138
    Not HT Relevant 5,587 5,536 30 35 10,820 238 0 8,550 2,573 296 3,795 1,494 5,538 0


    Using the HT LG Model



    Indicator Netflix Stanford HT binary HT categorical
    POS NEG ANGRY SAD NEUTRAL LIKE LOVE POS NEG ANGRY SAD NEUTRAL LIKE LOVE
    HT Relevant 8,124 10,160 0 68 17,889 318 9 14,833 3,451 377 5,030 1,696 11,088 93
    Not HT Relevant 9,127 11,152 3 114 19,868 294 0 15,164 5,115 449 6,994 1,283 11,553 0




    Step IV.B

    HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL



    Step IV.B involves analysing the performance of the HT LG Model. To do this, the dataset #1 (the pre-labeled training dataset) was divided into 80% for training and 20% for testing. The NOT_RELEVANT and VERY_RELEVANT labels remained in the new training dataset, and were removed and saved separately in the test dataset. The new model was trained on the 80% of the data and run on the 20% testing dataset. The results of this run were thereafter compared to the actual saved labels, and the matches were calculated (as true positive, false positive, true negative, false negative). All of this was performed using the Test Model Parser.

    To build the table below, and later on, the 20% of the dataset #1 (4,449 ads) were evaluated in total:

    1. Before running the newly created 80% model on the 20% test set, the labels of all 4,449 ads were removed and saved separately.
    2. The model was run on the 4,449 ads and the predicted labels were saved.
    3. The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was also RELEVANT, was calculated to give the total number of TRUE POSITIVE (true relevant) ads, which in this case is 2,213.
    4. The number of ads the initial (actual) label of which was RELEVANT, and the label predicted by the model was NOT RELEVANT, was calculated to give the total number of FALSE NEGATIVE (false not relevant) ads, which in this case is 29.
    5. The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was also NOT RELEVANT, was calculated to give the total number of TRUE NEGATIVE (true not relevant) ads, which in this case is 2,165.
    6. The number of ads the initial (actual) label of which was NOT RELEVANT, and the label predicted by the model was RELEVANT, was calculated to give the total number of FALSE POSITIVE (false relevant) ads, which in this case is 42.


    The ROC Curve was later on built using the results of the 80:20 run, which was performed using the ROC creator. To get the X axis and Y axis coordinates for the purpose of plotting the curve, for each test ad in the 20% evaluated, the False Positive Rate (FPR) and the True Positive Rate (TPR) were calculated respectively. To calculate the TRP, the current count of true positive (or, in our case, true relevant) was divided by the total count of true positive (as seen in the table). The same was done for FPR, but using the the false postive (or false relevant) count instead.



    The same procedure was followed in Step V (Hybrid Model analysis).



    Dividing the train set into 80:20 (80% = training; 20% = testing)



    Predicted
    RELEVANT NOT RELEVANT
    Actual RELEVANT 2,213 29
    NOT RELEVANT 42 2,165




    Plotting an ROC curve on the 80:20 data







    Step V

    HT LEAD GENERATION - THE HYBRID MODEL



    Step V involves building a Hybrid Model using the Hybrid Model builder. The purpose of the model is to determine the HT Relevancy of an ad with more precision. In short, to build the training dataset the following steps were undertaken:

    1. Use the pre-labeled dataset (#1) and save its labels only
    2. Run the Stanford model on the ads from dataset #1 and give each ad a label of yes (indicating that the ad was labeled with "love") or no (indicating that the ad was labeled as either "like", "neutral", "sad" or "angry").
    3. Run the HT Categorical model on the ads and give each ad a label the same way.
    4. Run the GeoTopicParser on each ad and save the main geolocation found.
    5. Use that main geolocation as the next label.
    6. Parse each ad text and search for presence of negation ("no"). If it is present in the ad, save "yes" as the next label, save "no" otherwise
    7. As a result each line of the training file (of 22,246 lines in total) looked like this: NOT_RELEVANT no, no, New York City, no


    1. Running the Hybrid Model on the HT LG Data (dataset #2)
      • RELEVANT: 20,396
      • NOT RELEVANT: 18,167

      Compared to HT LG Binary model (#5) results on the same data:

      • RELEVANT: 18,284
      • NOT RELEVANT: 20,279






    Dividing the hybrid train set into 80:20 (80% = training; 20% = testing)



    Predicted
    RELEVANT NOT RELEVANT
    Actual RELEVANT 1,979 228
    NOT RELEVANT 671 1,571




    Plotting an ROC curve on the 80:20 data







    HT LEAD GENERATION - HYBRID MODEL VERSION II



    In this case a new version of the Hybrid Model was built that includes the top 5 geolocations as outputted by the GeoTopicParser rather than just one, as seen in the first version of the Hybrid Model (#6). As a result, an example of an element in the training dataset looks like this: RELEVANT no; no; Oakland County, Cumberland East Bay, San Fernando Valley Division; no



    Running the Hybrid Model Version II on the HT LG Data (dataset #2)



    Plotting an ROC curve on the 80:20 data for Hybrid Model V2







    Step VI

    COMPARING THE TWO HT LG MODELS (#5 AND #6)



    Step VI involves looking at the two Binary HT Lead Generation models (#5 and #6) once again and giving a more precise description of depiction of their performance to conclude which one of the two acts as a better and more reliable model for HT relevancy.

    During the training of the model, Apache OpenNLP provides the user with the accuracy rate for each of the 100 iterations. Looking at the source code, such accuracy is calculated by running the model on the training dataset (i.e. the dataset the model has already seen) and calculating number of correct predictions/total number of events. The curves #1 and #3 depict such accuracy for the HT LG Model (#5) and the Hybrid Model (#6) respectively. However, as mentioned earlier, the model had already seen the data it was being tested on, which makes such an analysis not very reliable. To calculate the same exact accuracy for the test set (20%), some of the OpenNLP source code was used and modifed. The results of such analysis can be seen in curves #2 and #4 for the HT LG Model (#5) and the Hybrid Model (#6) respectively.



    1. HT LG Model: Accuracy on the training set
      • Iteration 1: 0.625028096
      • Iteration 100: 0.979489773


    2. HT LG Model: Accuracy on the test set
      • Iteration 1: 0
      • Iteration 100: 0.333258427


    3. Hybrid Model: Accuracy on the training set
      • Iteration 1: 0.374971904
      • Iteration 100: 0.841818386


    4. Hybrid Model: Accuracy on the test set
      • Iteration 1: 0.201797753
      • Iteration 100: 0.520674157






    Conclusions:





    Comparing two Hybrid Models (#6 and #7)





    1. Hybrid Model: Accuracy on the training set
      • Iteration 1: 0.374971904
      • Iteration 100: 0.841818386


    2. Hybrid Model: Accuracy on the test set
      • Iteration 1: 0.201797753
      • Iteration 100: 0.520674157


    3. Hybrid Model Version II: Accuracy on the training set
      • Iteration 1: 0.374971904
      • Iteration 100: 0.861373342


    4. Hybrid Model Version II: Accuracy on the test set
      • Iteration 1: 0.19658879
      • Iteration 100: 0.516853933








    Conclusions:





    Step VII.A (NEW)

    DEBIASED HT LG DATA ANALYSIS - the HT LG Ground Truth



    Step VII.A involves performing the same steps as in Step II.A



    1. Correlation between the binary sentiment and the ground truth of the debiased HT data (Netflix model)
    2. RELEVANT
      • Postive: 2,131
      • Negative: 5,282
      NOT RELEVANT
      • Postive: 37,102
      • Negative: 54,819




    3. Correlation between the binary sentiment and the ground truth of the debiased HT data (HT Review model)
    4. RELEVANT
      • Postive: 5,586
      • Negative: 1,827
      NOT RELEVANT
      • Postive: 63,771
      • Negative: 28,150




    5. Correlation between the categorical sentiment and the ground truth of the debiased HT data (Stanford model)
    6. RELEVANT
      • Angry: 0
      • Sad: 72
      • Neutral: 7,297
      • Like: 44
      • Love: 0
      NOT RELEVANT
      • Angry: 15
      • Sad: 1,514
      • Neutral: 89,105
      • Like: 1,280
      • Love: 0




    7. Correlation between the categorical sentiment and the ground truth of the debiased HT data (HT Review model)
    8. RELEVANT
      • Angry: 54
      • Sad: 2,752
      • Neutral: 699
      • Like: 3,853
      • Love: 55
      NOT RELEVANT
      • Angry: 1,980
      • Sad: 36,917
      • Neutral: 5,827
      • Like: 45,989
      • Love: 0








    Step VII.B (NEW)

    DEBIASED HT LG DATA ANALYSIS - the HT LG Model II



    Step VII.B involves performing the same steps as in Step II.B. The initial debiased labeled dataset consisted of 99,334 elements, only 7,431 of which were labeled as RELEVANT. When a new HT LG model was trained on that dataset, almost all of the analysed ads were classified as NOT_RELEVANT due to the completely unbalanced nature of the training dataset. Because of that, the training dataset was balanced out, resulting in a total of 14,826 elements (7,431 RELEVANT and 7,431 NOT_RELEVANT). A new model was trained on that dataset and run on the unlabeled dataset #2. The results were as follows:



    1. Correlation between the binary sentiment and the truth of the debiased HT data (Netflix model)
    2. RELEVANT
      • Postive: 4,285
      • Negative: 6,509
      NOT RELEVANT
      • Postive: 12,966
      • Negative: 14,803




    3. Correlation between the binary sentiment and the truth of the debiased HT data (HT Review model)
    4. RELEVANT
      • Postive: 6,597
      • Negative: 4,197
      NOT RELEVANT
      • Postive: 23,400
      • Negative: 4,369




    5. Correlation between the categorical sentiment and the truth of the debiased HT data (Stanford model)
    6. RELEVANT
      • Angry: 2
      • Sad: 109
      • Neutral: 10,550
      • Like: 132
      • Love: 1
      NOT RELEVANT
      • Angry: 1
      • Sad: 73
      • Neutral: 27,204
      • Like: 480
      • Love: 0




    7. Correlation between the categorical sentiment and the truth of the debiased HT data (HT Review model)
    8. RELEVANT
      • Angry: 246
      • Sad: 4,663
      • Neutral: 775
      • Like: 5,048
      • Love: 62
      NOT RELEVANT
      • Angry: 580
      • Sad: 7,261
      • Neutral: 2,104
      • Like: 17,571
      • Love: 0












    Step VIII

    EXPLAINING THE WORKFLOW



    Step VII involves analysing and explaining the flow of the work done for the previous steps.



    1. Step II.A: HT LEAD GENERATION ANALYSIS - the HT LG Ground Truth Data




    2. Step II.B: HT LEAD GENERATION ANALYSIS - the HT LG Model




    3. Step III: HT LEAD GENERATION CLUSTER ANALYSIS - the HT LG Model




    4. Step IV.B: HT LEAD GENERATION - ANALYSING THE PERFORMANCE OF THE MODEL




    5. Step V: HT LEAD GENERATION - THE HYBRID MODEL




    6. Step VI: COMPARING THE TWO HT LG MODELS (#5 AND #6)