USC IRDS - CS599 - Content Detection and Analysis for Big Data - Spring 2018

CS599: Content Detection and Analysis for Big Data

Class Info

Spring Semester, 2018
Location: MHP 101
Time: Th 3:30-6:50pm
Class number: 30132D

Instructor

Dr. Chris Mattmann
sunset.usc.edu/~mattmann
E-Mail: chris.a.mattmann@jpl.nasa.gov
Office Hours: 2:30pm-3:30pm RTH 512 (right before class)

Teaching Assistant

Simin Karvigh

Simin Karvigh
ahmadika@usc.edu
Office Hours: Thursday 9:00am-10:00am KAP 217

Grader

Ming-Chang (Eric) Chiu

Ming-Chang (Eric) Chiu
mingchac@usc.edu

CS599 Overview

This course is designed as an advanced course in data analytics, and big data. The course introduces students to the area of content detection and analysis. This involves understanding of digital file formats, their detection and data extraction from them. Emphasis areas include Document Type Detection; Parsing and extraction; Metadata understanding and analysis; Language Identification and detection from files and finally file formats and representation. The class also has a specific focus on Content Detection and Analysis from large data sets. Datasets used in the course are publicly collected by the instructor or his collaborators involved in national Big Data initiatives including DARPA, NASA and other projects. The course is designed to be accessible to students with experience programming in Java and in Python at an intermediate level. The first half of the course focuses on Java, using the Tika framework as the core technology for instruction. The instructor is the co-inventor of Tika and has deep experience in the technology and in search engines technology from Apache. The second half of the course introduces the students to the use of Python programming for Content Detection and Analysis using Tika, ElasticSearch™, Solr, Nutch and Apache Hadoop™. The course will be a combination of lecture, in-class discussion, readings, group-based assignments and a final exam.

The objective of this course is to train students to be able to understand file formats, their representation, and how to automatically extract information from large datasets of files. Specifically, students successfully completing this course will achieve three main objectives:

Develop sufficient proficiency in the Tika framework to write software capable of automatically identifying files, extracting information from them including their text and metadata and language.
Develop sufficient proficiency in Information Retrieval and Data Extraction techniques with Large Data sets collected from the Web and other places (Intranet, Science Data Sets, Public Data Sets).
Develop sufficient proficiency in Java and Python to write and execute software that is “File Aware” and that automatically extracts text and metadata from large data sets.

The primary teaching methods will be discussion, case studies, and lectures. Students are expected to perform directed self learning outside of class which encompasses, among other things, a considerable amount of literature review. In addition, the class will directly leverage open source software and partnerships from the Instructor who sits on the Board of Directors at the Apache Software Foundation. Projects associated with the course make direct contributions to Apache Licensed (“ALv2”) open source software projects at the student’s discretion. Leadership training in open source is provided and encouraged, and students leave with an experience in open source that makes them more marketable to companies and institutions looking to hire in content detection and analysis, Big Data, and Data Science.

In addition to foundations, and practical experience with content detection and analysis, the class will also introduce the student to the state-of-the-art in content detection research, future trends and state-of-the-practice. Students are expected to attend class regularly, and participate (as directed) in all class discussions, and most importantly, have fun!

USC ACADEMIC INTEGRITY

Statement on Academic Conduct and Support Systems

Academic Conduct Plagiarism - presenting someone else.s ideas as your own, either verbatim or recast in your own words - is a serious academic offense with serious consequences. Please familiarize yourself with the discussion of plagiarism in SCampus in Section 11, Behavior Violating University Standards. Other forms of academic dishonesty are equally unacceptable. See additional information in SCampus and university policies on scientific misconduct. Discrimination, sexual assault, and harassment are not tolerated by the university. You are encouraged to report any incidents to the Office of Equity and Diversity or to the Department of Public Safety. This is important for the safety whole USC community. Another member of the university community - such as a friend, classmate, advisor, or faculty member - can help initiate the report, or can initiate the report on behalf of another person. The Center for Women and Men provides 24/7 confidential support, and the sexual assault resource center webpage sarc@usc.edu describes reporting options and other resources.

Support Systems

A number of USC's schools provide support for students who need help with scholarly writing. Check with your advisor or program staff to find out more. Students whose primary language is not English should check with the American Language Institute which sponsors courses and workshops specifically for international graduate students. The Office of Disability Services and Programs provides certification for students with disabilities and helps arrange the relevant accommodations. If an officially declared emergency makes travel to campus infeasible, USC Emergency Information will provide safety and other updates, including ways in which instruction will be continued by means of blackboard, teleconferencing, and other technology.

Statement on Diversity

The diversity of the participants in this course is a valuable source of ideas, problem solving strategies, and engineering creativity. We encourage and support the efforts of all of our students to contribute freely and enthusiastically. We are members of an academic community where it is our shared responsibility to cultivate a climate where all students and individuals are valued and where both they and their ideas are treated with respect, regardless of their differences, visible or invisible.

TEXTBOOK

Chris A. Mattmann, and Jukka Zitting. Tika in Action, 256 pages. New York: Manning Publications, November 2011. ISBN: 9781935182856.

ASSIGNMENTS and EXAMINATIONS

Name	Description	Weight
Exam	An exam testing your understanding of the lecture materials including digital file formats, their detection, parsing and extraction, metadata, language idnetification and translation, etc.	25%
Assignments	Assignments where you will build on the content detection and and analysiks topics in course and make a contribution to one of the existing Apache content detection and IR technologies (Nutch, Lucene, Solr, Tika, OODT, etc.).	45%
Individual Presentation	An individual presentation demonstrating the student's understanding of one of the required paper readings in the course.	25%
Participation	Participation in lectures, by asking questions and contributing to the conversation. Showing up to lectures and positively contributing to the class experience.	5%

Project Submission Guidelines

Submission guidelines will be specified in each assignment.

Schedule (subject to change; check regularly)

Week	Lecture Topic	Assigned Readings	Assignments & Deadlines
1 (Jan 11th, 2018)	Course Introduction Introduction to Big Data DARPA XDATA Program - Overview Slides Breakout Groups on Big Data	Tika in Action, Chapter 1 Mattmann, Chris. A vision for data science. Nature, Vol. 493, No. 7433, pp. 473-475, January 24, 2013. Lynch, Clifford. "Big data: How do your data grow?." Nature 455.7209 (2008): 28-29. Howe, Doug, et al. "Big data: The future of biocuration." Nature 455.7209 (2008): 47-50.(Presentation by: Mantripragada, Anurag Sai Raghav Krishna) Wigan, Marcus R., and Roger Clarke. "Big data's big unintended consequences." Computer 46.6 (2013): 46-53. Schwartz, J. A. N. A., et al. "Measuring the value of Big Data exploitation systems: Quantitative, non-subjective metrics with the user as a key component." Parsons Journal for Information Mapping 6 (2014): 1-12.(Presentation by: Chopra, Prince) Sotera Defense Solutions. A Survey of Big Data Methods, Assessments, and Approaches. November 2012 De Mauro, Andrea, Marco Greco, and Michele Grimaldi. "What is big data? A consensual definition and a review of key research topics." AIP conference proceedings. Vol. 1644. No. 1. AIP, 2015.	Resources: DARPA I2O (DARPA Dan) Video Zero Dark Thirty Workbench Video
2 (Jan 18th, 2018)	Report out from Big Data Breakouts A Taxonomy of File Formats Content Detection Libraries Language Bindings for Apache Tika Individual Student Presentations - Week 1 Papers	Tika in Action, Chapter 2 Crocker, David. RFC 822 "Standard for the format of ARPA Internet text messages." (1982).(Presentation by: Ashcraft, Teague, Kristian) Freed, Ned and Nathaniel Borenstein. RFC 1341. MIME (Multipurpose Internet Mail Extensions). Mechanisms for Specifying and Describing the Format of Internet Message Bodies. June 1992. Freed, Ned, and Nathaniel Borenstein. RFC 2045. Multipurpose internet mail extensions (MIME) part one: Format of internet message bodies. 1996.(Presentation by: Ghaisas, Sayali, Makarand) Freed, Ned, and Nathaniel Borenstein. RFC 2046 Multipurpose internet mail extensions (MIME) part two: Media types, November, 1996.(Presentation by: Hanumegowda, Vikas) Freed, Ned. RFC 2048 "Multipurpose internet mail extensions (MIME) part four: Registration procedures." ISI (1996). Hicks, Ben J., et al. "Organizing and managing personal electronic files: A mechanical engineer's perspective." ACM Transactions on Information Systems (TOIS) 26.4 (2008): 23. Shim, Jungwon Roy. "Arium: Beyond the Desktop Metaphor: A new way of navigating, searching, and organizing personal digital data." Masters Thesus, Carnegie Mellon University (2012). Crowder, Jerome, Jonathan Marion, and Michele Reilly. "File Naming in Digital Media Research: Examples from the Humanities and Social Sciences." Journal of Librarianship and Scholarly Communication 3.3 (2015). Jackson, Andrew N. "Formats over time: Exploring UK web history." arXiv preprint arXiv:1210.1714 (2012).
3 (Jan 25th, 2018)	Finish report out from Big Data Breakouts In class discussion around classifying files and the MIME taxonomy Document Similarity and Deduplication Individual Presentations - Week 2 Papers	Tika in Action, Chapter 3 Bik, Elisabeth M., Casadevall, Arturo, Fang, Ferrie C. The Prevalence of Inappropriate Image Duplication in Biomedical Research Publications.(Presentation by: Megotia, Abhishek) Manku, Gurmeet Singh, Arvind Jain, and Anish Das Sarma. "Detecting near-duplicates for web crawling." Proceedings of the 16th international conference on World Wide Web. ACM, 2007.(Presentation by: Pei, Yulong) Henzinger, Monika. "Finding near-duplicate web pages: a large-scale evaluation of algorithms." Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006.(Presentation by: Prakunhungsit, Surasit) Cooper, Matthew, Jonathan Foote, and Andreas Girgensohn. "Automatically organizing digital photographs using time and content." Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on. Vol. 3. IEEE, 2003.(Presentation by: Venkataram, Hamsa Shwetha) Manber, Udi. "Finding similar files in a large file system." Usenix Winter. Vol. 94. 1994. Chim, Hung, and Xiaotie Deng. "Efficient phrase-based document similarity for clustering." IEEE Transactions on Knowledge and Data Engineering 20.9 (2008): 1217-1229.	Resources: The MIME guys: How two Internet gurus changed e-mail forever IEEE Computer Interview with Nathaniel Bornstein
4 (Feb 1st, 2018)	Remaining Big Data Presentation by Prerana Group breakout session on MIME taxonomy Document Type Detection Individual Presentations - Week 3 papers Advanced File System Statistics and Understanding	Tika in Action, Chapter 4 Amirani, Mehdi Chehel, Mohsen Toorani, and A. Beheshti. A new approach to content-based file type detection. Computers and Communications, 2008. ISCC 2008. IEEE Symposium on. IEEE, 2008.(Presentation by: Mukherjee, Koustav) McDaniel, Mason, and M. Hossain Heydari. Content based file type detection algorithms. System Sciences, 2003. Proceedings of the 36th Annual Hawaii International Conference on. IEEE, 2003.(Presentation by: Vishwanath Bhadrappa, Kavya) Alamri, Nasser S., and William H. Allen. "A comparative study of file-type identification techniques." SoutheastCon 2015. IEEE, 2015.(Presentation by: Gumblapura Balasubramanya, Sachin Kumar) Li, Wei-Jen, et al. "Fileprints: Identifying file types by n-gram analysis." Information Assurance Workshop, 2005. IAW'05. Proceedings from the Sixth Annual IEEE SMC. IEEE, 2005.(Presentation by: Kothari, Dipti, Dineshkumar) Shahi, Ashim. "Classifying the classifiers for file fragment classification." Masters Thesis, Universiteit van Amsterdam (2012). Ahmed, Irfan, et al. "Fast file-type identification." Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, 2010. Pierris, Georgios, and Stilianos Vidalis. "Forensically classifying files using HSOM algorithms." Emerging Intelligent Data and Web Technologies (EIDWT), 2012 Third International Conference on. IEEE, 2012. Harris, Ryan M. "Using artificial neural networks for forensic file type identification." Master's Thesis, Purdue University (2007). Douceur, John R., and William J. Bolosky. A large-scale study of file-system contents. ACM SIGMETRICS Performance Evaluation Review 27.1 (1999): 59-70.	Resources: The Economist - Digital Bit Rot Bit Rot - on Digital Vellum (by Vint Cerf, Google) TEDx Talk
5 (Feb 8th, 2018)	Read out from group breakout sessions on MIME taxonomy Video about BitRot by Vint Cerf Content Extraction Individual Presentations - Week 4 papers	Tika in Action, Chapter 5 Kilicoglu, Halil, et al. "Semantic MEDLINE: a web application for managing the results of PubMed Searches." Proceedings of the third international symposium for semantic mining in biomedicine. Vol. 2008. 2008.(Presentation by: Patel, Dixita, Vasanbhai) Kobayashi, Mei, and Koichi Takeda. "Information retrieval on the web." ACM Computing Surveys (CSUR) 32.2 (2000): 144-173.(Presentation by: Patel, Shrushti Shaileshkumar) Voorhees, Ellen M., and Donna Harman. "Overview of the sixth text retrieval conference (TREC-6)." Information Processing & Management 36.1 (2000): 3-35.(Presentation by: Sunder, Arathi Mary) Arasu, Arvind, and Hector Garcia-Molina. Extracting structured data from web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 2003.(Presentation by: Taekasem, Soravis) Lewandowski, Dirk. "Web searching, search engines and Information Retrieval." Information Services & Use 25.3, 4 (2005): 137-147. Weninger, Tim, William H. Hsu, and Jiawei Han. "CETR: content extraction via tag ratios." Proceedings of the 19th international conference on World wide web. ACM, 2010. Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.(Presentation by: Rohira, Vritti Sanjesh)	Resources: Introduction to Information Retrieval - Chris Mattmann
6 (Feb 15th, 2018)	Introduction to Assignment 1 Make up Week 4 Individual Presentations (Vishwanath Bhadrappa, Kavya,Gumblapura Balasubramanya, Sachin Kumar, Kothari, Dipti, Dineshkumar) Individual Presentations - week 5 papers	Tika in Action, Chapter 6 Gowda, Thamme, and Chris A. Mattmann. "Clustering Web Pages Based on Structure and Style Similarity (Application Paper)." Information Reuse and Integration (IRI), 2016 IEEE 17th International Conference on. IEEE, 2016.(Presentation by: Jadwani, Kavish) Anquetil, Nicolas, and Timothy Lethbridge. File clustering using naming conventions for legacy systems. Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research. IBM Press, 1997.(Presentation by: Mohana, Akshatha) Swierk, Edward, et al. "The Roma personal metadata service." Mobile Networks and Applications 7.5 (2002): 407-418.(Presentation by: Kishore Kumar, Vaishnavi) Karypis, Michael Steinbach George, Vipin Kumar, and Michael Steinbach. "A comparison of document clustering techniques." KDD workshop on Text Mining. 2000.(Presentation by: Shi, Yue) Marchionini, Gary. "Exploratory search: from finding to understanding." Communications of the ACM 49.4 (2006): 41-46.
7 (Feb 22nd, 2018)	Video - Understanding Metadata TedX talk Understanding Metadata Information Clustering Make up Week 5 Individual Presentations (Rohira, Vritti Sanjesh) Week 6 Individual Presentations	Tika in Action, Chapter 7 Koehn, Philipp, et al. "Moses: Open source toolkit for statistical machine translation." Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, 2007.(Presentation by: Chirumamilla, Lakshmi, Priyanka) Post, Matt, et al. "Joshua 5.0: Sparser, better, faster, server." Proceedings of the Eighth Workshop on Statistical Machine Translation. 2013.(Presentation by: Giudice, Pablo, Javier) Lins, Rafael Dueire, and Paulo Gonçalves. Automatic language identification of written texts. Proceedings of the 2004 ACM symposium on Applied computing. ACM, 2004.(Presentation by: Lodha, Sanchit) Papineni, Kishore, et al. "BLEU: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002. Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). Tromp, Erik, and Mykola Pechenizkiy. "Graph-based n-gram language identification on short texts." Proc. 20th Machine Learning conference of Belgium and The Netherlands. 2011. Lopez-Moreno, Ignacio, et al. "Automatic language identification using deep neural networks." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.(Presentation by: Halageri, Akhilesh) Bertoldi, Nicola, et al. "MMT: New open source MT for the translation industry." Proceedings of The 20th Annual Conference of the European Association for Machine Translation (EAMT). 2017.	Resources: The Power of Metadata: Deepak Jagdish and Daniel Smilkov at TEDxCambridge 2013
8 (March 1st, 2018)	Video - Linguistic Forensics Group In Class Activity - Understanding Metadata Language Identification Week 7 Individual Presentations Named Entity Recognition Machine Translation	Tika in Action, Chapter 8 Tjong Kim Sang, Erik F., and Fien De Meulder. "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition." Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003.(Presentation by: Hachuel Baldridge, Eric, John Dylan) Nadeau, David, and Satoshi Sekine. "A survey of named entity recognition and classification." Lingvisticae Investigationes 30.1 (2007): 3-26.(Presentation by: Mukar, Pavneet Kaur) Ritter, Alan, Sam Clark, and Oren Etzioni. "Named entity recognition in tweets: an experimental study." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011.(Presentation by: Natarajan, Aditya) Mattmann, Chris A., and Madhav Sharan. "An automatic approach for discovering and geocoding locations in domain-specific web data." Proceedings of the 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI’16). 2016. Khodak, Mikhail, Nikunj Saunshi, and Kiran Vodrahalli. "A Large Self-Annotated Corpus for Sarcasm." arXiv preprint arXiv:1704.05579 (2017).(Presentation by: Thakur, Vinita, Ram) Hutto, Clayton J., and Eric Gilbert. "Vader: A parsimonious rule-based model for sentiment analysis of social media text." Eighth international AAAI conference on weblogs and social media. 2014. Geyer, Kelly, et al. "Named Entity Recognition in 140 Characters or Less." # Microposts. 2016.	Resources: Forensic Linguistic Profiling & What Your Language Reveals About You \| Harry Bradford \| TEDxStoke The Shallowness of Google Translate - The Atlantic - Douglas Hofstadter Apache cTAKES Apache UIMA Apache OpenNLP Stanford Core NER/NLP NLTK
9 (March 8th, 2018)	Video on Open Software Development - Wayne Moses Burke TedX talk Discussion on Homework 1 Outcomes Exam Review Exam	Location: MHP 101	Resources: How to Save the World \| Wayne Moses Burke \| TEDxBeaconStreet
10 (March 15th, 2018)	No Class This Week	No Class This Week - Spring Break (March 11-18, 2018)
11 (March 22nd, 2018)	Readout - In Class Activity on Understanding Metadata - Group Presentations Individual Presentations - Week 8 papers (Eric John Dylan Hachuel Baldridge, Pavneet Kaur Mukar, Aditya Natarajan, Vinita Ram Thakur) Discussion on Named Entity Recognition Hadoop Spark and Tika: Large Scale Content Detection and Analysis	Tika in Action, Chapter 9 Dean, Jeffrey, and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.1 (2008): 107-113.(Presentation by: Nandakumar, Srinidhi) Zaharia, Matei, et al. Spark: cluster computing with working sets.Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010.(Presentation by: Nadhavajhala, Sanjay) Elsayed, Tamer, Jimmy Lin, and Douglas W. Oard. "Pairwise document similarity in large collections with MapReduce." Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 2008.(Presentation by: Singh, Akashdeep) M. Bernaschi, M. Cianfriglia, A. Di Marco, A. Sabellico, G. Me, G. Carbone, G. Totaro. Forensic Disk Image Indexing and Search in an HPC environment. IEEE International Conference on High Performance Computing & Simulation (HPCS), 2014.(Presentation by: Wu, Yifan) Meusel, Robert, Peter Mika, and Roi Blanco. "Focused crawling for structured data." Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, 2014. Niu, Feng, et al. "DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference." VLDS 12 (2012): 25-28. Mattmann, C. A., Oh, J. H., Palsulich, T., McGibbney, L. J., Gil, Y., & Ratnakar, V. (2015, November). DRAT: An Unobtrusive, Scalable Approach to Large Scale Software License Analysis. In Automated Software Engineering Workshop (ASEW), 2015 30th IEEE/ACM International Conference on (pp. 97-101). IEEE.	Resources: Apache Distributed Release Audit Tool (DRAT) DRAT Video
12 (March 29th, 2018)	Introduction to Assignment 2 Open Source Content Detection Technologies TedXTalk Video: Analyzing and modeling complex and big data Discussion on open source technologies – ApacheCon 2015 talk In-Class Group Activity on Named Entity Recognition Individual Presentations - Thakur, Vinita, Ram (week 8 unfinished), one week 12 presentation (Akarsh Goyal) and week 11 presentations, Nandakumar, Srinidhi, Nadhavajhala, Sanjay, Singh, Akashdeep, Wu, Yifan	Tika in Action, Chapter 10 Białecki, Andrzej, et al. "Apache lucene 4." SIGIR 2012 workshop on open source information retrieval. 2012.(Presentation by: Goyal, Akarsh) Turtle, Howard, Yatish Hegde, and S. Rowe. "Yet another comparison of lucene and indri performance." SIGIR 2012 Workshop on Open Source Information Retrieval. 2012.(Presentation by: Raju, Renuka) Bontcheva, Kalina, et al. "TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text." RANLP. 2013.(Presentation by: Nataraju, Akshay) Cunningham, Hamish. "GATE, a general architecture for text engineering." Computers and the Humanities 36.2 (2002): 223-254.(Presentation by: Sundaram, Aditya) Atserias, Jordi, et al. "FreeLing 1.3: Syntactic and semantic services in an open-source NLP library." Proceedings of LREC. Vol. 6. 2006. Manning, Christopher D., et al. "The stanford corenlp natural language processing toolkit." ACL (System Demonstrations). 2014.(Presentation by: Deviah, Shiva) Savova, Guergana K., et al. "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications." Journal of the American Medical Informatics Association 17.5 (2010): 507-513.	Resources: TedXTalk Analyzing and modeling complex and big data \| Professor Maria Fasli \| TEDxUniversityofEssex Textract Scrapy Mattmann ApacheCon 2015 Content Talk
13 (April 5th, 2018)	Readout on in-class group discussion: Named Entity Recognition Evaluating Content Detection Individual Presentations - week 11 presentations - Singh, Akashdeep, Wu, Yifan, week 12 presentations - Raju, Renuka, Nataraju, Akshay, Sundaram, Aditya, Deviah, Shiva	Tika in Action, Chapter 11 Nowell, Lucy Terry, et al. "Visualizing search results: some alternatives to query-document similarity." Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1996.(Presentation by: Bhardwaj, Arpit) Shneiderman, Ben. "The eyes have it: A task by data type taxonomy for information visualizations." Visual Languages, 1996. Proceedings., IEEE Symposium on. IEEE, 1996.(Presentation by: Duan, Weiwei) Gottron, Thomas. "Evaluating content extraction on HTML documents." Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA’07). 2007.(Presentation by: Radhakrishnan, Rahul) Leuski, Anton. "Evaluating document clustering for interactive information retrieval." Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001.(Presentation by: Le, Khanh, Duy) Bailey, Peter, et al. "Evaluating search systems using result page context." Proceedings of the third symposium on Information interaction in context. ACM, 2010.(Presentation by: Teligi Harapanahalli Math, Prerana)	Resources: This is What happens when you reply to SPAM e-Mail. - James Veitch - Ted Talk
14 (April 12th, 2018)	Lecture Evaluating Content Detection & Analysis Lecture on NoSQL Discussion of Homework #3 Individual Presentations – week 12 - Deviah, Shiva, Individual Presentations – week 13 - Bhardwaj, Arpit, Duan, Weiwei, Radhakrishnan, Rahul, Le, Khanh, Duy, Teligi Harapanahalli Math, Prerana	Palamuttam, Rahul, et al. "SciSpark: Applying in-memory distributed computing to weather event detection and tracking." Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015.(Presentation by: Ganatra, Khyati, Kamlesh) Leavitt, Neal. "Will NoSQL databases live up to their promise?." Computer 43.2 (2010).(Presentation by: Priya, Parul) Stonebraker, Michael. "SQL databases v. NoSQL databases." Communications of the ACM 53.4 (2010): 10-11.(Presentation by: Sood, Aashish) Stonebraker, Michael. "Stonebraker on NoSQL and enterprises." Communications of the ACM 54.8 (2011): 10-11.(Presentation by: Mazetti, Bruno, Augusto Branco) Rafique, Ansar, et al. "On the performance impact of data access middleware for nosql data stores." IEEE Transactions on Cloud Computing (2015). Moniruzzaman, A. B. M., and Syed Akhter Hossain. "Nosql database: New era of databases for big data analytics-classification, characteristics and comparison." arXiv preprint arXiv:1307.0191 (2013).	Resources:
15 (April 19th, 2018)	Video - Scientific Data: Water and Snow in the Western US Searching Scientific Datasets Week 14 Individual Presentations - Ganatra, Khyati, Kamlesh, Riya, Parul, Sood, Aashish, Mazetti, Bruno, Augusto Branco, Week 15 Individual Presentations - Pujara, Dhairya, Atul, Asfaw, Matheos, Chawannakul, Theerapat, Le, Kevin, Khoi Nam	Tika in Action, Chapter 12 - 14 C. Mattmann, D. Freeborn, D. Crichton, B. Foster, A. Hart, D. Woollard, S. Hardman, P. Ramirez, S. Kelly, A. Y. Chang, C. E. Miller. A Reusable Process Control System Framework for the Orbiting Carbon Observatory and NPP Sounder PEATE missions. In Proceedings of the 3rd IEEE Intl Conference on Space Mission Challenges for Information Technology (SMC-IT 2009), pp. 165-172, July 19 - 23, 2009. Wilkinson, Mark D., et al. "The FAIR Guiding Principles for scientific data management and stewardship." Scientific data 3 (2016): 160018.(Presentation by: Pujara, Dhairya, Atul) Buneman, Peter, et al. "Archiving scientific data." ACM Transactions on Database Systems (TODS) 29.1 (2004): 2-42.(Presentation by: Asfaw, Matheos) Fox, Peter, and James Hendler. "Changing the equation on scientific data visualization." Science 331.6018 (2011): 705-708.(Presentation by: Chawannakul, Theerapat) Plale, Beth, et al. "Active management of scientific data." IEEE Internet Computing 9.1 (2005): 27-34.(Presentation by: Le, Kevin, Khoi Nam) Gray, Jim, et al. "Scientific data management in the coming decade." ACM SIGMOD Record 34.4 (2005): 34-41. Ailamaki, Anastasia, Verena Kantere, and Debabrata Dash. "Managing scientific data." Communications of the ACM 53.6 (2010): 68-78.	Resources: Thomas Painter of NASA JPL speaks at TEDxIS on a thought provoking take on Climate Change.
16 (April 26th, 2018)	Big Data with an Eye Towards the Future: Discussion	No required papers!