The Information Retrieval and Data Science Group’s (I.R.D.S.) mission is to research and develop new methodology and open source software to analyze, ingest, process, and manage Big Data and to turn it into information. We contribute to the world’s largest and most often downloaded open source software projects, we apply tried and true techniques including content detection and analysis, crawling, deduplication, similarity, named entity recognition, construction of inverted indices, query analysis, search, relevancy and ranking, interactive query analysis, and management of large data sets. We have expertise in data collection, working with NASA, DARPA, DHS, NIH across a number of domains, Earth Science, Planetary Science, Astronomy, defense, and private industry.
He is a Principal Data Scientist and the Chief Architect in the Instrument and Data Systems section, at the Jet Propulsion Laboratory (JPL) in Pasadena, California and an Adjunct Associate Professor in the Computer Science Department within USC's Viterbi School of Engineering.
At JPL, he developed the third generation of the Apache Object Oriented Data Technology (OODT) data processing and information integration system. OODT is an open source, data-grid middleware used across many scientific domains, such as planetary science, cancer research (go figure), and computer modeling, simulation and visualization. For more detail on OODT you can check out his ICSE 2006 paper that appeared in the Software Engineering Challenges and Achievements track and his 2009 IEEE Space Mission Challenges for Information Technology (SMC-IT) paper describing the refactorization and re-architecting of the data processing framework.
A command line gazetteer built around the Geonames.org dataset, that uses the Apache Lucene library to create a searchable gazetter.
The Geonames.org dataset contains over 10,000,000 geographical names corresponding to over 7,500,000 unique features. Beyond names of places in various languages, data stored include latitude, longitude, elevation, population, administrative subdivision and postal codes. All coordinates use the World Geodetic System 1984 (WGS84).
A distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize the workflow.
TACC's SSI is a week-long workshop which introduces researchers, faculty, staff, students, and industrial partners to high performance computing, data analytics, and scientific visualization. TACC's technology experts will teach attendees how to effectively use advanced computing resources and technologies like Stampede, Maverick, and Wrangler.