Anderson Data Science Research Lab

Machine Learning + Big Data + Data Science

Geeks stay up all night disassembling the world so they can put it back together with new features. They tinker and fix things that aren't broken. Geeks abandon the world around them because they're busing soldering together a new one. They obsess and, in many cases, they suffer. — Matthew Inman

** Please forgive the mess as I move content from an older website onto this new site.

The Lab

Data Science Research Lab In collaboration with internationally recognized researchers and clinicians, the Anderson Data Science Research Lab searches for patterns to diagnose aggressive forms of lung cancer, attempts to unlock the molecular mechanisms behind breast cancer, develops genome theory of marine mammals, mines knowledge from health medical records, and works to save the coral reefs. In a more detailed sense, we research the underlying computational algorithms that make many of the discoveries of modern science possible. We specialize in applying data mining, machine learning, and artificial intelligence to the fields of bioinformatics, biomedical informatics, genomics, and metabolomics. We develop algorithms and software to tackle some of the most challenging and interesting data intensive problems in the life sciences. Our research interests include data science, big data, pattern analysis in high-dimensional data sets, evolutionary computation and optimization, machine learning, computational genomics, cloud computing, computational metabolomics, and eScience. We currently have multidisciplinary projects underway in metabolomics, human cognition, toxicology, marine biology, medical genomics, biomedical informatics, and marine genomics.

Research Groups

Data Science Foundations

The Data Science Foundations Group researches fundamental problems and solutions for the general field of data science, specializing in machine learning, big data, pattern analysis in high-dimensional data sets, and deep learning. Two example ongoing projects are research into kernel approximation methods for supervised learning and distributed deep learning algorithms on Apache Spark.

Distributed Deep Learning with Apache Spark

It has been shown that training large models with deep learning techniques increases their performance and classification power. A group at Google pioneered two such algorithms: Downpour SGD, an asynchronous stochastic gradient descent method, and Sandblaster, a framework which is built upon the idea of distributed batch optimization. These algorithms improve training by having many model replicas while having a shared set of parameters hosted on an independent server. Additionally, each model replica is a Distbelief model, a set of machines that communicate in order to build even larger networks for classification. However, this framework has not been made available to the public and relies on the implementation of consistent and rapid communication and message-passing. There is a rise in the popularity of frameworks that simplify these communication mechanisms, most notably Apache Spark. Spark’s popularity is partially attributed to its seamless integration with other cluster frameworks, such as Hadoop, as well as its capability to act as a stand-alone cluster. Additionally, Spark has been shown to adapt well to the parallelization of popular machine learning algorithms, which is shown through Spark’s native machine learning library. We have implemented the original system and set of algorithms in Python using XML remote procedure calls as the method of communication. We plan to compare the efficiency and ease-of-use of this system to a Spark implementation of the framework.

Fast Food Elastic Net

Bowick et al extended the application of fast food (FF) kernel approximations to neural networks by creating a multilayer perceptron (MLP) which learned nonlinear FF feature transforms at each layer, providing an additional nonlinearity application in the MLP algorithm. They compared the performance of FF optimized NNs (FONNs), which optimize the FF parameters alongside the weight and bias parameters in the MLP training algorithm (such as backpropagation), against that of FF randomized NNs (FRNNs), which randomly generate the FF parameters without optimization, saving computational resources during training. Dai et al extended the model that Le and Smola and Rahimi and Recht constructed on top of NNs (creating a doubly non-linear NN) by creating the Doubly Stochastic Kernel Machine. We propose to apply FF to various machine learning algorithms which are traditionally non-kernel based. We compare the performance, on a large n dataset, of a support vector machine equipped with elastic net (SVEN) (which is computationally impractical on such datasets) against that of a novel model composed of the input patterns being transformed by the nonlinear FF feature transforms before being passed through an elastic net.

Bioinformatics

Biomedical Informatics

Translational Dashboard

The translational science dashboard has been an evolving project since January of 2015. The project has been developed under the direction of the Biomedical Informatics Center at the Medical University of South Carolina, with the cooperation of the Data Science Research Group at College of Charleston. From the start, this project seeked to leverage MUSC's participation in Open Linked Data. All data utilized is available via the MUSC SPARQL endpoint located here. The endpoint can be accessed with SPARQL queries conforming to the VIVO ontology.

Data Science Research Lab The primary visualization utilizes a tertiary plot and a slight variation of Weber's method of mapping publications to three general areas of research: Human, Animal and complex organisms, or Cells and molecules. By examining the sub MeSH tree numbers of the MeSH terms associated with a given publication, MeSH terms are grouped into one of these three categories. Following Weber's mapping scheme A is mapped to all sub MeSH numbers under the Eukaryota (B01) subtree, with exception of the human sub MeSH number B01.050.150.900.649.801.400.112.400.400 which is mapped to H. In addition the subtree Person (M01) is also mapped to H. C is mapped to Cells (A11), Archaea (B02), Bacteria (B03), Viruses (B04), Molecular Structures (G02.111.570), and Chemical Processes (G02.149). In addition to Weber's mapping scheme, we have added all sub MeSH numbers under the Disease (C) tree.

As we processed publications we treated the identifiable MeSH, those that could be codified by our mapping, as proportions based on the total MeSH terms cited for the given publication. The publications are then grouped by author, represented by nodes in the tertiary plot above. These aggregated values are representative of what percentage of the author's work falls into the given category. By treating each value as the magnitude of a vector from the tertiary's centroid to the category the value represents, a vector is calculated for each category. Once the vectors of the node are computed the vector sum is taken, resulting in the coordinates of the node in the tertiary plot.

By taking a cohort based on the population of nodes within discrete regions of the tertiary plot, the specific research trends of the cohort can be viewed over time. By default we focus on a cohort of those authors in the two regions of the tertiary plot furthest from the human research category. As cell and animal research can be considered basic science, we can follow the movement of the cohort population and see if they gravitate towards more translational research over time. It can be seen that for this specific cohort there is indeed a diaspora in the direction of human research.

Computational Metabolomics

C2G2

The goal of the Charleston Computational Genomics Group or C2G2 is to develop computational techniques, methodology, and infrastructure to efficiently analyze genomic and bioinformatic data with the express purpose of training undergraduate students to excel as scientists, software engineers, computer scientists, and bioinformaticians. Our objectives are to (1) build cyberinfrastructure for Charleston area genomics and bioinformatics projects, (2) develop novel software and algorithms for data mining, data acquisition, data storage, data management, data integration, data mining, data visualization, (3) train students in the genomic and bioinformatic sciences to be utilized at local and foreign institutions, and (4) collaborate with Charleston area scientists studying genomic medicine, genome theory, gene expression, marine genomics, etc.

Investigators

The PI for the project is Paul Anderson of the College of Charleston Computer Science Department. Co-PIs are Andrew Shedlock of the College of Charleston Biology Department and Dennis Watson of MUSC's Department of Pathology and Laboratory Medicine. Other members of the team are Bob Wilson, Director of the Genomics Core at MUSC. Alumni of the group are Jeremy Morgan, Connor Stanley, Matt Paul, and Tori McCaffrey.

Education

In addition to this research program, we have several ongoing academic initiatives to prepare and expose students to this exciting new field. Dr. Paul Anderson is the Director of the Data Science Program at the College of Charleston and routinely teaches Data Science courses. Dr. Andy Shedlock teaches a Vertebrate Genomics course for graduate and advanced undergraduate students, which includes a intensive laboratory experience (co-taught with Dr. Paul Anderson). In addition to this formal coursework, we are consistently meeting to discuss ongoing research projects and encourage interested students to contact one of the PIs.

Big Data Infrastructure

The C2G2 group currently maintains two compute clusters configured and optimized for computational genomics. The original commodity-based cluster is comprised of 4 nodes, with a total of 32 cores, 32 GB RAM per node, and 12 TB of raw storage. The recently awarded GEAR: CI grant has resulted in the purchase of a new high performance cluster comprised of 10 nodes, with more than 200 cores, 200+ TB of raw storage, 64 - 400 GB of RAM per node, and 4 Gb network connections. The C2G2 group also maintains cloud storage resources for the transfer, management, and organization of scientific data sets.

Primary Investigator

Dr. Paul Anderson graduated in 2004 from Wright State University with a B.S. degree in Computer Engineering. He received his Master of Computer Science in 2006 and his Ph.D. in Computer Science & Engineering in June 2010. After graduation, Dr. Anderson was awarded a Consortium of Universities Research Fellowship to study as a Bioinformatics and Computational Research Scientist for the Air Force Research Laboratory (AFRL). Dr. Anderson has published 24+ peer-reviewed articles in the fields of genomics, data mining, machine learning, computational intelligence, metabolomics, genomics, e-Science, bioinformatics, cloud computing, biomedical informatics, cancer informatics, and computer science & engineering education.

At present, Dr. Anderson is an assistant professor in the Computer Science Department at the College of Charleston. He is the director of the Data Science Program, the first such undergraduate program in the country. Dr. Anderson is the Director and Principal Investigator (PI) of the Data Science Research Group. He is also the PI for the Charleston Computational Genomics Group (C2G2). His research lab at the College of Charleston specializes in applying data mining, machine learning, and artificial intelligence to the fields of bioinformatics, genomics, biomedical informatics, and metabolomics. His lab develops algorithms and software to tackle some of the most challenging and interesting data intensive problems in the life sciences. Dr. Anderson’s research interests include data science, big data, pattern analysis in high-dimensionality data sets, evolutionary computation and optimization, machine learning, computational genomics, cloud computing, computational metabolomics, and e-science. He currently has multidisciplinary projects underway in metabolomics, human cognition and fatigue, toxicology, marine biology, cancer informatics, and medical and marine genomics. Dr. Anderson is also the primary investigator for Omics NSF Research Experience for Undergraduates at the College of Charleston (http://omics.cofc.edu).