This event has ended. Visit the official site or create your own event on Sched.

user2016

Click here to return to main conference site. For a one page, printable overview of the schedule, see this.

2:30pm PDT

A Large Scale Regression Model Incorporating Networks using Aster and R

Poster #26

Leveraging the Aster platform and the TeradataAsterR package, end users can overcome the challenges of memory/scalability limitations of R and the costs of transferring large amounts of data between platforms. We explore integration of R with Aster, a MPP database from Teradata, focusing on a predictive analytical case study from Wells Fargo. It’s always crucial for Wells Fargo to understand customer behaviors and why they do it. In this analysis, we utilized Aster’s graph analysis functionalities to explore customer relationship, and check how network effect changes customers’ behaviors. A logistic regression model was built, and a R shiny application was also used to visually represent impact of important attributes from the model.

Speakers

Brian Kreeger

Sr. Data Scientist, Teradata Corporation

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

All-inclusive but Practical Multivariate Stochastic Forecasting for Electric Utility Portfolio

Poster #2

Electric utility portfolio risk simulation requires stochastically forecasting various time series data: power and gas prices, peak and off-peak loads, thermal, solar and wind generation, and other covariates, in different time granularities. All these together presents modeling issues of autocorrelation, linear and non-linear covariate relationships, non-normal distribution, outliers, seasonal and weekly shapes, heteroskedasticity, temporal disaggregation and dispatch optimization. As a practitioner, I’ll discuss how to organize and put together such a portfolio model from data scraping, simulation modeling, all the way to deployment through Shiny UI, while pointing out what worked what didn’t.

Speakers

Eina Ooka

The Energy Authority

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Applied Biclustering Using the BiclustGUI R Package

Poster #15

Big and high dimensional data with complex structures are emerging steadily and rapidly over the last few years. A relative new data analysis method that aims to discover meaningful patterns in a big data matrix is {\it biclustering}. This method applies clustering simultaneously on 2 dimensions of a data matrix and aims to find a subset of rows for which the response profile is similar across a subset of columns in the data matrix. This results in a submatrix called a bicluster. The package \texttt{RcmdrPlugin.BiclustGUI} is a GUI plug-in for R Commander for biclustering. It combines different biclustering packages to provide many algorithms for data analysis, visualisations and diagnostics tools in one unified framework. By choosing R Commander, the BiclustGUI produces the original R code in the background while using the interface; this is useful for more experienced R users who would like to transition from the interface to actual R code after using the algorithms. Further, the BiclustGUI package contains template scripts that allow future developers to create their own biclustering windows and include them in the package. The BiclustGUI is available on CRAN and on R-Forge. The GUI also has a Shiny implementation including all the main functionalities. Lastly the template scripts have been generalized in the \texttt{REST} package, a new helping tool for creating R Commander plug-ins.

Speakers

Ewoud De Troyer

Hasselt University (Center of Statistics)

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Bridging the Data Visualization to Digital Humanities gap: Introducing the Interactive Text Mining Suite

Poster #24

In recent years, there has been growing interest in data visualization for text analysis. While text mining and visualization tools have been successfully integrated into research methods in many fields, their use still remains infrequent in mainstream Digital Humanities. Many tools require extensive programming skills, which can be a roadblock for some literary scholars. Furthermore, while some visualization tools provide graphical user interfaces, many humanities researchers desire more interactive and user-friendly control of their data. In this talk we introduce the Interactive Text Mining Suite (ITMS), an application designed to facilitate visual exploration of digital collections. ITMS provides a dynamic interface for performing topic modeling, cluster detection, and frequency analysis. With this application, users gain control over model selection, text segmentation as well as graphical representation. Given the considerable variation in literary genres, we have also designed our graphical user interface to reflect choice of studies: scholarly articles, literary genre, and sociolinguistic studies. For documents with metadata we include tools to extract the metadata for further analysis. Development with the Shiny web framework provides a set of clean user interfaces, hopefully freeing researchers from the limitations of memory or platform dependency.

Speakers

Jefferson Davis

Indiana University

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Community detection in multiplex networks : An application to the C. elegans neural network

Poster #31

We explore data from the neuronal network of the nematode C. elegans, a tiny hermaphroditic roundworm. The data consist of 279 neurons and 5863 directed connections between them, represented by three connectomes of electrical and chemical synapses. Our approach uses a fully Bayesian two-stage clustering method, based on the Dirichlet processes, that borrows information across the connectomes to identify communities of neurons via stochastic block modeling. This structure allows us to understand the communication patterns between the motor neurons, interneurons, and sensory neurons of the C. elegans nervous system.

Speakers

Brenda Betancourt

Duke University

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Curde: Analytical curves detection

Poster #21

The main aim of our work is to develop the new R package curde. The package is used to detect line or conic curves in a digital image. The package contains the Hough transformation for a line detection using the accumulator. The Hough transform is a feature extraction technique and its purpose is to find imperfect instances of objects within a certain class of shapes. This technique is not suitable for curves with more than three parameters. For conic fitting, robust regression is used. For noisy data, solution based on Least Median of Squares (LMedS) is highly recommended. In this package, algorithms for non-user image evaluation is implemented. The whole process of the non-user image evaluation includes the image preparation. The preparation consists of various methods such as image grayscaling, thresholding or histogram estimation. The conversion from the grayscaled image to binary is realised by the calculation of the Sobel operator convolution and by the application of the threshold technique. After that the convolution technique is applied. The new R package curde will be the integration of all previous techniques to the one complex package.

Speakers

Simon Gajzler

CTU in Prague

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

DiLeMMa - Distributed Learning with Markov Chain Monte Carlo Algorithms with the ROAR Package

Poster #29

Markov Chain Monte Carlo algorithms are a general technique for learning probability distributions. However, they tend to mix slowly in complex, high-dimensional models, and scale poorly to large datasets. This package arose from the need for conducting high dimensional inference in large models using R. It provides a distributed version of stochastic based gradient variations of common continuous-based Metropolis algorithms, and utilizes the theory of optimal acceptance rates of Metropolis algorithms to automatically tune the proposal distribution to its optimal value. We describe how to use the package to learn complex distributions, and compare to other packages such as RStan.

Speakers

Ali Zaidi

Data Scientist, Microsoft

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

High-performance R with FastR

Poster #12

R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. While these are straightforward to implement in an interpreter, it is hard to compile R functions to efficient bytecode or machine code. Consequently, applications that spend a lot of time in R code often have performance problems. Common solutions are to try to apply primitives to large amounts of data at once and to convert R code to a native language like C. FastR is a novel approach to solving R’s performance problem. It makes extensive use of the dynamic optimization features provided by the Truffle framework to remove the abstractions that the R language introduces, and can use the Graal compiler to create optimized machine code on the fly. This talk introduces FastR and the basic concepts behind Truffle’s optimization features. It provides examples of the language constructs that are particularly hard to implement using traditional compiler techniques, and shows how to use FastR to improve performance without compromising on language features.

Speakers

Adam Welc

Oracle Labs

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

IMGTStatClonotype: An R package with integrated web tool for pairwise evaluation and visualization of IMGT clonotype diversity and expression from IMGT/HighV-QUEST output

Poster #11

The adaptive immune response is our ability to produce up to 2.10$^{12}$ different immunoglobulins (IG) or antibodies and T cell receptors (TR) per individual to fight pathogens. IMGT$^$, the international ImMunoGeneTics information system$^$ (http://www.imgt.org), was created in 1989 by Marie-Paule Lefranc (Montpellier University and CNRS) to manage the huge and complex diversity of these antigen receptors and is at the origin of immunoinformatics, a science at the interface between immunogenetics and bioinformatics. Next generation sequencing (NGS) generates millions of IG and TR nucleotide sequences, and there is a need for standardized analysis and statistical procedures in order to compare immune repertoires. IMGT/HighV-QUEST is the unique web portal for the analysis of IG and TR high throughput sequences. Its standardized statistical outputs include the characterization and comparison of the clonotype diversity in up to one million sequences. IMGT$^$ has recently defined a procedure for evaluating statistical significance of pairwise comparisons between differences in proportions of IMGT clonotype diversity and expression, per gene of a given IG or TR V, D or J group. The procedure is generic and suitable for detecting significant changes in IG and TR immunoprofiles in protective (vaccination, cancers and infections) or pathogenic (autoimmunity and lymphoproliferative disorders) immune responses. In this talk, I will present the new R package (’IMGTStatClonotype’) which incorporates the IMGT/StatClonotype tool developed by IMGT$^$ to perform pairwise comparisons of sets from IMGT/HighV-QUEST output through a user-friendly web interface in users’ own browser.

Speakers

Safa Aouinti

IMGT® - the international ImMunoGeneTics information system®, Institute of Human Genetics

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Imputing Gene Expression to Maximise Platform Compatibility

Poster #9

Microarray measurements of gene expression constitute a large fraction of publicly shared biological data, and are available in the Gene Expression Omnibus (GEO). Many studies use GEO data to shape hypotheses and improve statistical power. Within GEO, the Affymetrix HG-U133A and HG-U133 Plus 2.0 are the two most commonly used microarray platforms for human samples; the HG-U133 Plus 2.0 platform contains 54,220 probes and the HG-U133A array contains a proper subset (21,722 probes). When different platforms are involved, the subset of common genes is most easily compared. This approach results in the exclusion of substantial measured data and can limit downstream analysis. To predict the expression values for the genes unique to the HG-U133 Plus 2.0 platform, we constructed a series of gene expression inference models based on genes common to both platforms. Our model predicts gene expression values that are within the variability observed in controlled replicate studies and are highly correlated with measured data. Using six previously published studies, we also demonstrate the improved performance of the enlarged feature space generated by our model in downstream analysis.

Speakers

Weizhuang Zhou

Stanford University

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Integrating R & Tableau

Poster #27

Tableau is regularly used by our clients for the purposes of visualization and dashboarding, but they also often require the analytics and statistical functionality of R to analyze their data. While Tableau supports the integration of R, it is not always a straightforward process to blend the functionality of the two together. We plan to discuss our lessons learned from building Tableau applications that integrate with R, including best practices for performance optimization, sessionizing interaction on Tableau production servers, and reducing network latency issues. We will also discuss the limitations of Tableau’s R integration capability.

Our goal is help others working to avoid common frustrations and roadblocks when integrating R and Tableau.

Speakers

Douglas Friedman

Booz Allen Hamilton

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Making Shiny Seaworthy: A weighted smoothing model for validating oceanographic data at sea.

Poster #30

The City of San Diego conducts one of the largest ocean monitoring programs in the world, covering ~340 square miles of coastal waters and sampling at sea ~150 days each year. Water quality monitoring is a cornerstone of the program and requires the use of sophisticated instrumentation to measure a suite of oceanographic parameters (e.g., temperature, depth, salinity, dissolved oxygen, pH). The various sensors or probes can be episodically temperamental, and oceanographic data can be inherently non-linear, especially within stratifications (i.e., where the water properties change rapidly with small changes in depth). This makes it difficult to distinguish between extreme observations due to natural events (anomalous data) and those due to instrumentation error (erroneous data), thus, requiring manual data validation at sea.
This Shiny app improves the manual validation process by providing a smoothing model to flag erroneous data points while including anomalous data. Standard smoothing models were unable to model stratification without including erroneous data, so we elected to use a custom weighted average model where observations with a greater deviation from the local mean have less weight.
We coupled this model with an interactive Shiny session using ggplot2 and R Portable to create an offline web application for use at sea. This Shiny app takes in a raw data file, presents a series of interactive graphs for removing/restoring potentially erroneous data, and exports a new data file. Additional customization of the Shiny interface using the shinyBS package, Javascript, and HTML improve the user experience.

Speakers

Kevin Wayne Byron

Marine Biologist, City of San Diego

I am interested in developing software and statistical tools for supporting biological research. As a Marine Biologist for the City of San Diego's Ocean Monitoring Program's IT/GIS team, my group is responsible for data base management, low-level IT support, GIS, and R coordination... Read More →

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Multi-stage Decision Method To Generate Rules For Student Retention

Poster #13

The retention of college students is an important problem that may be analyzed by computing techniques, such as data mining, to identify students who may be at risk of dropping out. The importance of the problem has grown due to institutions’ requirement of meeting legislative retention mandates, face budget shortfalls due to decreased tuition or state-based revenue, and fall short of producing enough graduates in fields of need, such as computing. While data mining techniques were applied with some success, this article aims to show how R can be used to develop a hybrid methodology to enable rules to be created for the minority class with coverage and accuracy range which were not available as per existing literature. A multiple stage decision methodology (MSDM) used data mining techniques for extracting rules from an institution’s student data set to enable administrators to identify at risk students. The data mining techniques included partial decisions trees, K-means clustering, and Apriori association mining to be implemented in R. MSDM was able to identify students with up to 89% accuracy on student datasets, where the number of at risk students was fewer than the retained students that made the at risk model difficult to build. The motivation for using R was twofold. First, to generate rules for minority class, and second, use R to make it reproducible.

Speakers

Soma Datta

Assistant Professor, University Of Houston Clear Lake

R in decision trees and Apriori, Controlled decision trees, Teaching R in school.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

mvarVis: An R package for Visualization of Multivariate Analysis Results

Poster #28

mvarVis is an R package for visualization of diverse multivariate analysis methods. We implement two new tools to facilitate analysis that are cumbersome with existing software. The first uses htmlwidgets and d3 to create interactive ordination plots; the second makes it easy to bootstrap multivariate methods and align the resulting scores. The interactive visualizations offer an alternative to printing multiple plots with different supplementary information overlaid, and bootstrapping enables a qualitative assessment of the uncertainty underlying the application of exploratory multivariate methods on particular data sets.

Our approach is to leverage existing packages -- FactoMineR, ade4, and vegan -- to perform the actual dimension reduction, and build a new layer for visualizing and bootstrapping their results. This allows our tools to wrap a variety of existing methods, including one table, multitable, and distance-based approaches -- principal components, multiple factor analysis, and multidimensional scaling, for example. Since our package uses htmlwidgets, it is possible to embed our interactive plots in Rmarkdown pages and Shiny apps. All code and many examples are available on our github.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Prediction of key parameters in the production of biopharmaceuticals using R

Poster #4

In this contribution we present our workflow for model prediction in E. coli fed-batch production processes using R. The major challenges in this context are the fragmentary understanding of bioprocesses and the severely limited real-time access to process variables related to product quality and quantity. Data driven modeling of process variables in combination with model predictive process control concepts represent a potential solution to these problems. In R the statistical techniques best qualified for bioprocess data analysis and modeling are readily available.

In a benchmark study the performance of a number of machine learning methods is evaluated, i.e., random forest, neural networks, partial least squares and structured additive regression models. For that purpose a series of recombinant E. coli fed-batch production processes with varying cultivation conditions employing a comprehensive on- and offline process monitoring platform was conducted. The prediction of cell dry mass and recombinant protein based on online available process parameters and two-dimensional multi-wavelength fluorescence spectroscopy is investigated. Parameter optimization and model validation are performed in the framework of a leave-one-fermentation-out cross validation. Computations are performed using among others the R packages robfilter, boost, nnet, randomForest, pls and caret. The results clearly argue for a combined approach: neural networks as modeling technique and random forest as variable selection tool.

Speakers

Theresa Scharl

BOKU Vienna

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

R Microplots in Tables with the latex() Function

Poster #16

Microplots are often used within cells of a tabular array. We describe several simple R functions that simplify the use of microplots within LaTeX documents constructed within R. These functions are coordinated with the latex() function in the Hmisc package or the xtable function in the xtable package. We show examples using base graphics, and three graphics systems based on grid: lattice graphics, gg2plot graphcs, and vcd graphics. These functions work smoothly with standalone LaTeX documents and with Sweave, with knitr, with org mode and with Rmarkdown.

Speakers

Richard Heiberger

Professor Emeritus, Temple University, Department of Statistics, Fox School of Business

Censuses and Surveys of the Jewish PeopleRich Heiberger is co-chair of P'nai Or Philadelphia, and was on the Board of the NHC some years ago.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

R Shiny Application for the Evaluation of Surrogacy in Clinical Trials

Poster #18

In clinical trials, the determination of the true endpoint or the effect of a new therapy on the true endpoint may be difficult, requiring an expensive, invasive or uncomfortable procedure. Furthermore, in some trials the primary endpoint of interest (the “true endpoint”), for example death, is rare and/or takes a long period of time to reach. In such trials, there would be benefit in finding a more proximate endpoint (the “surrogate endpoint”) to determine more quickly the effect of an intervention.

We present a new R Shiny application for the evaluation of surrogate endpoints in randomized clinical trials using patients data. The Shiny application for surrogacy consists of a set of friendly user function which allow the evaluation of different types of endpoints (i.e., continuous, categorical, binary, survival endpoints) and produce a unified and interoperable output. With this new Shiny App, the user does not need to have the R software installed on his computer. It is a web based application. It can also be run from any device with internet connection.

We demonstrate the usage and capacities of this Shiny App for surrogacy using several examples clinical trials in which validation of a surrogate to the primary endpoint in the trials was of interest.

Speakers

Theophile Bigirumurame

Hasselt University

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

RCAP Designer: An RCloud Package to create Analytical Dashboards

Poster #25

RCloud is an open source social coding environment for Big Data analytics and visualization developed by AT&T labs. We discuss RCAP Designer, an RCloud package that provides a way for Data Scientists to build R web applications similar to Shiny in the RStudio environment.

RCAP designer creates a workflow where the source R code is created within the RCloud environment in an R notebook. The package allows the data scientist to transform this notebook into an R dashboard application. This does not require developing web code (JavaScript, CSS, etc.). A number of widgets have been developed for creating the page design, several kinds of contents (R plots, interactive plots, an iframe, etc) and the different event controls for the page. For example, to include an R plot, one would drag and drop the RPlot widget onto the canvas. After the appropriate sizing of the plot window, the widget is configured to select the R plot function from the current workspace, and automatically link it to the control parameters. Once the design elements are saved, RCAP uses RCloud to render the page on the fly.

High level RCAP design considerations: On the server (R) side, RCAP produces the appropriate wrapping for the user’s R code with the necessary templates to push the results back to the client side. This includes all of the RCloud commands and various error catching mechanisms. These wrapped functions are exposed to the JavaScript via OCAP. The user can just do normal plotting code and RCAP makes sure it appears on the page. The JavaScript supplied by the widgets is in charge of the layout. It lays out the grid, loads the text, iframes and any other static content. The event controller widgets in RCAP use the reactive programming paradigm.RCAP is a statistician’s convenient web publishing tool for R analytics and visualizations developed within the RCloud environment.

References:
Subramaniam. G, Larchuk. T, Urbanek. S and Archibad. R (2014). iwplot: An R Package for Creating web Based Interactive. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014
Woodhull. G, RCloud – Integrating Exploratory Visualization, Analysis and Deployment. In useR! 2014, The R User Conference, (UCLA, USA), Jul. 2014
R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org.
RStudio, Inc, shiny: Easy web applications in R, 2014, URL: http://shiny.rstudio.com

Speakers

Ganesh K Subramaniam

AT&T Labs - Research

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Sequence Analysis with Package TraMineR

Poster #6

Sequence analysis started in biological science to examine pattern of protein DNA and subsequently applied in social sciences to study the pattern of sequences from individual’s life course. Many social science studies concerned with time series are recorded in sequences. Past studies using sequence analysis include footsteps of dances, class careers, employment biographies, family histories, school-to-work transitions, occupational career pattern, and other life-course trajectories.

The TraMineR is a package specially designed for carrying out sequence analysis for the social sciences (Gabadinho, Studer, Muller, Buergin, & Ritschard, 2015). It is a data mining tool that is most appropriate to mine and group social sequence data. It contains toolbox for the manipulation, description and rendering of sequences and functions to produce graphical output to describe state sequences, categorical sequences, sequence visualization, and sequence complexity. It also offers functions for computing distances between sequences with different metrics, which includes optimal matching, longest common prefix and longest common subsequence. In combination with cluster analysis and multidimensional scaling, typology can be formed to understand the life-course trajectories by grouping the sequences into groups.

I will briefly outline the key functionalities of TraMineR and demonstrate the procedure for carrying out social sequence analysis with real life examples to highlight the usefulness of the TraMineR package. Other R packages related to sequence analysis will also be covered during the session.

Speakers

Teck Kiang Tan

A Promising Power Analysis Package for Structural Equation Models: Package semPower, National University of Singapore

Dr. Teck Kiang Tan is a senior research fellow at the National University of Singapore. His research interests that involved R packages include R graphics, doubly classified models, multilevel modeling, cognitive diagnostic models, sequence analysis, informative hypotheses, and longitudinal... Read More →

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

shinyGEO: a web application for analyzing Gene Expression Omnibus (GEO) datasets using shiny

Poster #7

Identifying associations between patient gene expression profiles and clinical data provides insight into the biological processes associated with health and disease. The Gene Expression Omnibus (GEO) is a public repository of gene expression and sequence-based datasets, and currently includes >42,000 datasets with gene expression profiles obtained by microarray. Although GEO has its own analysis tool (GEO2R) for identifying differentially expressed genes, the tool is not designed for advanced data analysis and does not generate publication-ready graphics. In this work, we describe a web-based, easy-to-use tool for biomarker analysis in GEO datasets, called shinyGEO.

shinyGEO is a web-based tool that provides a graphical user interface for users without R programming experience to quickly analyze GEO datasets. The tool is developed using 'shiny', a web application framework for R. Specifically, shinyGEO allows a user to download the expression and clinical data from a GEO dataset, to modify the dataset correcting for spelling and misaligned data frame columns, to select a gene of interest, and to perform a survival or differential expression analysis using the available data. The tool uses the Bioconductor package 'GEOquery' to retrieve the GEO dataset, while survival and differential expression analyses are carried out using the 'survival' and 'stats' packages, respectively. For both analyses, shinyGEO produces publication-ready graphics using 'ggplot2' and generates the corresponding R code to ensure that all analyses are reproducible. We demonstrate the capabilities of the tool by using shinyGEO to identify diagnostic and prognostic biomarkers in cancer.

Speakers

Jasmine Dumas

Graduate Student & Data Scientist

Formerly a Graduate MS Predictive Analytics student at DePaul University; Currently a Graduate MS student at Johns Hopkins Engineering For Professionals studying Computer Science and Data Science. Currently an Associate Data Scientist at The Hartford Insurance Group working on building... Read More →

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Statistics and R for Analysis of Elimination Tournaments

Poster #8

There is keen interest in statistical methodology in sports. Such methods are valuable not only to sports sociologists but also those in sports themselves, as exemplified in the book and movie “Moneyball.” These statistics enhance comparisons among players and possibly even enable prediction of games. However, elimination tournaments present special statistical challenges. This paper explores data from the national high school debate circuit, in which the first author was an active national participant. All debaters participate in the 6 pre-elimination rounds, but subsequently the field successively narrows in the elimination rounds. This atypical format makes it difficult to use classical statistical methods, and also requires more sophisticated data wrangling. This paper will use R to explore questions such as: Does gender affect the outcome of rounds? Does geography play a role in wins/losses? What constitutes an upset? Is there a so-called “shadow effect,” in which the weaker the expected competitor in the next round, the greater the probability that the stronger player will win in the current stage? Among the purposes of this project is to use it as an R-based teaching tool, and help the debate community understand the inequalities that exist in relation to gender, region, and school. Typical graphs that can be generated may be viewed at https://github.com/ariel-shin/tourn. Our R software will be available in a package “tourn.”

Speakers

Ariel Shin

University of California, Davis

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Teaching statistics to medical students with R and OpenCPU

Poster #1

In general medical students do not have or aim at a deeper understanding of statistics. Nevertheless some knowledge of basic statistical reasoning and methodology is indispensable to apprehend the meaning of results of scientific studies published in medical journals. Also, some familiarity with the correct interpretation of probability statements concerning medical tests is crucial for physicians.

In order to supplement our regular statistics classes at the medical faculty we started to develop an online system providing a pool of assignments. Each student gets an individual assignment with a modified data set, asking therefore for a slightly different solution. This enables the system to verify the student’s personal achievement and a data base may keep record of his/her performance.

Our system utilizes OpenCPU installed on a Linux server. The front-end is developed with HTML and JavaScript, while the back-end involves R and MySQL.

The state of the development, the problems, and the students response will be presented.

Speakers

Joern Pons-Kuehnemann

Institute for Medical Informatics, Justus Liebig University, Giessen, Germany

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Urban Mobility Modeling using R and Big Data from Mobile Phones

Poster #22

There has been rapid urbanization as more and more people migrate into cities. The World Health Organization (WHO) estimates that by 2017, a majority of people will be living in urban areas. By 2030, 5 billion people—60 percent of the world’s population—will live in cities, compared with 3.6 billion in 2013. Developing nations must cope with this rapid urbanization while developed ones wrestle with aging infrastructures and stretched budgets. Transportation and urban planners must estimate travel demand for transportation facilities and use this to plan transportation infrastructure. Presently, the technique used for transportation planning includes the conventional four-step transportation planning model, which makes use of data inputs from local and national household travel surveys. However, local and national household surveys are expensive to conduct, cover smaller areas of cities and the time between surveys range from 5 to 10 years in even some of the most developed cities. This calls for new and innovative ways for Transportation Planning using new data sources.

In recent years, we have witnessed the proliferation of ubiquitous mobile computing devices (inbuilt with sensors, GPS, Bluetooth) that capture the movement of vehicles and people in near real time and generate massive amounts of new data. This study utilizes Call Detail Records (CDR) data from mobile phones and the R programming language to infer travel/mobility patterns. These CDR data contain the locations, time, and dates of billions of phone calls or Short Message Services (SMS) sent or received by millions of anonymized users in Cape Town, South Africa. By analyzing relational dependencies of activity time, duration, and land use, we demonstrate that these new “big” data sources are cheaper alternatives for activity-based modeling and travel behavior studies.

Speakers

Daniel Emaasit

Graduate Research Assistant, University of Nevada Las Vegas

Broadly, my research interests involve the development of probabilistic machine learning methods for high-dimensional data, with applications to Urban Mobility, Transport Planning, Highway Safety, & Traffic Operations.

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Using R in the evaluation of psychological tests

Poster #17

Psychological tests are used in many fields, including medicine and education, to assess the cognitive abilities of test takers. According to international standards for psychological testing, psychological tests are required to be reliable, fair, and valid. This presentation illustrates how R can be used to assess the reliability, fairness, and validity of psychological tests using the Tower of London task as an example. In clinical neuropsychology, the Tower of London task is widely used to assess a person’s planning ability. Our data consist of 798 respondents who worked on the 24 test items of the Tower of London – Freiburg Version. By employing the framework of factor analysis and item response theory, it is demonstrated that the number of correctly solved problems in this test can be considered as a reliable and sound indicator for the planning ability of the test takers. It is further demonstrated that the individual problem difficulties remain stable across different levels of age, sex and education, which provides evidence for the test’s fairness. All computations were carried out with the R packages psych, lavaan and eRm, all of which are freely available on CRAN.

Speakers

Rudolf Debelak

University of Zurich

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Using R with Taiwan Government Open Data to create a tool for monitor the city's age-friendliness

Poster #20

Due to rapidly growing aging population,to create a aging-friendly city is a important goal of modern government. Some indexes to reflect the city's age-friendliness may help the local government to monitor and improve the policy practice and these information also should be open and interactive to the citizen who caring about this issue. And the R language provides a great flexibility in dealing with the diversity of the file formats from government. Besides, the data visualization and web application supported by R can make the analysis result more understandable and interactive.

According to WHO 2015 age-friendly city guidelines, there are eight aspects for a comfort of elder living (outdoor spaces, transportation, housing, social participation, social respect, civic participation, communication, health and community support). And we use the Taiwan OpenGovernment data to integrate indexes with normalization and to visualize the indexes geographically. In the end, we create a shiny application with interactive Plotly to let the result easily be approached. The result may show how R can easily to utilize the government data and provide a great application turning WHO guideline into a monitor tool helping the government practice in age-friendly policy.

Speakers

TING-WEI LIN

National Taiwan University

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Video Tutorials in Introductory Statistics Instruction

Poster #14

I use the Rcmdr package in the introductory statistics course I teach for non-majors. For the past several years I've used video tutorials, in addition to written documents covering the same material, for the lab portion of the course where students use Rcmdr and R to analyze data. All course materials are made available via a content management system that allows me to analyze to what degree students are utilizing various delivery mechanisms. This poster will present how I've assembled the video tutorials as well as usage patterns over the last three course offerings. The associations between tutorial usage type/frequency and student performance in the course are also explored.

Speakers

Tom Burk

Professor, University of Minnesota

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Visualization of health and population indicators within urban African populations using R

Poster #23

The Demographic and Health Surveys (DHS) Program has collected and disseminated open data on population and health through more than 300 surveys from various countries. One of our research interests is to investigate the linkage between urban poverty and health in African countries. Using the DHS raw data we have computed indicators focusing on exploring how the indicators differ between different groups in the urban areas. These groups are based on wealth tertiles and consist of the urban poor, urban middle and the urban rich.

Following the analysis we have developed the Urban Population and Health Data Visualization Platform which is an interactive web application using Shiny. Online deployment of the platform through the APHRC website is underway and we believe it will assist policymakers and researchers to perform data explorations and gather actionable insights. By sharing the code through github we hope that it will contribute towards promoting the adoption of R particularly by universities and researchers in Africa as an alternative to costly proprietary statistical software.

The platform showcases the power of R and is developed using R and various R packages including shiny, ggplot, googleVis, RCharts, DT for graphics and dplyr for data manipulation.

Speakers

Amos Mbugua Thairu

African Population and Health Research Center (APHRC)

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Visualizations and Machine Learning in R with Tessera and Shiny

Poster #19

In a divide and recombine (D&R) paradigm, the Tessera tool suite of packages (https://tessera.io), developed at Pacific Northwest National Laboratory, presents a method for dynamic and flexible exploratory data analysis and visualization. At the front end of Tessera, analysts program in the R programming language, while the back end utilizes a distributed parallel computational environment. Using these tools, we have created an interactive display where users can explore visualizations and statistics on a large dataset from the National Football League (NFL). These visualizations allow any user to interact with the data in meaningful ways, leading to an in depth analysis of the data through general summary statistics as well as insights on fine grain information. In addition, we have incorporated an unsupervised machine learning scheme utilizing an interactive R Shiny application that predicts positional rankings for NFL players. We have showcased these tools using a variety of available data from the NFL in order to make the displays easily interpretable to a wide audience. Our results, fused into an interactive display, illustrate Tessera’s efficient exploratory data analysis capabilities and provide examples of the straightforward programming interface.

Speakers

Sarah Reehl

Pacific Northwest National Lab

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1

2:30pm PDT

Writing a dplyr backend to support out-of-memory data for Microsoft R Server

Poster #10

Over the last two years, the dplyr package has become very popular in the R community for the way it streamlines and simplifies many common data manipulation tasks. A feature of dplyr is that it’s extensible; by defining new methods, one can make it work with data sources other than those it supports natively. The dplyrXdf package is a backend that extends dplyr functionality to Microsoft R Server’s xdf files, which are a way of overcoming R’s in-memory limitations. dplyrXdf supports all the major dplyr verbs, pipeline notation, and provides some additional features to make working with xdfs easier. In this talk, I’ll share my experiences writing a new back-end for dplyr, and demonstrate how to use dplyr and dplyrXdf to carry out data wrangling tasks on large datasets that exceed the available memory.

Speakers

Hong Ooi

Microsoft

Tuesday June 28, 2016 2:30pm - 3:30pm PDT
Sponsor Pavilion

Poster, Group 1