Keynote #2 |
09:00–10:00 |
Penn 1 |
R in the AI Era: Leveraging Modern Technologies in Practice
More infoR is popular in part due to its extensive package ecosystem, which allows for the incorporation of new technologies and statistical methods. The language has been designed explicitly to focus on data, making it easy for users to apply diverse tools and methods in the course of data analysis. Although R itself is more than two decades old, the combination of those two major elements enables it to adapt to tools and techniques quickly, benefitting both researchers and data analysts. In this talk we will illustrate using R on real examples with current technologies moving from Big Data era of distributed data analysis to the AI era of fitting and evaluating attention models and leveraging large language models for both analysis and retrieval.
Date and time: Sat, Aug 9, 2025 - 09:00–10:00
Author(s): Simon Urbanek (University of Auckland + R Core)
Keyword(s): NA
Video recording available after conference: ✅ |
Simon Urbanek (University of Auckland + R Core) |
Data visualization |
10:30–12:00 |
Penn 1 |
From #EconTwitter to the White House: Real-Time Economic Data with R
More infoIt’s not just financial markets. Policy and economics reporters, commentators, and public officials all use real-time analysis of leading economic data as soon as it is available. Moments after the release of jobs numbers, inflation rates, or GDP data, policymakers, journalists, and commentators dive into real-time interpretation and visualization. In this high-speed environment, the right tools are essential, and R stands out as particularly powerful. Join Mike Konczal as he shares his firsthand experiences using R in real-time following data releases to create viral graphics on #EconTwitter, prepare quotes for reporters and materials for media appearances, and even coordinate analysis at the White House, where he served covering economic data for the National Economic Council. You’ll learn the process, from how to access and manipulate government economic data to making your own economic work clear and accessible to the broader public.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Mike Konczal (Economic Security Project)
Keyword(s): economics, politics, finance, macroeconomics, public communications
Video recording available after conference: ✅ |
Mike Konczal (Economic Security Project) |
10:30–12:00 |
Penn 1 |
Visualising Uncertainty with ggdibbler
More infoAdding uncertainty representation in a data visualisation can help in decision-making. There is an existing wealth of software designed to visualise uncertainty as a distribution or probability. These visualisations are excellent for helping understand the uncertainty in our data, but they may not be effective at incorporating uncertainty to prevent false conclusions. Successfully preventing false conclusions requires us to communicate the estimate and its error as a single “validity of signal” variable, and doing so proves to be difficult with current methods. In this talk, we introduce ggdibbler, a ggplot extension that makes it easier to visualise uncertainty in plots for the purposes of preventing these “false signals”. We illustrate how ggdibbler can be seamlessly integrated into existing visualisation workflows and highlight the effect of these changes by showing the alternative visualisations ggdibbler produces for a choropleth map.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Harriet Mason (Monash University); Dianne Cook (Monash University, Australia), Sarah Goodwin (Monash University), Susan Vanderplas (University of Nebraska - Lincoln)
Keyword(s): uncertainty, data visualisation, ggplot, r package
Video recording available after conference: ✅ |
Harriet Mason (Monash University) |
10:30–12:00 |
Penn 1 |
Visualizing time with ggtime’s grammar of temporal graphics
More infoWhile several commonly used plots exist for visualizing time series, little work has been done to formalize them into a unified grammar of temporal graphics. Re-expressing traditional time series graphics such as time plots and seasonal plots with grammatical elements supports deeper customization options. Composable grammatical elements provide the flexibility needed to easily visualize multiple seasonality, cycles, and other complex temporal patterns. These modular elements can be composed together to create familiar time series graphics, and also recombined to create new informative plots. The ggtime package extends the ggplot2 ecosystem with new grammar elements and plot helpers for visualising time series data. These additions leverage calendar structures to visually align time points across different granularities and timezones, warp time to standardize irregular durations, and wrap time into compact calendar layouts. In this talk, I will introduce ggtime and demonstrate how its grammar of temporal graphics enables a flexible visualization of time series patterns.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Mitchell O’Hara-Wild (Monash University); Cynthia Huang (Monash University)
Keyword(s): grammar of graphics, time series, calendars, package design, ggplot2 extension
Video recording available after conference: ✅ |
Mitchell O’Hara-Wild (Monash University) |
10:30–12:00 |
Penn 1 |
tinyplot: convenient and customizable base R plots
More infoThe {[tinyplot][1]} package provides a lightweight extension of the base R graphics system. It aims to pair the concise syntax and flexibility of base R plotting, with the convenience features pioneered by newer ({grid}-based) visualization packages like {ggplot2} and {lattice}. This includes the ability to plot grouped data with automatic legends and/or facets, advanced visualization types, and easy customization via ready-made themes. This talk will provide an introduction to {tinyplot} in the form of various plotting examples, describe its motivating use-cases, and also contrast its advantages (and disadvantages) compared to other R visualization libraries. The package is available on CRAN. [1]: https://grantmcdermott.com/tinyplot/
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Grant McDermott (Amazon); Vincent Arel-Bundock (Université de Montréal), Achim Zeileis (Universität Innsbruck)
Keyword(s): data viz, base graphics
Video recording available after conference: ✅ |
Grant McDermott (Amazon) |
Modeling 1 |
10:30–12:00 |
Penn 2 |
Adding new algorithms to {tidyclust}
More infoThe {tidyclust} package, released in 2022, brings unsupervised learning to the {tidymodels} framework. This talk will share an overview of process by which new models and algorithms are added to the {tidyclust} collection, based on recent work adding five new models for clustering and data mining (DBSCAN, GMM, BIRCH, itemset mining, and association rules). We will discuss in-depth the complications - programmatic, algorithmic, and philisophical - of adapting a supervised learning framework to unsupervised and semi-supervised settings. For example, what does it mean to tune a parameter in the absence of validating prediction metrics? How should row-based clustering be processed differently than column-based clustering? This talk is aimed at R users and developers who want to think deeply about the intersection between code design choices and methodological principles in unsupervised learning, and who want to peek behind the curtain of the {tidyclust} package framework.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Kelly Bodwin (California Polytechnic State University)
Keyword(s): tidymodels, tidyclust, unsupervised learning, clustering, package development
Video recording available after conference: ✅ |
Kelly Bodwin (California Polytechnic State University) |
10:30–12:00 |
Penn 2 |
Modeling Eviction Trends in Virginia
More infoVirginia is home to 5 of the top 10 cities in the country with the highest rates of eviction. Using civil court records, we are able to analyze the behavior of landlords, so that we can hold those in power accountable to make effective and just change. Where do landlords engage in more eviction actions? What characteristics of renters or landlords increase the practice of serial filing? Using administrative data – information collected by government and agencies in the implementation of public programs – we are able to evaluate systems and promote most just outcomes. Working with the Civil Court Data Initiative of Legal Services Corporation, we use data collected from civil court records in Virginia to analyze the behavior of landlords. Expanding on our Virginia Evictors Catalog, we use data on court evictions to build additional data tools to support the work of legal and housing advocates and model key eviction outcomes to contribute to our understanding of landlord behavior. First we visualized eviction activity across the state in an interactive Shiny app to address questions and needs of organizations providing legal, policy, and community advocacy. In addition we estimated landlord actions – eviction filings and serial filings – as a function of community and landlord characteristics. Using a series of mixed-effects models, with data aggregated to zipcodes nested in counties, we estimated the impact of community characteristics and landlord attributes on the likelihood of eviction filings. Participants will walk away with a better understanding of what influences landlord behavior, and will have a framework for investigating the practice in their own communities.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Michele Claibourn (Center for Community Partnerships), Samantha Toet (Center for Community Partnerships)
Keyword(s): shiny, data visualization, mixed-effects modeling, geography, social science
Video recording available after conference: ✅ |
Michele Claibourn (Center for Community Partnerships) Samantha Toet (Center for Community Partnerships) |
10:30–12:00 |
Penn 2 |
Predictive Modeling with Missing Data
More infoMost predictive modeling strategies require there to be no missing data for model estimation. When there is missing data, there are generally two strategies for working with missing data: 1.) exclude the variables (columns) or observations (rows) where there is missing data; or 2.) impute the missing data. However, data is often missing in systematic ways. Excluding data from training is ignoring potentially predictive information and for many imputation procedures the missing completely at random (MCAR) assumption is violated. The medley package implements a solution to modeling when there are systematic patterns of missingness. A working example of predicting student retention from a larger study of the Diagnostic Assessment and Achievement of College Skills (DAACS) will be explored. In this study, demographic data was collected at enrollment from all students and then students completed diagnostic assessments in self-regulated learning (SRL), writing, mathematics, and reading during their first few weeks of the semester. Although all students were expected to complete DAACS, there were no consequence and therefore a large percentage of student completed none or only some of the assessments. The resulting dataset has three predominate response patterns: 1.) students who completed all four assessments, 2.) students who completed only the SRL assessment, and 3). students who did not complete any of the assessments. The goal of the medley algorithm is to take advantage of missing data patterns. For this example, the medley algorithm trained three predictive models: 1.) demographics plus all four assessments, 2.) demographics plus SRL assessment, and 3.) demographics only. For both training and prediction, the model used for each student is based upon what data is available. That is, if a student only completed SRL, model 2 would be used. The medley algorithm can be used with most statistical models. For this study, both logistic regression and random forest are used. The accuracy of the medley algorithm was 3.5% better than using only the complete data and 3.1% better than using a dataset where missing data was imputed using the mice package. The medley package provides an approach for predictive modeling using the same training and prediction framework R users are accustomed to using. There are numerous parameters that can be modified including what underlying statistical models are used for training. Additional diagnostic functions are available to explore missing data patterns.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Jason Bryer (City University of New York)
Keyword(s): predictive modeling, r package
Video recording available after conference: ✅ |
Jason Bryer (City University of New York) |
10:30–12:00 |
Penn 2 |
jarbes: an R package for Bayesian parametric and nonparametric bias correction in meta-analysis
More infoMeta-analysis methods help researchers answer questions that require combining statistical results across several studies. Very often, the only available studies are of different types and of varied quality. Therefore, when we combine disparate evidence at face value, we are not only combining results of interest but also potential biases that might threaten the quality of the results. Consequently, the results of the meta-analysis could be misleading. This work presents the R package jarbes, “Just a rather Bayesian Evidence synthesis.” This package has been designed explicitly for Bayesian evidence synthesis and meta-analysis. It implements a family of Bayesian parametric and nonparametric models for meta-analysis that account for multiple biases. A model in jarbes is built upon two submodels: one that contains the parameters of interest (e.g., a pooled mean across studies) and another that accounts for biases. The biases submodel addresses hidden factors that may distort study results (e.g., selection bias, dilution bias, reporting bias) and are not directly observable. This model-building strategy allows the model of bias to correct the meta-analysis affected by biased evidence. We present two real examples of applying the Bayesian nonparametric modeling functionality of jarbes. The first combines studies of different types and quality, and the second shows the effect of bias correction in nonparametric meta-regression. References Verde, P. E. (2024), “jarbes: An R Package for Bayesian Evidence Synthesis.” Version 2.2.3. https://CRAN.R-project.org/package=jarbes Verde, P. E. and Rosner, G. L. (2025), A Bias-Corrected Bayesian Nonparametric Model for Combining Studies With Varying Quality in Meta-Analysis. Biometrical Journal., 67: e70034. https://doi.org/10.1002/bimj.70034
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Pablo Verde (University of Dusseldorf)
Keyword(s): meta-analysis, bayesian nonparametrics, bias-correction, evidence synthesis
Video recording available after conference: ✅ |
Pablo Verde (University of Dusseldorf) |
Case studies |
10:30–12:00 |
Penn Garden |
From Copy-Paste Chaos to Reproducible Workflows: A Wet Lab Researcher’s Journey into R
More infoAs a wet lab researcher, I used to struggle with fragmented data analysis workflows. I was taught: You do your experiments, you get your data, you copy-paste into separate software packages for descriptive statistics, visualisation, and documentation. I was constantly frustrated with data analysis: Change something early in the analysis? Go back and copy-paste. How did I analyse similar data sets previously while working at a different institute? Good luck opening that proprietary file format without that software and the license. Learning R transformed how I approach data, not just by replacing individual tools but reshaping my entire understanding of analysis. Beyond statistics, R introduced me to better data organisation, reproducible analysis, meaningful visualisation, and a community dedicated to improving data analysis and reporting. Working with R taught me more than any course on data analysis ever did. Now I use RMarkdown and Quarto daily to document and report my research. These tools allow me to standardise workflows, making my analyses reproducible and independent of proprietary software that might not be available in all research settings. Beyond improving my own work, these tools have become invaluable for guiding students, e.g. providing example workflows for common assays, and visualisations to help them better understand their data. In my talk, I will share my journey from chaotic spreadsheets to a reproducible, streamlined workflow. I will showcase the specific tools I use and how they have improved my research. Lastly, I will invite other wet lab researchers to discuss how these tools can help address reproducibility challenges in data analyses.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Anna Jaeschke
Keyword(s): wet lab research, workflow, experimental research
Video recording available after conference: ✅ |
Anna Jaeschke |
10:30–12:00 |
Penn Garden |
Readability: New ways to improve communication at the Central Bank of Chile
More infoThis study presents the development of a Shiny application, created entirely within the Central Bank of Chile, to improve the readability of its monetary policy communications. Effective communication is essential for central banks, as it influences expectations and decision-making. However, technical language and complex sentence structures often hinder comprehension. Initially, readability was assessed using the perspicuity index, an adaptation of the Flesch-Kincaid index. However, this method does not identify the specific sources of difficulty, especially in Spanish. To address this, a new theoretical framework was developed, identifying five key complexity dimensions: (1) nominalization, (2) gerunds, (3) depth of dependency, (4) subordinations, and (5) language complexity. Using Natural Language Processing (NLP), the Shiny application detects readability challenges by: 1. Calculating the percentage of sentences with readability issues. 2. Highlighting complex structures within the text. 3. Providing sentence-level breakdowns of readability difficulties. 4. Comparing language complexity against graded dictionaries. Applying this tool to monetary and financial policy reports since 2018 revealed that approximately 30% of the content contains readability challenges. The monetary policy summaries correlate strongly with the perspicuity index, indicating that most readability issues stem from syntactic complexity. In contrast, financial policy summaries show lower correlation, as their difficulty arises from long words and technical terms. Since its first use in December 2022, the application has played a key role in reducing text complexity in official reports. However, an increase in complexity in June 2023, following a change in report authorship, underscores the importance of user adoption in ensuring consistent readability improvements. Ultimately, this initiative highlights the need for tailored readability strategies across different policy instruments. While monetary policy documents benefit from structural simplifications, financial policy texts require a more nuanced approach that considers both syntax and terminology. Additionally, the study demonstrates that institutional willingness to adopt readability tools significantly impacts communication effectiveness. By developing this Shiny application, the Central Bank of Chile has taken a significant step toward improving policy communication, ensuring greater clarity and accessibility for diverse audiences.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Valentina Cortes Ayala (Central Bank of Chile); Karlla Munoz (Central Bank of Chile)
Keyword(s): shiny, communication, central bank, readability
Video recording available after conference: ✅ |
Valentina Cortes Ayala (Central Bank of Chile) |
10:30–12:00 |
Penn Garden |
Using R to Track, Monitor and Detect Changes in Movement and Diving Patterns of Beaked Whales off Cape Hatteras, NC
More infoBeaked whales can regularly dive to depths over 2,000m and during these dives hold their breath for over an hour. Understanding this physiological feat, as well as how individuals might alter their behavior when confronted with anthropogenic noise in the form of naval sonar is a daunting task that requires a diverse team of biologists, data scientists and statisticians. Here we report how we use R as part of a multiyear experiment off Cape Hatteras, NC, where we have monitored the behavior of 117 individual whales across 23 sonar exposures. Using biologging devices that are attached to individual whales, we record data on their acoustic behavior, diving kinematics and swimming behavior across multiple temporal and spatial scales. Using R, we focus our analysis on records detailing diving data every five minutes for two weeks and coarser movement data for approximately one month. Our workflow includes using structured EDA with bespoke R code to examine patterns before and after exposure; R packages (ctmcmove, walkMI) to fit continuous-time discrete space models to movement; and R packages (momentuhmm) to fit multi-state hidden Markov models to the dive data. We bring these together with 4D modeled data on sound propagation in the water column. This workflow allows us to parameterize dose-response models within a Bayesian model written in jags to quantify how exposure impacts behavior in this family of deep diving whales.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Rob Schick (Southall Environmental Associates, Inc.)
Keyword(s): animal movement, diving, dose-response, hierarchical bayes, workflows
Video recording available after conference: ✅ |
Rob Schick (Southall Environmental Associates Inc.) |
10:30–12:00 |
Penn Garden |
useR to Analyze Emergency Medical and Trauma Data
More infoEmergency Medical Services (EMS) and trauma centers provide life-saving care in critical moments. To support data-driven quality improvement in these high-stakes environments, the nemsqar and traumar R packages were developed to automate performance metric calculations for EMS and trauma care. This talk introduces nemsqar and traumar , which help researchers, data analysts, and public health professionals efficiently process standardized data and generate actionable insights. The nemsqar package simplifies the implementation of National EMS Quality Alliance (NEMSQA) performance measures. It processes National Emergency Medical Services Information System (NEMSIS) data, automating complex quality metric calculations to reduce errors, save time, and support prehospital care decision-making. The traumar package focuses on in-hospital trauma care, offering functions for risk-adjusted mortality metrics and other trauma quality indicators. Designed for flexibility, it supports multiple data sources and advanced statistical modeling to improve patient outcome assessments. This presentation will showcase real-world applications of both packages, demonstrating how they streamline quality reporting and enhance research efficiency. Attendees will see key functionalities, practical use cases, and integration strategies. Finally, the talk will highlight opportunities for community involvement, including contributions to package development, validation efforts, and feature expansion to meet evolving needs.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Nicolas Foss (Bureau of Emergency Medical and Trauma Services, Division of Public Health, Iowa Health and Human Services)
Keyword(s): ems, trauma, mortality, quality improvement, healthcare
Video recording available after conference: ✅ |
Nicolas Foss (Bureau of Emergency Medical and Trauma Services Division of Public Health Iowa Health and Human Services) |
Clinical trials |
10:30–12:00 |
Gross 270 |
Identifying Adverse Event Under-Reporting in Clinical Trials: A Statistical Approach
More infoAdverse event (AE) detection is a critical component of clinical trials, yet we know that AE underreporting is a concern with traditional reporting methods. This project reviews AE under-reporting best practices and introduces a new AI/ML framework for detecting un-reported AEs using R. This effort is being implemented under the Phuse OpenRBQM project. OpenRBQM is a collaborative effort to create open-source R packages focused on risk-based quality management (RBQM). First, we introduce the {gsm} and {simaerep} packages which facilitate site- and country- level assessments of AEs. The {gsm} or Good Statistical Monitoring package provides a standardized framework for calculating Key Risk Indicators (KRIs) across all aspects of RBQM, including AE monitoring. The {simaerep} package, developed by the IMPALA consortium, uses advanced statistical methodologies to simulate AE reporting in clinical trials to detect under-reporting sites. The IMPALA and OpenRBQM teams have collaborated to create the {gsm.simaerep} package for use in the {gsm} framework. Finally, we present a new approach that leverages AI/ML techniques to identify specific missed AEs by analyzing data from other clinical trial domains. Using R, we develop models that detect patterns and highlight anomalies indicative of unreported AEs. By applying these methods to real-world clinical trial datasets, we demonstrate how AI/ML can enhance RBQM efforts. This presentation introduces tools that combines standard RBQM methodologies for evaluating adverse event under-reporting with AI methods for identifying specific missed AEs. Attendees will gain insights into implementing R-based techniques to uncover hidden safety signals in clinical research data.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Laura Maxwell (Atorus Research), Jeremy Wildfire (Gilead Sciences)
Keyword(s): clinical trials, pattern recognition, simulation, ai/ml, biostatistics
Video recording available after conference: ✅ |
Laura Maxwell (Atorus Research) Jeremy Wildfire (Gilead Sciences) |
10:30–12:00 |
Gross 270 |
Implementing function factories for flexible clinical trial simulations
More infoThe R package {simtrial} simulates clinical trial results using fixed or group sequential designs. One of its advantages is it provides the user with sufficient flexibility to define complex stopping rules for specifying when intermediate analyses are to be performed and which tests are to be applied at each of these analyses. However, this flexibility in the design generates complexity when automating the simulations. In order to provide the desired flexibility while implementing a maintainable simulation framework, I applied a function factory strategy. Function factories are functions that return another function. This enables the user to define any arbitrary set of argument values, but then delay the execution of the function until the simulation is performed. In this presentation, I will provide an overview of function factories and explain how I implemented them in {simtrial}.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): John Blischak
Keyword(s): functional programming, simulations, function factories, clinical trials, group sequential design
Video recording available after conference: ✅ |
John Blischak |
10:30–12:00 |
Gross 270 |
Reproducible integrated processing of a large investigator-initiated, randomized-controlled multicenter clinical trial using Quarto and R
More infoNon-pharmaceutical clinical research often lacks reproducibility in data processing and analysis. In investigator-initiated trials, where financial resources are scarce, medical researchers must handle data management and analysis themselves, often using suboptimal tools. We present here the use case of a large, multicenter randomized-controlled trial in anesthesiology with over 2,500 enrolled patients. Embedded in a single Quarto-based project using tidyverse-style R, we processed the complete dataset from the electronic case report form from data tidying and analysis through plotting, report drafting, and presentation preparation. Our workflow is fully transparent, reproducible, and adaptive, following approaches demonstrated by Mine Çetinkaya-Rundel at R/medicine and Joshua Cook at posit:conf in 2024. To our knowledge, this represents the largest clinical trial managed using this methodology. This work demonstrates that accessible tools for tidy and reproducible scientific data processing are available even to researchers who are not native data scientists.
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Benedikt Schmid (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany); Robert Werdehausen (Department of Anesthesiology and Intensive Care Medicine, University Hospital Leipzig, Germany), Christopher Neuhaus (Department of Anesthesiology, University Hospital Heidelberg, Heidelberg, Germany), Linda Grüßer (Department of Anaesthesiology, RWTH Aachen University Hospital, Germany), Peter Paal (Department of Anaesthesiology and Intensive Care Medicine, Hospitallers Brothers Hospital, Paracelsus Medical University, Salzburg, Austria), Patrick Meybohm (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Peter Kranke (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Gregor Massoth (Department of Anaesthesiology and Intensive Care Medicine, University Hospital Bonn, Germany)
Keyword(s): medical research, reproducible workflow, randomized controlled clinical trial
Video recording available after conference: ✅ |
Benedikt Schmid (University Hospital Würzburg Department of Anaesthesiology Intensive Care Emergency and Pain Medicine Würzburg Germany) |
10:30–12:00 |
Gross 270 |
Retrospective clinical data harmonisation reporting using R and Quarto
More infoThere has been an increase of projects that involved data pooling from multiple sources. This is because combining data is an economical way to increase the statistical power of an analysis of a rare outcome that could not be addressed using data from a single project. Prior to statistical or machine learning analysis, a data steward must be able to sort through these heterogeneous inputs and document this process in a coherent way for different stakeholders. Despite its importance in the big data environment, there are limit resources on how to document this process in a structured, efficient and robust way. This presentation will provide an overview on how I create clinical data harmonisation reports using some R packages and a Quarto book project. A small preview can be found in https://github.com/JauntyJJS/harmonisation The audience in this talk will be able to know the basic framework of creating a Quarto Book or website to document data harmonisation processes, the basic workflow during the data harmonisation process, how to do data validation when writing code for data harmonisation to ensure code workflow is robust to changes in the input data, ways to show to higher management (with limited programming experience) in the harmonisation report that your code works (It is not enough to say that I use test units), able to write an R script to create many data harmonisation reports (One technical report for each cohort pooled and one report that summarised the data harmonisation process for all cohorts).
Date and time: Sat, Aug 9, 2025 - 10:30–12:00
Author(s): Jeremy Selva (National Heart Centre Singapore)
Keyword(s): data harmonisation, data validation, report making automation, quarto
Video recording available after conference: ✅ |
Jeremy Selva (National Heart Centre Singapore) |
Package lifecycle |
13:00–14:15 |
Penn 1 |
ARcenso: A Package Born from Chaos, Powered by Community
More infoHistorical census data in Argentina is scattered across multiple formats: books, spreadsheets, PDFs, and REDATAM, without a standardized structure. This lack of organization complicates analysis, requiring manual cleansing and integration of records before working with the data. As R users, we recognized an opportunity to transform this chaos into a meaningful solution not only for personal use but for all R users. That is how {arcenso} was born, a way to provide structured, ready-to-use census data, eliminating repetitive pre-processing and allowing users to focus on analysis with harmonized datasets. The goal is to make national census data in Argentina more accessible. Through the rOpenSci Champions program, the original idea turned into a functional R package. Thanks to the support of the R community, we learned how to structure the package, document datasets, and ensure reproducibility. This journey demonstrated the value of community learning, and those principles are embedded in {arcenso}, making it accessible and user-friendly. {arcenso} is currently developing and has released its first dataset along with three core functions. However, this is just the beginning. There are more datasets to integrate, additional features to develop, and improvements to be made to enhance the user experience. In this talk, we will introduce the package for users from both public and private sectors, including academics and researchers facing data challenges. We will explain the framework used for turning problems into solutions, highlight tools and community resources, and try to inspire others to tackle their own data challenges.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): EMANUEL CIARDULLO, ANDREA GOMEZ VARGAS; EMANUEL CIARDULLO
Keyword(s): package, community, workflow, census, official statistics
Video recording available after conference: ✅ |
EMANUEL CIARDULLO ANDREA GOMEZ VARGAS |
13:00–14:15 |
Penn 1 |
Curating a Community of Packages: Lessons from a Decade of rOpenSci Peer Review
More infoThemed collections of packages have long been a common feature of the R ecosystem, from the CRAN Task Views to today’s “universes”. These range from tightly integrated toolboxes engineered by a single team, to journal-like repositories of packages passing common standards, or loose collections of packages organized around communities, themes, or development approaches. This talk will share insights for managing package collections, and their communities of developers, gleaned from a decade of rOpenSci’s software peer-review initiatives. I will cover best practices for governing and managing collections, determining scope and standards for packages, onboarding and offboarding, and supporting continuing maintenance. Finally, I will discuss the essential role of mentorship and inclusive practices that support a diverse community of package maintainers and contributors.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Noam Ross (rOpenSci)
Keyword(s): standards, interoperability, maintenance, mentorship, community
Video recording available after conference: ✅ |
Noam Ross (rOpenSci) |
13:00–14:15 |
Penn 1 |
rtables: Challenges, Advances and Lessons Learned Going Into Production At J&J
More infortables is an open-sourced framework for the creation of complex, multi-faceted tables developed by the author while at Roche. Here, we will discuss the process of adopting rtables at J&J as the lynchpin of a larger transition to R and open-source tools for the creation of production outputs in clinical trials. In particular we will touch on 3 aspects: development of novel features in rtables required to meet J&J’s specific needs, development of additional tooling around rtables for use by the company’s SPAs, and lessons learned during the process.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Gabe Becker (Independent)
Keyword(s): clinical trials, tables, tlg, visualization,
Video recording available after conference: ✅ |
Gabe Becker (Independent) |
Teaching 1 |
13:00–14:15 |
Penn 2 |
Coursework RStudio Infrastructure at scale: Duke and NCShare
More infoTwo case studies covering lessons learned running large scale RStudio infrastructure for coursework at Duke University and [NCShare][1] (an NSF-funded consortium to advance scientific computing and innovate STEM education at North Carolina’s historically marginalized institutions). Each semester Duke provides containerized RStudio instances for over 1200 students. Similar instructure is used in the NCShare consortium to support advanced computing environments to less-resourced Higher Ed institutions. This talk covers best practices and pitfalls for automation, packaging, management, and support of RStudio and how cross institutional collaboration can make these environments more widely available. [1]: https://ncshare.org/
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Mark McCahill (Duke University)
Keyword(s): educational consortia, coursework infrastructure, automation
Video recording available after conference: ✅ |
Mark McCahill (Duke University) |
13:00–14:15 |
Penn 2 |
Enhancing R Instruction: Adapting Workshops for Time-Constrained Learners
More infoThe Data & Visualization Services Department at North Carolina State University Libraries offers data science support to faculty, staff, students, and the broader community. This support includes data science consulting, workshops and instruction on data science and programming topics, as well as specialized computer lab spaces equipped with hardware and software for data and visualization work. Among these, our introductory workshops are particularly popular. Our Intro to R workshop series consists of three sessions covering basic programming, data cleaning, and data visualization. Participants come from diverse academic and professional backgrounds, with varying levels of coding experience—from no prior exposure to limited familiarity with R or other programming languages. Additionally, they must balance academic, professional, and personal commitments, making it essential to provide efficient yet comprehensive instruction. We recently refined our curriculum to address these challenges in response to direct and observed student feedback. This presentation will explore the specific curriculum changes, the challenges they aim to resolve, and the role of instructor-led workshops in supporting early-stage R learners.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Selene Schmittling (North Carolina State University); Shannon Ricci (North Carolina State University), Alp Tezbasaran
Keyword(s): instruction, curriculum development, learner diversity, workshops
Video recording available after conference: ✅ |
Selene Schmittling (North Carolina State University) |
13:00–14:15 |
Penn 2 |
Rhapsody in R: Exploring Probability Through Music
More infoProbability is often introduced with applications from the natural and social sciences, but its role in the arts is less frequently explored. One example is stochastic music, pioneered by avant garde 20th-century composers like [Iannis Xenakis][1], who used probabilistic models and computer simulations to generate musical structures. While the aesthetic appeal of such music is subjective, its mathematical foundations offer a compelling way to engage students with probability and randomness. This talk presents an assignment for an introductory probability course where students compose their own stochastic music using R. By applying their knowledge of probability distributions and computer simulation, they explore randomization in pitch, rhythm, meter, instrumentation, and harmony — observing emergent patterns along the way. The R package [gm ][2] by Renfei Mao provides a user-friendly framework for layering musical elements, while integration with MuseScore allows students to generate sheet music and MIDI playback. This activity not only reinforces key concepts, but also offers students a fun and creative way to apply probability, engaging a different part of their brain than traditional scientific applications. [1]: https://youtu.be/nvH2KYYJg-o?feature=shared [2]: https://cran.r-project.org/web/packages/gm/index.html
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): John Zito (Duke University)
Keyword(s): teaching, probability, music
Video recording available after conference: ✅ |
John Zito (Duke University) |
Web APIs |
13:00–14:15 |
Penn Garden |
Automating CDISC Metadata Retrieval: An R-Based Approach Using the CDISC Library API
More infoThe CDISC Library API provides a programmatic gateway to clinical data standards, including SDTM and ADaM domains, variables, and controlled terminology. This presentation showcases an R-based approach to integrating the API for automated retrieval and structuring of CDISC metadata and controlled terminology, eliminating the need for manual extraction from PDFs or Excel files. Leveraging R packages such as shiny, httr2, jsonlite, and tidyverse, we demonstrate a reproducible workflow that queries the /mdr/sdtmig/{version} and /mdr/ct/{version} endpoints, parses JSON responses into structured data frames, and presents the results in a web application. Key topics include authentication via API keys, handling nested JSON structures, and ensuring seamless interaction with CDISC’s evolving standards. This approach enhances efficiency, reduces manual effort, and improves traceability in clinical data workflows.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Jagadish Katam
Keyword(s): cdisc, sdtm, adam, controlled terminology, shiny, api
Video recording available after conference: ✅ |
Jagadish Katam |
13:00–14:15 |
Penn Garden |
Web APIs for useRs: Getting data from websites, databases, and LLMs
More infoMany websites and services provide APIs, and useRs can take advantage of them to get data, make database operations, and talk to Large Language Models (LLMs). The httr2 package, with its support for sequential and parallel requests, is a great tool for efficient API interactions. I will demonstrate its use through two real-world examples. First, I will introduce the frstore package, which I developed to interact with Google Firestore, a NoSQL database. While client libraries exist for Python and JavaScript, R users were left out—until now. frstore enables create, read, update, and delete (CRUD) operations using httr2, making it a powerful tool for R users working with Firestore. The second example is a Shiny app designed to create an immersive storytelling experience. Users provide the first sentence of a children’s story, and the app uses httr2 to interact with multiple APIs. Cloudflare’s Workers Model API is used to send requests to text generation and image generation models. Moreover, Eleven Labs’ API converts text to speech for audiobook-like narration. These results are integrated in a quarto revealjs slide deck that yields a delightful, interactive storytime experience. This talk is aimed at R users of all levels who want to expand their toolkit for web data access and API interactions. Whether you’re scraping data, working with APIs, or building interactive applications, this session will provide practical examples to enhance your R workflows.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Umair Durrani (Presage Group)
Keyword(s): api, httr2, llm, database, shiny
Video recording available after conference: ✅ |
Umair Durrani (Presage Group) |
13:00–14:15 |
Penn Garden |
{plumber2}: Streamlining Web API Development in R
More infoOver the past nine years, the R package {plumber} has simplified the creation of web APIs using annotations over existing R source code with roxygen2-like comments. During this time, the community has gathered valuable insights and identified numerous areas for improvement. To invest in a way forward, a new package called {plumber2} has been created. {plumber2} is designed from the ground up to be highly extensible, enabling developers to easily integrate custom decorators to modify the behavior of their APIs. Furthermore, {plumber2} is built using a modern foundation, leveraging the latest packages associated with the {fiery} framework. This modern architecture is built upon middleware (the ability to introduce custom logic at specific points within the API’s request handling process). One of the many fine-grained controls over how your API can behave. By incorporating these improvements and embracing a modern framework, {plumber2} offers a sustainable path forward for building web APIs in R. This new approach avoids the need for short-term fixes and ensures that {plumber2} can continue to evolve and adapt to the changing needs of developers.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Barret Schloerke (Posit, PBC); Thomas Pedersen (Posit, PBC)
Keyword(s): api, plumber2, plumber, package, web api
Video recording available after conference: ✅ |
Barret Schloerke (Posit PBC) |
High-dimensional data |
13:00–14:15 |
Gross 270 |
Introducing riemmtan
More infoThe statistical analysis of random variables that take values in Riemannian manifolds is a rapidly growing area of research. Its main application is the study of connectomes obtained from brain imaging, which belong to the manifold of symmetric positive definite matrices. Large amounts of work have been devoted to address a variety of issues including the development of new metrics, new statistical models and visualization techniques. Unfortunately, the tools offered by R to handle this type of data have not evolved with the speed necessary to match the momentum of this growing area of the statistical literature. The R packages Riemann and frechet are important steps in that direction, but new tools are necessary to incorporate recent developments. That is why we are introducing riemmtan , a new R package. Its main goal is to offer a high level interface that abstracts away many day-to-day operations of this kind of analysis. In addition, it allows the user to exploit the growing capabilities of modern computer clusters by making use of parallelism in several parts of its implementation, including the computation of Fréchet means. Finally, it makes use of the object oriented programming tools in R to make Riemannian metrics self contained modules, allowing users to easily implement and experiment with new metrics. We hope riemmtan will become the foundation for an ecosystem of tools that allow for efficient and user-friendly analysis of Riemannian manifold valued data.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Nicolas Escobar (Indiana University); Jaroslaw Harezlak (Indiana University)
Keyword(s): riemannian manifolds, connectomics, fmri imaging
Video recording available after conference: ✅ |
Nicolas Escobar (Indiana University) |
13:00–14:15 |
Gross 270 |
Machine Learning-Powered Metabolite Identification in R: An Automated Workflow for Identifying Metabolomics Dark Matter
More infoUpwards of 90% of small molecules detected in LC-MS/MS-based untargeted metabolomics are unidentified due to limitations in current analytical techniques. Although this “dark matter” can significantly contribute to disease diagnosis and biomarker discovery, current identification methods are costly and resource-intensive. This study addresses these challenges by developing a computational workflow in R to encode the tandem mass spectra into simplified structural fingerprints, which can be predicted and related to known fingerprints in molecular databases. The developed pipeline includes different R packages such as RSQLite, SF, rcdk, chemminer, caret, sparsepca, rinchi, and rpubchem which finally improves metabolite identification in untargeted metabolomics. A total of 2,973 mass spectra of known and unknown molecules from an in-house high resolution LC-MS/MS study were extracted from an SQL database (mzVault) using the RSQLite package. The collected spectra were converted into machine-readable numbers using the rawToHex and readBin functions from the SF package. SMILES representations of known molecules were obtained by querying their names against PubChem using the rpubchem package. The set of 166 Molecular ACCess System (MACCS) fingerprints were computed for known molecules based on their SMILES using rCDK and ChemmineR packages. In the next step, 166 random forest (RF) models were trained on MS2 spectra of known molecules to model the MACCS fingerprints using the caret package. Before training, spectral data were normalized and subjected to dimensionality reduction using robust sparse principal component analysis (rSPCA) via the sparsepca package. The trained RF models were applied to high-resolution MS2 spectra of unknown molecules to predict their MACCS fingerprints, which were then used for similarity searches in the Human Metabolome Database (HMDB) using the Tanimoto coefficient. Retrieved candidates from HMDB were further refined based on LogP, topological polar surface area (TPSA), molecular mass, and retention time. The workflow was tested on an LC-MS/MS dataset containing 1,071 known and 1,902 unknown compounds. Despite the high dimensionality, rSPCA reduced the data to 25 principal components, preserving 97% of variance. RF models achieved a mean accuracy of 0.87 in 3-fold cross-validation. On average, 4.1±11.31 unique HMDB molecules were listed for each unknown molecule, and the retrieved list was prioritized using a hybrid scoring function. Applying a Tanimoto similarity threshold (>0.7), this workflow identified at least one HMDB match for 1,079 unknowns, improving metabolite identification by 57%. The incorporation of a hybrid scoring system based on Tanimoto similarity and physicochemical properties enhanced candidate ranking and structural elucidation of unknown metabolites.
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences); Sree V. Chintapalli (University of Arkansas for Medical Sciences), Renny Lan (University of Arkansas for Medical Sciences), Hailemariam Abrha Assress (University of Arkansas for Medical Sciences), Brian D. Piccolo (University of Arkansas for Medical Sciences), Colin Kay (University of Arkansas for Medical Sciences)
Keyword(s): metabolomics, cheminformatics, machine learning, identification
Video recording available after conference: ✅ |
Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences) |
13:00–14:15 |
Gross 270 |
Multi-omics Integration with GAUDI: A Novel R Package for Non-linear Dimensionality Reduction and Interpretable Clustering Analysis
More infoIntegrating high-dimensional multi-omics data presents significant challenges in computational biology, particularly when handling complex non-linear relationships across diverse biological layers. We present GAUDI (Group Aggregation via UMAP Data Integration), a novel R package that leverages Uniform Manifold Approximation and Projection (UMAP) for the concurrent analysis of multiple omics data types. GAUDI addresses key limitations of existing methods by enabling non-linear integration while maintaining interpretability and mitigating bias from datasets with vastly different dimensionalities. The GAUDI R package implements a straightforward yet powerful workflow: (1) independent UMAP embeddings are applied to each omics dataset, creating standardized representations that preserve dataset-specific structures; (2) these embeddings are concatenated; (3) a second UMAP transformation integrates these embeddings into a unified space; (4) hierarchical density-based clustering identifies sample groups; and (5) feature importance analysis via XGBoost and SHAP values enables biological interpretation. Our benchmarking against six state-of-the-art multi-omics integration methods demonstrates GAUDI’s superior performance across diverse datasets. Using simulated multi-omics data with known ground truth, GAUDI achieved perfect clustering accuracy across all tested scenarios. In cancer datasets from TCGA, GAUDI identified clinically relevant patient subgroups with significant survival differences, particularly in acute myeloid leukemia where it detected high-risk subgroups missed by other methods. At the single-cell level, GAUDI not only correctly classified cell lines but uniquely identified biologically meaningful substructures within them, confirmed by differential expression and pathway enrichment analyses. When evaluating large-scale functional genomics datasets from the Cancer Dependency Map (DepMap) Project, GAUDI demonstrated superior lineage identification accuracy. In a benchmark integrating gene expression, DNA methylation, miRNA expression, and metabolomics across 258 cancer cell lines, GAUDI achieved the highest score for lineage discrimination, approximately 15% better than the next-best performing method, MOFA+, underscoring its effectiveness with complex, heterogeneous multi-omics data. The GAUDI R package provides a user-friendly interface with extensive documentation, visualization tools, and compatibility with standard bioinformatics workflows. By combining the strengths of non-linear dimensionality reduction with interpretable machine learning approaches, the GAUDI R package offers researchers a powerful new tool for exploring complex relationships across multiple biological data types, potentially revealing novel insights in systems biology, precision medicine, and biomarker discovery. Package: https://github.com/hirscheylab/gaudi Benchmark: https://github.com/hirscheylab/umap_multiomics_integration
Date and time: Sat, Aug 9, 2025 - 13:00–14:15
Author(s): Pol Castellano Escuder (Heureka Labs)
Keyword(s): multi-omics integration, dimension reduction, clustering, statistical learning, interpretable machine learning, benchmarking
Video recording available after conference: ✅ |
Pol Castellano Escuder (Heureka Labs) |
Pragmatic programmer |
14:45–16:00 |
Penn 1 |
“How did you even think of that???” Techniques to code much faster
More infoThis talk will present a totally different way of thinking about writing R code. This method is completely different from anything I have ever seen in the R community (or any data science community). This is the method I used to write four R packages - NumericEnsembles, ClassificationEnsembles, LogisticEnsembles and ForecastingEnsembles. The largest part of the code was written in 15 months, and was approximately 15,000 lines at that time. No AI was used in any of the code development. This is a totally different style of thinking, using the same set of R tools that everyone else can use. What is totally different is the thinking that goes into the code development, compared to what I’ve seen everywhere else. This talk will show how the same method may be applied to the work you are doing. Come prepared to see that the methods you’ve been using to think through solutions and write code that achieves reproducible results can be improved very significantly by improving your thinking, not necessarily your tools. There will be several practical examples and live demonstrations to show how you may use these methods in real coding situations. Improving your thinking can do much more for improving how you code than adding the latest tools. This presentation will demonstrate how that was done in the development of four packages that automatically build ensembles as part of the analysis process, and how you can use the same methods in your work.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Russ Conte (Owner@dataaip.com)
Keyword(s): “code better, efficient coding, fast coding”
Video recording available after conference: ✅ |
Russ Conte (Owner@dataaip.com) |
14:45–16:00 |
Penn 1 |
Reusing ‘ggplot2’ code: how to design better plot helper functions
More infoWrapping ‘ggplot2’ code into plot helper functions is a common way to make multiple versions of a custom plot without copying and pasting the same code over and over again. Helper functions can replace long and complex ‘ggplot2’ code chunks with just a single function call. However, if that single function is not designed carefully, the initial convenience can often turn into frustration. While helper functions can reduce the amount of code needed to remake a complicated plot, they often mask the underlying layered grammar of graphics, complicating further customisation and tweaking of the plot. This talk addresses how to design effective ‘ggplot2’ plot helper functions that maximise reuse convenience whilst preserving access to the elegant flexibility of layered plot composition. By studying existing ‘ggplot2’ extensions for producing calendar plots, we identify a number of common pitfalls, including overly specific function arguments and hidden data manipulations. Then, we discuss how to avoid these pitfalls and retain the benefits of ‘ggplot2’ by: separating data preparation from plotting, utilising list arguments for customisation, and providing transparent documentation. We illustrate these strategies using examples from the design of the ‘ggtilecal’ package, which provides helper functions for plotting calendars using the geom_tile() geometry from ggplot2 .
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Cynthia Huang (Monash University)
Keyword(s): r package and function design, layered grammar of graphics, data visualisation, ggplot2 extensions
Video recording available after conference: ✅ |
Cynthia Huang (Monash University) |
14:45–16:00 |
Penn 1 |
The Language of Data: How R Package Syntax Shapes Analysis and Thought
More infoFor most users in data science, analytics, and research, a package’s syntax or API is their primary interface with the software. While R provides a well-defined framework for creating packages that make programming accessible, syntax choices serve as key connection points between users and their data. R packages exhibit a range of syntax styles—from explicit to implicit, verbose to symbolic, and structured to flexible. Drawing on research on language, cognition, and user experience, this talk explores how syntax design in R packages shapes the way we interact with data, approach analysis, and solve complex problems. In this talk, I will examine syntax design in powerful and popular data wrangling software in R–data.table, dplyr, polars, and base R, comparing their approaches and discussing their impact on usability, interpretation, and problem-solving in data workflows. Attendees will leave with an understanding of syntax design, how current leaders in data wrangling design their syntax, and considerations for how these designs can impact user behavior.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Tyson Barrett (Highmark Health)
Keyword(s): data wrangling, programming, syntax, analytics
Video recording available after conference: ✅ |
Tyson Barrett (Highmark Health) |
Teaching 2 |
14:45–16:00 |
Penn 2 |
Expanding Data Science’s Reach through Interdisciplinarity and the Humanities
More infoWhat does data science mean for those disciplines that don’t traditionally align themselves with this work? More specifically, how might instructors in the Humanities define — and teach — data science? How can the Humanities use data science to resist academic siloing and promote alignment across disciplines and methodologies? What is to be gained for traditional data science programs with a transdisciplinary understanding and application of data science? This presentation explores three courses developed by English instructors at North Carolina State University’s Data Science and AI Academy: Data Visualization, Introduction to AI Ethics, and Storytelling with Data and AI. The presenters will explain how their Humanities backgrounds help them create courses that extend data science beyond traditional applications. They’ll share examples of assignments that incorporate their disciplinary expertise while integrating core data science principles from the ADAPT model. Furthermore, by offering alternative perspectives on data science, we create “gateway” courses that attract students who might not otherwise enter the field. The presenters will also discuss how these courses achieve interdisciplinarity both through content and the student participants. The presenters will demonstrate how the three representative courses complement traditional data science curriculum (coding) by broadening the field’s reach in two ways: 1. enhancing the overall educational experience for students and 2. creating access points for faculty who don’t typically identify with data science, thus attracting instructors without traditional data science backgrounds. The presentation will conclude with reflections on lessons learned, challenges encountered, and strategies for institutions seeking to implement similar cross-disciplinary approaches. The presenters will share preliminary assessment data demonstrating student outcomes and discuss implications for the future of data science education across diverse academic contexts.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Kelsey Dufresne (North Carolina State University), James Harr (Christian Brothers University), Christin Phelps (North Carolina State University)
Keyword(s): interdisciplinary, data science, outreach, data visualization, ai
Video recording available after conference: ✅ |
Kelsey Dufresne (North Carolina State University) James Harr (Christian Brothers University) Christin Phelps (North Carolina State University) |
14:45–16:00 |
Penn 2 |
Leveraging LLMs for student feedback in introductory data science courses
More infoA considerable recent challenge for learners and teachers of data science courses is the proliferation of the use of LLM-based tools in generating answers. In this talk, I will introduce an R package that leverages LLMs to produce immediate feedback on student work to motivate them to give it a try themselves first. I will discuss technical details of augmenting models with course materials, backend and user interface decisions, challenges around evaluations that are not done correctly by the LLM, and student feedback from the first set of users. Finally, I will touch on incorporating this tool into low-stakes assessment and ethical considerations for the formal assessment structure of the course relying on LLMs.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Mine Cetinkaya-Rundel (Duke University + Posit, PBC)
Keyword(s): r-package, teaching, education, feedback, ai, llm
Video recording available after conference: ✅ |
Mine Cetinkaya-Rundel (Duke University + Posit PBC) |
14:45–16:00 |
Penn 2 |
Teaching Statistical Computing with R and Python
More infoComputing courses can be daunting for students for a variety of reasons, including programming anxiety, difficulty learning a programming language in a second language, and unfamiliarity with assumed computer knowledge. In an ongoing attempt to teach statistical computing effectively, I developed a textbook intended for use in a flipped classroom setting where R and Python are taught concurrently. This approach allows students to learn programming concepts applicable to most languages, while developing skills in both R and Python that can be used in an increasingly multilingual field. In this talk, I discuss the book’s design and how it integrates into a sequence of undergraduate and graduate computing courses. Along the way, we will talk about opinionated coding decisions, use of memes, comics, and YouTube tutorials, and other features integrated into this open-source textbook built with quarto and hosted on GitHub.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Susan Vanderplas (University of Nebraska - Lincoln)
Keyword(s): data science, education, statistical computing, python, reproducibility
Video recording available after conference: ✅ |
Susan Vanderplas (University of Nebraska - Lincoln) |
Workflows |
14:45–16:00 |
Penn Garden |
Building Agentic Workflows in R with axolotr
More infoLarge Language Models (LLMs) have revolutionized how we approach computational tasks, yet R users often face significant barriers when integrating these powerful tools into their workflows. Managing multiple API providers, handling authentication, and orchestrating complex interactions typically requires substantial boilerplate code and specialized knowledge across different service ecosystems. This presentation introduces axolotr, an R package that provides a unified interface for interacting with leading LLM APIs including OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, and Groq. Through progressive examples of increasing complexity, we demonstrate how R users can seamlessly incorporate LLMs into their data science workflows - from simple one-off queries to sophisticated agentic systems. We begin with fundamental LLM interactions, showing how axolotr simplifies credential management and API calls across providers. Next, we explore function-based implementations that transform raw LLM capabilities into reusable analytical tools. Finally, we demonstrate how to build true agentic workflows where multiple LLM calls work together to maintain state, make decisions, and accomplish complex tasks autonomously. Attendees will learn: - How to quickly incorporate LLMs into existing R projects using a consistent interface - Techniques for creating functions that leverage LLM capabilities for data analysis and interpretation - Approaches for building agentic systems that can reason about data, maintain context, and operate iteratively - Practical strategies for managing costs, optimizing performance, and selecting appropriate models for different tasks This presentation provides both newcomers and experienced R users with the practical knowledge needed to harness the power of LLMs through a streamlined, R-native approach. By the end, attendees will have a roadmap for transforming their interaction with LLMs from simple API calls to sophisticated autonomous workflows that can dramatically enhance productivity and analytical capabilities.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Matthew Hirschey
Keyword(s): llms, ai, agents, natural language processing, workflow automation
Video recording available after conference: ✅ |
Matthew Hirschey |
14:45–16:00 |
Penn Garden |
Data as code, packaging data as code with duckdb and S3
More infoDuckDB and object storage (S3) offer a powerful and cost-effective way to store and access data. The R package provides an efficient method to document data processing, simplify user access, incorporate business logic, increase reproducibility, and leverage both code and data. This talk will use [cori.data.fcc ][1] featuring the US FCC National Broadband Data, as a case study for a data package. We will discuss the advantages discovered during its development, challenges we encountered, and tips for others who wish to adapt these methods for their own needs. [1]: https://ruralinnovation.github.io/cori.data.fcc/
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Olivier Leroy; John Hall (Center on Rural Innovation)
Keyword(s): duckdb, s3, data package, broadband
Video recording available after conference: ✅ |
Olivier Leroy |
14:45–16:00 |
Penn Garden |
Small boosts, here and there
More infoRather than writing an entire R package or carrying out a data analysis in one fell swoop, I’m interested in large language models (LLMs) doing things for me that I don’t like to do: tedious little refactors, transitioning from deprecated APIs, and templating out boilerplate. Chores, if you will. This talk introduces chores, an R package implementing an extensible library of LLM assistants to help with repetitive but hard-to-automate tasks. I’ll demonstrate that LLMs are quite good at turning some 45-second tasks into 5-second ones and show you how to start automating drudgery from your work with a markdown file and a dollar on an API key.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Simon Couch (Posit, PBC)
Keyword(s): ai, llm, workflow, productivity
Video recording available after conference: ✅ |
Simon Couch (Posit PBC) |
Life sciences |
14:45–16:00 |
Gross 270 |
Co-occurrence Analysis And Knowledge Graphs For Biomedical Research
More infoThe analysis of data from large hospitals and healthcare providers comes with unique challenges. Electronic health records document information from patients’ visits as the diagnoses performed, medications prescribed, and more. To discover the best treatment options, facilitate early diagnosis, and understand co-morbidities and adverse effects, biomedical researchers extensively use co-occurrence analysis, which measures how features as diagnoses and medications are correlated with each other over time at the patient level. Results can then be merged between independent health systems while maintaining patient data privacy in a process called transfer learning, and insights can be organized, visualized and interpreted using knowledge graphs. Knowledge graphs model relationships between concepts, e.g. one medication “may treat” one disease. Biomedical research consistently shows that while large language models perform very well to discover similar concepts, as synonyms or closely related diagnoses, co-occurrence analysis and knowledge graphs perform better when trying to discover related concepts, as best treatment options or adverse effects. A large part of contemporary biomedical research is thus dedicated to merging results from pre-trained large language models and study-specific performed co-occurrence analyses. To help researchers efficiently perform co-occurrence analysis and knowledge graphs, we developed the nlpembeds and kgraph R packages. The nlpembeds package enables to efficiently compute co-occurrence matrices between tens of thousands of concepts from millions of patients over many years – which can prove challenging when taking into account not only codified data as diagnoses and medications but also natural language processing concepts extracted from clinicians notes (comments justifying why specific diagnoses were performed or medications prescribed). The kgraph package enables to measure the performance of the results, build the corresponding knowledge graphs, and visualize them as interactive Javascript networks. We used the packages to perform several studies as the analysis of insurance claims of 213 million patients (Inovalon), the visualization of Mendelian randomization meta-analyses performed by the Veterans Affairs, and the transfer learning between several institutions involved in the Center for Suicide Research and Prevention to build risk prediction models. In this talk, I will showcase the highlights of these packages, introduce their use, and demonstrate how to perform real-world interpretations useful for clinical research. Co-occurrence analysis and knowledge graphs enable to discover insights from large databases of electronic health records in order to improve our understanding of biomedical processes and the realities of large-scale and long-term patient care.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Thomas Charlon (Harvard Medical School)
Keyword(s): embeddings, knowledge graph, biomedical research, patient care, mental health
Video recording available after conference: ✅ |
Thomas Charlon (Harvard Medical School) |
14:45–16:00 |
Gross 270 |
Counting Birds Two Ways: Joint models of species abundance
More infoJoint species distribution models (JSDMs) enable ecologists to characterize relationships between species and their environment, infer interspecific dependencies, and predict the occurrence or abundance of entire ecological communities. Although several popular JSDM frameworks exist, the problem of modeling sparse relative abundance data remains an inferential and computational challenge for many. We describe two approaches and corresponding implementations within the context of a case study involving a large community of bird species surveyed across Finland. The first approach, hierarchical modeling of species communities, employs a generalized linear latent variable model and supports diverse data and sampling designs but falters when faced with sparse and overdispersed count data. The second approach, binary and real count decompositions, directly addresses limitations of log-linear multivariate count models but lacks some of the generality and extensibility.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Braden Scherting (Duke University)
Keyword(s): NA
Video recording available after conference: ✅ |
Braden Scherting (Duke University) |
14:45–16:00 |
Gross 270 |
Detecting Read Coverage Patterns Indicative of Genetic Variation and Mobility
More infoRead coverage data is commonly used in bioinformatics analyses of sequenced samples. Read coverage represents the count of short DNA sequences that align to specific locations in a reference sequence. When plotted, one can visualize how read coverage changes along the reference sequence. Some read coverage patterns, like gaps and elevations in coverage, are associated with real biological phenomena like mobile genetic elements (MGEs) and structural variants (SVs), for example. MGEs are genetic sequences capable of transferring to new genomic locations where they may disrupt functioning genes. Structural variants (SVs) refer to small genetic differences between individuals or microbial populations caused by deletions, insertions, and duplications of gene sequences. MGEs and SVs are important to host health and while many tools have been developed to detect them, the vast majority are either database-dependent or are limited to detection of specific types of MGEs and SVs. Using gaps and elevations in read coverage is a more general detection method for diverse MGEs and SVs, however the manual inspection of coverage graphs is tedious, time consuming, and subjective. We developed an algorithm that detects distinct patterns in read coverage data and implemented it into two R packages- TrIdent and ProActive- that automatically identify, classify, and characterize read coverage patterns indicative of genetic variation and mobilization. Our read coverage pattern-matching algorithm offers a unique approach to sequence data analysis and our tools enable researchers to efficiently implement read coverage inspections into their standard bioinformatics pipelines.
Date and time: Sat, Aug 9, 2025 - 14:45–16:00
Author(s): Jessie Maier (North Carolina State University); Craig Gin (North Carolina State University), Benjamin Callahan (North Carolina State University), Manuel Kleiner (North Carolina State University)
Keyword(s): pattern-matching, bioinformatic tools, read coverage data, mobile genetic elements, structural variants
Video recording available after conference: ✅ |
Jessie Maier (North Carolina State University) |
Keynote #3 |
17:00–18:00 |
Penn 1 |
We R Together. How to learn, use and improve a programming language as a community.
More infoCommunities of practice are powerful spaces for learning, collaboration, and innovation—especially in the context of coding and data science. In this talk, I’ll share what I’ve learned from leading and supporting R communities, with concrete examples of strategies, content, and programs that encourage participation and skill-sharing. Informed by a range of approaches to community building, I’ll explore how collective efforts can strengthen not only technical skills, but also support long-term career growth and visibility. This talk will be relevant to anyone interested in creating more inclusive, sustainable, and impactful technical communities—across research, education, and open source.
Date and time: Sat, Aug 9, 2025 - 17:00–18:00
Author(s): Yanina Bellini Saibene (rOpenSci + R-Ladies + Universidad Austral)
Keyword(s): NA
Video recording available after conference: ✅ |
Yanina Bellini Saibene (rOpenSci + R-Ladies + Universidad Austral) |