useR! 2025
  • Home
  • About
    • useR!
    • Travel
    • Key dates
    • Code of conduct
    • Organizers
    • Support
  • Program
    • Overview
    • Virtual
    • In-person
  • Sponsors
  • Register
  • Submit a proposal
  • Additional programming
    • R Dev Day
    • R-Ladies Networking Event
  • Contact

Program is now announced and registration for both the in-person and virtual conferences are open.

In-person conference program

Register for the in-person conference now!

Go to the registration page to register for the in-person conference!

Keynotes

Headshot of Hadley Wickham

Hadley Wickham
Posit, PBC

I wrote this talk with an LLM

In this keynote, I’ll explore the evolving relationship between data scientists, statisticians, and large language models through a unique experiment: this entire talk was…

Keynote #1

Headshot of Simon Urbanek

Simon Urbanek
University of Auckland + R Core

R in the AI Era: Leveraging Modern Technologies in Practice

R is popular in part due to its extensive package ecosystem, which allows for the incorporation of new technologies and statistical methods. The language has been designed…

Keynote #2

Headshot of Frauke Kreuter

Frauke Kreuter
LMU Munich / University of Maryland

Data in the Balance: Incentives, Independence, and Public Statistics

Government statistics—and the data behind them—are the backbone of scientific research and evidence-based policymaking. Yet chronic under-funding, fragmented systems, and…

Keynote #3

Headshot of Yanina Bellini Saibene

Yanina Bellini Saibene
rOpenSci + R-Ladies + Universidad Austral

We R Together.  How to learn, use and improve a programming language as a community.

Communities of practice are powerful spaces for learning, collaboration, and innovation—especially in the context of coding and data science. In this talk, I’ll share what…

Keynote #3

Headshot of Will Landau

Will Landau
Eli Lilly and Company

Powerful simulation pipelines with {targets}

When designing clinical trials, simulations are essential for comparing options and optimizing features like sample size, allocation, randomization, milestones, and decision…

Keynote #5
No matching items

Schedule

Please note that all session times are listed below in EDT and the schedule is subject to change.

  • Day 1 - August 8, 2025
  • Day 2 - August 9, 2025
  • Day 3 - August 10, 2025
Day 1: Friday, August 8, 2025
Room Title, abstract, and more info Presenter(s)
Morning tutorial
08:30–12:00 TBD Causal Machine Learning in R

More infoIn both data science and academic research, prediction modeling is often not enough; we need to approach them causally to answer many questions. However, we can augment and improve causal inferences using machine learning techniques. In this workshop, we’ll teach the essential elements of combining machine learning and causal techniques to answer causal questions in R. We’ll cover causal diagrams and doubly robust causal modeling techniques that allow for valid inferences with ML models via targeted maximum likelihood estimation (TMLE). We’ll also show that we can better take advantage of both tools by distinguishing predictive models from causal models. This workshop assumes you have a basic understanding of prediction modeling and R.

Learning goals:

* Understand why prediction models can’t answer causal questions

* Understand how causal diagrams allow us to improve causal queries and how to use them in R

* Develop doubly robust models in R to answer causal questions using machine learning techniques

Target audience: Intermediate to advanced; we assume basic experience with R and machine learning. Previous experience with causal inference will be helpful but not required. Interested users can consult our book ahead of time.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 08:30–12:00

Author(s): Malcolm Barrett (Stanford University)

Keyword(s): causal inference, tmle, machine learning

Video recording available after conference: ❌
Malcolm Barrett (Stanford University)
08:30–12:00 TBD Debugging Tools for Functions in R

More infoIf you write functions but are unsure of efficient strategies to identify the source of errors then join this workshop to unlock your programming superpower with debugging techniques! In this workshop, we will review code troubleshooting tips, discuss debugging functions (traceback(), browser(), debug(), trace(), and recover()), and distinguish between strategies for debugging your own code versus someone else’s code.

Learning goals: 1. Review code troubleshooting tips. 2. Apply debugging functions (traceback(), browser(), debug(), trace(), and recover()) and identify the additional benefits of employing some of these strategies within RStudio. 3. Distinguish between strategies for debugging your own code versus someone else’s code.

Target audience: Individuals with experience writing functions but new to debugging.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 08:30–12:00

Author(s): E. David Aja (Posit, PBC), Shannon Pileggi (The Prostate Cancer Clinical Trials Consortium)

Keyword(s): debugging, functions

Video recording available after conference: ❌
E. David Aja (Posit
PBC)
Shannon Pileggi (The Prostate Cancer Clinical Trials Consortium)
08:30–12:00 TBD From Model to Meaning: How to use the marginaleffects package to interpret results from statistical or machine learning models

More infoOur world is complex. To make sense of it, data analysts routinely fit sophisticated statistical or machine learning models. Interpreting the results produced by such models can be challenging, and researchers often struggle to communicate their findings to colleagues and stakeholders. This tutorial is designed to bridge that gap. It offers a practical guide to model interpretation for analysts who wish to communicate their results in a clear and impactful way. Tutorial attendees will be introduced to the marginaleffects package and to the conceptual framework that underpins it. The marginaleffects package for R offers a single point of entry for computing and plotting predictions, counterfactual comparisons, slopes, and hypothesis tests for over 100 different types of models. The package provides a simple and unified interface, is well-documented with extensive tutorials, and is model-agnostic—ensuring that users can extract meaningful quantities regardless of the modeling framework they use. The book Model to Meaning: How to Interpret Statistical Results Using marginaleffects for R (forthcoming with CRC Chapman & Hall) introduces a powerful conceptual framework to help analysts make sense of complex models. It demonstrates how to extract meaningful quantities from model outputs and communicate findings effectively using marginaleffects. This tutorial will provide participants with a deep understanding of how to use marginaleffects to improve model interpretation. Attendees will learn how to compute and visualize key statistical summaries, including marginal means, contrasts, and slopes, and how to leverage marginaleffects for hypothesis and equivalence testing. The package follows tidy principles, ensuring that results integrate seamlessly with workflows in R, and with other packages such as ggplot2, Quarto, and modelsummary. This tutorial is suitable for data scientists, researchers, analysts, and students who fit statistical models in R and seek an easy, reliable, and transparent approach to model interpretation. No advanced mathematical background is required, but familiarity with generalized linear models like logistic regression is assumed.

Learning goals: None

Target audience: None

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 08:30–12:00

Author(s): Vincent Arel-Bundock (Université de Montréal)

Keyword(s): model interpretation, statistical analysis, marginaleffects, regression modeling, causal inference

Video recording available after conference: ❌
Vincent Arel-Bundock (Université de Montréal)
08:30–12:00 TBD Tidy manipulation of genomic data

More infotidyomics is an open source project to enable a tidy data analysis framework for omics data, such as single cell gene expression, genomic annotation, chromatin interactions, and more. tidyomics enables the use of familiar tidyverse verbs (select, filter, mutate, etc.) to manipulate rich data objects in the R/Bioconductor ecosystem. In this workshop, we will give a high level overview of the project, and then work through a number of examples involving experimental datasets and typical bioinformatics tasks, showing how these can be cast as tidy data analyses.

Learning goals:

* Basic tidy operations on experimental data and genome annotation

* Bulk and single cell expression with tidy manipulation and visualization

* Examples of how to integrate diverse genomic datasets (ChIP-seq and RNA-seq)

Target audience: Bioinformatics audience at any level, some knowledge of dplyr is helpful

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 08:30–12:00

Author(s): Justin Landis (UNC), Michael Love (UNC-Chapel Hill)

Keyword(s): tidy data, genomics, bioinformatics, bioconductor

Video recording available after conference: ❌
Justin Landis (UNC)
Michael Love (UNC-Chapel Hill)
Afternoon tutorial
13:00–16:30 TBD Complex Survey Data Analysis: A Tidy Introduction with {srvyr} and {survey}

More infoThis interactive tutorial will introduce how to conduct analysis of survey data in R. We will first introduce a unifying workflow of tidy survey analysis in R for analysis of survey microdata with weights. We will cover topics of descriptive analysis, including functions to obtain weighted proportions, means, quantiles, and correlations from survey data. Then, we will discuss some statistical testing, including t-tests for comparing means and χ-squared tests for comparing proportions. Finally, we will discuss common probability sampling designs and how to create the survey design objects in R to account for the sampling design. The tutorial will include time for exercises using data from the 2020 American National Election Study and the 2020 Residential Energy Consumption Survey, so you can get hands-on experience with the functions. We will be using Posit Cloud, so you do not need to have R or RStudio preinstalled on your computer. For the best learning experience, we recommend you have some prior experience with R and the tidyverse, including familiarity with mutate, summarize, count, and group_by.

Learning goals: 1. Interpret documentation accompanying survey data and set up a survey design object in R. 2. Calculate weighted means, quantiles, proportions, and correlations along with standard errors and confidence intervals for survey data. 3. Specify t-tests for continuous survey data and understand the difference between two-sample t-tests and paired t-tests. Along the way, implement “dot” notation for passing design objects into the test function. 4. Conduct goodness of fit tests, tests of independence, and tests of homogeneity for categorical survey data.

Target audience: Analysts who want to analyze survey microdata (record-level data) with weights and disseminate results. May already use another language for survey analysis or are just starting out in survey analysis.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 13:00–16:30

Author(s): Rebecca Powell (Fors Marsh), Isabella Velásquez (Posit, PBC), Stephanie Zimmer (RTI International)

Keyword(s): survey analysis, statistical testing, weighted analysis

Video recording available after conference: ❌
Rebecca Powell (Fors Marsh)
Isabella Velásquez (Posit
PBC)
Stephanie Zimmer (RTI International)
13:00–16:30 TBD Getting Started with Positron: A Next-Generation IDE for data science

More infoPositron is a next-generation data science IDE built by Posit PBC that combines the best features of RStudio and Visual Studio Code. This tutorial will introduce R users to Positron’s core capabilities, with a special emphasis on helping RStudio users while highlighting its seamless integration with the R ecosystem. For R programmers coming from RStudio, Positron delivers a familiar yet enhanced environment for data analysis and package development, while offering a path to Python when needed. Unlike traditional software-development oriented IDEs, Positron provides first-class support for data science-specific workflows through its native support for R (via the Ark kernel), along with designated areas for Variables (Environment), Connections, Plots, Help, and more that RStudio users have come to rely on. During this hands-on tutorial, participants will learn how to: - Install, configure, and update Positron for an R-focused workflow - Navigate Positron’s interface and understand how it compares to RStudio - Use Positron’s advanced features for interactive R coding and data exploration - Customize Positron with useful settings, extensions, and keyboard shortcuts - Implement a project-based workflow with a workspace We’ll explore Positron’s innovative features that enhance R productivity, such as the improved interactive Data Explorer and the ability to switch between different R installations. The tutorial will include practical demonstrations of key workflows, such as developing and publishing Shiny apps and Quarto documents, package development, and data visualization. For R users curious about Python, we’ll briefly demonstrate how Positron makes Python accessible within a familiar environment. We’ll also briefly cover compatibility with VS Code extensions relevant to R users and how to leverage them through the Open VSX Registry. We’ll survey various ways to access GenAI for coding assistance and a high-level overview of more specialized topics, such as remote development and integrations with cloud providers. By the end of this tutorial, participants should understand how to accomplish their most-used workflows in Positron and how to tailor the IDE to their specific needs.

Learning goals: Participants will learn how to set up Positron for their data science work and navigate its interface. They will gain practical experience with Positron’s data-focused tools including the Variables pane, Data Explorer, and Plot viewer. By the end of the tutorial, participants will be able to customize Positron to maintain their familiar RStudio workflow patterns while gaining access to new capabilities and understand how to effectively transition their R-based data science projects to this new environment.

Target audience: This tutorial is designed specifically for R users familiar with the RStudio IDE who want to explore Positron as an alternative or additional tool. It’s particularly valuable for R programmers who occasionally need to use Python or other languages (Rust, C++, Lua, etc), or who collaborate with Python users. Both experienced R users looking to expand their toolset and RStudio enthusiasts curious about the next generation of R development environments will benefit from this session.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 13:00–16:30

Author(s): Jennifer Bryan (Posit, PBC), Julia Silge (Posit, PBC); Julia Silge (Posit, PBC)

Keyword(s): positron, r, ide, data science, rstudio

Video recording available after conference: ❌
Jennifer Bryan (Posit
PBC)
Julia Silge (Posit
PBC)
13:00–16:30 TBD R You Out of Memory Again? Level Up Your Data Game with Arrow and DuckDB

More info“I can’t analyze this dataset—R keeps running out of memory!” This common frustration signals a critical gap in the R analyst’s toolkit. This hands-on tutorial empowers tidyverse users to break through memory limitations by leveraging two game-changing technologies: Apache Arrow and DuckDB. Arrow provides a cross-language columnar memory format that enables efficient processing of large datasets without full memory loading, while DuckDB offers an embeddable analytical database engine that excels at complex aggregations and joins. When combined with dplyr’s grammar of data manipulation, these tools create a powerful framework for scalable data analysis. The beauty of this approach? You can keep using the dplyr syntax you already know and love. Arrow and DuckDB work seamlessly with dplyr’s grammar of data manipulation, translating familiar verbs into high-performance operations that process data outside of R’s memory constraints. This means analyzing gigabytes of data on your laptop without rewriting your existing code or learning entirely new frameworks. Through practical examples with real-world datasets, participants will discover how to: Transform existing dplyr pipelines to process larger-than-memory datasets Navigate the complementary strengths of Arrow (streaming operations, columnar processing) and DuckDB (complex aggregations, efficient joins) Integrate SQL when needed for specialized operations Optimize query performance through execution strategies like predicate pushdown and parallel processing We’ll focus on immediately applicable techniques rather than theory. Each concept is paired with hands-on exercises where participants implement patterns they can directly transfer to their own projects. You’ll experience firsthand the thrill of processing datasets 10-100x larger than previously possible with standard R. By the tutorial’s end, participants will confidently decide which tool fits each analytical challenge and implement scalable workflows that grow with their data needs. The days of “cannot allocate vector of size…” errors will be behind you. All materials, including code examples and datasets, will be available in a GitHub repository, ensuring continued learning beyond the workshop. Join us to transform your data analysis capabilities and remove the memory ceiling that’s been holding back your R workflows

Learning goals: Configure and integrate Arrow and DuckDB within an R environment. Translate dplyr workflows to handle out-of-memory datasets. Optimize query performance and implement scalable data processing workflows. Combine R and SQL for complex analytical operations.

Target audience: Data scientists, analysts, and researchers who are comfortable with tidyverse packages (particularly dplyr) and are encountering performance limitations when working with larger datasets. The tutorial will benefit professionals across academia, industry, and government sectors who need to scale their R-based data analysis.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 13:00–16:30

Author(s): Elyse Armstrong (Common App), JEANNE MCCLURE (NC State University), Sheila Saia (R-Ladies RTP); Elyse Armstrong (Common App), Sheila Saia (R-Ladies RTP)

Keyword(s): big data, dplyr, apache arrow, duckdb, performance optimization

Video recording available after conference: ❌
Elyse Armstrong (Common App)
JEANNE MCCLURE (NC State University)
Sheila Saia (R-Ladies RTP)
13:00–16:30 TBD Teaching statistics and data science with R and GitHub

More infoIn this tutorial, participants will learn about teaching R and GitHub in statistics and data science courses. We will discuss pedagogy and curriculum design for effectively teaching computing alongside statistical concepts. Participants will explore example in-class activities and assignments that demonstrate the student experience, while discussing strategies for implementing such activities from the instructor perspective. We will also discuss computing infrastructure options that enable students to use R and Rstudio from a web browser with minimal set up. Lastly, we will show how instructors can use R and Quarto to make course materials and streamline their workflow in a reproducible way using GitHub. The tutorial will focus on teaching introductory-level undergraduate students with no previous computing experience, but the tutorial content is applicable for instructors teaching high school courses and courses throughout the undergraduate statistics and data science curriculum.

Learning goals: Learn pedagogical strategies for teaching R and GitHub in a statistics or data science course - Identify how computing can be integrated alongside statistical concepts in a course curriculum - Experience computing activities and assignments from both the student and instructor perspective - Consider the computing infrastructure that may be the best fit for your student population - Learn how to develop course materials with R and Quarto and develop a reproducible workflow with GitHub

Target audience: This workshop is for instructors interested in teaching R in their statistics and data science courses. The workshop will be presented from the perspective of teaching at the undergraduate level;; however, the contents of this workshop will also be beneficial to instructors teaching high school statistics and data science.

Prerequisites: None

Date and time: Fri, Aug 8, 2025 - 13:00–16:30

Author(s): Elijah Meyer (North Carolina State University), Maria Tackett (Duke University)

Keyword(s): data science education, pedagogy, quarto, github, webr

Video recording available after conference: ❌
Elijah Meyer (North Carolina State University)
Maria Tackett (Duke University)
Keynote #1
17:15–18:15 Gross Hall 270 I wrote this talk with an LLM

More infoIn this keynote, I’ll explore the evolving relationship between data scientists, statisticians, and large language models through a unique experiment: this entire talk was created in collaboration with an LLM. From outline to slides, from code examples to key insights, I’ll share the practical realities of using AI as a thought partner in the R ecosystem.

Drawing on my experience developing tidyverse packages and teaching data science, I’ll demonstrate how LLMs can augment (rather than replace) the R user’s workflow. We’ll examine specific examples where AI assistance shines—rapid prototyping, documentation generation, and creative ideation—alongside areas where human expertise remains irreplaceable.

Most importantly, I’ll reflect on what this experiment reveals about the future of our community: How might AI change the way we teach R? What new skills should we prioritize? And how can we ensure that the tools we build remain accessible and empowering for all users?

Join me for this meta-exploration of AI’s role in our work, with honest reflections on both the promise and limitations of these new collaborators in our statistical computing journey.

✨ This abstract was generated by Claude Sonnet 3.7 and lightly edited by me. I used the prompt: I am Hadley Wickham, chief scientist at RStudio/Posit and I’ve been invited to give a keynote on AI at the useR conference. Please write a talk abstract for a talk entitled ‘I wrote this talk with an LLM’ ✨

Date and time: Fri, Aug 8, 2025 - 17:15–18:15

Author(s): Hadley Wickham (Posit, PBC)

Keyword(s): NA

Video recording available after conference: ✅
Hadley Wickham (Posit
PBC)
Poster
18:15–19:30 Gross Hall Energy Hub Applying GAM and MARS using R to predict daily household electricity load curves

More infoWe show how to achieve our objective of predicting the daily household electricity load curves using the R statistical language and several state packages that implement state of the art techniques. The widespread deployment of smart meters in the residential and tertiary sectors has made it possible to collect high-frequency electricity consumption data at the consumer level (individuals, professionals, etc.). This data is a raw material for research on the prediction of electricity consumption at this level. The majority of this research is largely aimed at meeting the needs of industry, such as applications in the context of smart homes and programs for managing and reducing consumption. The objective of this work is to deploy or implement short-term (D + 1) electrical load forecasting models at the consumer level. The complexity of the subject lies in the fact that consumption data on this scale is very volatile. Indeed, it includes a large amount of noise and depends on the consumer’s lifestyle and consumption habits. We studied the influence of integrating outdoor temperature in different forms on the performance of a Generalized Additive Models (GAM) model and a Multivariate Adaptive Regression Splines (MARS) model. These two models are capable of modelling both linear relationships and non-linear interactions between influencing factors (independent variables), and were adapted to model the temperature sensitivity of load curves. The models were tested and evaluated on a large sample of disparate load curves in the residential sector. An approach was also proposed for the prediction of the most volatile load curves. We will wrap an example dataset as well as the scripts that were used in this work in a package that will be available online.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Frederic Bertrand (Troyes University of Applied Sciences); Fatima Fahs (ES), Myriam Maumy (EHESP)

Keyword(s): statistical learning, generalized additive models, multivariate adaptive regression splines, prediction, daily electricity load curves
Frederic Bertrand (Troyes University of Applied Sciences)
18:15–19:30 Gross Hall Energy Hub Automating GCaMP Fluorescence Analysis for Neuronal Activity Quantification in Stress Response Studies

More infoCalcium imaging using genetically encoded calcium indicators (GECIs) such as GCaMP provides a critical window into neuronal activity, particularly in response to physiological stress. However, the large volume of imaging data presents challenges in efficient processing, normalization, and statistical analysis. This project introduces an automated R-based workflow that extracts, normalizes, and analyzes fluorescence intensity changes to identify significant neuronal responses. The pipeline standardizes intensity values against baseline fluorescence, filters for neurons exhibiting meaningful activity, and applies statistical modeling—including two-way ANOVA with post-hoc comparisons—to assess differences across experimental conditions. By integrating data processing and statistical analysis, this approach streamlines fluorescence quantification, reducing manual intervention while enhancing reproducibility. The methodology is broadly applicable across neuroscience, bioinformatics, and computational research, providing a scalable solution for analyzing calcium imaging data in diverse experimental settings.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Calvin Cho (Duke University, Trinity College of Arts & Sciences); Carlene Moore (Duke University School of Medicine), Christopher Wickware (Duke University School of Medicine)

Keyword(s): automation, statistical modeling, bioinformatics, medical research, data analysis
Calvin Cho (Duke University
Trinity College of Arts & Sciences)
18:15–19:30 Gross Hall Energy Hub Cloud-Based and AI-assisted Workflows in R: Case Studies from North Carolina State University’s Data Science Consulting Program

More infoThe Data Science Consulting Program at North Carolina State University, supports researchers in analytics, statistics, and data visualization by leveraging cloud-based and AI-assisted workflows that streamline collaboration and eliminate infrastructure challenges. For smaller projects, we use Google Colab, which supports both R and Python, enabling multiple consultants to work together without requiring extensive version control expertise. Since Colab operates entirely in the cloud, patrons can execute workflows seamlessly without the burden of local setup. Additionally, Gemini AI integration provides real-time coding and environment support, reducing time spent troubleshooting and navigating documentation. For larger, scalable projects, we turn to Posit Cloud, a cloud-based IDE for building and deploying interactive dashboards in R and Python. With Shiny Assistant, consultants and researchers receive AI-powered guidance on UI/UX design, backend development, and deployment, ensuring an efficient workflow. To maintain structured collaboration, we manage a modular codebase through a private GitHub repository, allowing for better version control and teamwork. In this poster, we will highlight case studies from our program, illustrating how these cloud-based and AI-assisted workflows enhance research in R by improving collaboration, reducing technical overhead, and increasing efficiency.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Ishti Sikder (North Carolina State University); Shannon Ricci (North Carolina State University), Alp Tezbasaran (North Carolina State University)

Keyword(s): collaboration, cloud, generative ai, consulting
Ishti Sikder (North Carolina State University)
18:15–19:30 Gross Hall Energy Hub Leveraging R for Multi-State Modeling in Real-World Oncology Research

More infoSurvival analysis is fundamental to oncology research, enabling the estimation of time-to-event outcomes such as overall survival (OS), progression-free survival (PFS), time to treatment discontinuation (TTD), and time to next treatment (TTNT). Multi-state modeling (MSM) extends survival analysis by incorporating dynamic transitions between treatment lines, while accounting for censoring and competing risks. This study demonstrates how R and the mstate package can be used to facilitate the development of complex MSM frameworks for analyzing oncology treatment pathways. This study used the nationwide Flatiron Health electronic health record (EHR)-derived deidentified database. The Flatiron Health database is a longitudinal database, comprising deidentified patient-level structured and unstructured data, curated via technology-enabled abstraction. We implemented a multi-state model in R to track patient transitions from first-line therapy (1L) to subsequent lines (2L, 3L+) and death, incorporating irreversible transitions to reflect real-world treatment pathways. The model adjusts for key baseline clinical covariates, including age, sex assigned at birth, ALK and EGFR biomarker status, and cancer stage, as well as covariates present at the time of state transition including percent change in weight from baseline and most recent serum albumin. The analysis was conducted using R packages survival, ggplot2, tidyverse, dplyr, and mstate, which provide a robust framework for data cleaning, transition probability estimation, patient trajectory visualization, and statistical inference. Non-parametric methods, including the Aalen-Johansen estimator and Kaplan-Meier estimator, were used to estimate transition probabilities, while the semi-parametric Cox proportional hazards model was applied to identify significant clinical factors influencing transitions. This study demonstrates how multi-state modeling techniques can enhance the ability to assess prognosis of time-to-event outcomes in oncology RWE.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Aashay Mahesh Mehta; Spencer Langerman

Keyword(s): multi-state modeling, real-world evidence, statistical methods, cox regression, oncology research
Aashay Mahesh Mehta
18:15–19:30 Gross Hall Energy Hub Optical Character Recognition (OCR) Screening in R for PFAS in Project Documents

More infoRamboll is frequently retained to analyze large batches of project documents as part of identifying per- and polyfluoroalkyl substances (PFAS) in client operations, such as for compliance with ongoing reporting requirements. Due to the scale of the effort needed to manually search project documents for thousands of terms that may be associated with PFAS (numerous compounds and variations in nomenclature), the need for automation can be very helpful in minimizing human error while reducing costs for our clients. The Optical Character Recognition (OCR) Screening Tool uses R to execute the screening of documents to identify and index CAS numbers, chemical names, trade names, and other PFAS-related keywords. The R tool was built for use within Ramboll for this purpose. This tool prepares the PDFs by organizing them into categories: those that can be read initially, those that need to have OCR applied, and those that will need human review, minimizing upfront effort. A large keyword list was developed by Ramboll’s PFAS subject matter expert team. The keyword list is organized in groups and can be customized based on client needs. It is important to note that the tool can be adapted for any keyword list. The R tool lists specific instances of relevant search terms within the document, along with approximate page numbers, building an index organized by the groupings provided by the user. Once complete, the output of the tool is summarized for use in a Shiny dashboard for easy viewing. Excel reports are also generated with the files returning results being hyperlinked automatically. A triage task is often implemented to screen the output for false positives and for identifications that require further action.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Bruce Franz (Ramboll); Brian Drollette (Ramboll), Jon Hunt (Ramboll)

Keyword(s): ocr,pfas,shiny,environmental science,health sciences
Bruce Franz (Ramboll)
18:15–19:30 Gross Hall Energy Hub Partially automated driving, EEG signals, and eye-tracking data: Using R to consolidate multiple subject files for machine learning models

More infoExploring the relationship between driver’s attention and conversational prompts to enhance performance while utilizing partially automated cars, each participant of this multimodal driving simulation study produced five separate eye-tracking datasets and a Muse Headset EEG dataset for each of four separate driving scenarios with differing cognitive workloads. The present work highlights how R was utilized to consolidate these datasets together to be usable with machine learning packages, including caret, randomForest, and gbm. Critically, generating an initial dataset containing names of the CSV files available would allow the iteration of this process across all files in the working directory. All eye-tracking data for a participant in a given scenario were read into a pooled dataset to determine if any screen had a positive indicator variable value that the participant was looking at the proper screen on a second-by-second basis. This second-by-second consolidated eye-tracking dataset would then be combined with the second-by-second EEG dataset for the participant for that scenario. Once all consolidated datasets had been made for all participants’ scenarios, they were then stacked and prepared for analysis. Utilizing an 80/20 split for the training/test paradigm, five EEG signals (Alpha, Beta, Gamma, Delta, and Theta) from four channels (AF7, AF8, TP9, and TP10) were used to predict whether the participant was looking at the proper screen with around 85% accuracy. Implications of attentional considerations reflected by this data in the interest of driver safety will be discussed.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Jesse DeLaRosa (Duke Clinical Research Institute); Xiaolu Bai (North Carolina State University), Jing Feng (North Carolina State University)

Keyword(s): data wrangling, statistical learning, eeg, eye-tracking, self-driving cars
Jesse DeLaRosa (Duke Clinical Research Institute)
18:15–19:30 Gross Hall Energy Hub R-Ladies Global: Promoting Diversity and Inclusion in the R Community for Nearly a Decade

More infoR-Ladies Global is a worldwide organization focused on achieving proportionate representation of genders currently underrepresented in the R programming community. To meet this goal, we support a network of local chapters who organize events that encourage, inspire, and empower individuals to meet their programming potential. Since R-Ladies Global was founded in 2016, it has grown to provide training and mentoring to over 100,000 members, in 244 chapters, and 63 countries. Local chapters have held over 4,200 events focused on a wide range of topics from workshops on popular R programming libraries (e.g., data.table, Shiny) and R package development to data science panels on various topics (e.g., Ethics in Data Science, Women in Tech) to hackathons (e.g., #TidyTuesday) to speaking opportunities (e.g., Lightning Talks) to networking events (e.g., dinner meet-ups, book clubs). The organization also maintains a directory of speakers, an abstract review system for conferences, and a YouTube channel with recordings of events, among other valuable resources. There are many leadership, training, career development, and mentoring opportunities for professionals who join R-Ladies Global. Please stop by our poster or visit our website (https://rladies.org/about-us) to learn more about how you can get involved and/or support our mission.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Sheila Saia (R-Ladies RTP); R-Ladies Global Team (R-Ladies Global)

Keyword(s): r-ladies, coding, community, diversity
Sheila Saia (R-Ladies RTP)
18:15–19:30 Gross Hall Energy Hub Reproducible Research at Scale using R

More infoReproducibility and collaboration are essential for scalable scientific research. In this presentation, we outline a workflow that integrates open-source tools with Posit’s proprietary products to enable and streamline reproducible research. Our approach leverages Backstage for automating project setup via template workflows, GitLab for version control and code review, and renv and Posit Package Manager for R environment management (internally- and externally-developed packages). Putting it all together, analytic outputs are created via Quarto in Posit Workbench and then shared with our clinical experts via Posit Connect. These templated workflows and tools allow Flatiron Health scientists to consistently produce high-quality analytic reports that can easily be reproduced and distributed across the organization. Our templates further ensure that developer setup is streamlined and output is styled consistently across projects and teams. By integrating industry-leading, open source tools, we create a robust, scalable workflow that embeds reproducibility and enhances collaboration across research teams.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Nicole Nasrallah (Flatiron Health), Benjamin Wagner (Flatiron Health); Nicole Nasrallah (Flatiron Health), Michael Thomson (Flatiron Health), Erica Yim (Flatiron Health)

Keyword(s): reproducibility, workflow, research
Nicole Nasrallah (Flatiron Health)
Benjamin Wagner (Flatiron Health)
18:15–19:30 Gross Hall Energy Hub Safety first: Design-informed inference for treatment effects via the propertee package for R

More infoWhen treatments are allocated by cluster, it is vital for correct inference that the clustering structure be tracked and appropriately attended to. In randomized trials and observational studies modeled on RCTs, clustering is determined at the early stage of study design, with subtle but important implications for the later stage of treatment effect estimation. A first contribution of our “propertee” R package is to make analysis safer by providing self-standing functions to record treatment allocations, with the thus-encoded study design informing subsequent calculations of inverse probability weights, if requested, and of standard errors. A second contribution is to facilitate the use of precision-enhancing predictions from models fitted to external or partly external samples. The user experience is kept simple by adapting such familiar R mechanisms as predict(), lm(), offset(), the sandwich package and summary(); under the hood it stacks estimating equations for sandwich estimates of variance. The propertee package makes it easy and safe to produce Hajek- or block fixed effect estimates with appropriate standard errors, even in the presence of grouped assignment to treatment, repeated measures, subgroup-level estimation and/or covariance adjustment.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Ben Hansen (University of Michigan); Adam Sales (Worcester Polytechnic Institute), Xinhe Wang (University of Michigan)

Keyword(s): causal inference, conditional growth status model, design-based, direct adjustment, workflow
Ben Hansen (University of Michigan)
18:15–19:30 Gross Hall Energy Hub Scaling R Support in an Academic Library: Data-Driven Insights from North Carolina State University’s Data Science Consulting Program

More infoNorth Carolina State University’s Data Science Consulting Program, housed within the NC State University Libraries, has seen a steady increase in R-related research requests over the past few years. This poster showcases how our team systematically tracks and analyzes consultation data to identify emerging trends, refine service offerings, and guide staffing decisions. We illustrate key patterns—such as the steady increase in demand for tidyverse-based data wrangling, reproducible workflows with R Markdown, and interactive Shiny applications—and detail how these insights inform our tailored workshops and one-on-one support. In addition to quantitative metrics, we highlight the collaborative and inclusive environment that underpins our consulting approach. From assisting novice users with their very first script to supporting advanced modeling for cross-disciplinary projects, our goal is to maintain a no-judgment, solution-oriented culture that empowers researchers at all skill levels. By sharing anonymized case studies and lessons learned, we demonstrate how a blend of data-driven planning and human-centered consulting can help institutions efficiently scale R support services.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Claire Murphy (North Carolina State University); Franziska Bickel (North Carolina State University), Alp Tezbasaran (North Carolina State University), Selene Schmittling (North Carolina State University), Shannon Ricci (North Carolina State University)

Keyword(s): research, data-driven analysis, consulting, r support
Claire Murphy (North Carolina State University)
18:15–19:30 Gross Hall Energy Hub The Workplace Wellbeing Assessment: Using R to evaluate organizational factors impacting mental health and wellbeing of international aid workers

More infoThe project presented here details the development of a diagnostic instrument, the Workplace Wellbeing Assessment (WWA), to help international aid organizations assess organizational structures that impact employee wellbeing. Background Humanitarian and development organizations (aid organizations) operate in environments that expose employees to unique stressors, including armed conflict, natural disasters, and working with traumatized clients. However, internal organizational factors have been shown to have a similarly large impact on employee mental health and wellbeing as the aforementioned acute stressors. These factors include unsupportive policies, unclear role expectations, poor work-life balance, inadequate compensation, and a lack of mental health resources. There is an urgent need for a practical tool to 1) help organizations evaluate how their policies and practices affect employee mental health, and 2) provide evidence-based recommendations for amending these policies and practices. The instrument The WWA consists of three distinct parts: 1. The questionnaire, built in Qualtrics and based on the [Workplace Mental Health & Well-Being framework][1], allows employees to provide input on key organizational structures, policies, and cultural aspects that influence employee wellbeing. 2. The assessment tool uses the qualtRics R package to retrieve the survey data via the Qualtrics API. It then decodes the questionnaire responses and provides a score for each of the five “Essentials” of the framework as well as their individual components, offering an objective evaluation of the organization’s strengths and weaknesses with regard to mental health and wellbeing. Finally, it uses OpenAI API to summarize the survey’s text responses, providing organizations with anonymized qualitative data. 3. The recommendation tool consists of a Shiny app that acts as a data dashboard for organizations, providing organizations with key charts, tables, and insights. Furthermore, for components with low scores, the dashboard provides organizations with tailored, actionable, evidence-based guidance (written by the author) on how to improve these problem areas. While the codebase is relatively simple, this project serves as an example for public health practitioners that R is a versatile tool that can be used for more than biostatistics and academic research - in this case to drive employee mental health and wellness improvements for international aid organizations. [1]: https://www.hhs.gov/sites/default/files/workplace-mental-health-well-being.pdf

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Julius Torres Kellinghusen (New York University - School of Global Public Health)

Keyword(s): mental health, humanitarian aid, survey analysis, shiny, ai
Julius Torres Kellinghusen (New York University - School of Global Public Health)
18:15–19:30 Gross Hall Energy Hub Using R and explainable machine learning to estimate green value of household

More infoWe show how to achieve our objective of estimating the green value of housing by focusing on energy performance labels in order to understand how housing prices evolve when energy performance improves using the R statistical language and several state packages that implement state of the art techniques. Instead of fitting a hedonic modeling that is some special kind of linear model, and as it was done in previous works, we fit random forests or XGBoost models. Unlike linear models, which directly reveal the relative importance of the variables via coefficients, these complex models require alternative methods to quantify the impact of the input variables. Shapley values are often used to tackle this issue for random forests and XGBoost models, that do not provide explicit coefficients. Their calculation guarantees that each feature is fairly represented, taking into account all possible combinations of variables. However, with non-linear and complex models such as random forests and XGBoost, the exact calculation of Shapley values becomes computationally prohibitive. As a consequence we used more efficient approximation methods such as SHAP, KernelSHAP and FastSHAP to interpret the predictions given by models and we managed to propose an estimate of the “green value” of a housing. We will wrap an example dataset as well as the scripts that were used in this work in a package that will be available online.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Myriam Maumy (EHESP); Frederic Bertrand (Troyes University of Applied Sciences), Elizaveta Logosha

Keyword(s): machine learning, explainable ai, random forests, shapley values, green value
Myriam Maumy (EHESP)
18:15–19:30 Gross Hall Energy Hub Utilizing R and Terra to describe the geographic distribution of patients with Early and Advanced Non-Small Cell Lung Cancer

More infoIn this case study we demonstrate the usage of the terra package to create a choropleth map describing the distribution of patients with early and advanced stage Non-Small Cell Lung Cancer (NSCLC) within the continental united states using data from the nationwide Flatiron Health electronic health record (EHR)-derived deidentified database. The study included adults aged ≥18 years who were diagnosed with NSCLC between January 2011 and December 2023. US 3 digit zip code area boundaries derived from the US census Bureau’s Zip Code Tabulation Areas were processed in R using the terra package and then linked to de-identified patient addresses. Within each ZIP3 boundary, the number of patients with early or advanced stage at diagnosis was summarized within 8 levels: less than 20, 21-50, 51-75, 76-100, 101-500, 501-1000, 1001-2000, and 2001 or greater. This visualization approach provides a clear representation of the geographic availability of data within the clinical data source and can be used to inform targeted clinical trial enrollment efforts or assess geographic representativeness of observational real-world studies.

Date and time: Fri, Aug 8, 2025 - 18:15–19:30

Author(s): Yunzhi Qian; Spencer Langerman (Flatiron Health)

Keyword(s): gis, real-world data, data visualization, oncology
Yunzhi Qian
Day 2: Saturday, August 9, 2025
Room Title, abstract, and more info Presenter(s)
Keynote #2
09:00–10:00 Penn 1 R in the AI Era: Leveraging Modern Technologies in Practice

More infoR is popular in part due to its extensive package ecosystem, which allows for the incorporation of new technologies and statistical methods. The language has been designed explicitly to focus on data, making it easy for users to apply diverse tools and methods in the course of data analysis. Although R itself is more than two decades old, the combination of those two major elements enables it to adapt to tools and techniques quickly, benefitting both researchers and data analysts. In this talk we will illustrate using R on real examples with current technologies moving from Big Data era of distributed data analysis to the AI era of fitting and evaluating attention models and leveraging large language models for both analysis and retrieval.

Date and time: Sat, Aug 9, 2025 - 09:00–10:00

Author(s): Simon Urbanek (University of Auckland + R Core)

Keyword(s): NA

Video recording available after conference: ✅
Simon Urbanek (University of Auckland + R Core)
Data visualization
10:30–12:00 Penn 1 From #EconTwitter to the White House: Real-Time Economic Data with R

More infoIt’s not just financial markets. Policy and economics reporters, commentators, and public officials all use real-time analysis of leading economic data as soon as it is available. Moments after the release of jobs numbers, inflation rates, or GDP data, policymakers, journalists, and commentators dive into real-time interpretation and visualization. In this high-speed environment, the right tools are essential, and R stands out as particularly powerful. Join Mike Konczal as he shares his firsthand experiences using R in real-time following data releases to create viral graphics on #EconTwitter, prepare quotes for reporters and materials for media appearances, and even coordinate analysis at the White House, where he served covering economic data for the National Economic Council. You’ll learn the process, from how to access and manipulate government economic data to making your own economic work clear and accessible to the broader public.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Mike Konczal (Economic Security Project)

Keyword(s): economics, politics, finance, macroeconomics, public communications

Video recording available after conference: ✅
Mike Konczal (Economic Security Project)
10:30–12:00 Penn 1 Visualising Uncertainty with ggdibbler

More infoAdding uncertainty representation in a data visualisation can help in decision-making. There is an existing wealth of software designed to visualise uncertainty as a distribution or probability. These visualisations are excellent for helping understand the uncertainty in our data, but they may not be effective at incorporating uncertainty to prevent false conclusions. Successfully preventing false conclusions requires us to communicate the estimate and its error as a single “validity of signal” variable, and doing so proves to be difficult with current methods. In this talk, we introduce ggdibbler, a ggplot extension that makes it easier to visualise uncertainty in plots for the purposes of preventing these “false signals”. We illustrate how ggdibbler can be seamlessly integrated into existing visualisation workflows and highlight the effect of these changes by showing the alternative visualisations ggdibbler produces for a choropleth map.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Harriet Mason (Monash University); Dianne Cook (Monash University, Australia), Sarah Goodwin (Monash University), Susan Vanderplas (University of Nebraska - Lincoln)

Keyword(s): uncertainty, data visualisation, ggplot, r package

Video recording available after conference: ✅
Harriet Mason (Monash University)
10:30–12:00 Penn 1 Visualizing time with ggtime’s grammar of temporal graphics

More infoWhile several commonly used plots exist for visualizing time series, little work has been done to formalize them into a unified grammar of temporal graphics. Re-expressing traditional time series graphics such as time plots and seasonal plots with grammatical elements supports deeper customization options. Composable grammatical elements provide the flexibility needed to easily visualize multiple seasonality, cycles, and other complex temporal patterns. These modular elements can be composed together to create familiar time series graphics, and also recombined to create new informative plots. The ggtime package extends the ggplot2 ecosystem with new grammar elements and plot helpers for visualising time series data. These additions leverage calendar structures to visually align time points across different granularities and timezones, warp time to standardize irregular durations, and wrap time into compact calendar layouts. In this talk, I will introduce ggtime and demonstrate how its grammar of temporal graphics enables a flexible visualization of time series patterns.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Mitchell O’Hara-Wild (Monash University); Cynthia Huang (Monash University)

Keyword(s): grammar of graphics, time series, calendars, package design, ggplot2 extension

Video recording available after conference: ✅
Mitchell O’Hara-Wild (Monash University)
10:30–12:00 Penn 1 tinyplot: convenient and customizable base R plots

More infoThe {[tinyplot][1]} package provides a lightweight extension of the base R graphics system. It aims to pair the concise syntax and flexibility of base R plotting, with the convenience features pioneered by newer ({grid}-based) visualization packages like {ggplot2} and {lattice}. This includes the ability to plot grouped data with automatic legends and/or facets, advanced visualization types, and easy customization via ready-made themes. This talk will provide an introduction to {tinyplot} in the form of various plotting examples, describe its motivating use-cases, and also contrast its advantages (and disadvantages) compared to other R visualization libraries. The package is available on CRAN. [1]: https://grantmcdermott.com/tinyplot/

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Grant McDermott (Amazon); Vincent Arel-Bundock (Université de Montréal), Achim Zeileis (Universität Innsbruck)

Keyword(s): data viz, base graphics

Video recording available after conference: ✅
Grant McDermott (Amazon)
Modeling 1
10:30–12:00 Penn 2 Adding new algorithms to {tidyclust}

More infoThe {tidyclust} package, released in 2022, brings unsupervised learning to the {tidymodels} framework. This talk will share an overview of process by which new models and algorithms are added to the {tidyclust} collection, based on recent work adding five new models for clustering and data mining (DBSCAN, GMM, BIRCH, itemset mining, and association rules). We will discuss in-depth the complications - programmatic, algorithmic, and philisophical - of adapting a supervised learning framework to unsupervised and semi-supervised settings. For example, what does it mean to tune a parameter in the absence of validating prediction metrics? How should row-based clustering be processed differently than column-based clustering? This talk is aimed at R users and developers who want to think deeply about the intersection between code design choices and methodological principles in unsupervised learning, and who want to peek behind the curtain of the {tidyclust} package framework.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Kelly Bodwin (California Polytechnic State University)

Keyword(s): tidymodels, tidyclust, unsupervised learning, clustering, package development

Video recording available after conference: ✅
Kelly Bodwin (California Polytechnic State University)
10:30–12:00 Penn 2 Modeling Eviction Trends in Virginia

More infoVirginia is home to 5 of the top 10 cities in the country with the highest rates of eviction. Using civil court records, we are able to analyze the behavior of landlords, so that we can hold those in power accountable to make effective and just change. Where do landlords engage in more eviction actions? What characteristics of renters or landlords increase the practice of serial filing? Using administrative data – information collected by government and agencies in the implementation of public programs – we are able to evaluate systems and promote most just outcomes. Working with the Civil Court Data Initiative of Legal Services Corporation, we use data collected from civil court records in Virginia to analyze the behavior of landlords. Expanding on our Virginia Evictors Catalog, we use data on court evictions to build additional data tools to support the work of legal and housing advocates and model key eviction outcomes to contribute to our understanding of landlord behavior. First we visualized eviction activity across the state in an interactive Shiny app to address questions and needs of organizations providing legal, policy, and community advocacy. In addition we estimated landlord actions – eviction filings and serial filings – as a function of community and landlord characteristics. Using a series of mixed-effects models, with data aggregated to zipcodes nested in counties, we estimated the impact of community characteristics and landlord attributes on the likelihood of eviction filings. Participants will walk away with a better understanding of what influences landlord behavior, and will have a framework for investigating the practice in their own communities.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Michele Claibourn (Center for Community Partnerships), Samantha Toet (Center for Community Partnerships)

Keyword(s): shiny, data visualization, mixed-effects modeling, geography, social science

Video recording available after conference: ✅
Michele Claibourn (Center for Community Partnerships)
Samantha Toet (Center for Community Partnerships)
10:30–12:00 Penn 2 Predictive Modeling with Missing Data

More infoMost predictive modeling strategies require there to be no missing data for model estimation. When there is missing data, there are generally two strategies for working with missing data: 1.) exclude the variables (columns) or observations (rows) where there is missing data; or 2.) impute the missing data. However, data is often missing in systematic ways. Excluding data from training is ignoring potentially predictive information and for many imputation procedures the missing completely at random (MCAR) assumption is violated. The medley package implements a solution to modeling when there are systematic patterns of missingness. A working example of predicting student retention from a larger study of the Diagnostic Assessment and Achievement of College Skills (DAACS) will be explored. In this study, demographic data was collected at enrollment from all students and then students completed diagnostic assessments in self-regulated learning (SRL), writing, mathematics, and reading during their first few weeks of the semester. Although all students were expected to complete DAACS, there were no consequence and therefore a large percentage of student completed none or only some of the assessments. The resulting dataset has three predominate response patterns: 1.) students who completed all four assessments, 2.) students who completed only the SRL assessment, and 3). students who did not complete any of the assessments. The goal of the medley algorithm is to take advantage of missing data patterns. For this example, the medley algorithm trained three predictive models: 1.) demographics plus all four assessments, 2.) demographics plus SRL assessment, and 3.) demographics only. For both training and prediction, the model used for each student is based upon what data is available. That is, if a student only completed SRL, model 2 would be used. The medley algorithm can be used with most statistical models. For this study, both logistic regression and random forest are used. The accuracy of the medley algorithm was 3.5% better than using only the complete data and 3.1% better than using a dataset where missing data was imputed using the mice package. The medley package provides an approach for predictive modeling using the same training and prediction framework R users are accustomed to using. There are numerous parameters that can be modified including what underlying statistical models are used for training. Additional diagnostic functions are available to explore missing data patterns.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Jason Bryer (City University of New York)

Keyword(s): predictive modeling, r package

Video recording available after conference: ✅
Jason Bryer (City University of New York)
10:30–12:00 Penn 2 jarbes: an R package for Bayesian parametric and nonparametric bias correction in meta-analysis

More infoMeta-analysis methods help researchers answer questions that require combining statistical results across several studies. Very often, the only available studies are of different types and of varied quality. Therefore, when we combine disparate evidence at face value, we are not only combining results of interest but also potential biases that might threaten the quality of the results. Consequently, the results of the meta-analysis could be misleading. This work presents the R package jarbes, “Just a rather Bayesian Evidence synthesis.” This package has been designed explicitly for Bayesian evidence synthesis and meta-analysis. It implements a family of Bayesian parametric and nonparametric models for meta-analysis that account for multiple biases. A model in jarbes is built upon two submodels: one that contains the parameters of interest (e.g., a pooled mean across studies) and another that accounts for biases. The biases submodel addresses hidden factors that may distort study results (e.g., selection bias, dilution bias, reporting bias) and are not directly observable. This model-building strategy allows the model of bias to correct the meta-analysis affected by biased evidence. We present two real examples of applying the Bayesian nonparametric modeling functionality of jarbes. The first combines studies of different types and quality, and the second shows the effect of bias correction in nonparametric meta-regression. References Verde, P. E. (2024), “jarbes: An R Package for Bayesian Evidence Synthesis.” Version 2.2.3. https://CRAN.R-project.org/package=jarbes Verde, P. E. and Rosner, G. L. (2025), A Bias-Corrected Bayesian Nonparametric Model for Combining Studies With Varying Quality in Meta-Analysis. Biometrical Journal., 67: e70034. https://doi.org/10.1002/bimj.70034

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Pablo Verde (University of Dusseldorf)

Keyword(s): meta-analysis, bayesian nonparametrics, bias-correction, evidence synthesis

Video recording available after conference: ✅
Pablo Verde (University of Dusseldorf)
Case studies
10:30–12:00 Penn Garden From Copy-Paste Chaos to Reproducible Workflows: A Wet Lab Researcher’s Journey into R

More infoAs a wet lab researcher, I used to struggle with fragmented data analysis workflows. I was taught: You do your experiments, you get your data, you copy-paste into separate software packages for descriptive statistics, visualisation, and documentation. I was constantly frustrated with data analysis: Change something early in the analysis? Go back and copy-paste. How did I analyse similar data sets previously while working at a different institute? Good luck opening that proprietary file format without that software and the license. Learning R transformed how I approach data, not just by replacing individual tools but reshaping my entire understanding of analysis. Beyond statistics, R introduced me to better data organisation, reproducible analysis, meaningful visualisation, and a community dedicated to improving data analysis and reporting. Working with R taught me more than any course on data analysis ever did. Now I use RMarkdown and Quarto daily to document and report my research. These tools allow me to standardise workflows, making my analyses reproducible and independent of proprietary software that might not be available in all research settings. Beyond improving my own work, these tools have become invaluable for guiding students, e.g. providing example workflows for common assays, and visualisations to help them better understand their data. In my talk, I will share my journey from chaotic spreadsheets to a reproducible, streamlined workflow. I will showcase the specific tools I use and how they have improved my research. Lastly, I will invite other wet lab researchers to discuss how these tools can help address reproducibility challenges in data analyses.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Anna Jaeschke

Keyword(s): wet lab research, workflow, experimental research

Video recording available after conference: ✅
Anna Jaeschke
10:30–12:00 Penn Garden Readability: New ways to improve communication at the Central Bank of Chile

More infoThis study presents the development of a Shiny application, created entirely within the Central Bank of Chile, to improve the readability of its monetary policy communications. Effective communication is essential for central banks, as it influences expectations and decision-making. However, technical language and complex sentence structures often hinder comprehension. Initially, readability was assessed using the perspicuity index, an adaptation of the Flesch-Kincaid index. However, this method does not identify the specific sources of difficulty, especially in Spanish. To address this, a new theoretical framework was developed, identifying five key complexity dimensions: (1) nominalization, (2) gerunds, (3) depth of dependency, (4) subordinations, and (5) language complexity. Using Natural Language Processing (NLP), the Shiny application detects readability challenges by: 1. Calculating the percentage of sentences with readability issues. 2. Highlighting complex structures within the text. 3. Providing sentence-level breakdowns of readability difficulties. 4. Comparing language complexity against graded dictionaries. Applying this tool to monetary and financial policy reports since 2018 revealed that approximately 30% of the content contains readability challenges. The monetary policy summaries correlate strongly with the perspicuity index, indicating that most readability issues stem from syntactic complexity. In contrast, financial policy summaries show lower correlation, as their difficulty arises from long words and technical terms. Since its first use in December 2022, the application has played a key role in reducing text complexity in official reports. However, an increase in complexity in June 2023, following a change in report authorship, underscores the importance of user adoption in ensuring consistent readability improvements. Ultimately, this initiative highlights the need for tailored readability strategies across different policy instruments. While monetary policy documents benefit from structural simplifications, financial policy texts require a more nuanced approach that considers both syntax and terminology. Additionally, the study demonstrates that institutional willingness to adopt readability tools significantly impacts communication effectiveness. By developing this Shiny application, the Central Bank of Chile has taken a significant step toward improving policy communication, ensuring greater clarity and accessibility for diverse audiences.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Valentina Cortes Ayala (Central Bank of Chile); Karlla Munoz (Central Bank of Chile)

Keyword(s): shiny, communication, central bank, readability

Video recording available after conference: ✅
Valentina Cortes Ayala (Central Bank of Chile)
10:30–12:00 Penn Garden Using R to Track, Monitor and Detect Changes in Movement and Diving Patterns of Beaked Whales off Cape Hatteras, NC

More infoBeaked whales can regularly dive to depths over 2,000m and during these dives hold their breath for over an hour. Understanding this physiological feat, as well as how individuals might alter their behavior when confronted with anthropogenic noise in the form of naval sonar is a daunting task that requires a diverse team of biologists, data scientists and statisticians. Here we report how we use R as part of a multiyear experiment off Cape Hatteras, NC, where we have monitored the behavior of 117 individual whales across 23 sonar exposures. Using biologging devices that are attached to individual whales, we record data on their acoustic behavior, diving kinematics and swimming behavior across multiple temporal and spatial scales. Using R, we focus our analysis on records detailing diving data every five minutes for two weeks and coarser movement data for approximately one month. Our workflow includes using structured EDA with bespoke R code to examine patterns before and after exposure; R packages (ctmcmove, walkMI) to fit continuous-time discrete space models to movement; and R packages (momentuhmm) to fit multi-state hidden Markov models to the dive data. We bring these together with 4D modeled data on sound propagation in the water column. This workflow allows us to parameterize dose-response models within a Bayesian model written in jags to quantify how exposure impacts behavior in this family of deep diving whales.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Rob Schick (Southall Environmental Associates, Inc.)

Keyword(s): animal movement, diving, dose-response, hierarchical bayes, workflows

Video recording available after conference: ✅
Rob Schick (Southall Environmental Associates
Inc.)
10:30–12:00 Penn Garden useR to Analyze Emergency Medical and Trauma Data

More infoEmergency Medical Services (EMS) and trauma centers provide life-saving care in critical moments. To support data-driven quality improvement in these high-stakes environments, the nemsqar and traumar R packages were developed to automate performance metric calculations for EMS and trauma care. This talk introduces nemsqar and traumar, which help researchers, data analysts, and public health professionals efficiently process standardized data and generate actionable insights. The nemsqar package simplifies the implementation of National EMS Quality Alliance (NEMSQA) performance measures. It processes National Emergency Medical Services Information System (NEMSIS) data, automating complex quality metric calculations to reduce errors, save time, and support prehospital care decision-making. The traumar package focuses on in-hospital trauma care, offering functions for risk-adjusted mortality metrics and other trauma quality indicators. Designed for flexibility, it supports multiple data sources and advanced statistical modeling to improve patient outcome assessments. This presentation will showcase real-world applications of both packages, demonstrating how they streamline quality reporting and enhance research efficiency. Attendees will see key functionalities, practical use cases, and integration strategies. Finally, the talk will highlight opportunities for community involvement, including contributions to package development, validation efforts, and feature expansion to meet evolving needs.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Nicolas Foss (Bureau of Emergency Medical and Trauma Services, Division of Public Health, Iowa Health and Human Services)

Keyword(s): ems, trauma, mortality, quality improvement, healthcare

Video recording available after conference: ✅
Nicolas Foss (Bureau of Emergency Medical and Trauma Services
Division of Public Health
Iowa Health and Human Services)
Clinical trials
10:30–12:00 Gross 270 Identifying Adverse Event Under-Reporting in Clinical Trials: A Statistical Approach

More infoAdverse event (AE) detection is a critical component of clinical trials, yet we know that AE underreporting is a concern with traditional reporting methods. This project reviews AE under-reporting best practices and introduces a new AI/ML framework for detecting un-reported AEs using R. This effort is being implemented under the Phuse OpenRBQM project. OpenRBQM is a collaborative effort to create open-source R packages focused on risk-based quality management (RBQM). First, we introduce the {gsm} and {simaerep} packages which facilitate site- and country- level assessments of AEs. The {gsm} or Good Statistical Monitoring package provides a standardized framework for calculating Key Risk Indicators (KRIs) across all aspects of RBQM, including AE monitoring. The {simaerep} package, developed by the IMPALA consortium, uses advanced statistical methodologies to simulate AE reporting in clinical trials to detect under-reporting sites. The IMPALA and OpenRBQM teams have collaborated to create the {gsm.simaerep} package for use in the {gsm} framework. Finally, we present a new approach that leverages AI/ML techniques to identify specific missed AEs by analyzing data from other clinical trial domains. Using R, we develop models that detect patterns and highlight anomalies indicative of unreported AEs. By applying these methods to real-world clinical trial datasets, we demonstrate how AI/ML can enhance RBQM efforts. This presentation introduces tools that combines standard RBQM methodologies for evaluating adverse event under-reporting with AI methods for identifying specific missed AEs. Attendees will gain insights into implementing R-based techniques to uncover hidden safety signals in clinical research data.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Laura Maxwell (Atorus Research), Jeremy Wildfire (Gilead Sciences)

Keyword(s): clinical trials, pattern recognition, simulation, ai/ml, biostatistics

Video recording available after conference: ✅
Laura Maxwell (Atorus Research)
Jeremy Wildfire (Gilead Sciences)
10:30–12:00 Gross 270 Implementing function factories for flexible clinical trial simulations

More infoThe R package {simtrial} simulates clinical trial results using fixed or group sequential designs. One of its advantages is it provides the user with sufficient flexibility to define complex stopping rules for specifying when intermediate analyses are to be performed and which tests are to be applied at each of these analyses. However, this flexibility in the design generates complexity when automating the simulations. In order to provide the desired flexibility while implementing a maintainable simulation framework, I applied a function factory strategy. Function factories are functions that return another function. This enables the user to define any arbitrary set of argument values, but then delay the execution of the function until the simulation is performed. In this presentation, I will provide an overview of function factories and explain how I implemented them in {simtrial}.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): John Blischak

Keyword(s): functional programming, simulations, function factories, clinical trials, group sequential design

Video recording available after conference: ✅
John Blischak
10:30–12:00 Gross 270 Reproducible integrated processing of a large investigator-initiated, randomized-controlled multicenter clinical trial using Quarto and R

More infoNon-pharmaceutical clinical research often lacks reproducibility in data processing and analysis. In investigator-initiated trials, where financial resources are scarce, medical researchers must handle data management and analysis themselves, often using suboptimal tools. We present here the use case of a large, multicenter randomized-controlled trial in anesthesiology with over 2,500 enrolled patients. Embedded in a single Quarto-based project using tidyverse-style R, we processed the complete dataset from the electronic case report form from data tidying and analysis through plotting, report drafting, and presentation preparation. Our workflow is fully transparent, reproducible, and adaptive, following approaches demonstrated by Mine Çetinkaya-Rundel at R/medicine and Joshua Cook at posit:conf in 2024. To our knowledge, this represents the largest clinical trial managed using this methodology. This work demonstrates that accessible tools for tidy and reproducible scientific data processing are available even to researchers who are not native data scientists.

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Benedikt Schmid (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany); Robert Werdehausen (Department of Anesthesiology and Intensive Care Medicine, University Hospital Leipzig, Germany), Christopher Neuhaus (Department of Anesthesiology, University Hospital Heidelberg, Heidelberg, Germany), Linda Grüßer (Department of Anaesthesiology, RWTH Aachen University Hospital, Germany), Peter Paal (Department of Anaesthesiology and Intensive Care Medicine, Hospitallers Brothers Hospital, Paracelsus Medical University, Salzburg, Austria), Patrick Meybohm (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Peter Kranke (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Gregor Massoth (Department of Anaesthesiology and Intensive Care Medicine, University Hospital Bonn, Germany)

Keyword(s): medical research, reproducible workflow, randomized controlled clinical trial

Video recording available after conference: ✅
Benedikt Schmid (University Hospital Würzburg
Department of Anaesthesiology
Intensive Care
Emergency and Pain Medicine
Würzburg
Germany)
10:30–12:00 Gross 270 Retrospective clinical data harmonisation reporting using R and Quarto

More infoThere has been an increase of projects that involved data pooling from multiple sources. This is because combining data is an economical way to increase the statistical power of an analysis of a rare outcome that could not be addressed using data from a single project. Prior to statistical or machine learning analysis, a data steward must be able to sort through these heterogeneous inputs and document this process in a coherent way for different stakeholders. Despite its importance in the big data environment, there are limit resources on how to document this process in a structured, efficient and robust way. This presentation will provide an overview on how I create clinical data harmonisation reports using some R packages and a Quarto book project. A small preview can be found in https://github.com/JauntyJJS/harmonisation The audience in this talk will be able to know the basic framework of creating a Quarto Book or website to document data harmonisation processes, the basic workflow during the data harmonisation process, how to do data validation when writing code for data harmonisation to ensure code workflow is robust to changes in the input data, ways to show to higher management (with limited programming experience) in the harmonisation report that your code works (It is not enough to say that I use test units), able to write an R script to create many data harmonisation reports (One technical report for each cohort pooled and one report that summarised the data harmonisation process for all cohorts).

Date and time: Sat, Aug 9, 2025 - 10:30–12:00

Author(s): Jeremy Selva (National Heart Centre Singapore)

Keyword(s): data harmonisation, data validation, report making automation, quarto

Video recording available after conference: ✅
Jeremy Selva (National Heart Centre Singapore)
Package lifecycle
13:00–14:15 Penn 1 ARcenso: A Package Born from Chaos, Powered by Community

More infoHistorical census data in Argentina is scattered across multiple formats: books, spreadsheets, PDFs, and REDATAM, without a standardized structure. This lack of organization complicates analysis, requiring manual cleansing and integration of records before working with the data. As R users, we recognized an opportunity to transform this chaos into a meaningful solution not only for personal use but for all R users. That is how {arcenso} was born, a way to provide structured, ready-to-use census data, eliminating repetitive pre-processing and allowing users to focus on analysis with harmonized datasets. The goal is to make national census data in Argentina more accessible. Through the rOpenSci Champions program, the original idea turned into a functional R package. Thanks to the support of the R community, we learned how to structure the package, document datasets, and ensure reproducibility. This journey demonstrated the value of community learning, and those principles are embedded in {arcenso}, making it accessible and user-friendly. {arcenso} is currently developing and has released its first dataset along with three core functions. However, this is just the beginning. There are more datasets to integrate, additional features to develop, and improvements to be made to enhance the user experience. In this talk, we will introduce the package for users from both public and private sectors, including academics and researchers facing data challenges. We will explain the framework used for turning problems into solutions, highlight tools and community resources, and try to inspire others to tackle their own data challenges.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): EMANUEL CIARDULLO, ANDREA GOMEZ VARGAS; EMANUEL CIARDULLO

Keyword(s): package, community, workflow, census, official statistics

Video recording available after conference: ✅
EMANUEL CIARDULLO
ANDREA GOMEZ VARGAS
13:00–14:15 Penn 1 Curating a Community of Packages: Lessons from a Decade of rOpenSci Peer Review

More infoThemed collections of packages have long been a common feature of the R ecosystem, from the CRAN Task Views to today’s “universes”. These range from tightly integrated toolboxes engineered by a single team, to journal-like repositories of packages passing common standards, or loose collections of packages organized around communities, themes, or development approaches. This talk will share insights for managing package collections, and their communities of developers, gleaned from a decade of rOpenSci’s software peer-review initiatives. I will cover best practices for governing and managing collections, determining scope and standards for packages, onboarding and offboarding, and supporting continuing maintenance. Finally, I will discuss the essential role of mentorship and inclusive practices that support a diverse community of package maintainers and contributors.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Noam Ross (rOpenSci)

Keyword(s): standards, interoperability, maintenance, mentorship, community

Video recording available after conference: ✅
Noam Ross (rOpenSci)
13:00–14:15 Penn 1 rtables: Challenges, Advances and Lessons Learned Going Into Production At J&J

More infortables is an open-sourced framework for the creation of complex, multi-faceted tables developed by the author while at Roche. Here, we will discuss the process of adopting rtables at J&J as the lynchpin of a larger transition to R and open-source tools for the creation of production outputs in clinical trials. In particular we will touch on 3 aspects: development of novel features in rtables required to meet J&J’s specific needs, development of additional tooling around rtables for use by the company’s SPAs, and lessons learned during the process.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Gabe Becker (Independent)

Keyword(s): clinical trials, tables, tlg, visualization,

Video recording available after conference: ✅
Gabe Becker (Independent)
Teaching 1
13:00–14:15 Penn 2 Coursework RStudio Infrastructure at scale: Duke and NCShare

More infoTwo case studies covering lessons learned running large scale RStudio infrastructure for coursework at Duke University and [NCShare][1] (an NSF-funded consortium to advance scientific computing and innovate STEM education at North Carolina’s historically marginalized institutions). Each semester Duke provides containerized RStudio instances for over 1200 students. Similar instructure is used in the NCShare consortium to support advanced computing environments to less-resourced Higher Ed institutions. This talk covers best practices and pitfalls for automation, packaging, management, and support of RStudio and how cross institutional collaboration can make these environments more widely available. [1]: https://ncshare.org/

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Mark McCahill (Duke University)

Keyword(s): educational consortia, coursework infrastructure, automation

Video recording available after conference: ✅
Mark McCahill (Duke University)
13:00–14:15 Penn 2 Enhancing R Instruction: Adapting Workshops for Time-Constrained Learners

More infoThe Data & Visualization Services Department at North Carolina State University Libraries offers data science support to faculty, staff, students, and the broader community. This support includes data science consulting, workshops and instruction on data science and programming topics, as well as specialized computer lab spaces equipped with hardware and software for data and visualization work. Among these, our introductory workshops are particularly popular. Our Intro to R workshop series consists of three sessions covering basic programming, data cleaning, and data visualization. Participants come from diverse academic and professional backgrounds, with varying levels of coding experience—from no prior exposure to limited familiarity with R or other programming languages. Additionally, they must balance academic, professional, and personal commitments, making it essential to provide efficient yet comprehensive instruction. We recently refined our curriculum to address these challenges in response to direct and observed student feedback. This presentation will explore the specific curriculum changes, the challenges they aim to resolve, and the role of instructor-led workshops in supporting early-stage R learners.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Selene Schmittling (North Carolina State University); Shannon Ricci (North Carolina State University), Alp Tezbasaran

Keyword(s): instruction, curriculum development, learner diversity, workshops

Video recording available after conference: ✅
Selene Schmittling (North Carolina State University)
13:00–14:15 Penn 2 Rhapsody in R: Exploring Probability Through Music

More infoProbability is often introduced with applications from the natural and social sciences, but its role in the arts is less frequently explored. One example is stochastic music, pioneered by avant garde 20th-century composers like [Iannis Xenakis][1], who used probabilistic models and computer simulations to generate musical structures. While the aesthetic appeal of such music is subjective, its mathematical foundations offer a compelling way to engage students with probability and randomness. This talk presents an assignment for an introductory probability course where students compose their own stochastic music using R. By applying their knowledge of probability distributions and computer simulation, they explore randomization in pitch, rhythm, meter, instrumentation, and harmony — observing emergent patterns along the way. The R package [gm][2] by Renfei Mao provides a user-friendly framework for layering musical elements, while integration with MuseScore allows students to generate sheet music and MIDI playback. This activity not only reinforces key concepts, but also offers students a fun and creative way to apply probability, engaging a different part of their brain than traditional scientific applications. [1]: https://youtu.be/nvH2KYYJg-o?feature=shared [2]: https://cran.r-project.org/web/packages/gm/index.html

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): John Zito (Duke University)

Keyword(s): teaching, probability, music

Video recording available after conference: ✅
John Zito (Duke University)
Web APIs
13:00–14:15 Penn Garden Automating CDISC Metadata Retrieval: An R-Based Approach Using the CDISC Library API

More infoThe CDISC Library API provides a programmatic gateway to clinical data standards, including SDTM and ADaM domains, variables, and controlled terminology. This presentation showcases an R-based approach to integrating the API for automated retrieval and structuring of CDISC metadata and controlled terminology, eliminating the need for manual extraction from PDFs or Excel files. Leveraging R packages such as shiny, httr2, jsonlite, and tidyverse, we demonstrate a reproducible workflow that queries the /mdr/sdtmig/{version} and /mdr/ct/{version} endpoints, parses JSON responses into structured data frames, and presents the results in a web application. Key topics include authentication via API keys, handling nested JSON structures, and ensuring seamless interaction with CDISC’s evolving standards. This approach enhances efficiency, reduces manual effort, and improves traceability in clinical data workflows.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Jagadish Katam

Keyword(s): cdisc, sdtm, adam, controlled terminology, shiny, api

Video recording available after conference: ✅
Jagadish Katam
13:00–14:15 Penn Garden Web APIs for useRs: Getting data from websites, databases, and LLMs

More infoMany websites and services provide APIs, and useRs can take advantage of them to get data, make database operations, and talk to Large Language Models (LLMs). The httr2 package, with its support for sequential and parallel requests, is a great tool for efficient API interactions. I will demonstrate its use through two real-world examples. First, I will introduce the frstore package, which I developed to interact with Google Firestore, a NoSQL database. While client libraries exist for Python and JavaScript, R users were left out—until now. frstore enables create, read, update, and delete (CRUD) operations using httr2, making it a powerful tool for R users working with Firestore. The second example is a Shiny app designed to create an immersive storytelling experience. Users provide the first sentence of a children’s story, and the app uses httr2 to interact with multiple APIs. Cloudflare’s Workers Model API is used to send requests to text generation and image generation models. Moreover, Eleven Labs’ API converts text to speech for audiobook-like narration. These results are integrated in a quarto revealjs slide deck that yields a delightful, interactive storytime experience. This talk is aimed at R users of all levels who want to expand their toolkit for web data access and API interactions. Whether you’re scraping data, working with APIs, or building interactive applications, this session will provide practical examples to enhance your R workflows.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Umair Durrani (Presage Group)

Keyword(s): api, httr2, llm, database, shiny

Video recording available after conference: ✅
Umair Durrani (Presage Group)
13:00–14:15 Penn Garden {plumber2}: Streamlining Web API Development in R

More infoOver the past nine years, the R package {plumber} has simplified the creation of web APIs using annotations over existing R source code with roxygen2-like comments. During this time, the community has gathered valuable insights and identified numerous areas for improvement. To invest in a way forward, a new package called {plumber2} has been created. {plumber2} is designed from the ground up to be highly extensible, enabling developers to easily integrate custom decorators to modify the behavior of their APIs. Furthermore, {plumber2} is built using a modern foundation, leveraging the latest packages associated with the {fiery} framework. This modern architecture is built upon middleware (the ability to introduce custom logic at specific points within the API’s request handling process). One of the many fine-grained controls over how your API can behave. By incorporating these improvements and embracing a modern framework, {plumber2} offers a sustainable path forward for building web APIs in R. This new approach avoids the need for short-term fixes and ensures that {plumber2} can continue to evolve and adapt to the changing needs of developers.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Barret Schloerke (Posit, PBC); Thomas Pedersen (Posit, PBC)

Keyword(s): api, plumber2, plumber, package, web api

Video recording available after conference: ✅
Barret Schloerke (Posit
PBC)
High-dimensional data
13:00–14:15 Gross 270 Introducing riemmtan

More infoThe statistical analysis of random variables that take values in Riemannian manifolds is a rapidly growing area of research. Its main application is the study of connectomes obtained from brain imaging, which belong to the manifold of symmetric positive definite matrices. Large amounts of work have been devoted to address a variety of issues including the development of new metrics, new statistical models and visualization techniques. Unfortunately, the tools offered by R to handle this type of data have not evolved with the speed necessary to match the momentum of this growing area of the statistical literature. The R packages Riemann and frechet are important steps in that direction, but new tools are necessary to incorporate recent developments. That is why we are introducing riemmtan, a new R package. Its main goal is to offer a high level interface that abstracts away many day-to-day operations of this kind of analysis. In addition, it allows the user to exploit the growing capabilities of modern computer clusters by making use of parallelism in several parts of its implementation, including the computation of Fréchet means. Finally, it makes use of the object oriented programming tools in R to make Riemannian metrics self contained modules, allowing users to easily implement and experiment with new metrics. We hope riemmtan will become the foundation for an ecosystem of tools that allow for efficient and user-friendly analysis of Riemannian manifold valued data.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Nicolas Escobar (Indiana University); Jaroslaw Harezlak (Indiana University)

Keyword(s): riemannian manifolds, connectomics, fmri imaging

Video recording available after conference: ✅
Nicolas Escobar (Indiana University)
13:00–14:15 Gross 270 Machine Learning-Powered Metabolite Identification in R: An Automated Workflow for Identifying Metabolomics Dark Matter

More infoUpwards of 90% of small molecules detected in LC-MS/MS-based untargeted metabolomics are unidentified due to limitations in current analytical techniques. Although this “dark matter” can significantly contribute to disease diagnosis and biomarker discovery, current identification methods are costly and resource-intensive. This study addresses these challenges by developing a computational workflow in R to encode the tandem mass spectra into simplified structural fingerprints, which can be predicted and related to known fingerprints in molecular databases. The developed pipeline includes different R packages such as RSQLite, SF, rcdk, chemminer, caret, sparsepca, rinchi, and rpubchem which finally improves metabolite identification in untargeted metabolomics. A total of 2,973 mass spectra of known and unknown molecules from an in-house high resolution LC-MS/MS study were extracted from an SQL database (mzVault) using the RSQLite package. The collected spectra were converted into machine-readable numbers using the rawToHex and readBin functions from the SF package. SMILES representations of known molecules were obtained by querying their names against PubChem using the rpubchem package. The set of 166 Molecular ACCess System (MACCS) fingerprints were computed for known molecules based on their SMILES using rCDK and ChemmineR packages. In the next step, 166 random forest (RF) models were trained on MS2 spectra of known molecules to model the MACCS fingerprints using the caret package. Before training, spectral data were normalized and subjected to dimensionality reduction using robust sparse principal component analysis (rSPCA) via the sparsepca package. The trained RF models were applied to high-resolution MS2 spectra of unknown molecules to predict their MACCS fingerprints, which were then used for similarity searches in the Human Metabolome Database (HMDB) using the Tanimoto coefficient. Retrieved candidates from HMDB were further refined based on LogP, topological polar surface area (TPSA), molecular mass, and retention time. The workflow was tested on an LC-MS/MS dataset containing 1,071 known and 1,902 unknown compounds. Despite the high dimensionality, rSPCA reduced the data to 25 principal components, preserving 97% of variance. RF models achieved a mean accuracy of 0.87 in 3-fold cross-validation. On average, 4.1±11.31 unique HMDB molecules were listed for each unknown molecule, and the retrieved list was prioritized using a hybrid scoring function. Applying a Tanimoto similarity threshold (>0.7), this workflow identified at least one HMDB match for 1,079 unknowns, improving metabolite identification by 57%. The incorporation of a hybrid scoring system based on Tanimoto similarity and physicochemical properties enhanced candidate ranking and structural elucidation of unknown metabolites.

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences); Sree V. Chintapalli (University of Arkansas for Medical Sciences), Renny Lan (University of Arkansas for Medical Sciences), Hailemariam Abrha Assress (University of Arkansas for Medical Sciences), Brian D. Piccolo (University of Arkansas for Medical Sciences), Colin Kay (University of Arkansas for Medical Sciences)

Keyword(s): metabolomics, cheminformatics, machine learning, identification

Video recording available after conference: ✅
Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences)
13:00–14:15 Gross 270 Multi-omics Integration with GAUDI: A Novel R Package for Non-linear Dimensionality Reduction and Interpretable Clustering Analysis

More infoIntegrating high-dimensional multi-omics data presents significant challenges in computational biology, particularly when handling complex non-linear relationships across diverse biological layers. We present GAUDI (Group Aggregation via UMAP Data Integration), a novel R package that leverages Uniform Manifold Approximation and Projection (UMAP) for the concurrent analysis of multiple omics data types. GAUDI addresses key limitations of existing methods by enabling non-linear integration while maintaining interpretability and mitigating bias from datasets with vastly different dimensionalities. The GAUDI R package implements a straightforward yet powerful workflow: (1) independent UMAP embeddings are applied to each omics dataset, creating standardized representations that preserve dataset-specific structures; (2) these embeddings are concatenated; (3) a second UMAP transformation integrates these embeddings into a unified space; (4) hierarchical density-based clustering identifies sample groups; and (5) feature importance analysis via XGBoost and SHAP values enables biological interpretation. Our benchmarking against six state-of-the-art multi-omics integration methods demonstrates GAUDI’s superior performance across diverse datasets. Using simulated multi-omics data with known ground truth, GAUDI achieved perfect clustering accuracy across all tested scenarios. In cancer datasets from TCGA, GAUDI identified clinically relevant patient subgroups with significant survival differences, particularly in acute myeloid leukemia where it detected high-risk subgroups missed by other methods. At the single-cell level, GAUDI not only correctly classified cell lines but uniquely identified biologically meaningful substructures within them, confirmed by differential expression and pathway enrichment analyses. When evaluating large-scale functional genomics datasets from the Cancer Dependency Map (DepMap) Project, GAUDI demonstrated superior lineage identification accuracy. In a benchmark integrating gene expression, DNA methylation, miRNA expression, and metabolomics across 258 cancer cell lines, GAUDI achieved the highest score for lineage discrimination, approximately 15% better than the next-best performing method, MOFA+, underscoring its effectiveness with complex, heterogeneous multi-omics data. The GAUDI R package provides a user-friendly interface with extensive documentation, visualization tools, and compatibility with standard bioinformatics workflows. By combining the strengths of non-linear dimensionality reduction with interpretable machine learning approaches, the GAUDI R package offers researchers a powerful new tool for exploring complex relationships across multiple biological data types, potentially revealing novel insights in systems biology, precision medicine, and biomarker discovery. Package: https://github.com/hirscheylab/gaudi Benchmark: https://github.com/hirscheylab/umap_multiomics_integration

Date and time: Sat, Aug 9, 2025 - 13:00–14:15

Author(s): Pol Castellano Escuder (Heureka Labs)

Keyword(s): multi-omics integration, dimension reduction, clustering, statistical learning, interpretable machine learning, benchmarking

Video recording available after conference: ✅
Pol Castellano Escuder (Heureka Labs)
Pragmatic programmer
14:45–16:00 Penn 1 “How did you even think of that???” Techniques to code much faster

More infoThis talk will present a totally different way of thinking about writing R code. This method is completely different from anything I have ever seen in the R community (or any data science community). This is the method I used to write four R packages - NumericEnsembles, ClassificationEnsembles, LogisticEnsembles and ForecastingEnsembles. The largest part of the code was written in 15 months, and was approximately 15,000 lines at that time. No AI was used in any of the code development. This is a totally different style of thinking, using the same set of R tools that everyone else can use. What is totally different is the thinking that goes into the code development, compared to what I’ve seen everywhere else. This talk will show how the same method may be applied to the work you are doing. Come prepared to see that the methods you’ve been using to think through solutions and write code that achieves reproducible results can be improved very significantly by improving your thinking, not necessarily your tools. There will be several practical examples and live demonstrations to show how you may use these methods in real coding situations. Improving your thinking can do much more for improving how you code than adding the latest tools. This presentation will demonstrate how that was done in the development of four packages that automatically build ensembles as part of the analysis process, and how you can use the same methods in your work.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Russ Conte (
Owner@dataaip.com)

Keyword(s): “code better, efficient coding, fast coding”

Video recording available after conference: ✅
Russ Conte (Owner@dataaip.com)
14:45–16:00 Penn 1 Reusing ‘ggplot2’ code: how to design better plot helper functions

More infoWrapping ‘ggplot2’ code into plot helper functions is a common way to make multiple versions of a custom plot without copying and pasting the same code over and over again. Helper functions can replace long and complex ‘ggplot2’ code chunks with just a single function call. However, if that single function is not designed carefully, the initial convenience can often turn into frustration. While helper functions can reduce the amount of code needed to remake a complicated plot, they often mask the underlying layered grammar of graphics, complicating further customisation and tweaking of the plot. This talk addresses how to design effective ‘ggplot2’ plot helper functions that maximise reuse convenience whilst preserving access to the elegant flexibility of layered plot composition. By studying existing ‘ggplot2’ extensions for producing calendar plots, we identify a number of common pitfalls, including overly specific function arguments and hidden data manipulations. Then, we discuss how to avoid these pitfalls and retain the benefits of ‘ggplot2’ by: separating data preparation from plotting, utilising list arguments for customisation, and providing transparent documentation. We illustrate these strategies using examples from the design of the ‘ggtilecal’ package, which provides helper functions for plotting calendars using the geom_tile() geometry from ggplot2.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Cynthia Huang (Monash University)

Keyword(s): r package and function design, layered grammar of graphics, data visualisation, ggplot2 extensions

Video recording available after conference: ✅
Cynthia Huang (Monash University)
14:45–16:00 Penn 1 The Language of Data: How R Package Syntax Shapes Analysis and Thought

More infoFor most users in data science, analytics, and research, a package’s syntax or API is their primary interface with the software. While R provides a well-defined framework for creating packages that make programming accessible, syntax choices serve as key connection points between users and their data. R packages exhibit a range of syntax styles—from explicit to implicit, verbose to symbolic, and structured to flexible. Drawing on research on language, cognition, and user experience, this talk explores how syntax design in R packages shapes the way we interact with data, approach analysis, and solve complex problems. In this talk, I will examine syntax design in powerful and popular data wrangling software in R–data.table, dplyr, polars, and base R, comparing their approaches and discussing their impact on usability, interpretation, and problem-solving in data workflows. Attendees will leave with an understanding of syntax design, how current leaders in data wrangling design their syntax, and considerations for how these designs can impact user behavior.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Tyson Barrett (Highmark Health)

Keyword(s): data wrangling, programming, syntax, analytics

Video recording available after conference: ✅
Tyson Barrett (Highmark Health)
Teaching 2
14:45–16:00 Penn 2 Expanding Data Science’s Reach through Interdisciplinarity and the Humanities

More infoWhat does data science mean for those disciplines that don’t traditionally align themselves with this work? More specifically, how might instructors in the Humanities define — and teach — data science? How can the Humanities use data science to resist academic siloing and promote alignment across disciplines and methodologies? What is to be gained for traditional data science programs with a transdisciplinary understanding and application of data science? This presentation explores three courses developed by English instructors at North Carolina State University’s Data Science and AI Academy: Data Visualization, Introduction to AI Ethics, and Storytelling with Data and AI. The presenters will explain how their Humanities backgrounds help them create courses that extend data science beyond traditional applications. They’ll share examples of assignments that incorporate their disciplinary expertise while integrating core data science principles from the ADAPT model. Furthermore, by offering alternative perspectives on data science, we create “gateway” courses that attract students who might not otherwise enter the field. The presenters will also discuss how these courses achieve interdisciplinarity both through content and the student participants. The presenters will demonstrate how the three representative courses complement traditional data science curriculum (coding) by broadening the field’s reach in two ways: 1. enhancing the overall educational experience for students and 2. creating access points for faculty who don’t typically identify with data science, thus attracting instructors without traditional data science backgrounds. The presentation will conclude with reflections on lessons learned, challenges encountered, and strategies for institutions seeking to implement similar cross-disciplinary approaches. The presenters will share preliminary assessment data demonstrating student outcomes and discuss implications for the future of data science education across diverse academic contexts.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Kelsey Dufresne (North Carolina State University), James Harr (Christian Brothers University), Christin Phelps (North Carolina State University)

Keyword(s): interdisciplinary, data science, outreach, data visualization, ai

Video recording available after conference: ✅
Kelsey Dufresne (North Carolina State University)
James Harr (Christian Brothers University)
Christin Phelps (North Carolina State University)
14:45–16:00 Penn 2 Leveraging LLMs for student feedback in introductory data science courses

More infoA considerable recent challenge for learners and teachers of data science courses is the proliferation of the use of LLM-based tools in generating answers. In this talk, I will introduce an R package that leverages LLMs to produce immediate feedback on student work to motivate them to give it a try themselves first. I will discuss technical details of augmenting models with course materials, backend and user interface decisions, challenges around evaluations that are not done correctly by the LLM, and student feedback from the first set of users. Finally, I will touch on incorporating this tool into low-stakes assessment and ethical considerations for the formal assessment structure of the course relying on LLMs.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Mine Cetinkaya-Rundel (Duke University + Posit, PBC)

Keyword(s): r-package, teaching, education, feedback, ai, llm

Video recording available after conference: ✅
Mine Cetinkaya-Rundel (Duke University + Posit
PBC)
14:45–16:00 Penn 2 Teaching Statistical Computing with R and Python

More infoComputing courses can be daunting for students for a variety of reasons, including programming anxiety, difficulty learning a programming language in a second language, and unfamiliarity with assumed computer knowledge. In an ongoing attempt to teach statistical computing effectively, I developed a textbook intended for use in a flipped classroom setting where R and Python are taught concurrently. This approach allows students to learn programming concepts applicable to most languages, while developing skills in both R and Python that can be used in an increasingly multilingual field. In this talk, I discuss the book’s design and how it integrates into a sequence of undergraduate and graduate computing courses. Along the way, we will talk about opinionated coding decisions, use of memes, comics, and YouTube tutorials, and other features integrated into this open-source textbook built with quarto and hosted on GitHub.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Susan Vanderplas (University of Nebraska - Lincoln)

Keyword(s): data science, education, statistical computing, python, reproducibility

Video recording available after conference: ✅
Susan Vanderplas (University of Nebraska - Lincoln)
Workflows
14:45–16:00 Penn Garden Building Agentic Workflows in R with axolotr

More infoLarge Language Models (LLMs) have revolutionized how we approach computational tasks, yet R users often face significant barriers when integrating these powerful tools into their workflows. Managing multiple API providers, handling authentication, and orchestrating complex interactions typically requires substantial boilerplate code and specialized knowledge across different service ecosystems. This presentation introduces axolotr, an R package that provides a unified interface for interacting with leading LLM APIs including OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, and Groq. Through progressive examples of increasing complexity, we demonstrate how R users can seamlessly incorporate LLMs into their data science workflows - from simple one-off queries to sophisticated agentic systems. We begin with fundamental LLM interactions, showing how axolotr simplifies credential management and API calls across providers. Next, we explore function-based implementations that transform raw LLM capabilities into reusable analytical tools. Finally, we demonstrate how to build true agentic workflows where multiple LLM calls work together to maintain state, make decisions, and accomplish complex tasks autonomously. Attendees will learn: - How to quickly incorporate LLMs into existing R projects using a consistent interface - Techniques for creating functions that leverage LLM capabilities for data analysis and interpretation - Approaches for building agentic systems that can reason about data, maintain context, and operate iteratively - Practical strategies for managing costs, optimizing performance, and selecting appropriate models for different tasks This presentation provides both newcomers and experienced R users with the practical knowledge needed to harness the power of LLMs through a streamlined, R-native approach. By the end, attendees will have a roadmap for transforming their interaction with LLMs from simple API calls to sophisticated autonomous workflows that can dramatically enhance productivity and analytical capabilities.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Matthew Hirschey

Keyword(s): llms, ai, agents, natural language processing, workflow automation

Video recording available after conference: ✅
Matthew Hirschey
14:45–16:00 Penn Garden Data as code, packaging data as code with duckdb and S3

More infoDuckDB and object storage (S3) offer a powerful and cost-effective way to store and access data. The R package provides an efficient method to document data processing, simplify user access, incorporate business logic, increase reproducibility, and leverage both code and data. This talk will use [cori.data.fcc][1] featuring the US FCC National Broadband Data, as a case study for a data package. We will discuss the advantages discovered during its development, challenges we encountered, and tips for others who wish to adapt these methods for their own needs. [1]: https://ruralinnovation.github.io/cori.data.fcc/

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Olivier Leroy; John Hall (Center on Rural Innovation)

Keyword(s): duckdb, s3, data package, broadband

Video recording available after conference: ✅
Olivier Leroy
14:45–16:00 Penn Garden Small boosts, here and there

More infoRather than writing an entire R package or carrying out a data analysis in one fell swoop, I’m interested in large language models (LLMs) doing things for me that I don’t like to do: tedious little refactors, transitioning from deprecated APIs, and templating out boilerplate. Chores, if you will. This talk introduces chores, an R package implementing an extensible library of LLM assistants to help with repetitive but hard-to-automate tasks. I’ll demonstrate that LLMs are quite good at turning some 45-second tasks into 5-second ones and show you how to start automating drudgery from your work with a markdown file and a dollar on an API key.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Simon Couch (Posit, PBC)

Keyword(s): ai, llm, workflow, productivity

Video recording available after conference: ✅
Simon Couch (Posit
PBC)
Life sciences
14:45–16:00 Gross 270 Co-occurrence Analysis And Knowledge Graphs For Biomedical Research

More infoThe analysis of data from large hospitals and healthcare providers comes with unique challenges. Electronic health records document information from patients’ visits as the diagnoses performed, medications prescribed, and more. To discover the best treatment options, facilitate early diagnosis, and understand co-morbidities and adverse effects, biomedical researchers extensively use co-occurrence analysis, which measures how features as diagnoses and medications are correlated with each other over time at the patient level. Results can then be merged between independent health systems while maintaining patient data privacy in a process called transfer learning, and insights can be organized, visualized and interpreted using knowledge graphs. Knowledge graphs model relationships between concepts, e.g. one medication “may treat” one disease. Biomedical research consistently shows that while large language models perform very well to discover similar concepts, as synonyms or closely related diagnoses, co-occurrence analysis and knowledge graphs perform better when trying to discover related concepts, as best treatment options or adverse effects. A large part of contemporary biomedical research is thus dedicated to merging results from pre-trained large language models and study-specific performed co-occurrence analyses. To help researchers efficiently perform co-occurrence analysis and knowledge graphs, we developed the nlpembeds and kgraph R packages. The nlpembeds package enables to efficiently compute co-occurrence matrices between tens of thousands of concepts from millions of patients over many years – which can prove challenging when taking into account not only codified data as diagnoses and medications but also natural language processing concepts extracted from clinicians notes (comments justifying why specific diagnoses were performed or medications prescribed). The kgraph package enables to measure the performance of the results, build the corresponding knowledge graphs, and visualize them as interactive Javascript networks. We used the packages to perform several studies as the analysis of insurance claims of 213 million patients (Inovalon), the visualization of Mendelian randomization meta-analyses performed by the Veterans Affairs, and the transfer learning between several institutions involved in the Center for Suicide Research and Prevention to build risk prediction models. In this talk, I will showcase the highlights of these packages, introduce their use, and demonstrate how to perform real-world interpretations useful for clinical research. Co-occurrence analysis and knowledge graphs enable to discover insights from large databases of electronic health records in order to improve our understanding of biomedical processes and the realities of large-scale and long-term patient care.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Thomas Charlon (Harvard Medical School)

Keyword(s): embeddings, knowledge graph, biomedical research, patient care, mental health

Video recording available after conference: ✅
Thomas Charlon (Harvard Medical School)
14:45–16:00 Gross 270 Counting Birds Two Ways: Joint models of species abundance

More infoJoint species distribution models (JSDMs) enable ecologists to characterize relationships between species and their environment, infer interspecific dependencies, and predict the occurrence or abundance of entire ecological communities. Although several popular JSDM frameworks exist, the problem of modeling sparse relative abundance data remains an inferential and computational challenge for many. We describe two approaches and corresponding implementations within the context of a case study involving a large community of bird species surveyed across Finland. The first approach, hierarchical modeling of species communities, employs a generalized linear latent variable model and supports diverse data and sampling designs but falters when faced with sparse and overdispersed count data. The second approach, binary and real count decompositions, directly addresses limitations of log-linear multivariate count models but lacks some of the generality and extensibility.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Braden Scherting (Duke University)

Keyword(s): NA

Video recording available after conference: ✅
Braden Scherting (Duke University)
14:45–16:00 Gross 270 Detecting Read Coverage Patterns Indicative of Genetic Variation and Mobility

More infoRead coverage data is commonly used in bioinformatics analyses of sequenced samples. Read coverage represents the count of short DNA sequences that align to specific locations in a reference sequence. When plotted, one can visualize how read coverage changes along the reference sequence. Some read coverage patterns, like gaps and elevations in coverage, are associated with real biological phenomena like mobile genetic elements (MGEs) and structural variants (SVs), for example. MGEs are genetic sequences capable of transferring to new genomic locations where they may disrupt functioning genes. Structural variants (SVs) refer to small genetic differences between individuals or microbial populations caused by deletions, insertions, and duplications of gene sequences. MGEs and SVs are important to host health and while many tools have been developed to detect them, the vast majority are either database-dependent or are limited to detection of specific types of MGEs and SVs. Using gaps and elevations in read coverage is a more general detection method for diverse MGEs and SVs, however the manual inspection of coverage graphs is tedious, time consuming, and subjective. We developed an algorithm that detects distinct patterns in read coverage data and implemented it into two R packages- TrIdent and ProActive- that automatically identify, classify, and characterize read coverage patterns indicative of genetic variation and mobilization. Our read coverage pattern-matching algorithm offers a unique approach to sequence data analysis and our tools enable researchers to efficiently implement read coverage inspections into their standard bioinformatics pipelines.

Date and time: Sat, Aug 9, 2025 - 14:45–16:00

Author(s): Jessie Maier (North Carolina State University); Craig Gin (North Carolina State University), Benjamin Callahan (North Carolina State University), Manuel Kleiner (North Carolina State University)

Keyword(s): pattern-matching, bioinformatic tools, read coverage data, mobile genetic elements, structural variants

Video recording available after conference: ✅
Jessie Maier (North Carolina State University)
Keynote #3
17:00–18:00 Penn 1 We R Together.  How to learn, use and improve a programming language as a community.

More infoCommunities of practice are powerful spaces for learning, collaboration, and innovation—especially in the context of coding and data science. In this talk, I’ll share what I’ve learned from leading and supporting R communities, with concrete examples of strategies, content, and programs that encourage participation and skill-sharing. Informed by a range of approaches to community building, I’ll explore how collective efforts can strengthen not only technical skills, but also support long-term career growth and visibility. This talk will be relevant to anyone interested in creating more inclusive, sustainable, and impactful technical communities—across research, education, and open source.

Date and time: Sat, Aug 9, 2025 - 17:00–18:00

Author(s): Yanina Bellini Saibene (rOpenSci + R-Ladies + Universidad Austral)

Keyword(s): NA

Video recording available after conference: ✅
Yanina Bellini Saibene (rOpenSci + R-Ladies + Universidad Austral)
Day 3: Sunday, August 10, 2025
Room Title, abstract, and more info Presenter(s)
Keynote #4
09:00–10:00 Penn 1 To be announced

More infoTo be announced.

Date and time: Sun, Aug 10, 2025 - 09:00–10:00

Author(s): Frauke Kreuter (University of Maryland + Ludwig Maximilian University)

Keyword(s): NA

Video recording available after conference: ✅
Frauke Kreuter (University of Maryland + Ludwig Maximilian University)
Lightning
10:30–12:00 Penn 1 An Interactive webR Approach to Teaching Statistical Inference to Behavioral Science Students

More infoIn many applied data analysis courses, null hypothesis significance testing (NHST) is introduced using a frequentist framework based on theoretical probability distributions. Yet, behavioral science students often struggle with NHST because its logic can seem counterintuitive. A common error is viewing the p-value as the probability that the null hypothesis is true, instead of recognizing it as the chance of obtaining data as extreme (or more extreme) than observed, assuming the null hypothesis is true. In contrast, a permutation test lets students derive p-values directly from data, making core ideas such as randomness, variability, and extreme outcomes more tangible. In my applied data science course for students of Psychology, I use webR to create interactive R-based activities that immerse students in statistical concepts, uncertainty visualization, and hands-on experimentation. This presentation showcases a webR-based tutorial where students explore permutation tests to examine differences between experimental conditions in a recent Psychological study published on the Center for Open Science repository. Through interactive resampling and dynamic visualizations of empirical null distributions, students gain insight into how random variation influences statistical results. They then use the generated empirical null distribution to assess the extremity of their observed test statistic, calculate p-values, and construct confidence intervals – deepening their understanding of NHST. Running simulations within webR enables interactive, self-contained learning modules, allowing students to experiment with code in real time within structured educational materials that scaffold their learning. I will discuss how this approach boosts engagement, supports replicability, and lowers barriers to learning R-based analysis. I will share insights from student feedback, challenges encountered, and best practices for integrating webR into applied data science education.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Kimberly Henry (Colorado State University)

Keyword(s): webr, statistical inference, applied data science, behavioral science

Video recording available after conference: ✅
Kimberly Henry (Colorado State University)
10:30–12:00 Penn 1 Bringing the fun of hex stickers to your R session

More infoOver the years, R users have embraced logos and stickers for packages and communities of practice as a fun way to show support for open-source projects. The logos themselves often reflect a project’s core attributes and not just vague visual branding. This talk describes the development and functionality of the hexsession package, which creates interactive hexagonal tiles for logos or custom images. Similar to arranging stickers on a laptop, we can tesselate the logos for our installed packages or for any arbitrary set of images and produce a responsive HTML tile with each image linking to its respective web page. This output, created using CSS and JavaScript behind the scenes, now integrates with web-based documentation platforms that use Quarto or RMarkdown for websites, vignettes, and online books. This can be useful for documenting metapackages, developer portfolios, or showcasing any interrelated sets of packages. Developing hexsession was a challenging but rewarding process, which was greatly facilitated by existing open-source resources and feedback from its small but helpful user base. What started as a silly but ambitious idea will hopefully mature into a way of visually showcasing the tools that power our projects and also bring us together as a community.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Luis D. Verde Arregoitia (Instituto de Ecología AC - INECOL)

Keyword(s): hex, stickers, quarto, documentation

Video recording available after conference: ✅
Luis D. Verde Arregoitia (Instituto de Ecología AC - INECOL)
10:30–12:00 Penn 1 Celebrating R: Code Snippets from NC Public Health Epidemiology

More infoUNC Injury Prevention Research Center collaborates with the NC Division of Public Health Injury & Violence Prevention Branch to track and prevent injuries. Injury epidemiology is broad; its scope includes: self-harm and other violence; motor vehicle, bicycle, and pedestrian crashes; overdoses, alcohol, and cannabis harms; and social drivers like child maltreatment and homelessness. Using short code examples that span these topic areas, this lightning talk will celebrate R by sharing favorite code patterns and snippets from a decade of public health epidemiology programming in research, public health practice, and graduate coursework settings here in North Carolina.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Michael Fliss (UNC Injury Prevention Research Center / NC Division of Public Health)

Keyword(s): epidemiology, north carolina, public health,

Video recording available after conference: ✅
Michael Fliss (UNC Injury Prevention Research Center / NC Division of Public Health)
10:30–12:00 Penn 1 Facilitating Open-Source Transitions: Lessons Learned from a Hands-On R Training Initiative

More infoThe increasing adoption of open-source tools like R is transforming clinical and real-world data analysis, improving visualization and reporting, reducing costs, and fostering industry-academia collaboration. However, transitioning from proprietary to open-source solutions present cultural and operational challenges for organizations, underscoring the need for proper training to overcome reproducibility problems. To address this, the Duke Clinical Research Institute, an academic research organization, developed a six-module introductory R training program tailored to varying experience levels. The curriculum begins by exploring the distinction between R and RStudio, including installation guidance, before progressing to core R fundamentals such as data structures and manipulation. Building on these fundamentals, the subsequent modules focused on practical applications by exploring table generation, functions customization, R Markdown for reproducible reporting, and advancing skills in data visualization with ggplot2 and statistical modeling. In the final module, learners consolidate their training by applying key concepts to real-world scenarios, thereby acquiring practical skills vital for clinical research. Several lessons about this training program were identified, providing a strategic foundation for practical open-source training initiatives in a work environment. One key insight was the challenge of securing in-person attendance due to the predominantly remote workforce. To overcome this, hybrid sessions consisting of online and in-person workshops with interactive exercises were implemented. During exercises, participants were divided into small breakout rooms, each led by an R expert, allowing for personalized support and creating a more engaging environment. Another key takeaway was the difficulty participants faced in understanding the distinction between R and RStudio, particularly when it came to the complexities of package management, which emerged as one of the more challenging aspects of the introductory session. In retrospect, providing overview materials beforehand could have better prepared attendees for these challenging concepts. To ensure accessibility, all sessions were recorded, enabling participants to revisit the content if they were unable to follow the live session or wished to refer back in the future. Lastly, to sustain engagement and foster continuous improvement, post-session feedback was systematically collected and used to refine future modules. Additionally, training materials were shared in advance to ensure participants were set up for success. Sustained support is vital for successful open-source adoption. We aim to foster continuous growth and drive meaningful contributions to the R community by offering Open-Source office hours, incorporating R proficiency into annual performance goals, and planning advanced training on package management, development, and container-based reproducibility.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Anna Giczewska; Brooke Alhanti, DaJuanicia Holmes (Duke Clinical Research Institute), Ronald Kamusiime (Duke Clinical Research Institute), Miloni Shah (Duke Clinical Research Institute)

Keyword(s): open-source adoption, r training program, clinical research, reproducibility, continuous learning

Video recording available after conference: ✅
Anna Giczewska
10:30–12:00 Penn 1 Rediscovering R for Library Data Instruction with Google Colab

More infoAs the first data literacies lead at my university, I support users with a wide range of skills and preferences, from those working with spreadsheets to those comfortable writing code. While I have used R extensively in my research, I found that most users prefer Python for coding, so I adapted my workshops and services to meet that demand. However, when a faculty member recently requested an R data visualization workshop for a data analysis course, I welcomed the opportunity to revisit R in a meaningful way. In this lightning talk, I will share insights on balancing comprehensive data services with the limitations of time and resources. I will also highlight how Google Colab’s built-in R support has transformed the delivery of hands-on sessions by removing the necessity for software installations or IT approvals. This is especially significant at our Google-centric campus, where Colab integration is seamless and readily accessible to everyone. I hope to encourage discussion on strategies for integrating R into academic settings, particularly in environments where software access and institutional preferences for other tools present challenges.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Ahmad Pratama (Stony Brook University)

Keyword(s): library data services, data visualization, data instruction, google colab, barriers to r adoption

Video recording available after conference: ✅
Ahmad Pratama (Stony Brook University)
10:30–12:00 Penn 1 Storytelling with ggplot2: Using custom functions to sequence visualizations

More infoCustom functions using {ggplot2} can decrease repetition in the exploratory phase of an analysis, but they can also be incredibly useful for highlighting and telling stories when communicating final results. This session features a visualization sequence demonstrating how functions can break {ggplot2} figures into pieces for ease of interpretation in a presentation setting. Audience members will come away from the session with a broadened view of how transparency levels and color can be used as function arguments to reveal sections of visuals piece by piece.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): McCall Pitcher (Duke University)

Keyword(s): data visualization, functions, ggplot2, data storytelling

Video recording available after conference: ✅
McCall Pitcher (Duke University)
10:30–12:00 Penn 1 The Art of the Question: Building an Effective R Consultation Program in an Academic Library

More infoIn data science consultations, the path to success often begins not with providing immediate technical solutions, but with asking the right questions. This presentation explores how the Data Science Consulting Program at North Carolina State University has developed a question-based framework for R programming consultations that transforms how researchers approach data analysis problems. We’ll share our structured consultation methodology that helps patrons clarify research objectives, challenge underlying assumptions, and refine analytical approaches before writing a single line of code. Through case studies spanning multiple academic disciplines, we’ll demonstrate how strategic questioning fueled by critical thinking has led researchers to revise their analytical strategies, discover more elegant solutions, and sometimes completely reframe their research questions—ultimately producing more robust and meaningful results. This talk will provide practical insights for anyone supporting R users across skill levels, including a typology of questions that promote deeper analytical thinking, strategies for training consultation staff in this approach, and assessment methods to measure success. Whether you’re supporting colleagues in industry, mentoring students, teaching workshops, or building a consultation program, you’ll learn skills to learn someone’s real versus stated need and communicate technical information back effectively. Join us to explore how the art of asking the right questions can transform technical consultations and position R experts as valuable contributors to the entire research process.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Abhinandra Singh (North Carolina State University); Selene Schmittling (North Carolina State University), Alp Tezbasaran (North Carolina State University), Shannon Ricci (North Carolina State University), Mara Blake (North Carolina State University)

Keyword(s): consulting, research, critical thinking

Video recording available after conference: ✅
Abhinandra Singh (North Carolina State University)
10:30–12:00 Penn 1 gfwr, an R package to access data from the Global Fishing Watch APIs

More infoAt Global Fishing Watch, we create and publicly share knowledge about human activity at sea to enable fair and sustainable use of our ocean. By processing terabytes of global vessel position data transmitted via the Automatic Identification System (AIS) and applying machine learning models, we create the most comprehensive view of vessel activities around the world. Our datasets are open to everyone and our aim is to facilitate the access to them to the general scientific community. As part of this goal, we created gfwr, an R package that communicates with our public APIs and retrieves our data through three main functions: - get_vessel_info() provides access to vessel information from hundreds of thousands of fishing and non-fishing vessels from all over the world, getting identity information based on AIS self-reported data, public registries and authorizations - get_event() retrieves event information calculated by our algorithms. This includes encounters at sea, loitering, port visits and fishing events by vessel. - get_raster() returns fishing effort based on AIS data. In the package documentation, we created vignettes that show how to concatenate these functions into comprehensive workflows that can be adapted depending on the researcher’s needs. By offering access through R, we also contribute to more open, transparent and reproducible science. Since 2022, more than 400 users have used gfwr, mostly from institutions in North America and Europe. We are now conducting multilingual training workshops to expand this user base and help promote a culture of transparency in ocean governance.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Andrea Sánchez-Tapia (Global Fishing Watch)

Keyword(s): apis, fisheries science, public data, r package, machine learning

Video recording available after conference: ✅
Andrea Sánchez-Tapia (Global Fishing Watch)
10:30–12:00 Penn 1 pretestcad: An R package to calculate PreTest Probability (PTP) scores for obstructive Coronary Artery Disease (CAD)

More infoMost clinicians in cardiology use an online portal such as HeartScore to calculate a risk score for a patient. However, as risk scores continue to evolve and update themselves, it can be a tedious process to recalculate the risk score of many patients as these online portal could only do so one patient at a time. As such, there has been a rise of R package dedicated to calculating a patient’s risk of having cardiovascular diseases such as CVrisk, RiskScorescvd and whoishRisk in an automated way. Despite the progress made, pretest risk scores for obstructive CAD is lacking. Hence an R package called pretestcad was made to fill this gap, allowing users to calculate these scores automatically for many patients. Examples of such scores are the 2012 CAD Consortium 2 (CAD2) PTP scores, 2017 PROMISE Minimal-Risk Score and the 2020 Winther et. al. Risk-Factor-weighted Clinical Likelihood (RF-CL) and Coronary Artery Calcium Score-Weighted Clinical Likelihood (CACS-CL) PTP which was recommended to be used in the 2024 ESC Guidelines. I hope that presenting this R package in this conference could not only raise awareness of the package in the medical field but also collaboration to make the R package more accessible and user friendly and to expand my knowledge of other pretest scores.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Jeremy Selva (National Heart Centre Singapore)

Keyword(s): pretest probability, risk scores, r package, clinical/medical research

Video recording available after conference: ✅
Jeremy Selva (National Heart Centre Singapore)
10:30–12:00 Penn 1 propertee: Flexible Covariance Adjustment and Improved Standard Errors in Analyses of Intact Clusters

More infoIn studies with intact clusters, regressing outcomes on an indicator for treatment assignment along with covariates commonly provides an estimate of the intent-to-treat (ITT) effect, with robust sandwich standard errors addressing clustering and possibly heteroskedasticity. Even when treatment assignment is ignorable under the study design, the parametric structure necessary for consistent effect estimation limits the gains in precision one seeks by including covariates. In contrast, differencing estimators take the difference between treated and control individuals in their average difference between outcome and some proxy for confounding effects. The propertee package offers users the opportunity to use a prediction of the outcome under control from a flexible “first-stage” model fit as this proxy. The Neyman variance estimator, the default standard error for difference-in-means estimates, fails to account for sampling variability from the model predictions when applied to this ITT effect estimator. The propertee package addresses this issue by augmenting a novel cluster-robust jackknife estimate of the sampling variability of the difference-in-means with a heteroskedasticity-robust estimate of the variability of the model coefficient estimates. This standard error provides asymptotically valid inference for the ITT effect when the first-stage model dimension grows sufficiently slowly with the size of the fitting sample, while entirely removing downward bias in finite samples, even when the model dimension grows more quickly than the rate necessary for asymptotic unbiasedness.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Joshua Wasserman (University of Michigan - Ann Arbor); Ben Hansen (University of Michigan - Ann Arbor)

Keyword(s): cluster-randomized trials, clustered observational studies, cluster-robust standard errors, causal inference, intent-to-treat effect

Video recording available after conference: ✅
Joshua Wasserman (University of Michigan - Ann Arbor)
10:30–12:00 Penn 1 tinytable: A lightweight package to create simple and configurable tables in a wide variety of formats

More infoThe R ecosystem offers a wide range of packages for generating tables in various formats. However, existing solutions often suffer from complexity, excessive dependencies, or rigid formatting systems. In response to these challenges, we introduce tinytable, a lightweight yet powerful R package for producing high-quality tables in multiple formats, including HTML, LaTeX, Word, PDF, PNG, Markdown, and Typst. tinytable is designed with a minimalist and intuitive interface while providing extensive customization options. Unlike many existing table-drawing packages, tinytable adheres to a strict design philosophy centered on three principles: separation of data and style, flexibility, and lightweight implementation. First, tinytable ensures that table content remains distinct from formatting instructions, enabling users to generate clean, human-readable code that is easier to edit and debug. Second, the package leverages well-established frameworks such as Bootstrap for HTML and tabularray for LaTeX, providing robust and highly customizable styling capabilities. Third, tinytable prioritizes a lightweight implementation by importing zero third-party R packages by default, reducing computational overhead and improving maintainability. The package was developed to address key limitations observed in existing table-drawing tools, specifically aiming for a simple, flexible, concise, and safe user experience. tinytable provides a streamlined API with a minimal learning curve, allowing users to generate high-quality tables with less code while ensuring strong input validation and informative error messages. Its small and maintainable codebase avoids excessive reliance on regular expressions, making it both efficient and transparent. This presentation will showcase tinytable’s capabilities through applied examples, demonstrating how users can create aesthetically pleasing tables with minimal effort while retaining complete control over their formatting. By offering a zero-dependency, highly customizable, and human-readable approach to table generation, tinytable represents a valuable addition to the R ecosystem for data analysis and reporting.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Vincent Arel-Bundock (Université de Montréal)

Keyword(s): table formatting, latex, html, markdown, reproducible research

Video recording available after conference: ✅
Vincent Arel-Bundock (Université de Montréal)
Modeling 2
10:30–12:00 Penn 2 Bayesian Variable Selection and Model Averaging in R using BAS

More info[Placeholder text]

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Merlise Clyde (Duke University)

Keyword(s): NA

Video recording available after conference: ✅
Merlise Clyde (Duke University)
10:30–12:00 Penn 2 Integrating R Models into Automated Machine Learning Pipelines with SAS

More infoMany data scientists rely on automated pipelines to streamline predictive modeling and machine learning workflows. But did you know how easy it is to incorporate R models into these pipelines using SAS software? Whether you’re working in a mixed-language environment or want to enhance automation, integrating R with SAS Model Studio allows you to seamlessly compare, select, and deploy models within a structured framework. Join us to explore how open-source models can fit into an end-to-end machine learning process, enabling efficiency, reproducibility, and scalability.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Rachel McLawhon

Keyword(s): multi-language integration, model pipelines

Video recording available after conference: ✅
Rachel McLawhon
10:30–12:00 Penn 2 Multimedia: An R package for multimodal mediation analysis

More infoMediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software’s rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. We illustrate the package through two case studies. The first re-analyzes a study of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Kris Sankaran

Keyword(s): causal inference, microbiome, biostatistics, r package, modularity

Video recording available after conference: ✅
Kris Sankaran
10:30–12:00 Penn 2 distfreereg: A New R Package for Distribution-Free Parametric Regression Testing

More infoGoodness-of-fit testing is a crucial step in verifying the reliability of inferences drawn from a parametric regression model. It helps avoid invalid conclusions based on false assumptions. For example, the p-value associated with a coefficient in a linear model is unreliable if the mean function being used does not agree sufficiently with the data. Until now, there has been no easy and reliable way in R to test formally whether or not the mean function of a parametric regression model agrees with the data. In my presentation, I shall discuss my new R package, distfreereg, that implements the distribution-free goodness-of-fit testing procedure for parametric regression models introduced by Estate Khmaladze in 2021. I shall outline Khmaladze’s algorithm, discuss the main features of the package, and illustrate its use with examples.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Jesse Miller (University of Minnesota)

Keyword(s): goodness-of-fit testing, parametric regression, regression modeling, empirical partial sum process, r package

Video recording available after conference: ❌
Jesse Miller (University of Minnesota)
R in organizations
10:30–12:00 Penn Garden Implementing Posit® Team: Lessons Learned

More infoWe are Smithfield Premium Genetics (SPG), a small department in Smithfield Foods that analyzes and reports on a wide variety of data. Our primary tools for reporting and visualization since 2012 have been open source versions of R, R-Studio Server and Shiny Server. In the summer of 2022, we committed to installing Posit® Team in order to simplify package management, collaborative development and report publication. Initially, SPG expected all components of Posit® Team to be operational within six months. The actual time was closer to two and a half years. We struggled with strict corporate IT protocols, missed deadlines, service provider selections and legal issues. In hind sight, six months was an unreasonable expectation, but the actual waiting period could have been shortened dramatically had we prepared properly. The purpose of this presentation is to shed light on some of SPG’s roadblocks to implementation and possibly help other small data science groups implement Posit® Team.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Lowell Gould (Smithfield Foods)

Keyword(s): posit team, collaboration, workflow

Video recording available after conference: ✅
Lowell Gould (Smithfield Foods)
10:30–12:00 Penn Garden Navigating the Transition: Lessons from Adopting an Open-Source Approach

More infoMany organizations are looking to switch to open-source software for its cost-effectiveness, greater flexibility, and better long-term maintainability compared to other software. For our project, this transition was driven by the need for a more cost-effective software tool and to reduce dependency on proprietary software services. However, this transition can present challenges such as training for team members, balancing training with other work commitments, and quality assurance for new programs. We describe the transition to R and RStudio over a 3-month period on a public health project with a team that had varying levels of R programming experience. Included in this discussion is our approach to training and implementation, and results for the project transition as we convert programs focused on data quality checks, data manipulation and processing, and statistical estimation. We touch on lessons learned on how to train new programmers to use R, what worked and did not work in terms of converting existing code, and resource planning needed to further continue this transition across other projects.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Carlos Petzold (RTI International); Adam Lee (RTI International)

Keyword(s): training, implementation, public health, open-source, transititon

Video recording available after conference: ✅
Carlos Petzold (RTI International)
10:30–12:00 Penn Garden R & Python play nice, in production

More infoHi, I’m Claudia Peñaloza, a Data Scientist at Continental Tires, where going data-driven can be an adventure. What started as a proof of concept a few years ago, evolved into Conti’s first-ever Predictive Machine Learning Model for R&D! A wedding, two babies, three lateral moves, and four hires later, our team had also evolved… from mostly R to mostly Python developers. Rewriting 1000+ commits? No thanks. Instead, we got R and Python to play nice. With Renv, Poetry, and Docker, we keep things reproducible, portable, and deployable on various ML-Ops platforms. The takeaway? With the right tools, teams can mix and match languages, leveraging the best in each, and still build solid, scalable solutions.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Claudia Penaloza (Continental Tires)

Keyword(s): multi-lingual, r, python, docker, mlops

Video recording available after conference: ✅
Claudia Penaloza (Continental Tires)
10:30–12:00 Penn Garden Standardizing Institutional Research Operations Using R

More infoIn this presentation, we will demonstrate how a small Institutional Research (IR) team can leverage R and Quarto to consolidate and streamline its analytics platforms and reporting tools. We will highlight the versatility of R and Quarto in institutional research by showcasing their use in producing operational manuals, presentation slide decks, BI dashboards, internal and external reports, and institution-wide parameterized reports.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Chris Kao (Flagler College)

Keyword(s): institutional research

Video recording available after conference: ✅
Chris Kao (Flagler College)
Shiny
10:30–12:00 Gross 270 AI Execution Capability Assessment: A Shiny Web App for AI Strategy and Governance

More infoAs organizations adopt AI, assessing readiness, maturity, and governance is critical. This talk introduces a Shiny-based AI capability assessment tool that integrates R, Python (via reticulate), FAISS, and OpenAI APIs to evaluate AI execution across 15 key areas. The app enables organizations to: Conduct structured self-assessments on AI capabilities. Use FAISS-enhanced retrieval-augmented generation (RAG) to retrieve insights from NIST AI RMF and ISO/IEC 42001. Generate customized LLM-based recommendations, gap analysis reports, and automated project proposals. Visualize AI maturity through interactive radar charts and data-driven priority rankings. Built with Shiny, shinydashboard, DT, plotly, and OpenAI, this tool streamlines AI governance, helping stakeholders align strategy with regulatory and operational needs. This talk will showcase the app’s development, challenges in integrating R with LLMs, and its real-world impact on AI strategy execution.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Akbar Akbari Esfahani (Central California Alliance for Health)

Keyword(s): shiny, python, ai governance, generativeai, rag (retrieval-augmented generation).

Video recording available after conference: ✅
Akbar Akbari Esfahani (Central California Alliance for Health)
10:30–12:00 Gross 270 Every Eclipse Visible from Your Current Location for the Rest of Your Life (with shiny)

More infoA shiny web app that calculates all solar and lunar eclipses for up to the next 75 years visible at your current lon/lat coordinate location ([link][1]). The shiny app leverages the swephR High Precision Swiss Ephemeris package for celestial body calculations. Anecdote ======== I witnessed the sky darken for no apparent reason in the fall of 2023. Months later I realized I was witnessing the October 2023 annular solar eclipse. So I began to wonder: How many other eclipse events have I been missing? Broad Topics Covered ==================== Julius Caesar vs Pope Gregory XIII: The Battle for Space-Time ———————————————————— While most of the western world adopted the Gregorian calendar (365.2425 days long) by the 20th century for agricultural and cultural reasons, astronomers track time off-earth using the older Julian calendar (365.25 days long). We will briefly touch on why this is, how to convert back-and-forth between Gregorian and Julian to perform astronomical calculations, and other notable phenomena to keep in mind when using R in space. User Input: Flexibility vs Ease-of-Use ====================================== Specifically with regards to earthly longitude/latitude coordinate inputs (required for calculating the alignment of celestial bodies to a Earth-bound viewer), a decision had to be made affecting usability of the shiny app for the end-user based on their ability to easily enter an input location: The Breadth of the swephR Package ================================= The swephR package is useful for calculating not only solar and lunar eclipse events visible from earth but all kids of celestial alignments including: - Planetary positions - The crossing of planets over positions - Fixed star positions - Orbital periods for the Earth, asteroids, etc Biography ——— Tim Bender, Hobbyist Bachelor of Urban Planning from the University of Cincinnati. ~15 years local government experience as an urban planner and transit planner. Tim was part of a team that helped deploy Google Transit for his transit agency in Kentucky in 2008, being among the first 50-ish agencies worldwide to go live. Journey with R began with a desire to log transit vehicle real-time location data from an API for analysis but I had no programming experience or knowledge of how to approach the problem. I wouldn’t successfully solve this problem until after about 5 years of self-guided learning. [linkedin][2] [github][3] [1]: https://tim-bender.shinyapps.io/shiny_all_eclipses/ [2]: https://www.linkedin.com/in/tim-bender-238870171/ [3]: https://github.com/benda18

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Timothy Bender

Keyword(s): shiny, astronomy, leaflet, geocoding, communication

Video recording available after conference: ✅
Timothy Bender
10:30–12:00 Gross 270 Extending Shiny Dashboards to Mobile with Ionic and Rust: A Cross-Platform Approach

More infoShiny has long been a framework of choice for interactivity dashboards on the web, but what if you need a mobile variant? This talk proposes new ways to extend existing Shiny dashboards to mobile dashboards by using Ionic and JavaScript, while making use of a Rust server for data manipulation and preparation. We’ll analyze the integration of Shiny with a mobile frontend Shiny, the benefits offered by Rust in terms of backend efficiency, and the tradeoffs offered between web-based and mobile dashboard interfaces. Finally, we’ll assemble everything with a live working proof of concept mobile application that streams a Shiny dashboard into a simplified and responsive user interface.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): Anastasiia Kostiv

Keyword(s): ionic shiny rust react.js

Video recording available after conference: ✅
Anastasiia Kostiv
10:30–12:00 Gross 270 paleopal: a highly modular and interactive Shiny app for building reproducible data science workflows in paleontology

More infoThe field of computational paleontology is rapidly advancing with many recently developed open-source R packages leading the charge for more standardized, reproducible, and open research. This push is a relief for many data-science-minded paleontologists who have previously toiled over writing their own scripts to download, clean, analyze, and visualize their data. Many of these steps are now covered by functions in these new packages (and those of other packages in the R universe). However, this push for more script-based research may introduce a wrench in the existing scientific workflows of less technical researchers who lack a background in coding or cause a greater learning curve for new researchers introduced to the field. Therefore, bridging the gap between visual, hands-on workflows and digital, code-based workflows is imperative to the collaborative future of computational paleontology. Here I present a new Shiny app, paleopal, that provides a user-friendly interface to build paleontological data science workflows. The app connects existing paleontological R packages such as palaeoverse and deeptime with the tidyverse suite of R package to encourage standardized scientific pipelines. Furthermore, the app uses the shinymeta R package to provide a live code and results panel and a downloadable RMarkdown script for the pipeline. Altogether, paleopal aims to spearhead the next generation of training of computational paleontologists, regardless of age, background, or technical expertise. Further, the modular nature of the app introduces an avenue for other fields to fork and adapt the project for their own needs.

Date and time: Sun, Aug 10, 2025 - 10:30–12:00

Author(s): William Gearty (Syracuse University)

Keyword(s): shiny, workflow, earth sciences, biological sciences, reproducible research

Video recording available after conference: ✅
William Gearty (Syracuse University)
Quarto
13:00–14:30 Penn 1 Final Flourishes

More infoQuarto’s clean and simple API is one of its greatest strengths. This ease of use extends to creating Quarto extensions as well. In this talk, we will document the details and process we took to create our Quarto extension, Flourish. Flourish allows users to dynamically target text in code chunks, like functions or parameters, and apply styling (e.g. highlighting) in the rendered document. If you are familiar with the R package Flair, Flourish works similarly, but is language agnostic. We will speak in depth about how Flourish dynamically injects styling into a rendered report using Javascript, and the work it took to get to this process. This talk seeks to inform participants of our extension and its technical workings, as well as step them through our process of creating a Quarto extension, showing how simple it can be to extend Quarto’s functionality.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Visruth Srimath Kandali (Cal Poly – San Luis Obispo); Kelly Bodwin (California Polytechnic State University)

Keyword(s): quarto, extension development, pedagogy, formatting, technical

Video recording available after conference: ✅
Visruth Srimath Kandali (Cal Poly – San Luis Obispo)
13:00–14:30 Penn 1 From Frustration to Function: Tackling the Challenges of New Tech Adoption “Cracking the Code: Overcoming Early Hurdles with New Open-Source Tools”

More infoAdopting new open-source technology can be both exciting and challenging. While a tool may appear promising and seem like the perfect fit for a specific task, early-stage technologies often come with their own set of hurdles. One of the biggest challenges is the lack of comprehensive resources—such as detailed documentation, practical examples, active discussion boards, and community support—which can make it difficult to troubleshoot issues or fully understand the tool’s capabilities. This often requires additional effort, experimentation, and problem-solving to get things working as intended. At last year’s Posit Conference, I was introduced to a new tool called closeread, a Quarto extension designed for vertical scrollytelling. The concept immediately caught my interest because it seemed like an innovative way to enhance storytelling with data. Motivated by its potential, I decided to give it a try shortly after the conference. However, my initial experience was challenging. The tool was only partially functional, and when I encountered technical issues, I struggled to find enough supporting resources to resolve them. The available documentation was limited, examples were scarce, and there wasn’t much discussion happening in community forums. Frustrated by these obstacles, I eventually set the tool aside, unsure of how to move forward. Some time later, I came across a user-contribution contest that reignited my interest in closeread. This contest motivated me to tackle the tool again, but this time with a different mindset. Instead of relying solely on available resources, I approached the problem more systematically—digging into the code, experimenting with different configurations, and learning through trial and error. This hands-on approach, combined with the fresh motivation from the contest, helped me overcome the technical challenges I had faced earlier. Eventually, I was able to get the tool working successfully, and in the process, I gained a deeper understanding of how to navigate the common pitfalls associated with adopting new technology. In my talk, I will share this learning journey in detail, highlighting the strategies that helped me move from frustration to success. I’ll discuss practical approaches for overcoming deployment challenges, including how to troubleshoot effectively when documentation is limited, how to leverage community resources even when they seem sparse, and how to maintain motivation when progress feels slow. I’ll also offer insights into the mindset shifts that can make a big difference—like viewing challenges as opportunities to deepen your technical skills rather than as roadblocks.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Dror Berel

Keyword(s): quarto, closeread, best practices, community, scrollytelling

Video recording available after conference: ✅
Dror Berel
13:00–14:30 Penn 1 Parsing Quarto and R Markdown documents in R

More infoIn this talk, we will share recent work on parsermd R package and its use for the programmatic manipulation of Quarto and R Markdown documents. The talk will include a brief overview of the underlying technical details of the parser and abstract syntax tree (AST) representation of these documents in R. Additionally, we will present work to support a number of use cases for these tools to solve practical problems. Examples include building documents that have multiple output variants (e.g. assignments with or without solutions included) and utilities built to support grading and feedback for assignments based on these document formats. Finally, we will discuss our future development plans for the package and related tools.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Colin Rundel (Duke University)

Keyword(s): quarto,rmarkdown,literate programming,automation

Video recording available after conference: ✅
Colin Rundel (Duke University)
13:00–14:30 Penn 1 Reproducible pedagogy with R and Quarto

More infoReproducible teaching materials (1) help students understand the usefulness of reproducibility, (2) facilitate pedagogy around meaningful, data-driven statistical applications, and (3) allow other instructors to iterate on course materials. In this talk, I show how a variety of R packages and Quarto work together to produce useful, simple and aesthetic reproducible course materials. I discuss how to weave R-code into lectures, embed reproducible visualizations and animations, and easily host data sets as well as downloadable R scripts on a public course website developed with Quarto. I showcase an undergraduate Bayesian statistics course taught at Duke University in Spring 2025 as an example of reproducible statistics pedagogy and provide tutorial-like instructions to adapt the methods to other courses.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Alexander Fisher (Duke University)

Keyword(s): teaching, statistics, website, reproducible

Video recording available after conference: ✅
Alexander Fisher (Duke University)
Productivity boosters
13:00–14:30 Penn 2 Air - A blazingly fast R code formatter

More infoIn Python, Rust, Go, and many other languages, code formatters are widely loved. They run on every save, on every pull request, and in git pre-commit hooks to ensure code consistently looks its best at all times. In this talk, you’ll learn about Air, a new R code formatter. Air is extremely fast, capable of formatting individual files so fast that you’ll question if its even running, and of formatting entire projects in under a second. Air integrates directly with your favorite IDEs, like Positron, RStudio, and VS Code, and is available on the command line, making it easy to standardize on one tool even for teams using various IDEs. Once you start using Air, you’ll never worry about code style ever again!

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Lionel Henry (Posit, PBC), Davis Vaughan (Posit, PBC)

Keyword(s): formatter, rust

Video recording available after conference: ✅
Lionel Henry (Posit
PBC)
Davis Vaughan (Posit
PBC)
13:00–14:30 Penn 2 Getting Things Logged

More infologger is a lightweight, modern, and flexible logging utility for R, with a clear concept and separation of log message formatter, layout render, and log record appender functions – which makes it effortless to log messages in various formats and destinations, such as your console, files, databases, or even Slack. The package was first released 6 years ago, and it’s widely used since then. The development recently spiked thanks to generous contributions from the community (over 100 pull requests), introducing improved async logger; new formatter, helper, and appender functions; documentation updates; and a hex logo!

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Gergely Daroczi

Keyword(s): logging

Video recording available after conference: ✅
Gergely Daroczi
13:00–14:30 Penn 2 R4R: Reproducibility for R

More infoCreating a reproducible environment for data analysis pipelines is challenging, due to the wide range of dependencies involved—from data inputs and external tools to system libraries and R packages. Although various tools exist to simplify the process, they often focus exclusively on R package dependencies and omit the system ones, rely on user-supplied metadata, or create an unnecessarily large environment. We present r4r, a tool that automatically traces all dependencies in a pipeline using system call interception. Based on these traces, r4r generates a Docker image containing precisely the dependencies needed for reproducible execution. We demonstrate its effectiveness on a collection of R Markdown notebooks from Kaggle, illustrating how r4r helps ensure fully reproducible workflows.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Pierre Donat-Bouillud (Czech Technical University), Filip Křikava (Czech Technical University)

Keyword(s): reproducibility, docker, automation

Video recording available after conference: ✅
Pierre Donat-Bouillud (Czech Technical University)
Filip Křikava (Czech Technical University)
13:00–14:30 Penn 2 wizrd: Programming with LLMs in R

More infoLarge Language Models (LLMs) offer new opportunities for accelerating data analysis, offering flexibility through a natural language interface. The wizrd package was born for testing the hypothesis that LLMs can be used to implement functions within larger data science tools and workflows. With wizrd, users can parameterize model inputs, implement logic that delegates to R functions, and constrain outputs to specific R data structures. Finally, configured models can be converted into actual R functions, which can in turn serve as tools for other models, forming a graph of agents. LLM-based programs can be imported from the Langsmith hub, realizing their portability across languages. We will give an overview and demo of the package.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Michael Lawrence

Keyword(s): llm, ai, programming

Video recording available after conference: ✅
Michael Lawrence
Too big to fail
13:00–14:30 Penn Garden Futureverse P2P: Peer-to-Peer Parallelization in R using Futureverse

More info# TL;DR In this presentation, I will show how you can move from running your Futureverse R code in parallel on your local computer to a distributed peer-to-peer (P2P) network of computers shared among friends - all with a single change of settings. Any user with R installed can contribute their compute power when idle and harness others when needed. # Abstract The Futureverse framework revolutionized parallel computing in R by providing a simple, unified API for parallel evaluation of R code. At its core, the future package allows developers to write code once (e.g. f <- future(lm(x, y))), which then may be evaluated on any future-compatible parallel backend. For instance, plan(multisession) parallelizes on the local machine, plan(cluster) and plan(mirai_cluster) can parallelize on a set of local or remote machines, and plan(batchtools_slurm) distributes the computations on a Slurm high-performance compute (HPC) cluster, and so on. Regardless of backend used, getting the value of the future is always the same (e.g. v <- value(f)). In this presentation, I introduce a novel peer-to-peer (P2P) future backend that enables distributed parallel computing among friends and colleagues using shared storage. The shared storage can be a local file system or a cloud storage. I will illustrate this concept using the plan(p2p_gdrive) backend, which leverages Google Drive as a communication medium, where users can offload computational tasks to peers in a shared workspace. When a user creates a future, it ends up in the “todo” folder, where idle peers can detect it, download it, and execute it locally. Once completed, the result is uploaded to the “done” folder, making them available for retrieval by the original user. The Futureverse ecosystem has a simple API, which makes it is easy for anyone to write parallel R code. Because Futureverse is exceedingly well-tested, you can easily and safely scale up code currently running on your local computer to run on distributed P2P clusters. This approach democratizes distributed computing, allowing R users to harness the collective power of their social network without requiring dedicated HPC infrastructure. Come to my talk and learn how you and your friends can get together and share your compute resources, allowing you to run y <- future_lapply(X, my_long_running_analysis) across their computers, while you share your idle compute resources with them. All this is available from your local R prompt.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Henrik Bengtsson (University of California San Francisco (UCSF))

Keyword(s): programming, parallel processing, performance, reproducibility

Video recording available after conference: ✅
Henrik Bengtsson (University of California San Francisco (UCSF))
13:00–14:30 Penn Garden Outgrowing your laptop with Positron

More infoHave you ever run out of memory or time when tidying data, making a visualization, or training a model? An R user may find their laptop more than sufficient to start their journey with statistical computing, but as datasets grow in size and complexity, so does the necessity for more sophisticated tooling. This talk will step through a set of approaches to scale your tasks beyond in-memory analysis on your local machine, using the Postron IDE: adopting a lazy evaluation engine like DuckDB, connecting to remote databases with fluent workflows, and even migrating from desktop analysis entirely to server or cloud compute using SSH tunnelling. The transition away from a local, in-memory programming paradigm can be challenging for R users, who may not have much exposure to tools or training for these ways of working. This talk will explore available options which make crossing this boundary more approachable, and how they can be used with an advanced development environment suited for statistical computing with R. Integrations in the Positron IDE make all these tasks easier; for example, remote development in Positron allows an R user to seamlessly write code on their local machine, and execute that code on a remote host without tedious interactions outside the IDE. Whether you train statistical models, build interactive apps, or work with large datasets, after this talk you’ll walk away with techniques for doing it better with Positron.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Julia Silge (Posit, PBC)

Keyword(s): ide, workflow, tooling, remote development

Video recording available after conference: ✅
Julia Silge (Posit
PBC)
13:00–14:30 Penn Garden Scaling Up Data Workflows with Arrow, Parquet, and DuckDB

More infoWhile R is an expressive language for exploring and manipulating data, it is not naturally suited to working with datasets that are larger than can fit into memory. However, modern tooling, including Parquet files, the Arrow format, and query engines like DuckDB, can expand what is possible to do with large datasets in R. Using practical examples, this talk will introduce several packages that bring these tools into R with intuitive interfaces, and demonstrate how to adopt them to work efficiently with large datasets. It will also show how these tools unlock new opportunities, such as easy access to data in cloud storage, and will explore recent developments in the Arrow and Parquet ecosystems, including for geospatial data.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Neal Richardson (Posit, PBC)

Keyword(s): arrow, parquet, duckdb, big data

Video recording available after conference: ✅
Neal Richardson (Posit
PBC)
13:00–14:30 Penn Garden sparsity support in tidymodels, faster and less memory hungry models

More infoSparse data, data with a lot of 0s, appear quite often in modeling contexts. However, existing data structures such as data.frames or matrices doesn’t have a good way of controlling them. You were forced to represent all data as sparse or dense (non-sparse). This means that many modelings workflows uses an non-optimal data structure, that at best slows down computation and at worst won’t be computational feasible. This talk will cover how we overcame these issues in tidymodels. Starting with the creation of a sparse vector format that fit in tibbles followed by the wiring needed to make it happen in our packages. The best part is that most users doesn’t need to change anything in their code to benefit from these speed improvements.

Date and time: Sun, Aug 10, 2025 - 13:00–14:30

Author(s): Emil Hvitfeldt (Posit, PBC)

Keyword(s): machine learning, tidymodels, sparse data

Video recording available after conference: ✅
Emil Hvitfeldt (Posit
PBC)
Keynote #5
15:00-16:00 Penn 1 Powerful simulation pipelines with {targets}

More infoWhen designing clinical trials, simulations are essential for comparing options and optimizing features like sample size, allocation, randomization, milestones, and decision criteria. However, simulations require thousands of replications and long execution times. Pipeline tools bring efficiency, reproducibility, and peace of mind to demanding computations such as simulations. The {targets} package is a friendly workflow manager that brings the automation and structure of pipelines to the uniquely fluid and interactive environments where R users thrive. This talk explores {targets} with a simulation pipeline for an example clinical trial.

Date and time: Sun, Aug 10, 2025 - 15:00-16:00

Author(s): Will Landau (Eli Lilly and Company)

Keyword(s): NA

Video recording available after conference: ✅
Will Landau (Eli Lilly and Company)

Last updated 2025-05-25.

An R Foundation conference