R Guide

Name: r-guide
Rating: 92
Author: ar4mirez

Applies to: R 4.1+, Statistical Computing, Data Analysis, R Packages, Shiny Apps

Core Principles

•Tidyverse First: Use tidyverse conventions for data manipulation, visualization, and functional programming; fall back to base R only when performance demands it
•Vectorize Everything: Prefer vectorized operations and purrr::map() over explicit for loops; R is optimized for vector operations
•Reproducibility: Every analysis must be reproducible -- use renv for dependency management, set.seed() for stochastic operations, and R Markdown/Quarto for literate programming
•Functional Style: Write pure functions with no side effects; avoid modifying global state or relying on .GlobalEnv
•Explicit Over Implicit: No reliance on partial matching, implicit type coercion, or positional argument passing for non-trivial functions

Guardrails

Version & Dependencies

•Target R 4.1+ (native pipe |>, lambda shorthand \(x))
•Manage dependencies with renv -- always commit renv.lock
•For packages, declare all dependencies in DESCRIPTION (Imports:, Suggests:)
•Pin CRAN snapshot dates in renv for full reproducibility
•Audit new dependencies: check CRAN status, reverse dependencies, license (GPL compatibility)

Code Style

•Follow the tidyverse style guide
•Run styler::style_pkg() and lintr::lint_package() before every commit
•Naming: snake_case for functions/variables, PascalCase for R6/S4 classes
•Max line length: 80 characters
•Use <- for assignment (not = outside function arguments)
•Explicit library() at top of scripts; never use require()
•Always use TRUE/FALSE (never T/F -- they can be overwritten)
•No attach() or setwd() -- use here::here() for project-relative paths

Vectorization

•Prefer vectorized operations: x * 2 not for (i in seq_along(x)) x[i] * 2
•Use dplyr::mutate() / dplyr::summarise() for column-wise transformations
•Use purrr::map() family for list iteration (map_dbl(), map_chr(), map_dfr())
•Use dplyr::across() for applying functions to multiple columns
•Reserve for loops for side effects only (writing files, API calls)
•Use vapply() over sapply() when base R is required (explicit return type)

Error Handling

•Use rlang::abort() / cli::cli_abort() over stop() for structured conditions
•Validate inputs at the start of every exported function
•Use stopifnot() or rlang::arg_match() for argument validation
•Never use try() -- always tryCatch() or purrr::safely()

validate_dataframe <- function(df, required_cols) {
  if (!is.data.frame(df)) {
    cli::cli_abort("{.arg df} must be a data frame, not {.obj_type_friendly {df}}.")
  }
  missing_cols <- setdiff(required_cols, names(df))
  if (length(missing_cols) > 0) {
    cli::cli_abort(
      "Missing required column{?s}: {.field {missing_cols}}.",
      class = "validation_error"
    )
  }
  invisible(df)
}

Reproducibility

•Always use set.seed() before stochastic operations; document the seed
•Use renv::snapshot() after adding or updating packages
•Never use absolute paths -- use here::here() for project-relative paths
•Use R Markdown (.Rmd) or Quarto (.qmd) for analysis reports
•Include sessioninfo::session_info() at the end of reports

Project Structure

code

mypackage/                          myanalysis/
├── R/             # Source files   ├── R/             # Reusable functions
│   ├── data-clean.R                ├── analysis/      # Rmd/Quarto (numbered)
│   └── utils.R                     │   ├── 01-exploration.Rmd
├── tests/                          │   └── 02-modeling.qmd
│   ├── testthat.R  # Runner        ├── data/
│   └── testthat/                   │   ├── raw/       # Immutable input
│       └── test-data-clean.R       │   └── processed/ # Generated output
├── man/            # roxygen2      ├── output/        # Figures, reports
├── vignettes/                      ├── tests/testthat/
├── data-raw/       # Data scripts  ├── renv.lock
├── DESCRIPTION                     └── README.md
├── NAMESPACE       # roxygen2
├── renv.lock
└── README.md

•Use roxygen2 for all docs; never edit man/ or NAMESPACE by hand
•Raw data is immutable -- store in data/raw/, process into data/processed/

Key Patterns

Tidyverse Pipe Chains

# Prefer native pipe |> (R 4.1+) over magrittr %>%
result <- raw_data |>
  dplyr::filter(year >= 2020, !is.na(revenue)) |>
  dplyr::mutate(
    revenue_m = revenue / 1e6,
    growth = (revenue - dplyr::lag(revenue)) / dplyr::lag(revenue)
  ) |>
  dplyr::summarise(
    mean_revenue = mean(revenue_m, na.rm = TRUE),
    .by = region
  )

Tidy Evaluation

# Use {{ }} (embrace) for column names passed as arguments
summarise_by <- function(df, group_col, value_col) {
  df |>
    dplyr::summarise(
      mean_val = mean({{ value_col }}, na.rm = TRUE),
      n = dplyr::n(),
      .by = {{ group_col }}
    )
}

# Use .data pronoun for string column references
filter_column <- function(df, col_name, threshold) {
  df |> dplyr::filter(.data[[col_name]] > threshold)
}

# Use across() for multiple columns
standardize_numeric <- function(df) {
  df |>
    dplyr::mutate(dplyr::across(
      where(is.numeric),
      \(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
    ))
}

ggplot2 Grammar of Graphics

plot_distribution <- function(df, x_col, fill_col = NULL) {
  ggplot2::ggplot(df, ggplot2::aes(x = {{ x_col }})) +
    ggplot2::geom_histogram(ggplot2::aes(fill = {{ fill_col }}), bins = 30, alpha = 0.7) +
    ggplot2::labs(title = "Distribution", x = NULL, y = "Count") +
    ggplot2::theme_minimal(base_size = 14)
}

Functional Programming with purrr

# Type-stable map variants -- read and combine CSV files
results <- purrr::map_dfr(file_paths, \(path) {
  readr::read_csv(path, show_col_types = FALSE) |>
    dplyr::mutate(source_file = basename(path))
})

# Safe execution -- capture errors without stopping
safe_read <- purrr::safely(readr::read_csv)
reads <- purrr::map(file_paths, safe_read)
successes <- purrr::map(purrr::keep(reads, \(x) is.null(x$error)), "result")

Testing

Standards

•Use testthat 3rd edition (Config/testthat/edition: 3 in DESCRIPTION)
•Test files: test-*.R (mirror source: data-clean.R -> test-data-clean.R)
•Test names describe behavior: test_that("filter_active removes inactive users", ...)
•Coverage target: >80% for business logic, >60% overall (measured with covr)
•Use snapshot tests (expect_snapshot()) for complex output (plots, printed tables)
•No test interdependencies -- each test_that() block is self-contained
•Use withr::local_*() for temporary state changes (env vars, options, files)

testthat Examples

test_that("summarise_by computes correct group means", {
  df <- tibble::tibble(
    region = c("east", "east", "west", "west"),
    revenue = c(100, 200, 300, 400)
  )
  result <- summarise_by(df, region, revenue)
  expect_equal(nrow(result), 2)
  expect_equal(result$mean_val[result$region == "east"], 150)
})

test_that("validate_dataframe errors on missing columns", {
  df <- tibble::tibble(a = 1, b = 2)
  expect_error(validate_dataframe(df, c("a", "c")), class = "validation_error")
})

Tooling

Essential Commands

bash

Rscript -e 'styler::style_pkg()'        # Format package code
Rscript -e 'lintr::lint_package()'       # Lint package
Rscript -e 'devtools::test()'            # Run tests
Rscript -e 'covr::package_coverage()'    # Coverage report
Rscript -e 'devtools::check()'           # Full R CMD check
Rscript -e 'renv::snapshot()'            # Lock dependencies
Rscript -e 'devtools::document()'        # Rebuild roxygen2 docs
quarto render analysis/report.qmd        # Render Quarto document

References

For detailed patterns and examples, see:

•references/patterns.md -- dplyr pipelines, ggplot2 recipes, purrr functional patterns

R Guide

Core Principles

Guardrails

Version & Dependencies

Code Style

Vectorization

Error Handling

Reproducibility

Project Structure

Key Patterns

Tidyverse Pipe Chains

Tidy Evaluation

ggplot2 Grammar of Graphics

Functional Programming with purrr

Testing

Standards

testthat Examples

Tooling

Essential Commands

References

External References