Welcome to the cyCombine reference manual! This manual will show how to use the main functions of cyCombine - either using the recommended (all-in-one) workflow or the modular workflow, which is more customizable to fit specific data inputs.

cyCombine works on dataframes/tibbles in R. If you already have data in R or in a readable text format, the conversion to dataframe/tibble should be relatively straightforward. However, it is expected that most users will start their analysis from FCS files.

Alongside a directory of FCS files, a metadata and a panel file are assumed to be present. These files are helpful in generating the tibble structure, which is processable with cyCombine. I.e. besides containing the protein marker expression per cell, the tibble should also contain information regarding the batch of origin per cell, and generally it is easier to work with data that also encompasses the sample IDs - and potential conditions. This information should be contained in a metadata file.

The panel information is also nice to have for several reasons:

  1. FCS files contain some columns, which should not be included in batch correction - such as “Time” and “Event_length”. Furthermore, there may be “empty” channels, which should also be ignored during analysis
  2. Sometimes, FCS files do not contain the proper protein names and a panel file can help ensure that the FCS files are read correctly.

The metadata of our example has the following columns:

Filename batch condition Patient_id

And the panel should contain the columns:

Channel Antigen Type

By setting Type = ‘none’ in a panel file, the columns to exclude are easy to identify.

Prepare data

The first step of a cyCombine analysis is to convert the relevant FCS files into a work-able tibble. We introduce two approaches for this.

Modular workflow

In case you want more control, you can adjust any step of this approach. Feel free to skip this segment, if the method above worked just fine for you.

This example will include all possible input parameters to give an overview of what can be modified.

# Load packages

# Directory with FCS files
data_dir <- "~/data"

# Extract markers from panel
panel <- read_csv(file.path(data_dir, "panel.csv")) # Can also be .xlsx
metadata <- read_csv(file.path(data_dir, "metadata.csv"))

# Extract markers of interest
markers <- panel %>% 
  filter(Type != "none") %>% 

# Read FCS files
flowset <- compile_fcs(
  data_dir = data_dir,
  pattern = "\\.fcs" # Read all FCS files

# Convert flowset to tibble
df <- convert_flowset(
  flowset = flowset,
  metadata = metadata,
  filename_col = "Filename",
  sample_ids = "Filename", # By default the filename is used to get sample ids
  batch_ids = "batch",
  condition = "condition",
  down_sample = TRUE,
  sample_size = 2000000,
  seed = 101,
  panel = panel, # Can also be the filename. It is solely used to ensure the channel names match what you expect (i.e. what is in the panel_antigen column)
  panel_channel = "Channel",
  panel_antigen = "Antigen"

# Transform data - This function also de-randomizes the data
uncorrected <- transform_asinh(
  df = df,
  markers = markers,
  cofactor = 5,
  .keep = TRUE # Lets you keep all columns, in case they are useful to you
# Store result
saveRDS(uncorrected, file = file.path(data_dir, "uncorrected.RDS"))

Batch correction

Now that the data is converted to a tibble format, it is straightforward to perform batch correction with cyCombine. Again, we demonstrate two different workflows.

Modular workflow

# Load packages

# Load data (if not already loaded)
# uncorrected <- readRDS("data/uncorrected.RDS")
# markers <- get_markers(uncorrected)

# Create cell type labels using a SOM grid (if you want to use your own labels, they can be added manually and this step should not be run)
labels <- uncorrected %>% 
  normalize(markers = markers,
            norm_method = "rank", # "scale" is recommended in cases with light batch effects (e.g. when combining similar data)
            ties.method = "average") %>% # Can also be minimum
  create_som(markers = markers,
             rlen = 10, # If results are not convincing, consider using a higher value (e.g. 100)
             seed = 101,
             xdim = 8,
             ydim = 8)

# Batch correct
corrected <- correct_data(
  df = uncorrected,
  label = labels # Add custom labels here, if desired
  covar = "condition",
  markers = markers,
  parametric = TRUE

# Save result
saveRDS(corrected, file.path(data_dir, "corrected.RDS"))

Evaluate performance using Earth Mover’s Distance

The EMD reduction is implemented as the performance metric; EMDs are computed for both the uncorrected and corrected data, removing those values where both had an EMD < 2.

\[Reduction = \frac{\sum{EMD_{before}} - \sum{EMD_{after}}}{\sum{EMD_{before}}}\]

# Load packages

# Load data (if not already loaded)
# data_dir <- "~/data"
# uncorrected <- readRDS(file.path(data_dir, "uncorrected.RDS"))
# corrected <- readRDS(file.path(data_dir, "corrected.RDS"))
# markers <- get_markers(uncorrected)

# Re-run clustering on corrected data
labels <- corrected %>% 
  create_som(markers = markers,
             rlen = 10)
uncorrected$label <- corrected$label <- labels

# Evaluate EMD
emd <- evaluate_emd(uncorrected, corrected, cell_col = "label")

# Reduction

# Violin plot

# Scatter plot

Create UMAPs and density plots

This segment will demonstrate the built-in functions for generating UMAPs and density plots.

# Load packages

# Load data (if not already loaded)
# data_dir <- "~/data"
# uncorrected <- readRDS(file.path(data_dir, "uncorrected.RDS"))
# corrected <- readRDS(file.path(data_dir, "corrected.RDS"))
# markers <- get_markers(uncorrected)

# Create UMAPs
sam <- sample(1:nrow(uncorrected), 30000)
plot1 <- plot_dimred(uncorrected[sam, ], "Uncorrected", type = "umap", plot = "batch", markers = markers)
plot2 <- plot_dimred(corrected[sam, ], "Corrected", type = "umap", plot = "batch", markers = markers)
plot_save_two(plot1, plot2, "figs/umap.png")

# Density plots
             markers = markers,
             filename = "figs/density.png",
             y = "batch",
             ncol = 6,
             xlim = 10)