Welcome to the cyCombine reference manual! This manual will show how to use the main functions of cyCombine - either using the recommended (all-in-one) workflow or the modular workflow, which is more customizable to fit specific data inputs.
cyCombine works on dataframes/tibbles in R. If you already have data in R or in a readable text format, the conversion to dataframe/tibble should be relatively straightforward. However, it is expected that most users will start their analysis from FCS files.
Alongside a directory of FCS files, a metadata and a panel file are assumed to be present. These files are helpful in generating the tibble structure, which is processable with cyCombine. I.e. besides containing the protein marker expression per cell, the tibble should also contain information regarding the batch of origin per cell, and generally it is easier to work with data that also encompasses the sample IDs - and potential conditions. This information should be contained in a metadata file.
The panel information is also nice to have for several reasons:
The metadata of our example has the following columns:
Filename | batch | condition | Patient_id |
---|---|---|---|
In addition, when replicated samples/anchors are included, it may be relevant to also use the column anchor
as described below.
And the panel should contain the columns:
Channel | Antigen | Type |
---|---|---|
By setting Type = ‘none’ in a panel file, the columns to exclude are easy to identify.
The first step of a cyCombine analysis is to convert the relevant FCS files into a work-able tibble. We introduce two approaches for this.
# Load packages
library(cyCombine)
library(tidyverse)
# Directory with FCS files
data_dir <- "~/data"
# Extract markers from panel
panel_file <- file.path(data_dir, "panel.csv") # Can also be .xlsx
metadata_file <- file.path(data_dir, "metadata.csv") # Can also be .xlsx
# Extract markers of interest
markers <- read.csv(panel_file) %>%
filter(Type != "none") %>%
pull(Antigen)
# Prepare a tibble from directory of FCS files
uncorrected <- prepare_data(
data_dir = data_dir,
metadata = metadata_file,
filename_col = "Filename",
batch_ids = "batch",
condition = "condition",
down_sample = FALSE,
markers = markers
)
# Store result
saveRDS(uncorrected, file = file.path(data_dir, "uncorrected.RDS"))
In case you want more control, you can adjust any step of this approach. Feel free to skip this segment, if the method above worked just fine for you.
This example will include all possible input parameters to give an overview of what can be modified.
# Load packages
library(cyCombine)
library(tidyverse)
# Directory with FCS files
data_dir <- "~/data"
# Extract markers from panel
panel <- read_csv(file.path(data_dir, "panel.csv")) # Can also be .xlsx
metadata <- read_csv(file.path(data_dir, "metadata.csv"))
# Extract markers of interest
markers <- panel %>%
filter(Type != "none") %>%
pull(Antigen)
# Read FCS files
flowset <- compile_fcs(
data_dir = data_dir,
pattern = "\\.fcs" # Read all FCS files
)
# Convert flowset to tibble
df <- convert_flowset(
flowset = flowset,
metadata = metadata,
filename_col = "Filename",
sample_ids = "Filename", # By default the filename is used to get sample ids
batch_ids = "batch",
condition = "condition",
down_sample = TRUE,
sample_size = 2000000,
seed = 101,
panel = panel, # Can also be the filename. It is solely used to ensure the channel names match what you expect (i.e. what is in the panel_antigen column)
panel_channel = "Channel",
panel_antigen = "Antigen"
)
# Transform data - This function also de-randomizes the data
uncorrected <- transform_asinh(
df = df,
markers = markers,
cofactor = 5,
.keep = TRUE # Lets you keep all columns, in case they are useful to you
)
# Store result
saveRDS(uncorrected, file = file.path(data_dir, "uncorrected.RDS"))
Now that the data is converted to a tibble format, it is straightforward to perform batch correction with cyCombine. Again, we demonstrate two different workflows.
Besides the functionality used below, it is also possible to directly inform cyCombine about which samples are replicates. For this, please refer to the relevant section below.
Skip this segment if you are interested in more modularity.
# Load packages
library(cyCombine)
library(tidyverse)
# Load data (if not already loaded)
# uncorrected <- readRDS("data/uncorrected.RDS")
# markers <- get_markers(uncorrected)
# Batch correct
corrected <- batch_correct(
df = uncorrected,
covar = "condition",
markers = markers,
norm_method = "scale", # "rank" is recommended when combining data with heavy batch effects
rlen = 10, # Higher values are recommended if 10 does not appear to perform well
seed = 101 # Recommended to use your own random seed
)
# Save result
saveRDS(corrected, file.path(data_dir, "corrected.RDS"))
# Load packages
library(cyCombine)
library(tidyverse)
# Load data (if not already loaded)
# uncorrected <- readRDS("data/uncorrected.RDS")
# markers <- get_markers(uncorrected)
# Create cell type labels using a SOM grid (if you want to use your own labels, they can be added manually and this step should not be run)
labels <- uncorrected %>%
normalize(markers = markers,
norm_method = "rank", # "scale" is recommended in cases with light batch effects (e.g. when combining similar data)
ties.method = "average") %>% # Can also be minimum
create_som(markers = markers,
rlen = 10, # If results are not convincing, consider using a higher value (e.g. 100)
seed = 101,
xdim = 8,
ydim = 8)
# Batch correct
corrected <- correct_data(
df = uncorrected,
label = labels, # Add custom labels here, if desired
covar = "condition",
markers = markers,
parametric = TRUE,
)
# Save result
saveRDS(corrected, file.path(data_dir, "corrected.RDS"))
There are several ways to inform cyCombine which samples are replicates. One option is to add an additional column (e.g. “anchor”) to your metadata file, which can then be added as a column to your uncorrected data by using prepare_data()
. This information may also be added manually to the uncorrected data, or it is possible to provide the information directly to batch_correct()
using a vector.
To perform the batch correction, one should use batch_correct()
with the anchor
parameter, which should be set to the column name (e.g. “anchor”) or the name of the vector containing the information. cyCombine will automatically check whether the covariate and anchor are confounded with each other or the batch.
# Load packages
library(cyCombine)
library(tidyverse)
# Load data (if not already loaded)
# uncorrected <- readRDS("data/uncorrected.RDS")
# markers <- get_markers(uncorrected)
# Batch correct
corrected <- batch_correct(
df = uncorrected,
covar = "condition",
markers = markers,
norm_method = "scale", # "rank" is recommended when combining data with heavy batch effects
rlen = 10, # Higher values are recommended if 10 does not appear to perform well
seed = 101, # Recommended to use your own random seed
anchor = "anchor"
)
# Save result
saveRDS(corrected, file.path(data_dir, "corrected.RDS"))
The cyCombine package includes two performance metrics. The Earth Mover’s Distance (EMD) reduction and the Median Absolute Deviation (MAD) score
The EMD reduction is implemented as the first performance metric; EMDs are computed for both the uncorrected and corrected data, removing those values where both had an EMD < 2.
\[EMD_{reduction} = \frac{\sum_{i=1}^n {(EMD_{before_i} - EMD_{after_i})}} {\sum_{i=1}^n {EMD_{before_i}}}\]
The MAD score is implemented as the second performance metric; MADs are computed for both the uncorrected and corrected data per-cluster, per-marker, per-batch. The MAD score is then calculated as the median of the absolute difference in MAD per value:
\[MAD_{score} = \mathrm{median}_{i=1}^n (|MAD_{before_i} - MAD_{after_i}|)\]
Because the MAD score quantifies the information ‘loss’, the ideal tool has a small MAD score.
# Load packages
library(cyCombine)
library(tidyverse)
# Load data (if not already loaded)
# data_dir <- "~/data"
# uncorrected <- readRDS(file.path(data_dir, "uncorrected.RDS"))
# corrected <- readRDS(file.path(data_dir, "corrected.RDS"))
# markers <- get_markers(uncorrected)
# Re-run clustering on corrected data
labels <- corrected %>%
create_som(markers = markers,
rlen = 10)
uncorrected$label <- corrected$label <- labels
# Evaluate EMD
emd <- evaluate_emd(uncorrected, corrected, cell_col = "label")
# Reduction
emd$reduction
# Violin plot
emd$violin
# Scatter plot
emd$scatter
# Evaluate MAD
mad <- evaluate_mad(uncorrected, corrected, cell_col = "label")
# Score
mad$score
This segment will demonstrate the built-in functions for generating UMAPs and density plots.
# Load packages
library(cyCombine)
library(tidyverse)
# Load data (if not already loaded)
# data_dir <- "~/data"
# uncorrected <- readRDS(file.path(data_dir, "uncorrected.RDS"))
# corrected <- readRDS(file.path(data_dir, "corrected.RDS"))
# markers <- get_markers(uncorrected)
# Create UMAPs
sam <- sample(1:nrow(uncorrected), 30000)
plot1 <- plot_dimred(uncorrected[sam, ], "Uncorrected", type = "umap", plot = "batch", markers = markers)
plot2 <- plot_dimred(corrected[sam, ], "Corrected", type = "umap", plot = "batch", markers = markers)
plot_save_two(plot1, plot2, "figs/umap.png")
# Density plots
plot_density(uncorrected,
corrected,
markers = markers,
filename = "figs/density.png",
y = "batch",
ncol = 6,
xlim = 10)
Contact