This vignette will demonstrate the batch correction of a CyTOF dataset, where samples were measured using two different panels. Not only will batch correction be performed, but we will also impute the non-overlapping markers allowing for a much more direct integration of these data.
This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). Protein expression was quantified using two different panels of proteins with an overlap. The data generated with each panel was processed in seven batches. The data is B-cell depleted.
In this dataset, it seems reasonable to start by looking at the two panels.
# Loading packages
library(cyCombine)
library(tidyverse)
# Set a seed for reproducibility
seed <- 123
# Set a combined size for UMAPs
umap_size = 50000Now, we have the panels - so let us extract the markers and identify the overlap.
# Extracting the markers
markers1 <- panel1 %>%
filter(Type != "none") %>%
pull(Marker) %>%
str_remove_all("[ _-]")
markers2 <- panel2 %>%
filter(Type != "none") %>%
pull(Marker) %>%
str_remove_all("[ _-]")
# Defining overlap
overlap_markers <- intersect(markers1, markers2)
overlap_markers [1] "CD20" "CD3" "CD45RA" "CD5" "CD19" "CD14" "CD33" "CD4"
[9] "CD8" "CD197" "CD56" "CD161" "FoxP3" "HLADR" "XCL1"
We observe that there is a total of 15 overlapping markers. These span a lot of the major cell types (eg. CD3, CD4, and CD8 for T cells, CD56 for NK cells and CD14 and CD33 for myeloid cell types).
The workflow presented in this vignette can be visualized with the following schematic. Dataset a and b are the the datasets for the two panels here.
We are now ready to load the CyTOF data. We convert it to a tibble format, which is easy to process. We use derandomization and cofactor = 5 (default) for asinh-transformation.
# Preparing the expression data
dfci1 <- prepare_data(data_dir = "Panel1_renamed",
metadata = "metadata.csv",
filename_col = "FCS_name",
batch_ids = "Batch",
condition = "Set",
markers = markers1,
derand = TRUE,
down_sample = FALSE)Reading 128 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
# Preparing the expression data
dfci2 <- prepare_data(data_dir = "Panel2_renamed",
metadata = "metadata.csv",
filename_col = "FCS_name",
batch_ids = "Batch",
condition = "Set",
markers = markers2,
derand = TRUE,
down_sample = FALSE)Reading 112 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
In this case, the dataset for each panel is, as mentioned, run in eight batches - this means that there are likely some batch effects to correct for within each panel as well. We take of this first, before we start integrating across panels!
# Run batch correction for panel 1
dfci1_bc <- dfci1 %>%
batch_correct(covar = "condition",
xdim = 8,
ydim = 8,
norm_method = 'scale',
markers = markers1)
dfci1_bc_som <- dfci1_bc$labelLet us have a quick look at some UMAPs to visualize the correction for each batch. We downsample so it is easier to see what is going on.
# Downsampling and making pre- and post-correction UMAPs for panel 1 data
set.seed(seed)
dfci1_sliced <- dfci1 %>%
dplyr::slice_sample(n = umap_size)
dfci1_bc_sliced <- dfci1_sliced %>%
select(id) %>%
dplyr::left_join(dfci1_bc, by = "id")
# UMAP plot uncorrected
umap1 <- dfci1_sliced %>%
plot_dimred(name = "uncorrected (panel 1)", type = "umap", markers = markers1)
# UMAP plot corrected
umap2 <- dfci1_bc_sliced %>%
plot_dimred(name = "corrected (panel 1)", type = "umap", markers = markers1)
# Show plots
cowplot::plot_grid(umap1, umap2)Now, let us view the expression distributions for all the markers in panel 1.
# Plot density before/after
plot_density(uncorrected = dfci1,
corrected = dfci1_bc,
markers = markers1,
format = 1, ncol = 6)