This vignette will demonstrate the batch correction of a CyTOF dataset, where samples were measured using two different panels. Not only will batch correction be performed, but we will also impute the non-overlapping markers allowing for a much more direct integration of these data.

This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). Protein expression was quantified using two different panels of proteins with an overlap. The data generated with each panel was processed in seven batches. The data is B-cell depleted.

Pre-processing data

In this dataset, it seems reasonable to start by looking at the two panels.

# Loading packages
library(cyCombine)
library(tidyverse)

# Set a seed for reproducibility
seed <- 123

# Set a combined size for UMAPs
umap_size = 50000

# Panel 1 reading
panel1 <- read_csv("panel1.csv")
panel1

# Panel 2 reading
panel2 <- read_csv("panel2.csv")
panel2

Now, we have the panels - so let us extract the markers and identify the overlap.

# Extracting the markers
markers1 <- panel1 %>%
  filter(Type != "none") %>%
  pull(Marker) %>%
  str_remove_all("[ _-]")

markers2 <- panel2 %>%
  filter(Type != "none") %>%
  pull(Marker) %>%
  str_remove_all("[ _-]")

# Defining overlap
overlap_markers <- intersect(markers1, markers2)
overlap_markers

 [1] "CD20"   "CD3"    "CD45RA" "CD5"    "CD19"   "CD14"   "CD33"   "CD4"   
 [9] "CD8"    "CD197"  "CD56"   "CD161"  "FoxP3"  "HLADR"  "XCL1"

We observe that there is a total of 15 overlapping markers. These span a lot of the major cell types (eg. CD3, CD4, and CD8 for T cells, CD56 for NK cells and CD14 and CD33 for myeloid cell types).

The workflow presented in this vignette can be visualized with the following schematic. Dataset a and b are the the datasets for the two panels here.

We are now ready to load the CyTOF data. We convert it to a tibble format, which is easy to process. We use derandomization and cofactor = 5 (default) for asinh-transformation.

# Preparing the expression data
dfci1 <- prepare_data(data_dir = "Panel1_renamed",
                      metadata = "metadata.csv",
                      filename_col = "FCS_name",
                      batch_ids = "Batch",
                      condition = "Set",
                      markers = markers1,
                      derand = TRUE,
                      down_sample = FALSE)

Reading 128 files to a flowSet..

Extracting expression data..

Your flowset is now converted into a dataframe.

Transforming data using asinh with a cofactor of 5..

Done!

# Preparing the expression data
dfci2 <- prepare_data(data_dir = "Panel2_renamed",
                      metadata = "metadata.csv",
                      filename_col = "FCS_name",
                      batch_ids = "Batch",
                      condition = "Set",
                      markers = markers2,
                      derand = TRUE,
                      down_sample = FALSE)

Reading 112 files to a flowSet..

Extracting expression data..

Your flowset is now converted into a dataframe.

Transforming data using asinh with a cofactor of 5..

Done!

Processing data - batch correction

Panel 1 - batch correction

In this case, the dataset for each panel is, as mentioned, run in eight batches - this means that there are likely some batch effects to correct for within each panel as well. We take of this first, before we start integrating across panels!

# Run batch correction for panel 1
dfci1_bc <- dfci1 %>%
  batch_correct(covar = "condition",
                xdim = 8,
                ydim = 8,
                norm_method = 'scale',
                markers = markers1)

dfci1_bc_som <- dfci1_bc$label

Let us have a quick look at some UMAPs to visualize the correction for each batch. We downsample so it is easier to see what is going on.

# Downsampling and making pre- and post-correction UMAPs for panel 1 data
set.seed(seed)
dfci1_sliced <- dfci1 %>%
  dplyr::slice_sample(n = umap_size)

dfci1_bc_sliced <- dfci1_sliced %>%
  select(id) %>%
  dplyr::left_join(dfci1_bc, by = "id")


# UMAP plot uncorrected
umap1 <- dfci1_sliced %>%
  plot_dimred(name = "uncorrected (panel 1)", type = "umap", markers = markers1)

# UMAP plot corrected
umap2 <- dfci1_bc_sliced %>%
  plot_dimred(name = "corrected (panel 1)", type = "umap", markers = markers1)

# Show plots
cowplot::plot_grid(umap1, umap2)

Now, let us view the expression distributions for all the markers in panel 1.

# Plot density before/after
plot_density(uncorrected = dfci1,
             corrected = dfci1_bc,
             markers = markers1,
             format = 1, ncol = 6)

Finally, we will evaluate the EMD reduction for the batch correction of panel 1. To do this, we first need to perform a clustering of the corrected set, and we will transfer the labels to the uncorrected set for direct comparison.

# Generate labels for EMD evaluation
labels <- dfci1_bc %>%
  create_som(seed = seed,
             xdim = 8,
             ydim = 8,
             markers = markers1)

# Add labels to datasets
dfci1_bc <- dfci1_bc %>%
  dplyr::mutate(som = labels)
celltype_col <- "som"

dfci1 <- dfci1_bc %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(dfci1, by = "id")


# Evaluate Earth Movers Distance
emd_val1 <- dfci1 %>%
  evaluate_emd(dfci1_bc, markers = markers1, cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd_val1$violin, emd_val1$scatterplot)

# Evaluation using MAD
mad_val1 <- dfci1 %>%
  cyCombine::evaluate_mad(dfci1_bc,
                          filter_limit = NULL,
                          markers = markers1,
                          cell_col = celltype_col)

cat('The MAD score is:', mad_val1$score, '\n')

The MAD score is: 0.02

Panel 2 - batch correction

Now, it is time to do the same with the data from panel 2. First, we batch correct:

# Run batch correction for panel 2
dfci2_bc <- dfci2 %>%
  batch_correct(covar = "condition",
                xdim = 8,
                ydim = 8,
                norm_method = 'scale',
                markers = markers2)

dfci2_bc_som <- dfci2_bc$label

We look at the UMAPs.

# Downsampling and making pre- and post-correction UMAPs for panel 2 data
set.seed(seed)
dfci2_sliced <- dfci2 %>%
  dplyr::slice_sample(n = umap_size)

dfci2_bc_sliced <- dfci2_sliced %>%
  select(id) %>%
  dplyr::left_join(dfci2_bc, by = "id")


# UMAP plot uncorrected
umap1 <- dfci2_sliced %>%
  plot_dimred(name = "uncorrected (panel 2)", type = "umap", markers = markers2)

# UMAP plot corrected
umap2 <- dfci2_bc_sliced %>%
  plot_dimred(name = "corrected (panel 2)", type = "umap", markers = markers2)


# Show plots
cowplot::plot_grid(umap1, umap2)

And we take a look at the marker distributions for panel 2

# Plot density before/after
plot_density(uncorrected = dfci2,
             corrected = dfci2_bc,
             markers = markers2,
             format = 1, ncol = 5)

Lastly, we evaluate the EMD reduction for the batch correction of panel 2 in the same manner as presented above.

# Generate labels for EMD evaluation
labels <- dfci2_bc %>%
  create_som(seed = seed,
             xdim = 8,
             ydim = 8,
             markers = markers2)

# Add labels
dfci2_bc <- dfci2_bc %>%
  dplyr::mutate(som = labels)
celltype_col <- "som"

dfci2 <- dfci2_bc %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(dfci2, by = "id")


# Evaluate Earth Movers Distance
emd_val2 <- dfci2 %>%
  evaluate_emd(dfci2_bc, markers = markers2, cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd_val2$violin, emd_val2$scatterplot)

# Evaluation using MAD
mad_val2 <- dfci2 %>%
  cyCombine::evaluate_mad(dfci2_bc,
                          filter_limit = NULL,
                          markers = markers2,
                          cell_col = celltype_col)

cat('The MAD score is:', mad_val2$score, '\n')

The MAD score is: 0.02

For both panels, we find EMD reductions of 0.66 and MAD scores of 0.02, when considering the batch corrections separately.

Combined batch correction

Based on the marker distributions after correction and the UMAPs, it looks like the batch effects within the data for each panel are minimized. Now we can focus on the integration of the two sets. The first step here, is to batch correct the datasets based on the overlapping markers.

# Generate a combined tibble for the two datasets
df <- rbind(dfci1_bc[,c(overlap_markers, 'sample', 'condition')], 
            dfci2_bc[,c(overlap_markers, 'sample', 'condition')]) %>%
  mutate('batch' = c(rep(1, nrow(dfci1_bc)), rep(2, nrow(dfci2_bc))),
         'id' = 1:(nrow(dfci1_bc)+nrow(dfci2_bc)))

# Make sample names unique (some of the same samples are stained with panel 1 and panel 2)
df$sample <- paste0(df$batch, '-', df$sample)


# Batch correct based on overlapping markers
co_corrected <- batch_correct(df,
                              covar = "condition",
                              xdim = 8,
                              ydim = 8,
                              norm_method = 'scale',
                              markers = overlap_markers)

co_corrected_som <- co_corrected$label

Similarly to the corrections within each panel’s data, we can now look at the UMAPs and marker distributions before and after correction - and we can also calculate the EMD reduction.

# Downsampling and making pre- and post-correction UMAPs for the combined data (on overlapping markers)
set.seed(seed)
df_sliced <- df %>%
  dplyr::slice_sample(n = umap_size)

co_corrected_sliced <- df_sliced %>%
  select(id) %>%
  dplyr::left_join(co_corrected, by = "id")


# UMAP plot uncorrected
umap1 <- df_sliced %>%
  plot_dimred(name = "uncorrected (combined)", type = "umap", markers = overlap_markers)

# UMAP plot corrected
umap2 <- co_corrected_sliced %>%
  plot_dimred(name = "corrected (combined)", type = "umap", markers = overlap_markers)


# Show plots
cowplot::plot_grid(umap1, umap2)

# Plot density before/after
plot_density(uncorrected = df,
             corrected = co_corrected,
             markers = overlap_markers,
             format = 2, ncol = 5)

And the EMD reduction:

# Generate labels for EMD evaluation
labels <- co_corrected %>%
  create_som(seed = seed,
             xdim = 8,
             ydim = 8,
             markers = overlap_markers)

# Add labels
co_corrected <- co_corrected %>%
  dplyr::mutate(som = labels)
celltype_col <- "som"

df <- co_corrected %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(df, by = "id")


# Evaluate Earth Movers Distance
emd_val_co <- df %>%
  evaluate_emd(co_corrected, markers = overlap_markers, cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd_val_co$violin, emd_val_co$scatterplot)

# Evaluation using MAD
mad_val_co <- df %>%
  cyCombine::evaluate_mad(co_corrected,
                          filter_limit = NULL,
                          markers = overlap_markers,
                          cell_col = celltype_col)

cat('The MAD score is:', mad_val_co$score, '\n')

The MAD score is: 0.01

For this correction, we obtain an EMD of 0.83 and an MAD score of 0.01.

Imputing non-overlapping markers

Now that we have batch corrected the overlapping markers from the two panels, they are directly comparable. However, when limiting ourselves only to the overlapping set, we also remove all information contained in the non-overlapping markers. In this case, panel 1 contains 21 markers not found in panel 2 - and panel 2 has 19 markers not found in panel 1. These markers include CD16, which is important for NK cells and monocyte distinction and Granzyme A, which is important to deeply characterize cytotoxic T cells and NK cells.

We want to include these markers in our dataset, but because the non-overlapping markers were only measured on roughly half of the cells, we have to use imputation to provide a value for the other panel’s cells.

We start by defining the sets of non-overlapping markers and then add the values for these to the batch corrected values for the overlapping markers for each panel.

# Define the non-overlaping markers
missing1 <- markers2[!markers2 %in% markers1] # 19 markers measured only in panel 2
missing2 <- markers1[!markers1 %in% markers2] # 21 markers measured only in panel 1


# Add non-overlapping markers back to df
co_corrected1 <- bind_cols(co_corrected[co_corrected$batch == 1,], dfci1_bc[,missing2])
co_corrected2 <- bind_cols(co_corrected[co_corrected$batch == 2,], dfci2_bc[,missing1])

We are now ready to impute the values for the markers unique panel 2 for the panel 1 data - and vice versa.

# Imputation for whole panels
imputed <- impute_across_panels(dataset1 = co_corrected1, 
                                dataset2 = co_corrected2,
                                overlap_channels = overlap_markers, 
                                impute_channels1 = missing1,
                                impute_channels2 = missing2)

Creating SOM grid..

Performing density draws for dataset1.

Performing density draws for dataset2.

Caution! Analysis based on imputed values can lead to false inferences. We recommend only using imputed marker expressions for visualization and not for differential expression analysis.

# Extract data for each panel
final1 <- imputed$dataset1
final2 <- imputed$dataset2

Now, we are ready to use this combined dataset to answer the biological questions and make nice visualizations. Let us look at all of the markers after batch corrections and imputation.

plot_density(final1, 
             final2, 
             y = 'Type',
             dataset_names = paste('Panel', 1:2),
             ncol = 5)

This looks very nice in terms of obtaining comparable distributions between the cells originating from each panel.

Let us make a UMAP for the combined set - based on all 55 markers (15 overlapping + 21 unique to panel 1 + 19 unique to panel 2).

# Combining dataset
final <- rbind(final1,
               final2)

# Down-sampling for plot
set.seed(seed)
final_sliced <- final %>%
  dplyr::slice_sample(n = umap_size)

# UMAP plot corrected
umap <- final_sliced %>%
  cyCombine::plot_dimred(name = "corrected", 
                         type = "umap", 
                         markers = c(overlap_markers, missing1, missing2), 
                         return_coord = T)

umap$plot

After running this process, it is possible to perform any processing one finds relevant for the dataset. The next step could be to perform clustering on the lineage markers and then performing differential abundance testing.

Contact

Batch correction of two-panel CyTOF data

Christina Bligaard Pedersen