This vignette will demonstrate the batch correction of a CyTOF set consisting of 128 samples in seven batches using cyCombine. It will also include a small discussion regarding the grid size during batch correction.


This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). The protein expression was quantified using mass cytometry for 128 samples (20 healthy donors). The data was run in seven batches and used a panel measuring expression of 36 proteins.


Pre-processing data

We start by loading some packages.


We are now ready to load the CyTOF data. We have set up a panel file in csv format, so the correct information is extractable from there. Let us have a look at the contents:


We then progress with reading the CyTOF dataset and converting it to a tibble format, which is easy to process. We use cofactor = 5 (default) in this case.

Reading 128 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!


Checking for batch effects

Now, let us use a cyCombine function to check if there are any batch effects to correct for at all… cyCombine will run on data even with no real batch effects, and in those cases, the batch correction should have minimal effect. However, there is no reason to run the algorithm, if we have no batch effects in the data.

Starting the quick(er) detection of batch effects.
Downsampling to 10000 cells.
Making distribution plots for all markers in each batch.
Saved marker distribution plots here: batch_effect_check/distributions_per_batch.png.
Applying global EMD-based batch effect detection.
Saved EMD plot here: batch_effect_check/emd_per_marker.png.
CD69 has clear outlier batch(es): 7
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 0.57, Mean = 0.85, 3rd Qu. = 1.25, Max. = 9.1
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0, Mean = 0.29, 3rd Qu. = 0.39, Max. = 9.24
CD4 has clear outlier batch(es): 1
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0.2, Median = 2.7, Mean = 2.39, 3rd Qu. = 4.25, Max. = 5.19
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 1.35, Mean = 1.97, 3rd Qu. = 4.05, Max. = 5.93
CD127 has clear outlier batch(es): 7
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 1.88, Mean = 2.09, 3rd Qu. = 2.89, Max. = 9.59
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 0.82, 3rd Qu. = 1.53, Max. = 9.27
HLADR has clear outlier batch(es): 2
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 1.43, 3rd Qu. = 2.92, Max. = 7.09
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 1.14, Mean = 2.13, 3rd Qu. = 4.24, Max. = 7.66
XCL1 has clear outlier batch(es): 1
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 2.14, Median = 3.32, Mean = 3.54, 3rd Qu. = 4.89, Max. = 8.45
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0.39, Mean = 1.1, 3rd Qu. = 1.35, Max. = 9.39
Making MDS plot for median protein expression per sample.
Saved MDS plot here: batch_effect_check/MDS_per_batch.png
Done!


In the printed output, we already get some pointers to potential problems with batch effects. But let us look at each of these plots for this dataset. First we have the EMD per marker-plot, which shows the mean Earth Mover’s Distance for all pairwise batch-to-batch comparisons. The distribution of each marker is considered globally for each comparison. The error bars represent the standard deviation. In this dataset, we observe a relatively high mean EMD for XCL1, and further this marker has a large standard deviation. This indicates that there may be a batch effect to consider in this marker - perhaps it is significantly over- or under-stained in one or more batches compared to the rest? According to the text, batch 1 is the problem.


To figure out if that is really the case, we can look at the second generated plot. This is the distribution of each marker in each batch. Quantiles are shown as vertical bars.

When looking at XCL1, we clearly observe the batch effect indicated before - batch 1 looks very different from the rest! Also, have a look at TBet. This marker had the second-highest mean EMD above - and here, the distributions for batches 6 and 7 look different than the rest.


Now for the final plot - a multidimensional scaling (MDS) plot. This form of dimensionality reduction can be used to detect outlier batches (or samples) based on the median marker expression per sample. Each dot corresponds to a sample - and the colors represent batches. If there are no batch effects, the pattern would be random - and though it can be hard to judge, here we will notice how the red batch 1 samples almost exclusively appear in the top part of the plot. In other words, it looks a bit like batch effects!


We could dwell more at this, and we could also use the full detect_batch_effect() function of cyCombine to look at this more carefully. However, for now we are convinced that batch effects exist and we will correct them with cyCombine.


Evaluating performance

We start with some quantitative measurements of the correction performance.

The MAD score is: 0.02 

Let us also look at some plots to visualize the correction. First, the marker distributions before and after:


Finally, some UMAPs to visualize the correction. We will downsample to 5,000 cells from each batch so it is easier to see what is going on.


Based on the marker distributions after correction and the UMAPs, it looks like the batch effects are eliminated. We can now address our biological questions. However, we have simply used the default grid size of 8x8 in this example. The clustering with the self-organizing map is used to group similar cells, for which batch correction can be performed properly for a more homogeneous set of cells. While 64 clusters should be enough for most PBMC analyses, the default grid size may not always be appropriate.



Grid size discussion

Rare subsets can be missed when using smaller grids, depending on their specific expression patterns and how distinct the rare cell type is from the other cells in the data. This problem is not unique to cyCombime, but rather a function of the SOM itself (which also happens to be the foundation of the widely used FlowSOM clustering approach).


If greater heterogeneity is anticipated in a dataset, we recommend increasing the grid size. Generally, one should aim for a grid size which can capture the variance of the data, but it is not a goal in itself to strongly over-cluster as this will increase the risk of having clusters that do not contain cells from all of the batches, resulting in a worse correction (or only cells from a single batch, resulting in no correction at all). As in any analysis involving a clustering step, we would recommend initial explorations for any dataset to establish an idea of the heterogeneity in the data, and then set the grid size accordingly when running cyCombine.


One way to get an idea about the heterogeneity in a dataset could be to run a clustering tool which automatically determines the number of clusters (e.g. Phenograph) using all the markers and letting the number of clusters obtained be a guide. Otherwise one could use ConsensusClusterPlus (the meta-clustering engine of FlowSOM) with the elbow criterion as discussed here. However, because of the batch effects, it can be problematic to cluster across batches on the uncorrected dataset (the investigation of clusters should take place before applying cyCombine), as it can lead to over-estimation of the cluster count. In this example, we normalize the data using scaling (per-batch) before doing clustering (analogously to the cyCombine workflow). We will select a 10x10 grid for the initial clustering and then use ConsensusClusterPlus with up to 90 (maximum allowed) clusters.


The delta area plot shows the relative change in cluster stability (area under CDF curve), when adding one extra meta-cluster. The goal is to choose a k where there is no significant increase. In this case, that occurs around k = 25 or k = 30 depending on how strict the requirement is. In any case, this shows that using 64 SOM nodes should capture the total data variance for batch correction.



Testing different grid sizes

One could also address the question of the optimal grid size by simply trying multiple sizes. We will demonstrate the results for 2x2, 4x4, 12x12, and 16x16 grids below (8x8 was used above). We will also time each correction:


Now, we can evaluate the performance using the EMD reduction and MAD score.

# Make a function for evaluation
evaluate <- function(uncorrected, corrected, gridsize, markers, celltype_col = 'som') {
  
  # Cluster corrected data (still using 8x8 so we can compare directly to the 8x8 grid correction)
  labels <- corrected %>%
            cyCombine::create_som(rlen = 10,
                                  xdim = 8,
                                  ydim = 8,
                                  markers = markers)
  # Add labels
  corrected <- corrected %>%
    dplyr::mutate(som = labels)
  uncorrected <- uncorrected %>%
    dplyr::mutate(som = labels)
  
  # Evaluation using EMD
  emd_val <- uncorrected %>%
        cyCombine::evaluate_emd(corrected,
                                binSize = 0.1,
                                markers = markers,
                                cell_col = celltype_col)
  
  # Evaluation using MAD
  mad_val <- uncorrected %>%
        cyCombine::evaluate_mad(corrected,
                                filter_limit = NULL,
                                markers = markers,
                                cell_col = celltype_col)

  # UMAP - 5,000 cells per batch
  inds <- split(1:length(uncorrected$batch), uncorrected$batch)
  set.seed(6157)
  sample <- unlist(lapply(inds, sample, 5000))
  
  plot1 <- plot_dimred(uncorrected[sample,], name = 'Uncorrected ', type = 'umap')
  plot2 <- plot_dimred(corrected[sample,], name = paste0('Corrected ', gridsize, 'x', gridsize), type = 'umap')
  
  # umaps <- cowplot::plot_grid(plot1, plot2)
  
  # # Density plots
  # densities <- plot_density(dfci, corrected2, ncol = 6, markers = markers,
  #                           dataset_names = c('Uncorrected', paste0('Corrected ', gridsize, 'x', gridsize)))

  
  return(list('emd_val' = emd_val, 'mad_val' = mad_val, 'umaps' = list(plot1, plot2)))#, 'densities' = densities))
}

# Use function to evaluate
eval2 <- evaluate(dfci, corrected2, gridsize = 2, markers)
eval4 <- evaluate(dfci, corrected4, gridsize = 4, markers)
eval12 <- evaluate(dfci, corrected12, gridsize = 12, markers)
eval16 <- evaluate(dfci, corrected16, gridsize = 16, markers)
For the 2x2 grid, the EMD reduction is: 0.48 and the MAD score is: 0 
For the 4x4 grid, the EMD reduction is: 0.6 and the MAD score is: 0.01 
For the 8x8 grid, the EMD reduction is: 0.66 and the MAD score is: 0.02 
For the 12x12 grid, the EMD reduction is: 0.68 and the MAD score is: 0.02 
For the 16x16 grid, the EMD reduction is: 0.67 and the MAD score is: 0.02 


What we observe here, is that using the very small 2x2 grid, there is still a reduction in the batch effects for this dataset, but it is not as strong as for the other settings. For 4x4, the reduction is slightly worse than for the 8x8 grid, which was actually used in the study, and 12x12/16x16 actually appear to be slightly better, with higher EMD reductions and the same MAD score as for the 8x8 grid. However, the differences for 8x8, 12x12, and 16x16 are very small.


We can also compare the runtimes of the different settings.

For the 2x2 grid, the runtime was: 2.46 minutes
For the 4x4 grid, the runtime was: 3.01 minutes
For the 8x8 grid, the runtime was: 4.68 minutes
For the 12x12 grid, the runtime was: 10.6 minutes
For the 16x16 grid, the runtime was: 16.32 minutes


The runtimes do however increase with larger grid sizes, and since there is no significant performance gain in increasing the grid size beyond 8x8, we have used this for the analyses in the cyCombine article. We can also conclude that when the dataset consists of a large number of cells, the grid size is not a highly important parameter. This allows the user not to worry too much about finding an optimal value.


We have included the UMAPs for the different settings below:

 

Contact