This vignette will demonstrate the performance benchmarking applied in the cyCombine article.


Performance metrics - discussion

In the cyCombine article, the Earth Mover’s Distance (EMD) reduction is used as the most important performance metric. The EMD reduction relies on the EMD, which is a distance measure for probability distributions. As with most metrics, the EMD reduction has both advantages and disadvantages. First, the formula:

\[EMD_{reduction} = \frac{\sum\limits_{i=1}^{n}(EMD_{before_i} - EMD_{after_i})}{\sum\limits_{i=1}^{n}EMD_{before_i}}\] Where n is the total number of comparisons (number of SOM nodes times the number of markers times the number of pairwise batch comparisons).

As mentioned in the article, the EMD was used to compare the distribution of each marker within SOM nodes across batches. The metric relies on binning of values in bins at 0.1-increments (default) and computes the EMD for every marker for each pairwise batch comparison. This means that if you have seven batches and 15 markers, you have 21 (7-choose-2) * 15 = 315 EMD calculations per SOM node. Now, this notion of calculation per-SOM node is one to discuss: If you calculate the EMD globally, across all cells, you essentially assume that the distribution of cell types is also the same between batches - i.e. that you have an equal fraction of CD8+ cells in each batch. This may not hold true. Instead, when clustering (in this case by the use of a SOM) is applied, you avoid such assumptions. We use a relatively large SOM of 8x8 = 64 nodes, so we are likely to obtain over-clustering of the dataset in question. This is done to compare only highly similar cells when the EMDs are calculated. This holds a potential for problems, which we will return to in the next section.


In practice, we perform the SOM clustering on the corrected data, no matter the batch correction tool. The idea behind this is that batch effects should be removed after correction allowing for more correct clustering. The assigned SOM node for each cell is then transferred from the corrected data to the uncorrected data allowing for direct comparison of single cells before and after batch correction.

This however has some implications for the analysis, including the fact that the EMD values for an uncorrected data set will vary between batch correction methods. Depending on the correction cells may get clustered differently, leading to different EMD computations in uncorrected data.


Why do we do this? Because the clustering of uncorrected data might not be ‘correct’ - and while we make the assumption that scaling/ranking is necessary to align populations, not all the tested tools apply such measures.

Another approach could be to use pre-defined cell labels for each cell instead of a SOM clustering and then quantify the batch effect with EMD calculations within these cell groups. This strongly relies on the user to come up with a robust labeling before the testing and can also be considered biased. However, once labels are in place, the solution is quick to implement and also lowers the computational run time.


However, there are some more general caveats to using the EMD reduction term, which we should also touch upon.


EMD caveats

We will split this discussion into three parts:

  1. Cluster-wise vs. global EMD
  2. Small batch effects => low reduction
  3. Systemic information loss => high reduction


Regarding the use of cluster-wise vs. global EMD, we have already discussed the issues that prevent the use of global EMD calculations. However, there may be cases where cluster-wise calculations provide overly good results. This can happen if we, after correction, still have a very batch-driven SOM-clustering. That would mean that each cluster mostly contains cells from single batch, and the EMD calculations between batches will be based on only few cells from some batches.


For small batch effects, one may achieve very small EMD reductions since this is a relative measure comparing batch effects before and after correction. Essentially, a tool should not correct a dataset with no batch effects, so a low EMD reduction can be considered a mark of quality - but at first glance it may seem like a tool is not performing well. When comparing corrections for the same dataset for multiple tools, this is not a large problem. But in use-cases, the EMD reduction should not stand alone: One should always inspect some plots to judge whether A) batch correction is necessary (see our detection module) and B) if batch correction worked.


The last case is a tricky one, which we demonstrate using a small synthetic example. Imagine that you perform batch correction using two different tools. We consider only a single marker - ‘exp’. We look at it before correction and see that the positive peak is misaligned:


You then decide to perform batch correction using two different tools. They produce the following two plots after correction:

## The EMD for tool 1 is 2.28
## The EMD for tool 2 is 0.1


You calculate the the EMD between the two batches for both tools. Tool 1 yields a much larger EMD - but looking at the plot for tool 2, it is clear that this is only because of systemic information loss. Tool 2 has removed all the potentially interesting variation for the marker in question. This is yet another reason to always inspect some plots when performing batch correction. Because of this, we introduced a second performance metric, the Median Absolute Deviation (MAD) score. This score is meant to detect potential systemic information loss by quantifying the variability of each marker in the dataset before and after correction. In practice, it is calculated very similarly to the EMD reduction: The MAD is calculated for the dataset after performing a SOM-based clustering, and is calculated per-cluster, per-marker, per-batch. So, the MAD is calculated per-batch, whereas the EMD calculations are performed for each pairwise batch-batch comparison. This means that the MAD score quantifies intra-batch effects of the correction and the EMD reduction quantifies inter-batch effects. After calculating the MAD per-cluster, per-marker, per-batch for both the corrected and uncorrected datasets, the MAD score is calculated as the median of the absolute difference in MAD per value:


\[MAD_{score} = \mathrm{median}_{i=1}^n (|MAD_{before_i} - MAD_{after_i}|)\] Where n is the total number of comparisons (number of SOM nodes times the number of markers times the number of batches).


Because the MAD score quantifies the information ‘loss’, the ideal tool has a small MAD score.


In the simple example, the MAD score calculations for tools 1 and 2 would be:

## The MADs for the unccorrected set are 0.78 and 0.91
## The MADs for tool 1 are 0.77 and 0.94
## The MADs for tool 2 are 0.49 and 0.51
## The MAD score for tool 1 is 0.02
## The MAD score for tool 2 is 0.35


Here, it is clearly seen that the MAD score is efficient in detecting the systemic information loss of tool 2, which has a much higher MAD score. Tool 1 has a MAD score very close to 0, which fits our idea that this tool does not remove biological variation, but only corrects batch effects. For calculation of the MAD score in a real dataset, we recommend using the evaluate_mad function.


With the EMD reduction and the MAD score, we have provided two quantitative measures for batch effect and information removal. They each hold some strong positive qualities for this purpose and could prove useful for many future studies on the topic. We would also like to mention the approach presented in BatchBench, which relies on entropy metrics. This is a strong alternative to the EMD/MAD approaches presented here.



Different tools - discussion

We compare the performance of cyCombine to that of four other tools, CytoNorm, CytofRUV, iMUBAC, and CytofBatchAdjust.

All tools are written in R, and with the exception of CytofBatchAdjust, they are published as installable packages. CytofBatchAdjust has to be loaded from a script.


Three of the tools, CytoNorm, CytofRUV, and CytofBatchAdjust all require technical replicates to perform batch correction. CytoNorm only performs correction of the non-replicated samples, and as such assumes that the technical replicates are only included for technical purposes. iMUBAC conversely excludes patients samples from the actual correction, and instead has a scheme for projecting the patient samples into the corrected ‘healthy’ space post-correction. iMUBAC also downsamples in the batch correction step as it would otherwise be too computationally heavy to run. Similarly, CytofRUV is relatively memory-heavy, but it does not require downsampling for any of our tested datasets.


These three tools all rely on clustering of the data before performing cluster-wise corrections. This is not the case for CytofBatchAdjust, which performs global corrections. This means that CytofBatchAdjust does not account for the fact that batch effects might affect different cell types in different ways.



Benchmarking: CytoNorm data

CytoNorm

# Define the samples to correct and the reference samples
train_data <- meta %>% 
  dplyr::filter(`Sample ID` %>% startsWith("Control1"))
val_data <- meta %>% 
  dplyr::filter(`Sample ID` %>% startsWith("Control2"))


# Code for setting up the run - taken from the CytoNorm Github
transformList <- flowCore::transformList(panel$channel,
                                         cytofTransform)

transformList.reverse <- flowCore::transformList(panel$channel,
                                                 cytofTransform.reverse)


# Building the model for correction using Control 1
model <- CytoNorm.train(files = paste0(data_dir, "/", train_data$FCS_name, ".fcs"),
                        labels = train_data$Batch,
                        channels = panel$channel,
                        transformList = transformList,
                        FlowSOM.params = list(nCells = 2842208, # All cells in the Control1 samples
                                              xdim = 10,
                                              ydim = 10,
                                              nClus = 25,
                                              scale = FALSE),
                        normMethod.train = QuantileNorm.train,
                        normParams = list(nQ = 101,
                                          goal = "mean"),
                        seed = 101,
                        verbose = TRUE)

# Apply model to Control 2
CytoNorm.normalize(model = model,
                   files = paste0(data_dir, "/", val_data$FCS_name, ".fcs"),
                   labels = val_data$Batch,
                   transformList = transformList,
                   transformList.reverse = transformList.reverse,
                   normMethod.normalize = QuantileNorm.normalize,
                   outputDir = 'Normalized',
                   prefix = "Norm_",
                   clean = TRUE,
                   verbose = TRUE)


# Read corrected data - CytoNorm outputs FCS files, so we have to read them into R
corrected <- prepare_data(data_dir = 'Normalized',
                          metadata = "Normalized/cytonorm_cohort.xlsx",
                          filename_col = "FCS_name",
                          batch_ids = "Batch",
                          condition = "Set",
                          sample_ids = "Sample ID",
                          markers = markers,
                          down_sample = FALSE,
                          cofactor = 5)


# Because only the Control2 samples are corrected, we evaluate the correction for those samples only
uncorrected_down <- uncorrected %>% 
  dplyr::select(-any_of(celltype_col)) %>%
  dplyr::filter(sample %in% c("Control2_Unstim", "Control2_IFNa_LPS"))

# Capping values at 300 to avoid very large values
corrected_down <- corrected %>%
  dplyr::mutate(id = uncorrected_down$id) %>%
  dplyr::mutate_at(dplyr::vars(all_of(markers)),
                   function(x) {
                     x[x > 300] <- 300
                     return(x)
                   })


# Cluster on corrected data
labels <- corrected_down %>%
          cyCombine::create_som(rlen = 10,
                                xdim = 8,
                                ydim = 8,
                                markers = markers)

# Add labels
corrected_down <- corrected_down %>%
  dplyr::mutate(som = labels)

# Transfer labels to uncorrected data
uncorrected_down <- corrected_down %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(uncorrected_down, by = "id", )

# Evaluation using EMD
emd_val <- uncorrected_down %>%
      cyCombine::evaluate_emd(corrected_down,
                              binSize = 0.1,
                              markers = markers,
                              cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd_val$violin, emd_val$scatterplot)

The MAD score is: 0.03 



CytofRUV

We run CytofRUV with k = 5, after having tested the performance for values of 5, 10, 15, and 20, and concluding that the choice of 5 yields the most biologically meaningful results.

The MAD score is: 0.11 



iMUBAC

The MAD score is: 0.02 



CytofBatchAdjust

CytofBatchAdjust is not implemented as an R package, but instead the function is found in a script that users must download. Furthermore, it is created to only handle filenames that follow specific patterns. Consequently, we have made a folder for the datasets specifically for this tool. We start by taking a look at the contents to get an idea of the filenames used.

# Load script for CytofBatchAdjust
source("BatchAdjust.R")

# Run CytofBatchAdjust - filenames are used to specify anchors etc.
BatchAdjust(
  basedir = data_dir,
  outdir = out_dir,
  channelsFile = file.path(data_dir, "ChannelsToAdjust.txt"),
  batchKeyword="Batch", # Adjustments for all batches are relative to batch 1, and samples in batch 1 are not changed. Therefore one batch must be must be labeled as batch 1. The reference batch should not be 'an outlier'.
  anchorKeyword = "Anchor", # One sample from each batch must also contain the anchor keyword defined by the parameter "anchorKeyword" to indicate the control sample expected to be consistent across batches.
  method="quantile",
  transformation=FALSE,
  addExt='',
  plotDiagnostics=F)


# Read fcs files - before and after correction (we read the before-files once more because 
# of the different file ordering after renaming files for CytofBatchAdjust)
uncorrected <- prepare_data(data_dir = data_dir,
                            metadata = "cytonorm_cohort.xlsx",
                            filename_col = "FCS_name",
                            batch_ids = "Batch",
                            sample_ids = "FCS_name",
                            markers = markers,
                            down_sample = FALSE,
                            cofactor = 5)

corrected <- prepare_data(data_dir = out_dir,
                          metadata = file.path(data_dir, 'cytonorm_cohort.xlsx'),
                          filename_col = "FCS_name",
                          batch_ids = "Batch",
                          sample_ids = "FCS_name",
                          markers = markers,
                          down_sample = FALSE,
                          cofactor = 5)

# CytofBatchAdjust introduces some Inf values in the corrected data - they inhibit the SOM/EMD calculations, so we cap them to a large value
corrected[,markers][corrected[,markers] > 300] <- 300


# Cluster on corrected data
labels <- corrected %>%
          cyCombine::create_som(rlen = 10,
                                xdim = 8,
                                ydim = 8,
                                markers = markers)
# Add labels
corrected <- corrected %>%
  dplyr::mutate(som = labels)

# Transfer labels to uncorrected data
uncorrected <- corrected %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(uncorrected, by = "id")

# Evaluation using EMD
emd_val <- uncorrected %>%
      cyCombine::evaluate_emd(corrected,
                              binSize = 0.1,
                              markers = markers,
                              cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd_val$violin, emd_val$scatterplot)

The MAD score is: 0.01 


Now that we have run all five tools we can conclude that cyCombine achieves the largest EMD reduction (0.84) on the CytoNorm dataset.


Visualizations

We can also look at the data using dimensionality reductions and density plots to make the results more visual.


Looking a bit a this set of plots, we observe that CytofRUV has a slight tendency towards truncation of the distributions (e.g. for CD25), which is also reflected in the MAD value for this tool. iMUBAC tends to move some distributions for all batches (e.g. p38 and STAT3). CytofBatchAdjust (CBA) also has an issue for STAT3 and MAPKAPK2, where it appears to introduce an extra peak.


Finally, let us look at some UMAPs for the different batch correction tools.

 

Contact