This vignette will demonstrate the batch correction of a CyTOF set consisting of 128 samples in seven batches using cyCombine. It will also include a small discussion regarding the grid size during batch correction.


This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). The protein expression was quantified using mass cytometry for 128 samples (20 healthy donors). The data was run in seven batches and used a panel measuring expression of 36 proteins.


Pre-processing data

We start by loading some packages.


We are now ready to load the CyTOF data. We have set up a panel file in csv format, so the correct information is extractable from there. Let us have a look at the contents:


We then progress with reading the CyTOF dataset and converting it to a tibble format, which is easy to process. We use cofactor = 5 (default) in this case.

Reading 128 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!


Checking for batch effects

Now, let us use a cyCombine function to check if there are any batch effects to correct for at all… cyCombine will run on data even with no real batch effects, and in those cases, the batch correction should have minimal effect. However, there is no reason to run the algorithm, if we have no batch effects in the data.

Starting the quick(er) detection of batch effects.
Downsampling to 10000 cells.
Making distribution plots for all markers in each batch.
Saved marker distribution plots here: batch_effect_check/distributions_per_batch.png.
Applying global EMD-based batch effect detection.
Saved EMD plot here: batch_effect_check/emd_per_marker.png.
CD69 has clear outlier batch(es): 7
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 0.57, Mean = 0.85, 3rd Qu. = 1.25, Max. = 9.1
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0, Mean = 0.29, 3rd Qu. = 0.39, Max. = 9.24
CD4 has clear outlier batch(es): 1
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0.2, Median = 2.7, Mean = 2.39, 3rd Qu. = 4.25, Max. = 5.19
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 1.35, Mean = 1.97, 3rd Qu. = 4.05, Max. = 5.93
CD127 has clear outlier batch(es): 7
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 1.88, Mean = 2.09, 3rd Qu. = 2.89, Max. = 9.59
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 0.82, 3rd Qu. = 1.53, Max. = 9.27
HLADR has clear outlier batch(es): 2
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 1.43, 3rd Qu. = 2.92, Max. = 7.09
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 1.14, Mean = 2.13, 3rd Qu. = 4.24, Max. = 7.66
XCL1 has clear outlier batch(es): 1
Summary of the distribution in the OUTLIER batch(es):
Min. = 0, 1st Qu. = 2.14, Median = 3.32, Mean = 3.54, 3rd Qu. = 4.89, Max. = 8.45
Summary of the distribution in the non-outlier batches:
Min. = 0, 1st Qu. = 0, Median = 0.39, Mean = 1.1, 3rd Qu. = 1.35, Max. = 9.39
Making MDS plot for median protein expression per sample.
Saved MDS plot here: batch_effect_check/MDS_per_batch.png
Done!


In the printed output, we already get some pointers to potential problems with batch effects. But let us look at each of these plots for this dataset. First we have the EMD per marker-plot, which shows the mean Earth Mover’s Distance for all pairwise batch-to-batch comparisons. The distribution of each marker is considered globally for each comparison. The error bars represent the standard deviation. In this dataset, we observe a relatively high mean EMD for XCL1, and further this marker has a large standard deviation. This indicates that there may be a batch effect to consider in this marker - perhaps it is significantly over- or under-stained in one or more batches compared to the rest? According to the text, batch 1 is the problem.


To figure out if that is really the case, we can look at the second generated plot. This is the distribution of each marker in each batch. Quantiles are shown as vertical bars.

When looking at XCL1, we clearly observe the batch effect indicated before - batch 1 looks very different from the rest! Also, have a look at TBet. This marker had the second-highest mean EMD above - and here, the distributions for batches 6 and 7 look different than the rest.