This vignette will demonstrate the batch correction of a CyTOF dataset, where samples were measured using two different panels. Not only will batch correction be performed, but we will also impute the non-overlapping markers allowing for a much more direct integration of these data.


This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). Protein expression was quantified using two different panels of proteins with an overlap. The data generated with each panel was processed in seven batches. The data is B-cell depleted.


Pre-processing data

In this dataset, it seems reasonable to start by looking at the two panels.

Now, we have the panels - so let us extract the markers and identify the overlap.

 [1] "CD20"   "CD3"    "CD45RA" "CD5"    "CD19"   "CD14"   "CD33"   "CD4"   
 [9] "CD8"    "CD197"  "CD56"   "CD161"  "FoxP3"  "HLADR"  "XCL1"  

We observe that there is a total of 15 overlapping markers. These span a lot of the major cell types (eg. CD3, CD4, and CD8 for T cells, CD56 for NK cells and CD14 and CD33 for myeloid cell types).


The workflow presented in this vignette can be visualized with the following schematic. Dataset a and b are the the datasets for the two panels here.


We are now ready to load the CyTOF data. We convert it to a tibble format, which is easy to process. We use derandomization and cofactor = 5 (default) for asinh-transformation.

Reading 128 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
Reading 112 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!


Processing data - batch correction

Panel 1 - batch correction

In this case, the dataset for each panel is, as mentioned, run in eight batches - this means that there are likely some batch effects to correct for within each panel as well. We take of this first, before we start integrating across panels!


Let us have a quick look at some UMAPs to visualize the correction for each batch. We downsample so it is easier to see what is going on.

Now, let us view the expression distributions for all the markers in panel 1.

Finally, we will evaluate the EMD reduction for the batch correction of panel 1. To do this, we first need to perform a clustering of the corrected set, and we will transfer the labels to the uncorrected set for direct comparison.

The MAD score is: 0.02 


Panel 2 - batch correction

Now, it is time to do the same with the data from panel 2. First, we batch correct:


We look at the UMAPs.

And we take a look at the marker distributions for panel 2

Lastly, we evaluate the EMD reduction for the batch correction of panel 2 in the same manner as presented above.

The MAD score is: 0.02 


For both panels, we find EMD reductions of 0.66 and MAD scores of 0.02, when considering the batch corrections separately.


Combined batch correction

Based on the marker distributions after correction and the UMAPs, it looks like the batch effects within the data for each panel are minimized. Now we can focus on the integration of the two sets. The first step here, is to batch correct the datasets based on the overlapping markers.


Similarly to the corrections within each panel’s data, we can now look at the UMAPs and marker distributions before and after correction - and we can also calculate the EMD reduction.

And the EMD reduction:

The MAD score is: 0.01 

For this correction, we obtain an EMD of 0.83 and an MAD score of 0.01.


Imputing non-overlapping markers

Now that we have batch corrected the overlapping markers from the two panels, they are directly comparable. However, when limiting ourselves only to the overlapping set, we also remove all information contained in the non-overlapping markers. In this case, panel 1 contains 21 markers not found in panel 2 - and panel 2 has 19 markers not found in panel 1. These markers include CD16, which is important for NK cells and monocyte distinction and Granzyme A, which is important to deeply characterize cytotoxic T cells and NK cells.

We want to include these markers in our dataset, but because the non-overlapping markers were only measured on roughly half of the cells, we have to use imputation to provide a value for the other panel’s cells.

We start by defining the sets of non-overlapping markers and then add the values for these to the batch corrected values for the overlapping markers for each panel.


We are now ready to impute the values for the markers unique panel 2 for the panel 1 data - and vice versa.

Creating SOM grid..
Performing density draws for dataset1.
Performing density draws for dataset2.
Caution! Analysis based on imputed values can lead to false inferences. We recommend only using imputed marker expressions for visualization and not for differential expression analysis.


Now, we are ready to use this combined dataset to answer the biological questions and make nice visualizations. Let us look at all of the markers after batch corrections and imputation.

This looks very nice in terms of obtaining comparable distributions between the cells originating from each panel.

Let us make a UMAP for the combined set - based on all 55 markers (15 overlapping + 21 unique to panel 1 + 19 unique to panel 2).

After running this process, it is possible to perform any processing one finds relevant for the dataset. The next step could be to perform clustering on the lineage markers and then performing differential abundance testing.

 

Contact