This vignette will demonstrate the batch correction of a CyTOF dataset, where samples were measured using two different panels. Not only will batch correction be performed, but we will also impute the non-overlapping markers allowing for a much more direct integration of these data.


This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). Protein expression was quantified using two different panels of proteins with an overlap. The data generated with each panel was processed in seven batches. The data is B-cell depleted.


Pre-processing data

In this dataset, it seems reasonable to start by looking at the two panels.

Now, we have the panels - so let us extract the markers and identify the overlap.

 [1] "CD20"   "CD3"    "CD45RA" "CD5"    "CD19"   "CD14"   "CD33"   "CD4"   
 [9] "CD8"    "CD197"  "CD56"   "CD161"  "FoxP3"  "HLADR"  "XCL1"  

We observe that there is a total of 15 overlapping markers. These span a lot of the major cell types (eg. CD3, CD4, and CD8 for T cells, CD56 for NK cells and CD14 and CD33 for myeloid cell types).


The workflow presented in this vignette can be visualized with the following schematic. Dataset a and b are the the datasets for the two panels here.


We are now ready to load the CyTOF data. We convert it to a tibble format, which is easy to process. We use derandomization and cofactor = 5 (default) for asinh-transformation.

Reading 128 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
Reading 112 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!