This vignette will demonstrate the use of the detect_batch_effect function of cyCombine.

For an introduction to the detect_batch_effect_express function, please refer to the one-panel CyTOF example.

The function will be demonstrated using data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). The protein expression was quantified using mass cytometry for 109 samples (20 healthy donors). The data was run in seven batches.

Pre-processing data

We start by loading some packages.

library(cyCombine)
library(tidyverse)

We are now ready to load the CyTOF data. We have set up a panel file in csv format, so the correct information is extractable from there.

# Directory with raw .fcs files
data_dir <- "dfci2_2"

# Panel and reading data
panel <- read_csv(paste0(data_dir, "/panel2.csv"))

We then progress with reading the CyTOF dataset and converting it to a tibble format, which is easy to process. We use cofactor = 5 (default) and derandomization (default) in this case.

# Extracting the markers
markers <- panel %>%
  filter(Type != "none") %>%
  pull(Marker) %>%
  str_remove_all("[ _-]")

# Preparing the expression data
dfci <- prepare_data(data_dir = data_dir,
                     metadata = paste0(data_dir, "/metadata.csv"),
                     filename_col = "FCS_name",
                     batch_ids = "Batch",
                     condition = "Set",
                     derand = TRUE,
                     markers = markers,
                     down_sample = FALSE)

Reading 109 files to a flowSet..

Extracting expression data..

Your flowset is now converted into a dataframe.

Transforming data using asinh with a cofactor of 5..

Done!

Checking for batch effects

In some experiments one may already know that batch effects exist. Maybe a marker was clearly overstained in one batch or a different antibody for the same protein was used in some batches. However, sometimes it is not known if batch effects - beyond those accounted for by bead normalization - are present. Such hidden technical variation can, depending on its severity, be detrimental to a biological analysis.

Accordingly, one should always inspect their data for batch effects before proceeding to clustering and other analysis steps. Here, we will see what can be learned from cyCombine’s detect_batch_effect function.

detect_batch_effect(dfci,
                    batch_col = 'batch',
                    out_dir = paste0(data_dir, '/batch_effect_check'), 
                    seed = 434,
                    name = 'DFCI panel 2 data')

Determining new cell type labels using SOM:

Creating SOM grid..

Scaling expression data..


There are 3 markers, which appear to be outliers in a single batch:

CD14, FCeR1a, TGFB1


There are 1 clusters, in which a single cluster is strongly over- or underrepresented.

The cluster percentages for each batch in cluster 10 are:

1 = 0.69 %, 2 = 1.98 %, 3 = 0.69 %, 4 = 0.7 %, 5 = 0.67 %, 6 = 0.92 %, 7 = 0.78 %

The cluster expresses IL1RA, CD56, JAK1, CD11b, CD11c, CD45RA, CD16, TGFBR2, CD8, DR3.

Making UMAP plots for up to 50,000 cells.

Saved UMAP plot for batches and labels here: dfci2_2/batch_effect_check as UMAP_batches_labels.png.

Saved UMAP plot colored by each marker in directory: dfci2_2/batch_effect_check/UMAP_markers.

Done!

In the printed output, we get some pointers to potential problems with batch effects. It points to three markers, which seem affected by batch effects. Furthermore, it also performs a clustering (SOM) and identifies that one of the obtained clusters has a significant over-representation in batch 2. This cluster appears to be a CD16+ NK cell cluster.

Additionally, the function provides a lot of UMAP plots, which are saved in the directory specified by out_dir. The first plot is the UMAP_batches_labels plot, which shows a UMAP for up to 50,000 cells across the batches in the dataset. One UMAP is colored by batch - another by SOM node (=cluster). While this might be tricky to decipher, it can be a good starting point for understanding the potential batch effects. It further generates the same UMAP colored by each protein marker in the dataset.

It is recommended to also call the detect_batch_effect_express function to invesigate the output of that function and relate it to the results generated here.

Contact

Detection of batch effects in cytometry data

Christina Bligaard Pedersen

February 18, 2021

Pre-processing data

Checking for batch effects