This vignette will demonstrate the use of the detect_batch_effect
function of cyCombine.
For an introduction to the detect_batch_effect_express
function, please refer to the one-panel CyTOF example.
The function will be demonstrated using data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). The protein expression was quantified using mass cytometry for 109 samples (20 healthy donors). The data was run in seven batches.
We start by loading some packages.
We are now ready to load the CyTOF data. We have set up a panel file in csv format, so the correct information is extractable from there.
# Directory with raw .fcs files
data_dir <- "dfci2_2"
# Panel and reading data
panel <- read_csv(paste0(data_dir, "/panel2.csv"))
We then progress with reading the CyTOF dataset and converting it to a tibble format, which is easy to process. We use cofactor = 5 (default) and derandomization (default) in this case.
# Extracting the markers
markers <- panel %>%
filter(Type != "none") %>%
pull(Marker) %>%
str_remove_all("[ _-]")
# Preparing the expression data
dfci <- prepare_data(data_dir = data_dir,
metadata = paste0(data_dir, "/metadata.csv"),
filename_col = "FCS_name",
batch_ids = "Batch",
condition = "Set",
derand = TRUE,
markers = markers,
down_sample = FALSE)
Reading 109 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
In some experiments one may already know that batch effects exist. Maybe a marker was clearly overstained in one batch or a different antibody for the same protein was used in some batches. However, sometimes it is not known if batch effects - beyond those accounted for by bead normalization - are present. Such hidden technical variation can, depending on its severity, be detrimental to a biological analysis.
Accordingly, one should always inspect their data for batch effects before proceeding to clustering and other analysis steps. Here, we will see what can be learned from cyCombine’s detect_batch_effect
function.
detect_batch_effect(dfci,
batch_col = 'batch',
out_dir = paste0(data_dir, '/batch_effect_check'),
seed = 434,
name = 'DFCI panel 2 data')
Determining new cell type labels using SOM:
Creating SOM grid..
Scaling expression data..
There are 3 markers, which appear to be outliers in a single batch:
CD14, FCeR1a, TGFB1
There are 1 clusters, in which a single cluster is strongly over- or underrepresented.
The cluster percentages for each batch in cluster 10 are:
1 = 0.69 %, 2 = 1.98 %, 3 = 0.69 %, 4 = 0.7 %, 5 = 0.67 %, 6 = 0.92 %, 7 = 0.78 %
The cluster expresses IL1RA, CD56, JAK1, CD11b, CD11c, CD45RA, CD16, TGFBR2, CD8, DR3.
Making UMAP plots for up to 50,000 cells.
Saved UMAP plot for batches and labels here: dfci2_2/batch_effect_check as UMAP_batches_labels.png.
Saved UMAP plot colored by each marker in directory: dfci2_2/batch_effect_check/UMAP_markers.
Done!
In the printed output, we get some pointers to potential problems with batch effects. It points to three markers, which seem affected by batch effects. Furthermore, it also performs a clustering (SOM) and identifies that one of the obtained clusters has a significant over-representation in batch 2. This cluster appears to be a CD16+ NK cell cluster.
Additionally, the function provides a lot of UMAP plots, which are saved in the directory specified by out_dir
. The first plot is the UMAP_batches_labels plot, which shows a UMAP for up to 50,000 cells across the batches in the dataset. One UMAP is colored by batch - another by SOM node (=cluster). While this might be tricky to decipher, it can be a good starting point for understanding the potential batch effects. It further generates the same UMAP colored by each protein marker in the dataset.
It is recommended to also call the detect_batch_effect_express
function to invesigate the output of that function and relate it to the results generated here.
Contact