This vignette will demonstrate the integration of CITE-seq, spectral flow cytometry and CyTOF protein expression measurements using cyCombine.

First, let us present the three datasets we will integrate:

For the CITE-seq data, we will use the ‘10k PBMCs from a Healthy Donor - Gene Expression and Cell Surface Protein’ dataset from https://support.10xgenomics.com/single-cell-gene-expression/datasets. We work on the filtered feature/cell matrix.

For spectral flow cytometry data, we will use the dataset from Park et al. (2020). The data is based on healthy donor PBMCs, and contains four samples, which were frozen and thawed, stained with 40 different antibodies in one panel, and analysed using a 5-laser full spectrum flow cytometer (Cytek Biosciences Aurora). We downloaded these from FlowRepository (ID: FR-FCM-Z2QV) and pre-gated to live single cells in FlowJo version 10 (Tree Star Inc). Singlets and non-debris were identified using forward and side-scatter. Dead cells were excluded using live/dead stains. Data from these gates were then exported in FCS format.

For the CyTOF data, we use the data from a single healthy donor processed at the Human Immune Monitoring Center. The sample was derived from FlowRepository (ID: FR-FCM-ZYAJ) and pre-gated to live intact singlets in FlowJo version 10 (Tree Star Inc).

Pre-processing

Pre-processing the CITE-seq data

We start by loading some packages

library(cyCombine)
library(tidyverse)
library(Seurat)

# Set a seed to use throughout
seed = 840

# Get some nice colors for plotting
color_clusters <- c(RColorBrewer::brewer.pal(12, 'Paired'), RColorBrewer::brewer.pal(8, 'Dark2'))

Then, we load the CITE-seq dataset and preprocess that using Seurat.

# Read CITE-seq data
citeseq_read <- Read10X(data.dir = "filtered_feature_bc_matrix/")

# Now, we alter the rownames of the protein data slightly, so they will be easier to work with
rownames(x = citeseq_read[["Antibody Capture"]]) <- gsub('-', '', gsub(pattern = "_[control_]*TotalSeqB", replacement = "", 
     x = rownames(x = citeseq_read[["Antibody Capture"]])))


# Create Seurat object with both RNA and protein levels
citeseq <- CreateSeuratObject(counts = citeseq_read[["Gene Expression"]], min.cells = 3, min.features = 200, project = "cyCombine")
citeseq[["ADT"]] <- CreateAssayObject(citeseq_read[["Antibody Capture"]][, colnames(x = citeseq)])

# Add mitochondrial percentage for cell filtering
citeseq[["percent.mt"]] <- PercentageFeatureSet(object = citeseq, pattern = "^MT-")

# Filter data to exclude cells with very large feature counts (high number of different genes expressed, potential doublets), no protein expression, and large mitochondrial content
VlnPlot(object = citeseq, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "nFeature_ADT", "nCount_ADT"), pt.size = 0, ncol = 3)

citeseq <- subset(citeseq, subset = nFeature_RNA < 2800 & nCount_RNA < 10000 & percent.mt < 10 & nFeature_ADT > 0 & nCount_ADT < 20000)

The CITE-seq data contains measurements for 6,949 cells after this initial filtering.

Now that the CITE-seq data is pre-processed, we are ready to include the other two data types: Spectral flow cytometry (SFC) and mass cytrometry (CyTOF). Particularly, we want to subset to only markers present in all three datasets’ protein expression measurements.

These are the following 11 markers: CD3, CD4, CD8(a), CD14, CD16, CD19, CD25, CD45RA, CD56, CD127, and PD1.

A word of caution here - sometimes matches cannot be made directly, due to different naming in different sets. I.e. PD-1 can be spelled as PD1 and CD8 may be referred to as CD8a in some sets. Furthermore, many proteins have different names - e.g. CCR7 = CD197. As a consequence, one has to look at these carefully and not just apply intersect.

Now we are ready to cluster the protein portion of the CITE-seq data on these overlapping protein markers.

# Defining overlapping markers
overlap_markers <- c('CD3', 'CD4', 'CD8a', 'CD14', 'CD16', 'CD19', 'CD25', 'CD45RA', 'CD56', 'CD127', 'PD1')

# We will make the protein data default since we will only work with this part of the data going forward
DefaultAssay(citeseq) <- "ADT"

# Next, we normalize and scale the protein portion of the dataset
# VariableFeatures(citeseq) <- overlap_markers
citeseq <- NormalizeData(citeseq, assay = "ADT", normalization.method = "CLR", margin = 2, verbose = F) %>%
  ScaleData(verbose = F)

# We will now cluster based on the protein levels
citeseq <- RunPCA(citeseq, features = overlap_markers, verbose = F, npcs = 11, approx = F) %>%
  RunUMAP(features = overlap_markers, seed.use = seed, verbose = F) %>%
  FindNeighbors(features = overlap_markers, verbose = F, k.param = 10) %>%
  FindClusters(resolution = 0.2, verbose = F, graph.name = "ADT_snn")

# Visualization
DimPlot(citeseq, reduction = 'umap')

DimPlot(citeseq, reduction = 'umap', split.by = 'seurat_clusters', ncol = 4)

VlnPlot(citeseq, features = c('nFeature_RNA', 'nCount_RNA', 'percent.mt', 'nFeature_ADT', 'nCount_ADT'), pt.size = 0)

Looking at these initial UMAPs, we notice cluster 10 being quite spread-out. Cluster 11 looks slightly odd as it is very small and situated quite far from the large clusters. Also, when we consider the nFeature_RNA and nCount_RNA and _ADT, these two clusters look a bit off. Perhaps these are cell-cell doublets? Let us look at the protein expression for the 11 markers.

# And let's look at the marker distributions per cluster
VlnPlot(citeseq, features = overlap_markers, pt.size = 0)

Based on these plots, cluster 10 is definitely strange. It appears to be bimodal for CD4 and CD8a. It has relatively high CD16-levels, with a hint of CD56, but it also expresses CD3. Based on the weird nCount_RNA/nFeature_RNA stats, we decide to exclude those cells. For cluster 11, we also see a strange pattern: CD56+ CD16+ CD19+ indicates B-NK doublets - and based on the high nFeature_RNA values, these cells are also removed from downstream analysis.

Consequently, we move on with the cells in clusters 0-6 and 8-9. Now, we will label and subset them:

# Labels for the 11 clusters
citeseq_cl_names <- c('0' = 'Myeloid cells', '1' = 'CD4+ T cells (naive)',
                      '2' = 'CD4+ T cells (non-naive)', '3' = 'NK cells',
                      '4' = 'CD8+ T cells (non-naive)', '5' = 'CD8+ T cells (naive)',
                      '6' = 'B cells', '7' = 'Tregs',
                      '8' = 'Dbl. neg. T cells', '9' = 'Myeloid cells',
                      '10' = 'Doublets', '11' = 'Doublets')

citeseq <- RenameIdents(citeseq, citeseq_cl_names)

# Remove doublet clusters
citeseq <- subset(citeseq, idents = 'Doublets', invert = T)

Now there is 6,776 cells left in the CITE-seq dataset and before we move on to the other two datasets, let us extract the protein-part of the CITE-seq data for integration using cyCombine.

# Extract CITE-seq data (normalized)
citeseq_labels <- Idents(citeseq)
citeseq <- citeseq@assays$ADT@data %>%
                              t() %>%
                              as_tibble()

# Add columns to tibble for cyCombine processing
citeseq <- citeseq %>%
  mutate('batch' = 'CITEseq',
         'sample' = 'Sample1',
         'label' = citeseq_labels)


# Because "CD8a" was technically only used for CITE-seq, and we will refer to it as CD8 to be completely correct going forward
overlap_markers[overlap_markers == 'CD8a'] <- 'CD8'
colnames(citeseq)[colnames(citeseq)=='CD8a'] <- 'CD8'

Pre-processing the spectral flow cytometry data

We are now ready to load the spectral flow data. We convert it to a tibble format, which is easy to process. We use cofactor = 6000 for spectral flow data asinh-transformation. This looks reasonable in terms of separating negative and positive peaks in the data, and it was further recommended by Ferrer-Font et al. (2020).

In this example, we will use only a single sample to better match the conditions from the CITE-seq and CyTOF datasets.

# Directory with raw .fcs files
data_dir <- "Park_et_al_2020_Live+/"

# Panel and reading data
sfc_panel <- read_csv(paste0(data_dir, "/panel_Park2020.csv"))

sfc_markers <- sfc_panel %>%
  dplyr::filter(Type != "none") %>%
  pull(Marker) %>%
  str_remove_all("[ _-]")

spectral <- prepare_data(data_dir = data_dir,
                         metadata = paste0(data_dir, "/Spectral samples cohort.xlsx"),
                         filename_col = "FCS_name",
                         batch_ids = "Batch",
                         condition = "Set",
                         sample_ids = "Patient id",
                         cofactor = 6000,
                         derand = FALSE,
                         markers = sfc_markers,
                         down_sample = FALSE)


# Subset to a single sample, since CITE-seq is only from one donor and we want the data to be more directly comparable
spectral <- spectral %>%
  dplyr::filter(sample == "303444")

Now, we a single sample consisting of 582,005 cells. Similarly to the CITE-seq data, we now need some cell labels - based on the overlapping markers only. Let us look at this:

# Subset to overlap
spectral <- spectral %>%
  select(all_of(c(overlap_markers, 'sample', 'batch')))

# Time to label the cells. Here, we will apply a SOM and ConsensusClusterPlus

# Clustering with kohonen 6x6, since we are only interested in relatively high-level labels
set.seed(seed)
som_ <- spectral %>%
  dplyr::select(dplyr::all_of(overlap_markers)) %>%
  as.matrix() %>%
  kohonen::som(grid = kohonen::somgrid(xdim = 6, ydim = 6),
               rlen = 10,
               dist.fcts = "euclidean")


cell_clustering_som <- som_$unit.classif
codes <- som_$codes[[1]]

# Meta-clustering with ConsensusClusterPlus
mc <- ConsensusClusterPlus::ConsensusClusterPlus(t(codes), maxK = 25, reps = 100,
                                                 pItem = 0.9, pFeature = 1, plot = F,
                                                 clusterAlg = "hc", innerLinkage = "average", 
                                                 finalLinkage = "average",
                                                 distance = "euclidean", seed = seed)

# Get cluster ids for each cell - here we choose to look at 20 meta-clusters but we will merge some in the next step
code_clustering1 <- mc[[20]]$consensusClass
spectral$label <- as.factor(code_clustering1[cell_clustering_som])

# Down-sampling for plot
set.seed(seed)
spectral_sliced <- spectral %>%
  dplyr::slice_sample(n = 10000)

# Let us visualize these clusters on a UMAP
umap <- spectral_sliced %>% plot_dimred(name = 'SFC clusters')
spectral_sliced <- cbind.data.frame(umap$data, spectral_sliced)

ggplot(spectral_sliced, aes(x = UMAP1, y = UMAP2)) +
  geom_point(aes(color = label), alpha = 0.3, size = 0.4) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = color_clusters) +
  guides(colour = guide_legend(override.aes = list(alpha = 1, size = 1)))

# And now let us use violin plots for the assignment of labels for each population
p <- list()
for (m in overlap_markers) {
  p[[m]] <- ggplot(spectral, aes_string(x = 'label', y = m)) +
    geom_violin(aes(fill = label)) +
    scale_fill_manual(values = color_clusters) +
    theme_bw()
}

p2 <- ggpubr::ggarrange(plotlist = p, common.legend = T)

p2

Based on these plots and the UMAP, there are two clusters whose labeling is impossible. Cluster 6 and cluster 11 both do not express any of the overlapping lineage markers, and although cluster 6 is clearly CD45RA+ it is not sufficient to provide a label for these. Analogously to a manual gating, we will exclude the cells in these two clusters in the downstream analysis.

Regarding cluster 10, this was also tricky, but considering both its UMAP location and the intermediate level of CD4, which is comparable to clusters 1, 2, and 5, we are comfortable with labeling these as myeloid cells. The rest of the labels are assigned below.

spectral_cl_names <- c('1' = 'Myeloid cells', '2' = 'Myeloid cells', '3' = 'B cells', '4' = 'NK cells', '5' = 'Myeloid cells',
                       '6' = 'Unlabeled', '7' = 'NK cells', '8' = 'Dbl. neg. T cells', '9' = 'CD8+ T cells (naive)', '10' = 'Myeloid cells',
                       '11' = 'Unlabeled', '12' = 'Dbl. neg. T cells', '13' = 'CD8+ T cells (naive)', '14' = 'CD8+ T cells (non-naive)', '15' = 'Tregs',
                       '16' = 'CD4+ T cells (non-naive)', '17' = 'Dbl. neg. T cells', '18' = 'CD4+ T cells (naive)', '19' = 'CD8+ T cells (naive)', '20' = 'Tregs')

spectral$label <- spectral_cl_names[spectral$label]

# Removing the unlabeled cells
spectral <- spectral %>%
  dplyr::filter(label != 'Unlabeled')

We still have 560,698 cells remaining in the SFC dataset and this portion of the data is now ready for batch correction. We will now look at the CyTOF data.

Pre-processing the CyTOF data

Then it is time to read the CyTOF data. We use a single sample (ctrls-001) from FlowRepository: FR-FCM-ZYAJ. We downloaded the version normalized with MATLAB and pre-gated it to live intact singlets using FlowJo.

### Get HIMC cytof data for one sample
# Panel and reading data
cytof_panel <- read_csv("cytof_panel_HIMC.csv")

cytof_markers <- cytof_panel %>%
  dplyr::filter(Type != "none") %>%
  pull(Marker) %>%
  str_remove_all("[ _-]")

# Read the fcs file and make it a dataframe
cytof <- prepare_data(data_dir = '.',
                      cofactor = 5,
                      derand = T,
                      markers = cytof_markers,
                      down_sample = FALSE,
                      batch_ids  = c('CyTOF'),
                      sample_ids = c('Sample1'))

Now, we a single sample consisting of 174,601 cells. Similarly to the other datasets, we now need to generate some cell labels - based on the overlapping markers only. Let us look at this:

# Subset to overlap
cytof <- cytof %>%
  select(all_of(c(overlap_markers, 'sample', 'batch')))

# Time to label the cells. Here, we will again apply a SOM and ConsensusClusterPlus

# Clustering with kohonen 6x6, since we are only interested in relatively high-level labels
set.seed(seed)
som_ <- cytof %>%
  dplyr::select(dplyr::all_of(overlap_markers)) %>%
  as.matrix() %>%
  kohonen::som(grid = kohonen::somgrid(xdim = 6, ydim = 6),
               rlen = 10,
               dist.fcts = "euclidean")


cell_clustering_som <- som_$unit.classif
codes <- som_$codes[[1]]

# Meta-clustering with ConsensusClusterPlus
mc <- ConsensusClusterPlus::ConsensusClusterPlus(t(codes), maxK = 25, reps = 100,
                                                 pItem = 0.9, pFeature = 1, plot = F,
                                                 clusterAlg = "hc", innerLinkage = "average", 
                                                 finalLinkage = "average",
                                                 distance = "euclidean", seed = seed)

# Get cluster ids for each cell - here we choose to look at 20 meta-clusters but we will merge some in the next step
code_clustering1 <- mc[[20]]$consensusClass
cytof$label <- as.factor(code_clustering1[cell_clustering_som])

# Down-sampling for plot
set.seed(seed)
cytof_sliced <- cytof %>%
  dplyr::slice_sample(n = 10000)

# Let us visualize these clusters on a UMAP
umap <- cytof_sliced %>% plot_dimred(name = 'CyTOF clusters')
cytof_sliced <- cbind.data.frame(umap$data, cytof_sliced)

ggplot(cytof_sliced, aes(x = UMAP1, y = UMAP2)) +
  geom_point(aes(color = label), alpha = 0.3, size = 0.4) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values = color_clusters) +
  guides(colour = guide_legend(override.aes = list(alpha = 1, size = 1)))

# Now let us color the plot by the relatively few overlapping markers - and the cluster labels
p <- list()
for (m in overlap_markers) {

  p[[m]] <- ggplot(cytof_sliced, aes_string(x = colnames(cytof_sliced)[1], y = colnames(cytof_sliced)[2])) +
  geom_point(aes_string(color = m), alpha = 0.3, size = 0.4) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_viridis_c()

}

# And now let us use violin plots for the assignment of labels for each population
p <- list()
for (m in overlap_markers) {
  p[[m]] <- ggplot(cytof, aes_string(x = 'label', y = m)) +
    geom_violin(aes(fill = label)) +
    scale_fill_manual(values = color_clusters) +
    theme_bw()
}

p2 <- ggpubr::ggarrange(plotlist = p, common.legend = T)

p2

Based on these plots and the UMAP, there are four clusters whose labeling is impossible. Clusters 8, 9, 15, and 18 are tricky - they could be myeloid, but the patterns are not very clear. The middle two are relatively high in CD56, but the levels of CD4 are a little contradictory. Because of these doubts, all cells of these four clusters are excluded now.

Regarding cluster 3, this was also tricky, but considering its UMAP location, the intermediate level of CD4, which is comparable to clusters 19 and 20, and its expression of CD16, which is expected in non-classical monocytes, we are comfortable with labeling these as myeloid cells. The rest of the labels are assigned below.

cytof_cl_names <- c('1' = 'CD8+ T cells (non-naive)', '2' = 'CD8+ T cells (naive)', '3' = 'Myeloid cells', '4' = 'NK cells', '5' = 'NK cells',
                    '6' = 'Dbl. neg. T cells', '7' = 'CD8+ T cells (non-naive)', '8' = 'Unlabeled', '9' = 'Unlabeled', '10' = 'CD4+ T cells (naive)',
                    '11' = 'CD8+ T cells (naive)', '12' = 'Dbl. neg. T cells', '13' = 'B cells', '14' = 'NK cells', '15' = 'Unlabeled',
                    '16' = 'CD4+ T cells (non-naive)', '17' = 'Tregs', '18' = 'Unlabeled', '19' = 'Myeloid cells', '20' = 'Myeloid cells')

cytof$label <- cytof_cl_names[cytof$label]

# Removing the unlabeled cells
cytof <- cytof %>%
  dplyr::filter(label != 'Unlabeled')

We still have 165,669 cells remaining in the CyTOF dataset and this portion of the data is now ready for batch correction.

Batch correction

Now, we are ready to combine the three datasets on the overlapping columns. As discussed above, there are 11 markers which overlap between the sets. We also choose to downsample to have the same number of cells from each platform.

# Get overlapping column names
overlap_cols <- intersect(colnames(citeseq), intersect(colnames(cytof), colnames(spectral)))

# Make one tibble - here, we downsample to have the same number of cells from each platform
uncorrected <- bind_rows(cytof[,overlap_cols], citeseq[,overlap_cols], spectral[,overlap_cols]) %>%
  group_by(batch) %>%
  slice_sample(n = min(c(nrow(cytof), nrow(citeseq), nrow(spectral)))) %>%
  ungroup() %>%
  mutate(id = 1:(min(c(nrow(cytof), nrow(citeseq), nrow(spectral)))*3))

So now we have 6,776 cells from each platform in our final set. This is perhaps not the best approach for biological downstream analysis - but for the purpose of displaying the output from cyCombine, it makes the results easier to interpret.

Now, batch correction can be performed with cyCombine. Because these are datasets of completely different platforms, we use rank as the normalization method.

# Run batch correction
corrected <- uncorrected %>%
  batch_correct(seed = seed,
                xdim = 8,
                ydim = 8,
                norm_method = 'rank',
                ties.method = 'average')

# Add labels back from uncorrected
corrected$label <- uncorrected$label

Evaluating batch correction

We now evaluate the correction using EMD - first clustering and then evaluation of each marker in each cluster.

# Generate labels for EMD evaluation
labels <- corrected %>%
  create_som(seed = seed,
             xdim = 8,
             ydim = 8,
             markers = get_markers(corrected))

# Add labels
corrected <- corrected %>%
  dplyr::mutate(som = labels)
celltype_col <- "som"

uncorrected <- corrected %>%
  dplyr::select(id, all_of(celltype_col)) %>%
  dplyr::left_join(uncorrected, by = "id")


# Evaluate Earth Movers Distance
emd <- uncorrected %>%
  evaluate_emd(corrected, markers = get_markers(corrected), cell_col = celltype_col)

# Show plots
cowplot::plot_grid(emd$violin, emd$scatterplot)

We also use the MAD score for evaluation.

# MAD
mad <- uncorrected %>%
  cyCombine::evaluate_mad(corrected,
                          markers = get_markers(corrected),
                          cell_col = celltype_col,
                          filter_limit = NULL)

cat('The MAD score is: ', mad$score, '\n')

The MAD score is:  0.07

Now, let us look at some plots to visualize the correction. First, the marker distributions before and after:

plot_density(uncorrected, corrected, ncol = 4, xlims = c(-2, 9))

Let us also see the UMAPs for uncorrected and corrected data colored by batch.

# UMAP plot uncorrected
umap1 <- uncorrected %>%
  plot_dimred(name = "uncorrected", type = "umap")

# UMAP plot corrected
umap2 <- corrected %>%
  plot_dimred(name = "corrected", type = "umap")

# Show plots
cowplot::plot_grid(umap1, umap2)

Now, we also have labels for each of the datasets, which were generated independently of the batch correction. Let us have some visualization with these.

# For uncorrected data, let's make a combined object
uncor_df <- cbind.data.frame(umap1$data[,1:2], uncorrected)

umap3 <- ggplot(uncor_df, aes(x = UMAP1, y = UMAP2)) +
                geom_point(aes(color = label), alpha = 0.3, size = 0.4) +
                theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
                ggtitle('UMAP - uncorrected') +
                guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))

# For corrected data, let's make a combined object
cor_df <- cbind.data.frame(umap2$data[,1:2], corrected)

umap4 <- ggplot(cor_df, aes(x = UMAP1, y = UMAP2)) +
                geom_point(aes(color = label), alpha = 0.3, size = 0.4) +
                theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
                ggtitle('UMAP - corrected') +
                guides(color = guide_legend(override.aes = list(alpha = 1, size = 1)))


# Show plots
cowplot::plot_grid(umap3, umap4)

# Faceting the UMAPs per technology
umap5 <- umap3 + facet_wrap(~batch)
umap6 <- umap4 + facet_wrap(~batch)


# Show plots
cowplot::plot_grid(umap5, umap6, nrow = 2)

We can also show the UMAPs faceted by technology, where cells are colored by their expression of the 11 overlapping markers:

# Loop over markers to make the plots for both the uncorrected and corrected data
expr_plots <- list()
for (m in get_markers(corrected)) {
  
  # Find range for marker to make plots comparable
  range <- summary(c(uncor_df[,m], cor_df[,m]))[c(1,6)]
  
  expr_plots[[paste0(m, '1')]] <- ggplot(uncor_df, aes(x = UMAP1, y = UMAP2)) +
                geom_point(aes_string(color = m), alpha = 0.3, size = 0.4) +
                theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
                ggtitle('UMAP - uncorrected') + 
                scale_color_gradientn(m, limits  = range, colours = colorRampPalette(rev(RColorBrewer::brewer.pal(n = 11, name = "Spectral")))(50)) +
                facet_wrap(~batch)
  
  expr_plots[[paste0(m, '2')]] <- ggplot(cor_df, aes(x = UMAP1, y = UMAP2)) +
                geom_point(aes_string(color = m), alpha = 0.3, size = 0.4) +
                theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
                ggtitle('UMAP - corrected') + 
                scale_color_gradientn(m, limits = range, colours = colorRampPalette(rev(RColorBrewer::brewer.pal(n = 11, name = "Spectral")))(50)) +
                facet_wrap(~batch)
  
}

# Show plots
cowplot::plot_grid(plotlist = expr_plots, ncol = 2)

Contact

Integrating CITE-seq, spectral flow cytometry and mass cytometry data

Christina Bligaard Pedersen