This vignette will demonstrate the integration of CITE-seq, spectral flow cytometry and CyTOF protein expression measurements using cyCombine.


First, let us present the three datasets we will integrate:

For the CITE-seq data, we will use the ‘10k PBMCs from a Healthy Donor - Gene Expression and Cell Surface Protein’ dataset from https://support.10xgenomics.com/single-cell-gene-expression/datasets. We work on the filtered feature/cell matrix.

For spectral flow cytometry data, we will use the dataset from Park et al. (2020). The data is based on healthy donor PBMCs, and contains four samples, which were frozen and thawed, stained with 40 different antibodies in one panel, and analysed using a 5-laser full spectrum flow cytometer (Cytek Biosciences Aurora). We downloaded these from FlowRepository (ID: FR-FCM-Z2QV) and pre-gated to live single cells in FlowJo version 10 (Tree Star Inc). Singlets and non-debris were identified using forward and side-scatter. Dead cells were excluded using live/dead stains. Data from these gates were then exported in FCS format.

For the CyTOF data, we use the data from a single healthy donor processed at the Human Immune Monitoring Center. The sample was derived from FlowRepository (ID: FR-FCM-ZYAJ) and pre-gated to live intact singlets in FlowJo version 10 (Tree Star Inc).


Pre-processing

Pre-processing the CITE-seq data

We start by loading some packages

Then, we load the CITE-seq dataset and preprocess that using Seurat.

The CITE-seq data contains measurements for 6,949 cells after this initial filtering.

Now that the CITE-seq data is pre-processed, we are ready to include the other two data types: Spectral flow cytometry (SFC) and mass cytrometry (CyTOF). Particularly, we want to subset to only markers present in all three datasets’ protein expression measurements.

These are the following 11 markers: CD3, CD4, CD8(a), CD14, CD16, CD19, CD25, CD45RA, CD56, CD127, and PD1.

A word of caution here - sometimes matches cannot be made directly, due to different naming in different sets. I.e. PD-1 can be spelled as PD1 and CD8 may be referred to as CD8a in some sets. Furthermore, many proteins have different names - e.g. CCR7 = CD197. As a consequence, one has to look at these carefully and not just apply intersect.


Now we are ready to cluster the protein portion of the CITE-seq data on these overlapping protein markers.


Looking at these initial UMAPs, we notice cluster 10 being quite spread-out. Cluster 11 looks slightly odd as it is very small and situated quite far from the large clusters. Also, when we consider the nFeature_RNA and nCount_RNA and _ADT, these two clusters look a bit off. Perhaps these are cell-cell doublets? Let us look at the protein expression for the 11 markers.


Based on these plots, cluster 10 is definitely strange. It appears to be bimodal for CD4 and CD8a. It has relatively high CD16-levels, with a hint of CD56, but it also expresses CD3. Based on the weird nCount_RNA/nFeature_RNA stats, we decide to exclude those cells. For cluster 11, we also see a strange pattern: CD56+ CD16+ CD19+ indicates B-NK doublets - and based on the high nFeature_RNA values, these cells are also removed from downstream analysis.


Consequently, we move on with the cells in clusters 0-6 and 8-9. Now, we will label and subset them:


Now there is 6,776 cells left in the CITE-seq dataset and before we move on to the other two datasets, let us extract the protein-part of the CITE-seq data for integration using cyCombine.


Pre-processing the spectral flow cytometry data

We are now ready to load the spectral flow data. We convert it to a tibble format, which is easy to process. We use cofactor = 6000 for spectral flow data asinh-transformation. This looks reasonable in terms of separating negative and positive peaks in the data, and it was further recommended by Ferrer-Font et al. (2020).

In this example, we will use only a single sample to better match the conditions from the CITE-seq and CyTOF datasets.

Now, we a single sample consisting of 582,005 cells. Similarly to the CITE-seq data, we now need some cell labels - based on the overlapping markers only. Let us look at this:

Based on these plots and the UMAP, there are two clusters whose labeling is impossible. Cluster 6 and cluster 11 both do not express any of the overlapping lineage markers, and although cluster 6 is clearly CD45RA+ it is not sufficient to provide a label for these. Analogously to a manual gating, we will exclude the cells in these two clusters in the downstream analysis.


Regarding cluster 10, this was also tricky, but considering both its UMAP location and the intermediate level of CD4, which is comparable to clusters 1, 2, and 5, we are comfortable with labeling these as myeloid cells. The rest of the labels are assigned below.

We still have 560,698 cells remaining in the SFC dataset and this portion of the data is now ready for batch correction. We will now look at the CyTOF data.


Pre-processing the CyTOF data

Then it is time to read the CyTOF data. We use a single sample (ctrls-001) from FlowRepository: FR-FCM-ZYAJ. We downloaded the version normalized with MATLAB and pre-gated it to live intact singlets using FlowJo.

Now, we a single sample consisting of 174,601 cells. Similarly to the other datasets, we now need to generate some cell labels - based on the overlapping markers only. Let us look at this:

Based on these plots and the UMAP, there are four clusters whose labeling is impossible. Clusters 8, 9, 15, and 18 are tricky - they could be myeloid, but the patterns are not very clear. The middle two are relatively high in CD56, but the levels of CD4 are a little contradictory. Because of these doubts, all cells of these four clusters are excluded now.


Regarding cluster 3, this was also tricky, but considering its UMAP location, the intermediate level of CD4, which is comparable to clusters 19 and 20, and its expression of CD16, which is expected in non-classical monocytes, we are comfortable with labeling these as myeloid cells. The rest of the labels are assigned below.

We still have 165,669 cells remaining in the CyTOF dataset and this portion of the data is now ready for batch correction.



Batch correction

Now, we are ready to combine the three datasets on the overlapping columns. As discussed above, there are 11 markers which overlap between the sets. We also choose to downsample to have the same number of cells from each platform.

So now we have 6,776 cells from each platform in our final set. This is perhaps not the best approach for biological downstream analysis - but for the purpose of displaying the output from cyCombine, it makes the results easier to interpret.


Now, batch correction can be performed with cyCombine. Because these are datasets of completely different platforms, we use rank as the normalization method.



Evaluating batch correction

We now evaluate the correction using EMD - first clustering and then evaluation of each marker in each cluster.


We also use the MAD score for evaluation.

The MAD score is:  0.07 


Now, let us look at some plots to visualize the correction. First, the marker distributions before and after:

Let us also see the UMAPs for uncorrected and corrected data colored by batch.

Now, we also have labels for each of the datasets, which were generated independently of the batch correction. Let us have some visualization with these.


We can also show the UMAPs faceted by technology, where cells are colored by their expression of the 11 overlapping markers:

 

Contact