This vignette will introduce the panel merging module of cyCombine which has two main functions:

  1. Imputation of non-overlapping channels for whole data sets (impute_across_panels)
  2. Imputation of single misstained (or in some other way) problematic channels (salvage_problematic)

These modules rely on the same main workflow, as represented by this figure:



In this vignette, we will look at these functions and also provide a general discussion of panel merging robustness and limitations. Finally, we include benchmarking against existing tools.


Imputation of non-overlapping channels for whole data sets

Mass cytometry and spectral flow cytometry enables quantification of 40+ proteins in parallel on millions of cells. But sometimes that is not enough to deeply characterize the populations of interest. The solution: Use two or more panels on the same cells - i.e. split you samples before staining. When this is done, however, one essentially ends up with two different datasets. Usually, these overlap partially - after all, most cytometry panels contain some of the most important cell type markers; CD3, CD4, CD8, CD14, CD19, CD56 etc. But it still is challenging to jointly analyze the data. The panel merging module of cyCombine can help researchers facing this exact challenge. Based on the overlapping markers, cyCombine is able to impute the values for the non-overlapping set, enabling co-analysis with all the markers across the applied panels.


Another example, which may be important for larger dataset integration - either across studies or platforms - is data analyzed with two different, partially overlapping panels.


Methodology

cyCombine’s cross-dataset merging works by first performing a SOM-based (default 8x8) clustering of the datasets based on all of the overlapping markers. Then, for each cell in one of the datasets, the values for the missing markers are imputed by using the values from cells in the other dataset that fall within the same SOM node. The imputations are made by simulating a multi-dimensional kernel density estimate: Each cell’s missing values are imputed by randomly drawing a cell from the other dataset and adding a Gaussian error, which is based on a draw from a Normal distribution with mean 0 and standard deviation corresponding to the bandwidth of each marker.

However, if there are less than 50 cells from the other dataset within the SOM node, the values for the missing channels are set to NA as imputation would be unreliable.

We refer to the cyCombine vignette with a two-panel CyTOF dataset for a panel merging use-case.

In this vignette, we will instead focus a bit on the landscape of integration and imputation, before proceeding to an introduction to imputation of single misstained channels.


Discussion

In terms of advantages, the approach of cyCombine is designed to account for the underlying biology by using density draws. This is particularly important because cytometry data are not normally distributed.

In some cases, in which the variance in the marker expression is very low, measures like the median provide a good estimate of the values to be imputed. The unfortunate thing is that those cases are often biologically uninteresting, whereas markers with bimodal expression are often the center of attention. Such markers can either be relevant for distinguishing populations and downstream differential abundance analysis, or within a cluster they may provide insight into differential expression between experimental states.

When evaluating cyCombine, it is important to consider the following: Because of the use of density draws, it is not possible to use a one-one comparison of cells in a training set before and after imputation. This is because you might actually get some ‘wrong’ estimates with our approach - but there should be a corresponding ‘wrong’ estimate making up for that. The goal of our SOM/density-based approach is to preserve the global structure of the dataset instead of the single cell’s assignment. An additional upside to cyCombine is that it also does not exclude outlier values, as those should be sampled with the frequency that they are actually present in the data. We will get back to this in the benchmarks below.


Naturally, this approach also has limitations: Since the method relies on the information of the overlapping markers for imputation, this overlap must be of a reasonable size and quality. This is hard to accurately define, but overall, the overlapping set should contain the most important lineage-defining markers, for imputation to work properly. Also, we would err on the side of caution when it comes to the number of markers imputed: Do not impute 100 markers based on five overlapping. How far to push this will depend on the overlap in question and also the samples. If the samples are highly similar, as in a two-panel analysis of the same samples, perhaps it is okay to impute, but if they are very different, such as from different diseases or tissues, one should be more careful.

Furthermore, we will advise against using imputed values for differential expression analysis, as this can be highly sensitive to the values. Instead, panel merging should be viewed as a tool for easing data visualization - you can get one UMAP for multi-panel data and the population abundances will be directly comparable sparring researchers from the trouble of having to do cluster matching (difficult) and studying separate plots since data exists in two spaces. In other words, be careful when performing analysis based on imputed values, in order to avoid false inference.


Other integration and imputation tools

Previous work by Abdelaal et al. (2019) suggested the CyTOFmerge approaches for 1) designing panels with an optimal overlap for imputation and 2) imputation based on k-Nearest Neighbors (kNN) and medians. However, the authors state that it is best to use their own panel design tool if panel merging is to work optimally.

While the idea of designing multiple panels so they have an ideal information-carrying overlap is very good and should be considered best-practice, it is also not carried out in many real-life studies, in which the bioinformatician analyzing the data may only have been included after the data was generated on the CyTOF instrument. This often results in a situation where two panels with sub-optimal overlap have been run, but this is the available data.

As a result, comparing cyCombine and CyTOFmerge directly is not entirely fair, as they are not designed with the same purpose in mind. However, we do provide an example below.


We must also mention CytoBackBone by Pereira et al. (2019), which also uses nearest-neighbor imputation. An interesting notion of acceptable neighbors is applied to avoid spurious imputations based on very distinct cells. This is analogous to cyCombine’s lack of imputation in cases where a SOM node contains no/very few cells from one of the datasets. Furthermore, they provide a very interesting test case, in which a healthy donor sample was stained with five combinations of the same set of 35 markers, which we will focus on in the next section. This work also tested the impact of the size of the set of overlapping markers using more than 65,000 different combinations, showing that the larger the overlap, the better the imputation quality.


Alternatively, one could look to the flow cytometry work by Lee et al. (2011), which suggests a clustering-based approach for imputation. This model was however developed for two data files only and relies on the application of domain knowledge regarding marker expression of different cell types. Furthermore, it was only tested on lymphocyte data.


For combining and integrating datasets, approaches like QFMatch by Orlova et al. (2018), SIC by Meehan et al. (2019), and MetaCyto by Hu et al. (2018) have been presented. However, these are focused on combining the results of complete analyses with cluster matching and not on allowing truly integrated analysis from start to end.


Benchmark 1 - cyCombine, CytoBackBone, and CyTOFmerge (CytoBackBone dataset)

Along with the CytoBackBone publication, a whole blood dataset from a healthy donor was published (FlowRepository ID: FR-FCM-ZYV2). This donor was stained with five different panels, four of which were subsets of the full panel (#E). We downloaded the five pregated files from the repository and now we will read them and perform panel merging.

Panel #A and #B together have all the 35 markers included in panel #E, with an overlap of 16 markers. Similarly, panel #C and #D together have all the 35 markers included in panel #E, with an overlap of the same 16 markers. These 16 markers are, as would be a recommended choice, some of the key lineage markers: CD1c, CD3, CD4, CD8a, CD11c, CD14, CD16, CD19, CD32, CD64, CD66, CD86, CD123, CD141, Granzyme B, and HLA-DR. In the original article, they conduct an analysis regarding the importance of each backbone marker, in which markers are ranked based on their relative importance (CD3 and CD66 being the most important). We will not conduct a similar analysis, but instead focus on comparing cyCombine and CytoBackBone.

We will do panel merging with two panels at a time. Let us first focus on panels A and B.

And then let us integrate panels C and D.

Now that both sets of panel pairs have been integrated, we can compare the Earth Mover’s Distance to panel E which contained the full set of markers. First, we will look at the densities of each marker - both for the overlapping channels and the imputed ones.

Overall, this looks quite good. Also, it is worth noting that the distributions of CD3 are not completely identical, despite the marker being included in all five panels: So it would be a hard requirement to impose on imputations. Now, we will use the Earth Mover’s Distance (EMD) to look at the imputations more quantitatively.


Because CytoBackBone merging does not maintain the information about the original panel of each cell, we cannot look at per-sample EMDs after correction for CytoBackBone. In order to look at the same metrics for the two tools, we consequently relabel our merged cells from panels A and B with batch = 'A+B' and analogously for panels C and D. We can now calculate the EMDs for all batch-batch comparisons (A+B vs. C+D, A+B vs. E, and C+D vs. E).


Now, we compare these results to those of CytoBackBone. In this process, we aim to apply the exact approach presented in the CytoBackBone GitHub and consequently, we load the files again to make sure we are not adding any bias to the analysis.


Now that the files are loaded, we are ready to perform the integration with CytoBackBone. Here, we choose to use the normalization of the tool and a threshold of 3, which is a relatively high threshold meaning that more cells will be included in the integrated set.

=== CytoBackBone ===
Normalizing profiles
processing marker: CD66
processing marker: HLADR
processing marker: CD3
processing marker: CD64
processing marker: CD8a
processing marker: CD123
processing marker: CD16
processing marker: CD86
processing marker: CD32
processing marker: CD141
processing marker: CD1c
processing marker: CD11c
processing marker: CD14
processing marker: CD4
processing marker: GranzymeB
processing marker: CD19
profile 1 contains 162,741 cells
profile 2 contains 159,468 cells
===
profile 1 has 162,491 cells can be potentialy matched
profile 2 has 159,235 cells can be potentialy matched
===
maximum of number of cells that can be matched by CytoBackBone = 159,235 
===
step #1: 51,951 cells matched (107,284 cells remaining)
step #2: 74,112 cells matched (85,123 cells remaining)
step #3: 87,849 cells matched (71,386 cells remaining)
step #4: 97,961 cells matched (61,274 cells remaining)
step #5: 106,038 cells matched (53,197 cells remaining)
step #6: 112,641 cells matched (46,594 cells remaining)
step #7: 118,161 cells matched (41,074 cells remaining)
step #8: 122,867 cells matched (36,368 cells remaining)
step #9: 126,851 cells matched (32,384 cells remaining)
step #10: 130,064 cells matched (29,171 cells remaining)
step #11: 132,760 cells matched (26,475 cells remaining)
step #12: 135,036 cells matched (24,199 cells remaining)
step #13: 136,919 cells matched (22,316 cells remaining)
step #14: 138,483 cells matched (20,752 cells remaining)
step #15: 139,805 cells matched (19,430 cells remaining)
step #16: 140,934 cells matched (18,301 cells remaining)
step #17: 141,855 cells matched (17,380 cells remaining)
step #18: 142,728 cells matched (16,507 cells remaining)
step #19: 143,547 cells matched (15,688 cells remaining)
step #20: 144,375 cells matched (14,860 cells remaining)
step #21: 145,314 cells matched (13,921 cells remaining)
step #22: 146,303 cells matched (12,932 cells remaining)
step #23: 147,303 cells matched (11,932 cells remaining)
step #24: 148,325 cells matched (10,910 cells remaining)
step #25: 149,276 cells matched (9,959 cells remaining)
step #26: 150,106 cells matched (9,129 cells remaining)
step #27: 150,838 cells matched (8,397 cells remaining)
step #28: 151,531 cells matched (7,704 cells remaining)
step #29: 152,131 cells matched (7,104 cells remaining)
step #30: 152,644 cells matched (6,591 cells remaining)
step #31: 153,062 cells matched (6,173 cells remaining)
step #32: 153,384 cells matched (5,851 cells remaining)
step #33: 153,687 cells matched (5,548 cells remaining)
step #34: 153,943 cells matched (5,292 cells remaining)
step #35: 154,177 cells matched (5,058 cells remaining)
step #36: 154,441 cells matched (4,794 cells remaining)
step #37: 154,724 cells matched (4,511 cells remaining)
step #38: 154,968 cells matched (4,267 cells remaining)
step #39: 155,158 cells matched (4,077 cells remaining)
step #40: 155,270 cells matched (3,965 cells remaining)
step #41: 155,348 cells matched (3,887 cells remaining)
step #42: 155,398 cells matched (3,837 cells remaining)
step #43: 155,430 cells matched (3,805 cells remaining)
step #44: 155,450 cells matched (3,785 cells remaining)
step #45: 155,458 cells matched (3,777 cells remaining)
step #46: 155,459 cells matched (3,776 cells remaining)
====================
=== CytoBackBone ===
Normalizing profiles
processing marker: CD66
processing marker: HLADR
processing marker: CD3
processing marker: CD64
processing marker: CD8a
processing marker: CD123
processing marker: CD16
processing marker: CD86
processing marker: CD32
processing marker: CD141
processing marker: CD1c
processing marker: CD11c
processing marker: CD14
processing marker: CD4
processing marker: GranzymeB
processing marker: CD19
profile 1 contains 141,627 cells
profile 2 contains 163,035 cells
===
profile 1 has 141,497 cells can be potentialy matched
profile 2 has 162,642 cells can be potentialy matched
===
maximum of number of cells that can be matched by CytoBackBone = 141,497 
===
step #1: 47,446 cells matched (94,051 cells remaining)
step #2: 67,670 cells matched (73,827 cells remaining)
step #3: 80,015 cells matched (61,482 cells remaining)
step #4: 88,856 cells matched (52,641 cells remaining)
step #5: 95,807 cells matched (45,690 cells remaining)
step #6: 101,610 cells matched (39,887 cells remaining)
step #7: 106,666 cells matched (34,831 cells remaining)
step #8: 110,999 cells matched (30,498 cells remaining)
step #9: 114,652 cells matched (26,845 cells remaining)
step #10: 117,788 cells matched (23,709 cells remaining)
step #11: 120,422 cells matched (21,075 cells remaining)
step #12: 122,623 cells matched (18,874 cells remaining)
step #13: 124,502 cells matched (16,995 cells remaining)
step #14: 126,101 cells matched (15,396 cells remaining)
step #15: 127,496 cells matched (14,001 cells remaining)
step #16: 128,707 cells matched (12,790 cells remaining)
step #17: 129,745 cells matched (11,752 cells remaining)
step #18: 130,721 cells matched (10,776 cells remaining)
step #19: 131,637 cells matched (9,860 cells remaining)
step #20: 132,447 cells matched (9,050 cells remaining)
step #21: 133,179 cells matched (8,318 cells remaining)
step #22: 133,867 cells matched (7,630 cells remaining)
step #23: 134,545 cells matched (6,952 cells remaining)
step #24: 135,202 cells matched (6,295 cells remaining)
step #25: 135,858 cells matched (5,639 cells remaining)
step #26: 136,477 cells matched (5,020 cells remaining)
step #27: 137,076 cells matched (4,421 cells remaining)
step #28: 137,652 cells matched (3,845 cells remaining)
step #29: 138,144 cells matched (3,353 cells remaining)
step #30: 138,577 cells matched (2,920 cells remaining)
step #31: 138,970 cells matched (2,527 cells remaining)
step #32: 139,323 cells matched (2,174 cells remaining)
step #33: 139,664 cells matched (1,833 cells remaining)
step #34: 139,950 cells matched (1,547 cells remaining)
step #35: 140,169 cells matched (1,328 cells remaining)
step #36: 140,310 cells matched (1,187 cells remaining)
step #37: 140,405 cells matched (1,092 cells remaining)
step #38: 140,478 cells matched (1,019 cells remaining)
step #39: 140,540 cells matched (957 cells remaining)
step #40: 140,585 cells matched (912 cells remaining)
step #41: 140,630 cells matched (867 cells remaining)
step #42: 140,685 cells matched (812 cells remaining)
step #43: 140,746 cells matched (751 cells remaining)
step #44: 140,798 cells matched (699 cells remaining)
step #45: 140,853 cells matched (644 cells remaining)
step #46: 140,920 cells matched (577 cells remaining)
step #47: 140,968 cells matched (529 cells remaining)
step #48: 141,000 cells matched (497 cells remaining)
step #49: 141,017 cells matched (480 cells remaining)
step #50: 141,024 cells matched (473 cells remaining)
step #51: 141,026 cells matched (471 cells remaining)
====================

In the case of panels A + B, we observe that 155,459 cells are part of the integrated set. For C + D, that number is 141,026. Using cyCombine, all cells were actually integrated resulting in 322,209 and 304,662 cells in the merged sets, respectively. This can be an important factor in downstream analysis, especially considering the potential loss of statistical power by having few cells in your dataset.

We will now look at the density plots for CytoBackBone.

Again, we will use the Earth Mover’s Distance (EMD) to look at the imputations more quantitatively.

The EMD ranges for cyCombine and CytoBackBone are similar - i.e. both tools seem to be most challenged with markers such as CXCR4 and MIP1b.


Now, it is time to test the performance of CyTOFmerge on the dataset.

ℹ SHA-1 hash of file is f43e56dc681374b1c2dabe025981f3c5751b7647

Attaching package: 'flowCore'
The following object is masked from 'package:cyCombine':

    normalize

We will now look at the density plots for CyTOFmerge.

Again, we will use the Earth Mover’s Distance (EMD) to look at the imputations more quantitatively.

The EMD ranges for CyTOFmerge are very similar to those of the other two tools - i.e. all tools seem to be most challenged with markers such as CXCR4 and MIP1b. However, visual inspection of the density plots, e.g. for CCR5, does reveal a somewhat different data distribution for the measured and CyTOFmerge-imputed sets.

A final comparison parameter is the runtime. Here we do a small comparison of cyCombine, CytoBackBone, and CyTOFmerge on a machine with 40 cores and 100 gb memory:

Time difference of 23.32506 secs
Time difference of 15.73887 mins
Time difference of 15.89344 mins

While neither of these runtimes are terrible, it is worth considering that cyCombine is a lot faster than the two other tools.


In conclusion, the tools have similar performance in terms of the global distributions as seen from both the EMD calculations and the density plots. However, cyCombine integrates all cells and has the shortest runtime.


Benchmark 2 - cyCombine and CyTOFmerge (CyTOFmerge dataset)

In order to compare the performance of cyCombine panel merging more fairly to that of CyTOFmerge, we will demonstrate the imputation from the two tools using the VorteX set (39 markers), which is also used for validation in the CyTOFmerge article. We compare to the workflow presented in the CyTOFmerge vignette (available on their GitHub).

In the CyTOFmerge vignette, the most informative markers are selected using correlation and PCA. The top 11 markers are found to be MHCII, Ly6C, CD11b, B220, IgM, CD44, CD16_32, Sca1, CD43, IgD, and SiglecF, and these are the markers deemed to be suitable as the overlap for imputation.

Using an annotated set of ~500,000 cells, they simulate two panels by splitting the dataset in two parts having equal numbers of cells and sharing the 11 markers above. The remaining 28 markers are split, such that one half of the data has 14, and the other half has the other 14. The values of these 28 markers are then imputed for the opposite halves. Their raw and imputed data can be obtained from FlowRepository ID: FR-FCM-ZYVJ.

Here, we load their original data (before splitting and imputation).

uneven number of tokens: 457
The last keyword is dropped.
uneven number of tokens: 457
The last keyword is dropped.

We then load their dataset after imputation.

uneven number of tokens: 457
The last keyword is dropped.
uneven number of tokens: 457
The last keyword is dropped.


In order to compare, we now use cyCombine on the original dataset, split into the same parts as for CyTOFmerge. Finally, imputation is performed.

[1] TRUE
[1] TRUE

  FALSE 
3600702 

  FALSE 
3600702 


Now, let us have a look at some density plots for the original and imputed values for the 28 imputed markers. We will look at them in sets of 14 markers imputed for each half of the data and compare the distributions of cyCombine values, CyTOFmerge values, and the original values.


A brief look at the plots does not reveal dramatic difference between the two tools - however, looking more closely at some markers including Ly6G and CD45.2, there appears to be a higher degree of similarity between cyCombine’s imputed values and the original distribution.


We also evaluate the results using the EMD and MAD score to get a quantitative measure:

Computing MAD for corrected data..
Computing MAD for uncorrected data..
The MAD score is: 0.09
Computing MAD for corrected data..
Computing MAD for uncorrected data..
The MAD score is: 0.11
Computing MAD for corrected data..
Computing MAD for uncorrected data..
The MAD score is: 0
Computing MAD for corrected data..
Computing MAD for uncorrected data..
The MAD score is: 0

Based on the EMDs and MAD scores, cyCombine produces imputed distributions which are more similar to the original data than CyTOFmerge. Additionally, there is no information loss for the VorteX set when using cyCombine. CyTOFmerge loses a bit more information, which is reflected in the narrower distributions for some markers as seen above. Furthermore, while still relatively small, the EMDs compared to the original data are larger than when using cyCombine.

In this specific dataset, one could however argue that the imputations produced by CyTOFmerge will most likely not impact any downstream processing (clustering, visualizations, etc.), since the distributions are still in the same range as the original dataset. However, why not aim to get the distributions as correct as possible?


One key take-away from this analysis, which also extends the point above, is however that all of the 28 imputed markers do not have much variation in the original dataset. The majority of these markers most likely would not drive a clustering since most values for almost all markers are below 2.5. This also means that these markers are relatively easy to compute by simply taking a median - which is what CyTOFmerge does. Because of CyTOFmerge’s feature selection, it is quite certain that the hardest-to-impute markers (with all the variance) will not need imputing. This is of course a good thing when it comes to trusting the imputations, but the question remains if imputations are actually interesting when they are unimodal.

Let us have a quick look at the density plots for the 11 ‘overlapping’ markers to visualize their relatively high degree of variance:

So, this result brings us back to the question of when we should impute. As stated above, a reasonable overlap is needed. But the size of this overlap depends on both the overlapping and non-overlapping markers. As an example, it may be reasonable to impute CD5 from an overlap containing CD3. But imputing a B cell marker from a set overlapping T cell-exclusive markers is potentially a bad idea. Ideally, one would take a dataset and test the imputation quality for all different combinations of overlapping markers ranging from 1 to n (where n is the number of markers in the data). However, if a dataset had 30 markers, the combinatorial space for this would have a size of 1,073,741,823. A dataset with 40 markers would have 1.1*1012 options.

[1] 1073741823
[1] 1.099512e+12

Furthermore, there is actually a caveat with cyCombine, when using a dataset that is split in two (like the VorteX) example: Because the datasets are from the exact same distribution and the tool always performs nearly perfectly - since it samples from the other half, which is ‘identical’ - with the exception of the added ‘noise’ from the bandwidth. This means that the performance is very good even when using only a very small set of overlapping markers. Of course, the clustering can also be impacted when only a few markers are in the overlap, since so little information is available in those cases.

Furthermore, this also means that since cyCombine almost “copies” information from one batch to another within each cluster, it is assumed that the samples that one imputes from, are similar to those that are imputed for. This is also why we recommend against differential expression analysis on imputed values.


As a consequence, a proper testing of cyCombine makes more sense for different samples, which have overlapping markers. Even between two healthy PBMC samples, there is variation in population frequencies, so these are actually a true test for cyCombine. That brings us to benchmark 3.


Benchmark 3 - cyCombine on different healthy donors

Let us consider some of the healthy donor PBMC samples from a set of CyTOF data from Ogishi et al. (2021) (FlowRepository ID: FR-FCM-Z3YK). We will use nine samples - four from batch 1 (HC01-HC05) and four from batch 6 (HC14, 16, 20, 24-33). These samples comprise 3,969,544 cells.

This data will be processed in two ways:

  1. Simple batch correction on all 38 markers.
  2. Removing 13 markers from each batch, batch correcting the overlap (n = 12), and performing imputation on the removed markers.

This will be followed by a mini-analysis of the generated datasets, which should aid in quantifying both the degree of information loss from having a smaller overlap and performing imputations. It should also serve to prove the robustness of the cyCombine imputation. In part 2, we will remove both a set of relatively easy-to-impute markers, and a set of relatively hard-to-impute markers - based on the degree of variance of markers.

First, we read the data set:


Now, we batch correct the data (two batches, no co-variates):


To get an idea about the extent of batch effects and to survey the variance of each marker, we will make density plots for the corrected and uncorrected datasets.

Now that we have a ‘ground truth’ dataset, which is integrated across batches, we can make the imputation tests. First, let us look at the MAD for all the markers.

      CD27        CD4       CCR7       CD38     CD45RA      CD127        CD3 
2.29313112 2.26649472 2.05194380 1.98350880 1.54620747 1.32281328 1.21627332 
       CD8      HLADR      CD141       CD25       CD16       CD45      CXCR3 
1.19992311 0.73464730 0.54118871 0.54068695 0.51329102 0.49945252 0.47524044 
      CD19       CD56       CCR4       CD33       CCR5      CD11b       CCR6 
0.45914629 0.45295446 0.39147824 0.37861335 0.28524831 0.24840835 0.23826441 
      CD14      CD161      gdTCR       CD20       CD1c      CD66b     CX3CR1 
0.20050894 0.20046114 0.17705803 0.17669307 0.17222885 0.14221900 0.10603968 
   CLEC12A      CD123      CRTH2       CD86       CD24        PD1      CD209 
0.07585008 0.04622157 0.03923231 0.03807778 0.03767714 0.02924555 0.01442583 
     CD117      CD11c       CD57 
0.00000000 0.00000000 0.00000000 

We observe that the markers with most variation are CD27, CD4, CCR7, CD38, CD45RA, CD127, CD3, CD8, and HLA-DR. This is also reflected in the density plots, where the mentioned markers have clear bimodal patterns.

For a hard-to-impute set, we could have an overlap containing: CD57, CD117, CD209, PD1, CD86, CD123, CCR6, CX3CR1, CD20, gdTCR, CD11b, and CD161. For an easy-to-impute set, we could pick an overlap consisting of: CD27, CD4, CCR7, CD38, CD45RA, CD127, CD3, CD8, CD56, CD19, HLA-DR, and CD45.

Since there are no batch effects within the partitions (all samples are from the same batch already), we can proceed to the combining of the two sets. The first step is batch correction on the overlapping markers, and the second step is imputation of the non-overlapping markers.

Now, we have four versions of the data - the original set, the batch corrected set, the ‘easy’ imputed set, and the ‘hard’ imputed set. The next step is to investigate how these datasets look - both in terms of direct similarity to the original set, but also in relation to clustering, visualization and abundance analysis.

# Now we have the three datasets after their respective
# corrections ogishi_bc
ogishi_easy <- rbind.data.frame(impute_easy$dataset1[, colnames(ogishi_bc)],
    impute_easy$dataset2[, colnames(ogishi_bc)])
ogishi_hard <- rbind.data.frame(impute_hard$dataset1[, colnames(ogishi_bc)],
    impute_hard$dataset2[, colnames(ogishi_bc)])

# Set a label column to all 1's for intra-batch comparisons
# (globally)
ogishi$label <- ogishi_bc$label <- ogishi_hard$label <- ogishi_easy$label <- 1
ogishi_bc$batch <- paste0(ogishi_bc$batch, "_bc")
ogishi_easy$batch <- paste0(ogishi_easy$batch, "_easy")
ogishi_hard$batch <- paste0(ogishi_hard$batch, "_hard")


# Remove any lines with an NA
ogishi_easy <- tidyr::drop_na(ogishi_easy)
ogishi_hard <- tidyr::drop_na(ogishi_hard)

# Now let us calculate some EMDs for the comparisons within
# each batch - batch 1 original compared to the three
# 'generated' batch 1 data sets - and similar for batch 6
emds <- compute_emd(rbind(ogishi, ogishi_bc, ogishi_easy, ogishi_hard))[[1]]

boxplot(sapply(emds, function(x) {
    x["Batch1", "Batch1_bc"]
}), sapply(emds, function(x) {
    x["Batch1", "Batch1_easy"]
}), sapply(emds, function(x) {
    x["Batch1", "Batch1_hard"]
}), sapply(emds, function(x) {
    x["Batch6", "Batch6_bc"]
}), sapply(emds, function(x) {
    x["Batch6", "Batch6_easy"]
}), sapply(emds, function(x) {
    x["Batch6", "Batch6_hard"]
}), names = paste("Batch", rep(c(1, 6), each = 3), c("bc", "easy",
    "hard")), main = "EMDs when comparing to original data",
    col = RColorBrewer::brewer.pal(3, "Set2"))


The EMDs are actually quite similar for each marker when comparing the original set - within a batch - for each of the three processed sets. However, there is a longer tail for the “hard” imputation set - and also a slightly higher max for “easy” than plain batch correction.


We can also look at the UMAPs for the three sets.


#### SOM clustering and dimensionality reduction - using
#### all 38 markers Making a function to run on the three
#### corrected sets
cluster_and_plot <- function(df, grid_size = 6, umap_size = 20000,
    name = "data", seed = 8265) {

    # Generate labels using a SOM - since we are going for
    # more direct clustering, we will use a smaller grid
    # (6x6) -- perhaps we need to perform meta clustering
    # to make it proper analysis-like?
    labels <- df %>%
        create_som(seed = seed, xdim = grid_size, ydim = grid_size,
            markers = og_markers)

    df <- df %>%
        dplyr::mutate(som = as.factor(labels))

    # Down-sampling for plotting
    set.seed(seed)
    df_sliced <- df %>%
        dplyr::slice_sample(n = umap_size)

    # UMAP plot for all markers - batch corrected
    umap <- df_sliced %>%
        cyCombine::plot_dimred(name = name, type = "umap", markers = og_markers,
            return_coord = T)

    cmb_df <- df_sliced %>%
        mutate(UMAP1 = umap$dimred[, 1], UMAP2 = umap$dimred[,
            2])

    # Get a single id from each cluster to indicate the
    # stability
    set.seed(seed)
    ids <- cmb_df %>%
        dplyr::group_by(som) %>%
        dplyr::sample_n(1) %>%
        dplyr::ungroup() %>%
        dplyr::pull(id)

    p <- ggplot(cmb_df, aes(x = UMAP1, y = UMAP2)) + geom_point(aes(color = som),
        alpha = 0.3, size = 0.4) + geom_text(aes(label = ifelse(id %in%
        ids, as.character(som), "")), size = 6, hjust = 0, vjust = 0) +
        theme_bw() + theme(plot.title = element_text(hjust = 0.5)) +
        ggtitle(paste0("UMAP - ", name)) + guides(color = guide_legend(override.aes = list(alpha = 1,
        size = 1), title = "SOM node"))

    p1 <- list()
    for (m in og_markers) {
        p1[[m]] <- ggplot(df, aes_string(x = "som", y = m)) +
            geom_violin(aes(fill = som)) + theme_bw()
    }
    p2 <- ggpubr::ggarrange(plotlist = p1, common.legend = T)

    return(list(umap = p, vln = p2))
}

p_bc <- cluster_and_plot(ogishi_bc, name = "batch corrected")
p_easy <- cluster_and_plot(ogishi_easy, name = "'easy-to-impute' overlap")
p_hard <- cluster_and_plot(ogishi_hard, name = "'hard-to-impute' overlap")

cowplot::plot_grid(p_bc$umap, p_easy$umap, p_hard$umap, ncol = 3)

Simply looking at these UMAPs, it appears that there is a fairly well-preserved structure of the dataset when comparing batch correction and ‘easy-to-impute’ imputation. When studying the ‘hard-to-impute’ UMAP, there is less global separation of the sets of cells, indicating that the data structure has been altered. But it is important to note that these comments are only based on visual inspection in 2D.

In any case, we conclude that as long as cells from the impute-reference and the impute-for datasets are the same, imputation can work when using only a few overlapping markers. However, if the overlap is not informative for distinguishing cell types (i.e. no lineage markers), one may end up with a weird distribution of clusters, and this can lead to wrong conclusions for the final set.



Imputation of single misstained channels

Imagine that you have run an expensive and time-consuming cytometry experiment. You have 100+ samples and despite the ability to use barcoding for sample multiplexing you have run several batches. However, now that you visualize the data, you notice that the expression level for a marker in one of the batches looks very different than in the others. It may be strongly over-stained - or maybe there is no signal at all - maybe the antibody was damaged during the sample preparation.

In any case, you really wish that you could correct this marker in this single batch. This is because you need the marker for clustering - and if it’s non-sensical in one batch it could ruin the whole analysis plan. Enter cyCombine! In particular the salvage_problematic function, which allows imputation of a single marker in user-specified batches. The function uses the information found in all other markers in the combined dataset to determine the most likely value for the single marker. This vignette will demonstrate how.

But first, we will discuss when using this function is a good idea as opposed to just relying on batch correction.


When to apply imputation

In some experiments with misstained channels, the analyst may be faced with the dilemma of whether to use batch correction or imputation. The answer will depend on the individual case, but in general, if we consider a problem similar to the one presented for XCL1 in panel 1 of the DFCI data (below), it becomes a question of whether one believes that the channel is simply over-stained, but with a direct correlation between measurements and actual expression, or if there is no relation between the two. In the first case, batch correction is the answer, as it represents the most data-preserving approach. However, if the data cannot be trusted it is better to rely on imputation.

Another case could be that a marker was completely left out - maybe there was no more left - or maybe it was forgotten during staining. In that case, the answer is imputation since there is no signal in the raw dataset to rely on for batch correction. But if the problem is due to using a different antibody lot with a markedly different signal or perhaps a different concentration, it is likely that batch correction would suffice. The same holds for cases with higher background in some batches.


Methodology

The misstained channel correction relies on the same principles as the across-panel imputation, but instead of transferring information in one dataset to another, it utilizes the different batches of a single dataset.

First, a SOM (default 8x8) is calculated based on overlapping proteins. Then it applies the simulated multidimensional kernel density estimates (kde) on non-overlapping proteins, and performs imputations using probability draws from the kde.


Imputation example

This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). The protein expression was quantified using mass cytometry for 126 samples (20 healthy donors). In this dataset, the XCL1 staining was very strong for batch 1 compared to the other six batches. The signal is so much stronger that imputation seems like a relevant solution.


Pre-processing data

We start by loading some packages.


We are now ready to load the CyTOF data. We have set up a panel file in csv format, so the correct information is extractable from there.


We then progress with reading the CyTOF dataset and converting it to a tibble format, which is easy to process. We use cofactor = 5 (default) in this case.

Reading 126 files to a flowSet..
Warning in cyCombine::convert_flowset(., metadata = metadata, filename_col = filename_col, : The sample JGG0002_Int1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0002_Int2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0019_Int1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0019_Int2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0047_Int1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0047_Int2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0235_Int1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0235_Int1b.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0235_Int2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0386_Int1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JGG0180_1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample JRB0339_2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample KRR0843_2.fcs in the metadata was not found in the provided folder and will be ignored.
The sample TJK0105_1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample TJK0349_1.fcs in the metadata was not found in the provided folder and will be ignored.
The sample TJK0930_1.fcs in the metadata was not found in the provided folder and will be ignored.
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!


Started imputation for XCL1 in batch(es) 1
Creating SOM grid..
Performing density draws.
Caution! Analysis based on imputed values can lead to false inferences. We recommend only using imputed marker expressions for visualization and not for differential expression analysis.


Evaluating performance

Let us look at some plots to visualize the correction - the marker distributions before and after:


Notice how it is only the expression of XCL1 in the single batch, batch 1, which has changed compared to the raw data. This is exactly what we wanted. Furthermore, look how nicely the distribution of the marker corresponds to those in the other six batches.

 

Contact