This vignette will walk you through the full pipeline of
cyDefine. The adapt reference step is optional but relevant
when you want to adapt a reference of a larger marker panel (e.g., when
using the PBMC reference offered with cyDefine) to a
smaller query panel.
The vignette will cover both how you can load in your own reference and how you can use the built-in PBMC reference.
The vignette will cover data preparation, mapping of marker naming
conventions, reference adaptation, batch correction via
cyCombine, cell type annotation, identification of
unassigned cells, and visualizations.
cyDefine builds on top of cyCombine. Both
packages can be installed from GitHub:
# To ensure Rstudio looks up BioConductor packages run:
setRepositories(ind = 1:2)
# Then install package with
remotes::install_github("biosurf/cyCombine")
remotes::install_github("biosurf/cyDefine")library(cyCombine)
library(cyDefine)
Attaching package: 'cyDefine'
The following object is masked from 'package:cyCombine':
batch_correct
The modules of cyDefine work on data.frames
with cells by rows and markers (along with potential metadata) by
columns.
Importantly, the reference needs a celltype column, and
both reference and query need a column named batch if batch
effects are to be corrected for within the reference and/or query.
Otherwise, only technical variations between the reference and query
will be corrected.
The following sections demonstrate how either the inbuilt or a custom
reference is loaded, and how a query is loaded from FCS files. A
Seurat or SingleCellExperiment object can
easily be converted to a data.frame as well, just remember
to include the relevant metadata as a column.
cyDefine comes with a universal PBMC reference. See
?pbmc_reference for more details. In short, it can be
loaded and optionally stored with the following function:
reference <- get_reference(
"pbmc",
store = FALSE
)In case more references are included, they can be listed with
cyDefine:::get_all_references().
The first step is to convert a directory of FCS files into a
data.frame that cyDefine can process. In that
process, we utilize the convenience functions from
cyCombine to add optional metadata to the
data.frame, such as sample IDs, potential conditions, batch
information, as well as other features. For a custom reference, a
“celltype” column is also required.
Information on the marker panel can either be given as a character
vector of marker names or as a panel file, i.e., a CSV/Excel file with
columns that specify the FCS channel and antigen. See
?prepare_data for panel arguments.
The remaining information can either be extracted directly from the
file names, if present within them, or it can be given in a metadata
CSV/Excel file with columns filename, and
e.g. batch, condition, etc. See
?prepare_data for metadata arguments.
If you want to extract the metadata from the FCS file names, use the
extract_filename_regex and
extract_filename_into arguments. More information can be
found in the function documentation ?prepare_data, but here
is an example:
# Markers of interest
markers <- c('CD16', 'CD26', 'CD8a', 'CD33', 'CD161', 'CCR10', 'CCR7', 'CCR9',
'CD3', 'ICOS', 'CD45RA', 'CD27', 'CXCR3', 'NKG2A', 'CD14', 'CD127',
'CD57', 'HLADR', 'CD62L', 'CD19', 'IgG', 'CD4', 'IgD', 'IgA',
'CD86', 'CCR6', 'CD20', 'CD56', 'TCRgd', 'CD123', 'CD11c', 'CD25',
'CXCR5', 'CD38', 'a4b7', 'PD1', 'CLA')
# Prepare reference tibble from directory of FCS files
reference <- prepare_data(
data_dir = system.file("extdata", package = "cyDefine", mustWork = TRUE),
pattern = "Plate3_Sample1", # Using Sample1 from Plate3 as reference
markers = markers,
clean_colnames = TRUE, # Set to FALSE, if you want marker names exactly like in your panel file or FCS files.
transform = TRUE, # ArcSinh transformation
cofactor = 5, # CyTOF: 5, flow cytometry: 150, spectral: 6000
derand = TRUE, # Derandomization, only for CyTOF
compensate = FALSE, # Compensation for spectral overlap, only for flow
extract_filename_regex = "\\d{2}Feb18_Helios2_(Plate\\d+)_(Sample\\d+)_HIMCctrl_(.+)$",
extract_filename_into = c("batch", "sample", "celltype"))
# Prepare reference tibble from directory of FCS files
query <- prepare_data(
data_dir = system.file("extdata", package = "cyDefine", mustWork = TRUE),
pattern = "Plate4_Sample1", # Using Sample1 from Plate4 as query
markers = markers,
transform = TRUE, # ArcSinh transformation
cofactor = 5, # CyTOF: 5, flow cytometry: 150, spectral: 6000
derand = TRUE, # Derandomization, only for CyTOF
compensate = FALSE, # Compensation for spectral overlap, only for flow
extract_filename_regex = "^\\d{2}Feb18_Helios2_(Plate\\d+)_(Sample\\d+)",
extract_filename_into = c("batch", "sample"))Check that column names of reference and query include all the
expected information (incl. overlapping marker names): If marker names
do not match, see ?map_marker_names.
colnames(reference) [1] "CD16" "CD26" "CD8a" "CD33" "CD161" "CCR10"
[7] "CCR7" "CCR9" "CD3" "ICOS" "CD45RA" "CD27"
[13] "CXCR3" "NKG2A" "CD14" "CD127" "CD57" "HLADR"
[19] "CD62L" "CD19" "IgG" "CD4" "IgD" "IgA"
[25] "CD86" "CCR6" "CD20" "CD56" "TCRgd" "CD123"
[31] "CD11c" "CD25" "CXCR5" "CD38" "a4b7" "PD1"
[37] "CLA" "batch" "sample" "celltype" "id"
colnames(query) [1] "CD16" "CD26" "CD8a" "CD33" "CD161" "CCR10" "CCR7" "CCR9"
[9] "CD3" "ICOS" "CD45RA" "CD27" "CXCR3" "NKG2A" "CD14" "CD127"
[17] "CD57" "HLADR" "CD62L" "CD19" "IgG" "CD4" "IgD" "IgA"
[25] "CD86" "CCR6" "CD20" "CD56" "TCRgd" "CD123" "CD11c" "CD25"
[33] "CXCR5" "CD38" "a4b7" "PD1" "CLA" "batch" "sample" "id"
all(colnames(query) %in% colnames(reference))[1] TRUE
Additionally, check that the cell types of the reference are appropriate as these will be transferred to the query:
unique(reference$celltype) [1] "B-IgDnegCD27neg" "B-IgDnegCD27pos"
[3] "B-IgDposCD27neg" "B-IgDposCD27pos"
[5] "Basophil" "CD4-CM"
[7] "CD4-EM" "CD4-Eff"
[9] "CD4-Naive" "CD4CD8dblneg"
[11] "CD4CD8dblpos" "CD4pos_Treg"
[13] "CD56brightNK" "CD8-CM"
[15] "CD8-EM" "CD8-Eff"
[17] "CD8-Naive" "Monocytes-CD16negCD14neg"
[19] "Monocytes-CD16negCD14pos" "Monocytes-CD16posCD14neg"
[21] "Monocytes-CD16posCD14pos" "NK"
[23] "NKT" "Non-pDC-non-mDC"
[25] "NonNK-nonDC" "Tcells_TCRgd"
[27] "mDCs" "pDCs"
We see that two cell types do not represent canonical populations and are better described by ‘unassigned’. We will label them accordingly:
unassigned_names <- c("Non-pDC-non-mDC", "NonNK-nonDC")
# rename cell types
reference$celltype[reference$celltype %in% unassigned_names] <- "unassigned"If your data is in another format than CSV, you should convert it
appropriately. prepare_data can also convert a
flowSet by using the “flowset” argument instead of
“data_dir”. If needed transform_asinh() can
transform a data.frame in cases where
prepare_data is not needed.
To make cell type identification with cyDefine as simple
as possible, we offer a wrapper of the four main modules:
reference adaptation, batch correction,
celltype classification, and
identification of unassigned cells. Thereby, if you are not
interested in any intermediary information, cyDefine can be
performed by running a single function. However, running the workflow in
steps provides more control over hyperparameters and intermediate
outputs.
Details on the various steps in the workflow is presented
in the next section. Functions to visualize the
results are demonstrated in the last
section.
classified <- cyDefine(
reference,
query,
markers, # markers/features to use
num.threads = 1, # Number of threads for parallelization
mtry = floor(length(markers)/3), # Number of markers to use in random forest classification
num.trees = 500, # Number of trees to build in classification
use.weights = TRUE, # Whether to weigh the celltype assignment by abundance weights
adapt_reference = TRUE, # Whether the reference should be adapted to the markers available in the query
using_pbmc = FALSE, # Whether the PBMC reference is used
batch_correct = TRUE, # Whether to integrate the reference and query using cyCombine (see ?cyCombine::batch_correct for more options)
xdim = 6, ydim = 6, # Clustering dimensions for the integration
identify_unassigned = TRUE, # Whether unassignable cells should be identified
train_on_unassigned = FALSE, # Whether unassigned cells should be identified unsupervised or using unassigned cells in the reference
unassigned_name = "unassigned", # Name of unassigned cells in the reference
seed = 332,
verbose = TRUE
)cyDefine() returns three objects: “query” - the
classified query, “reference” - the adapted and integrated reference,
and “model” - the random forest model trained on the reference and used
for classification.
For more control over parameters and intermediate output, the
cyDefine workflow can be run in steps.
The first step is to adapt the reference to the markers in the query. If these markers are identical, you can safely skip this step.
mtry <- floor(length(markers)/3)
reference <- adapt_reference(
reference = reference,
markers = markers,
num.threads = 4,
mtry = mtry,
min_f1 = .7,
using_pbmc = FALSE,
verbose = TRUE
)# ------- Population merging - Round 1 ------- #
Running classification to identify similar populations
Merging groups of similar populations
Merging:
Monocytes-CD16posCD14neg, CD8-Eff, CD8-Naive, mDCs
unassigned, NK
# ------- Population merging - Round 2 ------- #
Running classification to identify similar populations
Reference adapted!
If your reference and query samples do not originate from the same
experimental batch, technical variance, referred to as batch effects,
can lead to disparate signal intensities of cells of the same phenotype,
between the samples. Therefore, cyDefine offers batch
correction via cyCombine (Pedersen et
al., 2022), which allows for correction of batch effects within as
well as across technologies. (This step is only strictly necessary if
the reference and query stem from different batches/experiments). See
?cyCombine::batch_correct for additional parameters and the cyCombine vignettes
for demonstrations of use cases.
corrected <- cyDefine::batch_correct(
reference,
query,
markers = markers,
covar = NULL,
xdim = 6, ydim = 6,
seed = 332)
reference <- corrected$reference
query <- corrected$queryAn initial projection can gauge if any celltypes in the reference are missing from the query data, deeming them redundant to include in the classification.
# Run initial projection to exclude redundant cell types
reference <- excl_redundant_populations(
reference = reference,
query = query,
markers = markers,
min_cells = 50,
min_pct = 0.05,
num.threads = 4,
mtry = mtry,
seed = 332
)Making initial projection to filter out redundant cell types of the reference
Excluding the following redundant celltypes from the reference:
Basophil
Now, our reference and query data are directly comparable.
cyDefine enables the transfer of celltype labels from your
reference dataset onto your query. The classified celltypes will be
saved in a column named model_prediction. See
?classify_cells for a full description of parameters. The
random forest model trained on the reference is also returning in
classified$model. The model can be reused with the argument
load_model = classified$model.
classified <- classify_cells(
reference = reference,
query = query,
markers = markers,
mtry = mtry,
seed = 332
)Growing trees.. Progress: 81%. Estimated remaining time: 7 seconds.
table(classified$query$model_prediction)
B-IgDnegCD27neg
463
B-IgDnegCD27pos
435
B-IgDposCD27neg
6089
B-IgDposCD27pos
292
CD4-CM
8434
CD4-EM
5076
CD4-Eff
880
CD4-Naive
10134
CD4CD8dblneg
572
CD4CD8dblpos
203
CD4pos_Treg
1824
CD56brightNK
484
CD8-CM
1244
CD8-EM
4353
Monocytes-CD16negCD14neg
928
Monocytes-CD16negCD14pos
10472
Monocytes-CD16posCD14neg / CD8-Eff / CD8-Naive / mDCs
16587
Monocytes-CD16posCD14pos
201
NKT
1763
Tcells_TCRgd
2240
pDCs
178
unassigned / NK
6327
After classifying all query cells to one of the canonical cell types
of your reference, you might want to filter out unassignable cells,
i.e., cells that do not seem to actually belong to the population to
which it was predicted to belong. Unassigned cells can represent
artifacts or cell types that are not defined in the reference, i.e.,
they could potentially represent biological novelties.
cyDefine has two strategies for identifying unassigned
cells, a supervised strategy using unassigned cells of the reference in
model training, and an unsupervised strategy that does not rely on
unassigned cells in the reference. In this example, the supervised
strategy is applied as the query sample is expected to hold similar cell
types as the reference, and the reference have defined “unassigned”
cells. The new cell type labels will be saved in a column named
predicted_celltype.
classified$query <- identify_unassigned(
query = classified$query,
reference = reference,
model = classified$model,
markers = markers,
mtry = mtry,
train_on_unassigned = FALSE,
unassigned_name = "unassigned",
seed = 332
)
table(classified$query$predicted_celltype)
B-IgDnegCD27neg
324
B-IgDnegCD27pos
333
B-IgDposCD27neg
5853
B-IgDposCD27pos
209
CD4-CM
6042
CD4-EM
3848
CD4-Eff
639
CD4-Naive
8719
CD4CD8dblneg
142
CD4CD8dblpos
12
CD4pos_Treg
1055
CD56brightNK
414
CD8-CM
976
CD8-EM
3627
Monocytes-CD16negCD14neg
595
Monocytes-CD16negCD14pos
9834
Monocytes-CD16posCD14neg / CD8-Eff / CD8-Naive / mDCs
14711
Monocytes-CD16posCD14pos
35
NKT
1359
Tcells_TCRgd
2049
pDCs
156
unassigned
12512
unassigned / NK
5735
cyDefine provides functionality for visualizing your
results by UMAP, cell type abundance, and marker expression.
Additionally, you can create a diagram of the merged cell populations
during reference adaptation, if any celltypes were merged.
plot_diagram(reference, fontcolor_nodes = c("unassigned" = "white"))# Define a color per cell type + black for unassigned
celltype_colors <- get_distinct_colors(unique(reference$celltype),
add_unassigned = TRUE)
# UMAP of reference and query
p_umap <- plot_umap(
reference,
classified$query,
markers = markers,
colors = celltype_colors,
query_col = "predicted_celltype",
ref_col = "celltype",
legend_title = "Cell type",
add_centroids = FALSE,
highlight_labels = FALSE,
sample_n = 5000,
return_data = TRUE)
p_umap$plot# Plot expression of a set of markers
plot_markers(
p_umap$query,
markers = c("CD4", "CD8a", "CD19", "CD11c"),
title = "Query marker expression",
ncol = 2)# Barplot of cell type abundances in query
plot_abundance(
predicted_populations = classified$query$predicted_celltype,
colors = celltype_colors)# Compare celltype abundances between query and reference
plot_abundance_comparison(
reference = reference,
query = classified$query,
query_col = "predicted_celltype")# Heatmap of marker expressions per cell type in query
plot_heatmap(
classified$query,
population_col = "predicted_celltype",
markers_to_plot = markers)# Celltype-wise expression correlation between reference and query
plot_expression_correlation(
reference = reference,
query = classified$query,
markers = markers,
query_col = "predicted_celltype")The alluvium plot can illustrate how celltypes are merged or if any missclassifications happen, if compared to true values. Here, we compare before and after unassigned cells are predicted to illustrate how many cells of each type are labeled unassigned.
# Alluvium plot of annotation changes - true vs predicted or model_prediction vs predicted_celltype
plot_alluvium(
classified$query,
predicted = "predicted_celltype",
true = "model_prediction",
color = celltype_colors,
n = 1000,
names = c("Model vs predicted"),
title = "Comparing model with unassigned prediction")# Importance of each marker in the classification
plot_marker_importance(classified$model, top_n = 15)
#### Marker usage
# Importance of each marker in the classification per celltype
plot_marker_usage(classified$model)# Importance of each marker in the classification per celltype
plot_marker_importance_cell(reference, markers = markers)# Importance of each marker in the classification per celltype
plot_top_marker_importance(reference)If you have any issues or questions regarding the use of
cyDefine, please do not hesitate to raise an issue on
GitHub. In this way, others may also benefit from the answers and
discussions.
Contact