Welcome to the cyCombine reference manual! This manual will show how to use the main functions of cyCombine - either using the recommended (all-in-one) workflow or the modular workflow, which is more customizable to fit specific data inputs.


cyCombine works on dataframes/tibbles in R. If you already have data in R or in a readable text format, the conversion to dataframe/tibble should be relatively straightforward. However, it is expected that most users will start their analysis from FCS files.


Alongside a directory of FCS files, a metadata and a panel file are assumed to be present. These files are helpful in generating the tibble structure, which is processable with cyCombine. I.e. besides containing the protein marker expression per cell, the tibble should also contain information regarding the batch of origin per cell, and generally it is easier to work with data that also encompasses the sample IDs - and potential conditions. This information should be contained in a metadata file.

The panel information is also nice to have for several reasons:

  1. FCS files contain some columns, which should not be included in batch correction - such as “Time” and “Event_length”. Furthermore, there may be “empty” channels, which should also be ignored during analysis
  2. Sometimes, FCS files do not contain the proper protein names and a panel file can help ensure that the FCS files are read correctly.


The metadata of our example has the following columns:

Filename batch condition Patient_id

In addition, when replicated samples/anchors are included, it may be relevant to also use the column anchor as described below.


And the panel should contain the columns:

Channel Antigen Type


By setting Type = ‘none’ in a panel file, the columns to exclude are easy to identify.



Prepare data

The first step of a cyCombine analysis is to convert the relevant FCS files into a work-able tibble. We introduce two approaches for this.


Modular workflow

In case you want more control, you can adjust any step of this approach. Feel free to skip this segment, if the method above worked just fine for you.

This example will include all possible input parameters to give an overview of what can be modified.



Batch correction

Now that the data is converted to a tibble format, it is straightforward to perform batch correction with cyCombine. Again, we demonstrate two different workflows.

Besides the functionality used below, it is also possible to directly inform cyCombine about which samples are replicates. For this, please refer to the relevant section below.


Batch correction with replicates (recommended workflow)

There are several ways to inform cyCombine which samples are replicates. One option is to add an additional column (e.g. “anchor”) to your metadata file, which can then be added as a column to your uncorrected data by using prepare_data(). This information may also be added manually to the uncorrected data, or it is possible to provide the information directly to batch_correct() using a vector.

To perform the batch correction, one should use batch_correct() with the anchor parameter, which should be set to the column name (e.g. “anchor”) or the name of the vector containing the information. cyCombine will automatically check whether the covariate and anchor are confounded with each other or the batch.



Evaluate performance

The cyCombine package includes two performance metrics. The Earth Mover’s Distance (EMD) reduction and the Median Absolute Deviation (MAD) score

The EMD reduction is implemented as the first performance metric; EMDs are computed for both the uncorrected and corrected data, removing those values where both had an EMD < 2.


\[EMD_{reduction} = \frac{\sum_{i=1}^n {(EMD_{before_i} - EMD_{after_i})}} {\sum_{i=1}^n {EMD_{before_i}}}\]

The MAD score is implemented as the second performance metric; MADs are computed for both the uncorrected and corrected data per-cluster, per-marker, per-batch. The MAD score is then calculated as the median of the absolute difference in MAD per value:


\[MAD_{score} = \mathrm{median}_{i=1}^n (|MAD_{before_i} - MAD_{after_i}|)\]

Because the MAD score quantifies the information ‘loss’, the ideal tool has a small MAD score.



 

Contact