Welcome to ‘The cytometrist’s primer to single cell data analysis’. This site was made with the purpose of aiding cytometrists who are interested in learning more about how the data from flow and mass cytometry platforms can be analyzed. Hopefully, this will not only lead to cytometrists being able to perform more analyses on their own, but also to a better understanding of which tools are available to bioinformaticians, in the downstream data handling. Really, it is pivotal that data analysis is considered already at the time an experiment is designed, and this resource should therefore provide a solid reference that allows cytometrists to make even better experiments in the future.
We have aimed at explaining everything to people who have almost no experience with data science and algorithms, and we are sure that while some readers will benefit from basic explanations, others may want to skip certain sections. This page will provide an overview of all the most important tools, divided into subsections similar to that of the data scientist’s primer.
In handling of cytometry data the most common approach is still manual gating using two marker channels at a time. As the number of measurable parameters increases, however, capturing all the relevant data patterns using this traditional approach becomes more and more challenging. If you run an experiment with 12 protein channels there are 66 combinations of markers to consider, but if the number of channels is increased to 40, which is not unlikely with CyTOF experiments, there are 780 unique channel pairs making it impossible to visualize all pairs in a reasonable way. Fortunately, there is a solution to visualizing high-dimensional data in 2D so we can inspect the data structure. This solution is, appropriately, termed dimensionality reduction.
Reduction of the dimensionality of any dataset will lead to a loss of some information (unless all but two of your channels contain the exact same information or correlate perfectly). However, it is extremely useful for visualizations and some tools actually also employ dimensionality reduction before grouping cells into subsets (see Clustering section). The goal of the existing dimensionality reduction methods is of naturally to preserve as much information from the high-dimensional space as possible, but compress it into two dimensions. In certain cases this will also highlight the main data patterns.
Within cytometry, three methods are commonly applied: ‘Principal component analysis’, ‘t-distributed Stochastic Neighbor Embedding’, and ‘Uniform manifold approximation and projection’. Each of these is presented below.
Principal component analysis (PCA) is the fastest, most easily interpretable, and most wide-spread way of projecting high-dimensional data onto a low dimensional subspace. To understand how it works, let us consider a three-dimensional CyTOF dataset (a dataset with three channels). We now wish to find a representation which maintains as much of the dataset variability in the data as possible and therefore, we identify the first “principal component”: The straight line along which the data is most spread out (see orange arrow below). How this is done does not matter too much but if you want the basics in place check out this page. The next step will then be to detect the straight line, which is perpendicular to the first, along which the data has the most variance (green arrow below). This can in principal go on until you have as many principal components as you have dimensions in your data. What the method does is essentially to create a new coordinate system, in which the data set is centered as seen below.