scMaSigPro
is an R package designed for the serial analysis of single-cell
RNA-seq (scRNA-Seq) data along inferred pseudotime. It builds upon the maSigPro
Bioconductor package to identify genes with significant expression changes across different
branching paths in a pseudotime-ordered scRNA-Seq dataset. This vignette illustrates
the basic workflow of scMaSigPro
, providing a step-by-step guide for users using a
small simulated dataset.
scMaSigPro 0.0.4
scMaSigPro
is a polynomial regression-based approach inspired by the maSigPro
Bioconductor package tailored for scRNA-Seq data. It first discretizes single cell
expression along inferred pseudotime while preserving order. Afterwards it applies
the maSigPro model
to pinpoint genes exhibiting significant expression profile differences among
branching paths and pseudotime.
Currently, scMaSigPro
is available on GitHub and can be installed as follows:
# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(version = "3.14")
BiocManager::install(c("SingleCellExperiment", "maSigPro", "MatrixGenerics", "S4Vectors"))
To install scMaSigPro
from GitHub, use the following R code:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
ref = "main",
build_vignettes = FALSE,
build_manual = TRUE,
upgrade = "never",
force = TRUE,
quiet = TRUE
)
Here, we demonstrate the basic workflow of scMaSigPro
using a simulated dataset
included in the package, simulated via splatter.
scMaSigPro
package and the dataset# Set Seed for Reproducibility
set.seed(123)
# Load Package
library(scMaSigPro)
# Load example data
data("splat.sim", package = "scMaSigPro")
# Load
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(SingleCellExperiment))
# Extract PCA
df <- as.data.frame(reducedDim(splat.sim)[, c(1:2)])
df$Step <- splat.sim@colData$Step
df$Group <- splat.sim@colData$Group
# Plot the data
ggplot(
df,
aes(x = PC1, y = PC2, color = Step, shape = Group)
) +
geom_point() +
theme_minimal()
If you are interested in learning how this data is simulated and what information
it contains, you can try ?splat.sim
in the R console to read about the details.
The UMAP visualization of our dataset reveals a bifurcating trajectory, illustrating how one cell type diverges into two distinct cell types. This bifurcation is driven by the selective expression of genes along the pseudotime. In the UMAP plot, the two branching paths of differentiation are represented by points with distinct shapes, each corresponding to a different cell type.
Furthermore, the points in the UMAP are color-coded based on a simulated pseudotime-like variable, termed Step, from the Splatter package. This color gradient effectively captures the progression of cells along the pseudotime axis, providing insights into the temporal dynamics of cellular differentiation.
scMaSigPro
ObjectscMaSigPro
offers various options to directly convert widely used single-cell
related S4 objects directly to the scmpClass class (now referred to as scmpObject
),
such as SingleCellExperiment
or Monocle3
’s cell_data_set
class. Here, using the
simulated object (of the SingleCellExperiment
class) that we have just loaded into the R environment, we will
demonstrate the direct conversion to scmpObject
object.
In our case, we already have pseudotime and path information stored in the
colData
slot of the splat.sim
. The Step column holds the information for the
simulated steps of each path, which we can treat as a pseudotime variable, and
the Group column has information regarding branching path, i.e., to which
branching path each of the cell belongs. We can view this information by executing
View(as.data.frame(splat.sim@colData)).
## Cell Batch Group ExpLibSize Step sizeFactor
## Cell1 Cell1 Batch1 Path2 56391.37 42 0.8847501
## Cell2 Cell2 Batch1 Path1 54923.49 41 0.7779386
## Cell3 Cell3 Batch1 Path1 52122.28 23 0.7687344
## Cell4 Cell4 Batch1 Path1 59039.15 42 0.9017475
## Cell5 Cell5 Batch1 Path2 64317.37 79 1.0111086
We will use as_scmp()
from scMaSigPro
and direct it to use the columns present in
the dataset as the columns for Pseudotime
and Path
. To do this, we have to enable
the parameter labels_exist
and then pass the existing column names as a named list.
# Helper Function to convert annotated SCE object to scmpObject
scmp_ob <- as_scmp(
object = splat.sim, from = "sce",
align_pseudotime = FALSE,
verbose = TRUE,
additional_params = list(
labels_exist = TRUE,
exist_ptime_col = "Step",
exist_path_col = "Group"
)
)
## Supplied object: SingleCellExperiment object
## Overwritting columns in cell level metadata, 'Group'
## is replaced by 'Path' and 'Step' is replaced by 'Pseudotime'.
Once the object is created, we can type the name of the object in the R console to view various attributes, such as the dimensions (number of cells and number of genes), the available branching paths, and the range of pseudotime.
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path2, Path1
sc.squeeze()
scMaSigPro
provides a comprehensive function for pseudo-bulking along the
pseudotime continuum, named sc.squeeze()
. This function discretizes the original
pseudotime values into bins using histogram binning methods. The sc.squeeze()
function includes various parameters that can be tailored depending on the
characteristics of the data. Here, we will show the basic usage with the default
parameters:
scmp_ob <- sc.squeeze(scmp_ob)
The console output of scMaSigPro
is dynamic and displays more attributes as the
analysis progresses. To view the results of the binning procedure, we can simply
type the object’s name into the console:
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
Here, we observe that the original pseudotime, which ranged from 1 to 100 with two cells at each step and different paths, is now rescaled to a range of 1 to 8, with each path having an equal number of bins (8+8 in total). We also see the average bin sizes per path, indicating the average number of cells that make up a single bin for each path.
We can also visually inspect the binning process using a tile plot:
plotBinTile(scmp_ob)
The tile plot provides a clear view of the number of cells in each bin and how the sizes of bins compare with those of other paths in the same scmp_binned_pseudotime.
scMaSigPro
, being the successor of maSigPro
, utilizes the same polynomial regression model. Let’s consider a case with two branching paths along the same
pseudotime scale, modeled with a quadratic polynomial. The model can be represented
as follows:
## Loading required package: knitr
\[\begin{align*} Y_{i} &\sim \text{NegativeBinomial}(\mu_{i}, \theta = 10) \\ \log(\mu_{i}) &= \beta_{0} + \beta_{1} \cdot (\text{Path}_{B}\text{vsPath}_{A})_{i} + \beta_{2} \cdot T_{\text{Pseudotime}}^{\text{Binned}_{i}} \\ &\quad + \beta_{3} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}} \cdot \text{Path}_{B_{i}}) + \beta_{4} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \\ &\quad + \beta_{5} \cdot ((T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \cdot \text{Path}_{B_{i}}) + \text{Offset}_{i} + \omega_{i} \cdot \epsilon_{i}\\ \end{align*}\]
To construct this model, we use sc.set.poly()
to include quadratic terms:
# Polynomial Degree 2
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 2)
Once the model is stored, we can visualize the corresponding polynomial using the showPoly()
function:
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2"
Similarly, we can fit a cubic polynomial by setting the polynomial degree to 3:
# Polynomial Degree 3
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 3)
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2 + beta6*scmp_binned_pseudotime3 + beta7*scmp_binned_pseudotime3xPath2"
However, for simplicity, we will explore a polynomial of degree 1. Note that
increasing the polynomial degree enhances scMaSigPro
’s performance in capturing
exponential and nonlinear gene expression patterns:
# Polynomial Degree 1
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 1)
showPoly(scmp_ob)
## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2"
In the above model we have:
beta0: Accounts for differences in expression from the start to the end.
_beta1*Path2vsPath1_: Captures differences between the branching paths, assuming path-1 does not change along pseudotime.
_beta2*scmp_binned_pseudotime_: Reflects differences across pseudotime.
_beta3*scmp_binned_pseudotimexPath2_: Represents interaction between pseudotime and path differences.
To identify genes that demonstrate significant changes along the pseudotime contniuum,
we use the sc.p.vector()
. This function, adapted from the original maSigPro
,
includes additional parameters such as the use of offsets
and weights
.
scMaSigPro
expects raw counts as input because it models data using a negative
binomial distribution, a count distribution, so the counts should not be normalized
to continuous values. To account for library size differences, it uses offsets
,
which are the log of size factors, similar to DESeq2.
For those who prefer to supply continuous data normalized using methods like log(1+x), the distribution can be changed during model fitting.
We can execute sc.p.vector()
as follows:
# Detect non-flat profiles
scmp_ob <- sc.p.vector(scmp_ob,
offset = TRUE, p_value = 0.05, verbose = FALSE,
log_offset = TRUE
)
scmp_ob
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
The console output reveals that scMaSigPro
detected 51 genes with non-flat
profiles.
Having identified genes with significant profiles, we can refine their polynomial
models using sc.t.fit()
. This function evaluates each term of the polynomial
model. In our case, it will assess which among “beta0 + beta1*Path2vsPath1 +
beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2”
significantly contributes to the differences. To execute sc.t.fit()
,
we proceed as follows:
# Model refinement
scmp_ob <- sc.t.fit(scmp_ob, verbose = FALSE)
scmp_ob
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8
With our refined models in hand, we now focus on identifying genes showing
significant differences with pseudotime, among paths, or both. For this purpose,
we use the sc.filter()
function. Our aim is to select models with a
relatively high \(R^2\), indicating simple linear relationships. The vars
parameter in sc.filter()
allows us to extract different sets of
significant genes. Setting vars = 'all'
retrieves all non-flat profiles
identified in sc.p.vector()
with \(R^2>=\) the specified threshold. The option
vars = 'groups
fetches genes per path, resulting in two gene lists that
demonstrate associative significance among paths, helping us identify genes
associated with one path or the other along the pseudotime continuum. The
vars = 'each'
option finds significance for each term in the polynomial.
In our case, we are interested in genes differentially expressed between paths
and over pseudotime continuum, so we will choose vars = 'groups
.
scmp_ob <- sc.filter(
scmpObj = scmp_ob,
rsq = 0.7,
vars = "groups",
intercept = "dummy",
includeInflu = TRUE
)
By setting the vars parameter to “groups”, the function will add genes with \(R^2\) >= 0.7 to the object. To explore the number of genes per group, we will make an upset plot:
plotIntersect(scmp_ob)
Here, we observe that 23 genes belong to both Path2vsPath1 and Path1, indicating that these genes not only change along pseudotime but also exhibit significantly different expression between the two paths. Additionally, there are 10 genes uniquely associated with Path2vsPath1. This implies that Path2 has 10 genes that are significantly differentially expressed over time, using Path1 genes as a reference. Let’s explore a few of these genes:
FigureA <- plotTrend(scmp_ob, "Gene9", logs = TRUE, logType = "log", lines = TRUE)
FigureB <- plotTrend(scmp_ob, "Gene95", logs = TRUE, logType = "log", lines = TRUE)
FigureC <- plotTrend(scmp_ob, "Gene10", logs = TRUE, logType = "log", lines = TRUE)
FigureD <- plotTrend(scmp_ob, "Gene92", logs = TRUE, logType = "log", lines = TRUE)
ggpubr::ggarrange(FigureA, FigureB, FigureC, FigureD,
ncol = 2, nrow = 2,
labels = c("A", "B", "C", "D")
)