`scMaSigPro`

is an R package designed for the serial analysis of single-cell
RNA-seq (scRNA-Seq) data along inferred pseudotime. It builds upon the `maSigPro`

Bioconductor package to identify genes with significant expression changes across different
branching paths in a pseudotime-ordered scRNA-Seq dataset. This vignette illustrates
the basic workflow of `scMaSigPro`

, providing a step-by-step guide for users using a
small simulated dataset.

scMaSigPro 0.0.4

`scMaSigPro`

is a polynomial regression-based approach inspired by the maSigPro
Bioconductor package tailored for scRNA-Seq data. It first discretizes single cell
expression along inferred pseudotime while preserving order. Afterwards it applies
the maSigPro model
to pinpoint genes exhibiting significant expression profile differences among
branching paths and pseudotime.

Currently, `scMaSigPro`

is available on GitHub and can be installed as follows:

```
# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(version = "3.14")
BiocManager::install(c("SingleCellExperiment", "maSigPro", "MatrixGenerics", "S4Vectors"))
```

To install `scMaSigPro`

from GitHub, use the following R code:

```
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
ref = "main",
build_vignettes = FALSE,
build_manual = TRUE,
upgrade = "never",
force = TRUE,
quiet = TRUE
)
```

Here, we demonstrate the basic workflow of `scMaSigPro`

using a simulated dataset
included in the package, simulated via splatter.

`scMaSigPro`

package and the dataset```
# Set Seed for Reproducibility
set.seed(123)
# Load Package
library(scMaSigPro)
# Load example data
data("splat.sim", package = "scMaSigPro")
```

```
# Load
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(SingleCellExperiment))
# Extract PCA
df <- as.data.frame(reducedDim(splat.sim)[, c(1:2)])
df$Step <- splat.sim@colData$Step
df$Group <- splat.sim@colData$Group
# Plot the data
ggplot(
df,
aes(x = PC1, y = PC2, color = Step, shape = Group)
) +
geom_point() +
theme_minimal()
```

If you are interested in learning how this data is simulated and what information
it contains, you can try `?splat.sim`

in the R console to read about the details.

The UMAP visualization of our dataset reveals a bifurcating trajectory, illustrating how one cell type diverges into two distinct cell types. This bifurcation is driven by the selective expression of genes along the pseudotime. In the UMAP plot, the two branching paths of differentiation are represented by points with distinct shapes, each corresponding to a different cell type.

Furthermore, the points in the UMAP are color-coded based on a simulated
pseudotime-like variable, termed *Step*, from the Splatter package. This
color gradient effectively captures the progression of cells along the pseudotime
axis, providing insights into the temporal dynamics of cellular differentiation.

`scMaSigPro`

Object`scMaSigPro`

offers various options to directly convert widely used single-cell
related S4 objects directly to the scmpClass class (now referred to as `scmpObject`

),
such as `SingleCellExperiment`

or `Monocle3`

’s `cell_data_set`

class. Here, using the
simulated object (of the `SingleCellExperiment`

class) that we have just loaded into the R environment, we will
demonstrate the direct conversion to `scmpObject`

object.

In our case, we already have pseudotime and path information stored in the
`colData`

slot of the `splat.sim`

. The *Step* column holds the information for the
simulated steps of each path, which we can treat as a pseudotime variable, and
the *Group* column has information regarding branching path, i.e., to which
branching path each of the cell belongs. We can view this information by executing
`View(as.data.frame(splat.sim@colData)).`

```
## Cell Batch Group ExpLibSize Step sizeFactor
## Cell1 Cell1 Batch1 Path2 56391.37 42 0.8847501
## Cell2 Cell2 Batch1 Path1 54923.49 41 0.7779386
## Cell3 Cell3 Batch1 Path1 52122.28 23 0.7687344
## Cell4 Cell4 Batch1 Path1 59039.15 42 0.9017475
## Cell5 Cell5 Batch1 Path2 64317.37 79 1.0111086
```

We will use `as_scmp()`

from `scMaSigPro`

and direct it to use the columns present in
the dataset as the columns for `Pseudotime`

and `Path`

. To do this, we have to enable
the parameter `labels_exist`

and then pass the existing column names as a named list.

```
# Helper Function to convert annotated SCE object to scmpObject
scmp_ob <- as_scmp(
object = splat.sim, from = "sce",
align_pseudotime = FALSE,
verbose = TRUE,
additional_params = list(
labels_exist = TRUE,
exist_ptime_col = "Step",
exist_path_col = "Group"
)
)
```

`## Supplied object: SingleCellExperiment object`

```
## Overwritting columns in cell level metadata, 'Group'
## is replaced by 'Path' and 'Step' is replaced by 'Pseudotime'.
```

Once the object is created, we can type the name of the object in the R console to view various attributes, such as the dimensions (number of cells and number of genes), the available branching paths, and the range of pseudotime.

```
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path2, Path1
```

`sc.squeeze()`

`scMaSigPro`

provides a comprehensive function for pseudo-bulking along the
pseudotime continuum, named `sc.squeeze()`

. This function discretizes the original
pseudotime values into bins using histogram binning methods. The `sc.squeeze()`

function includes various parameters that can be tailored depending on the
characteristics of the data. Here, we will show the basic usage with the default
parameters:

`scmp_ob <- sc.squeeze(scmp_ob)`

The console output of `scMaSigPro`

is dynamic and displays more attributes as the
analysis progresses. To view the results of the binning procedure, we can simply
type the object’s name into the console:

```
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
```

Here, we observe that the original pseudotime, which ranged from 1 to 100 with two cells at each step and different paths, is now rescaled to a range of 1 to 8, with each path having an equal number of bins (8+8 in total). We also see the average bin sizes per path, indicating the average number of cells that make up a single bin for each path.

We can also visually inspect the binning process using a tile plot:

`plotBinTile(scmp_ob)`

The tile plot provides a clear view of the number of cells in each bin and how the sizes of bins compare with those of other paths in the same scmp_binned_pseudotime.

`scMaSigPro`

, being the successor of `maSigPro`

, utilizes the same polynomial regression model. Let’s consider a case with two branching paths along the same
pseudotime scale, modeled with a quadratic polynomial. The model can be represented
as follows:

`## Loading required package: knitr`

\[\begin{align*} Y_{i} &\sim \text{NegativeBinomial}(\mu_{i}, \theta = 10) \\ \log(\mu_{i}) &= \beta_{0} + \beta_{1} \cdot (\text{Path}_{B}\text{vsPath}_{A})_{i} + \beta_{2} \cdot T_{\text{Pseudotime}}^{\text{Binned}_{i}} \\ &\quad + \beta_{3} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}} \cdot \text{Path}_{B_{i}}) + \beta_{4} \cdot (T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \\ &\quad + \beta_{5} \cdot ((T_{\text{Pseudotime}}^{\text{Binned}_{i}})^2 \cdot \text{Path}_{B_{i}}) + \text{Offset}_{i} + \omega_{i} \cdot \epsilon_{i}\\ \end{align*}\]

To construct this model, we use `sc.set.poly()`

to include quadratic terms:

```
# Polynomial Degree 2
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 2)
```

Once the model is stored, we can visualize the corresponding polynomial using the `showPoly()`

function:

`showPoly(scmp_ob)`

`## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2"`

Similarly, we can fit a cubic polynomial by setting the polynomial degree to 3:

```
# Polynomial Degree 3
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 3)
showPoly(scmp_ob)
```

`## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2 + beta4*scmp_binned_pseudotime2 + beta5*scmp_binned_pseudotime2xPath2 + beta6*scmp_binned_pseudotime3 + beta7*scmp_binned_pseudotime3xPath2"`

However, for simplicity, we will explore a polynomial of degree 1. Note that
increasing the polynomial degree enhances `scMaSigPro`

’s performance in capturing
exponential and nonlinear gene expression patterns:

```
# Polynomial Degree 1
scmp_ob <- sc.set.poly(scmp_ob, poly_degree = 1)
showPoly(scmp_ob)
```

`## [1] "beta0 + beta1*Path2vsPath1 + beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2"`

In the above model we have:

*beta0*: Accounts for differences in expression from the start to the end._beta1*Path2vsPath1_: Captures differences between the branching paths, assuming path-1 does not change along pseudotime.

_beta2*scmp_binned_pseudotime_: Reflects differences across pseudotime.

_beta3*scmp_binned_pseudotimexPath2_: Represents interaction between pseudotime and path differences.

To identify genes that demonstrate significant changes along the pseudotime contniuum,
we use the `sc.p.vector()`

. This function, adapted from the original `maSigPro`

,
includes additional parameters such as the use of `offsets`

and `weights`

.

`scMaSigPro`

expects raw counts as input because it models data using a negative
binomial distribution, a count distribution, so the counts should not be normalized
to continuous values. To account for library size differences, it uses `offsets`

,
which are the log of size factors, similar to DESeq2.

For those who prefer to supply continuous data normalized using methods like log(1+x), the distribution can be changed during model fitting.

We can execute `sc.p.vector()`

as follows:

```
# Detect non-flat profiles
scmp_ob <- sc.p.vector(scmp_ob,
offset = TRUE, p_value = 0.05, verbose = FALSE,
log_offset = TRUE
)
scmp_ob
```

```
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
```

The console output reveals that `scMaSigPro`

detected 51 genes with non-flat
profiles.

Having identified genes with significant profiles, we can refine their polynomial
models using `sc.t.fit()`

. This function evaluates each term of the polynomial
model. In our case, it will assess which among “beta0 + beta1*Path2vsPath1 +
beta2*scmp_binned_pseudotime + beta3*scmp_binned_pseudotimexPath2”
significantly contributes to the differences. To execute `sc.t.fit()`

,
we proceed as follows:

```
# Model refinement
scmp_ob <- sc.t.fit(scmp_ob, verbose = FALSE)
scmp_ob
```

```
## Class: ScMaSigProClass
## nCells: 200
## nFeatures: 100
## Pseudotime Range: 1 100
## Branching Paths: Path1, Path2
## Binned Pseudotime: 1-8(Range), 4.5(Mean),
## Number of bins-> Path1: 8 Path2: 8
## Average bin Size-> Path1: 12 Path2: 13
## Polynomial Order: 1
## No. of Significant Profiles: 50
## No. of Influential Features: 8
```

With our refined models in hand, we now focus on identifying genes showing
significant differences with pseudotime, among paths, or both. For this purpose,
we use the `sc.filter()`

function. Our aim is to select models with a
relatively high \(R^2\), indicating simple linear relationships. The `vars`

parameter in `sc.filter()`

allows us to extract different sets of
significant genes. Setting `vars = 'all'`

retrieves all non-flat profiles
identified in `sc.p.vector()`

with \(R^2>=\) the specified threshold. The option
`vars = 'groups`

fetches genes per path, resulting in two gene lists that
demonstrate associative significance among paths, helping us identify genes
associated with one path or the other along the pseudotime continuum. The
`vars = 'each'`

option finds significance for each term in the polynomial.
In our case, we are interested in genes differentially expressed between paths
and over pseudotime continuum, so we will choose `vars = 'groups`

.

```
scmp_ob <- sc.filter(
scmpObj = scmp_ob,
rsq = 0.7,
vars = "groups",
intercept = "dummy",
includeInflu = TRUE
)
```

By setting the vars parameter to “groups”, the function will add genes with \(R^2\) >= 0.7 to the object. To explore the number of genes per group, we will make an upset plot:

`plotIntersect(scmp_ob)`

Here, we observe that 23 genes belong to both Path2vsPath1 and Path1, indicating that these genes not only change along pseudotime but also exhibit significantly different expression between the two paths. Additionally, there are 10 genes uniquely associated with Path2vsPath1. This implies that Path2 has 10 genes that are significantly differentially expressed over time, using Path1 genes as a reference. Let’s explore a few of these genes:

```
FigureA <- plotTrend(scmp_ob, "Gene9", logs = TRUE, logType = "log", lines = TRUE)
FigureB <- plotTrend(scmp_ob, "Gene95", logs = TRUE, logType = "log", lines = TRUE)
FigureC <- plotTrend(scmp_ob, "Gene10", logs = TRUE, logType = "log", lines = TRUE)
FigureD <- plotTrend(scmp_ob, "Gene92", logs = TRUE, logType = "log", lines = TRUE)
ggpubr::ggarrange(FigureA, FigureB, FigureC, FigureD,
ncol = 2, nrow = 2,
labels = c("A", "B", "C", "D")
)
```