scMaSigPro 0.0.4
scMaSigPro
is designed to handle datasets with at least two branches and
requires that cells be assigned exclusively to these branches. This vignette
demonstrates how to evaluate datasets with common root cells using the scMaSigPro
approach. We will utilize an object from Monocle3’s
newCellDataSet
class
in this analysis.
Currently, scMaSigPro
is available on GitHub and can be installed as follows:
# Install Dependencies
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(version = "3.14")
BiocManager::install(c("SingleCellExperiment", "maSigPro", "MatrixGenerics", "S4Vectors"))
To install scMaSigPro
from GitHub, use the following R code:
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install scMaSigPro
devtools::install_github("BioBam/scMaSigPro",
ref = "main",
build_vignettes = FALSE,
build_manual = TRUE,
upgrade = "never",
force = TRUE,
quiet = TRUE
)
We will start by loading the necessary libraries.
## For plotting
library(scMaSigPro)
library(ggplot2)
## Install and load 'monocle3-1.3.4'
library(monocle3)
For this vignette, we will use a simulated dataset containing 1,977 cells and 501 genes. The dataset was generated using the Dyntoy package and is sourced from the tradeSeq
repository. This dataset has been simulated to have three branches and is analyzed with Monocle3 (version 1.3.4).
In the figure-A, we can see that the cells share a common root and then diverge
into multiple branches. The cells first diverge into two branches and then into
three branches. Figure B illustrates the elements of the trajectory used as input
in scMaSigPro
for the common root cells setup.
An important prerequisite for using scMaSigPro
is the hard assignment of cells
to branches. This means that each cell must be exclusively assigned to one branch.
There are broadly two ways to assign cells:
In Monocle3, each cell is associated with only one branch, implying a hard assignment,
which is suitable for scMaSigPro
current version. On the other hand, tradeSeq
can
handle both hard and soft assignments. In Slingshot, for instance, cells can be
part of multiple branches, which is a form of soft assignment. One key difference
is that Slingshot assigns different pseudotimes to each branch, whereas Monocle3
uses a universal pseudotime for all branches.
## Read CDS from the server
cds <- readRDS(url("https://www.metapriyansh.com/scMaSigPro/rds_objects/multifurcating_4_cds.RDS"))
We will follow the steps from the tradeSeq
Vignette
to extract the assignment of cells to branches. However, for the sake of simplicity,
we will tweak these steps to avoid using the magrittr and monocle3 packages.
Instead, we will use base R functions and directly access the S4 slots.
1 We extracted the vertex from UMAP and the trajectory was learned on tSNE. This is because our CDS object has the tSNE coordinates stored in the UMAP slot. This was done because monocle3::learn_graph()
does not work with tSNE coordinates. You can read more about this issue on monocle3-issues/242
# Get the closest vertices for every cell
y_to_cells <- as.data.frame(cds@principal_graph_aux$UMAP$pr_graph_cell_proj_closest_vertex)
y_to_cells$cells <- rownames(y_to_cells)
y_to_cells$Y <- y_to_cells$V1
## Root Nodes
root <- cds@principal_graph_aux$UMAP$root_pr_nodes
## Extract MST (PQ Graph)
mst <- cds@principal_graph$UMAP
## All end-points
endpoints <- names(which(igraph::degree(mst) == 1))
## Root is also an endpoint so we remove it
endpoints <- endpoints[!endpoints %in% root]
## Extract
cellAssignments_list <- lapply(endpoints, function(endpoint) {
# We find the path between the endpoint and the root
path <- igraph::shortest_paths(mst, root, endpoint)$vpath[[1]]
path <- as.character(path)
# We find the cells that map along that path
df <- y_to_cells[y_to_cells$Y %in% path, ]
df <- data.frame(weights = as.numeric(colnames(cds@assays@data@listData$counts) %in% df$cells))
colnames(df) <- endpoint
return(df)
})
## Format
cellAssignments <- do.call(what = "cbind", cellAssignments_list)
cellAssignments <- as.matrix(cellAssignments)
# Update columns
rownames(cellAssignments) <- colnames(cds@assays@data@listData$counts)
head(cellAssignments[c(20:30), ])
## Y_1 Y_15 Y_18 Y_52 Y_65
## C20 0 0 0 1 0
## C21 1 1 1 1 1
## C22 0 0 0 1 0
## C23 0 0 0 1 0
## C24 0 0 0 1 1
## C25 1 0 1 0 0
scMaSigPro
objectFor this vignette, we will only consider three paths: the root, Y_18, and Y_15 (Refer Figure-A). We will remove the other paths and any cells that are part of those paths.
# Subset for 3 paths
cellAssignments <- cellAssignments[, c("Y_18", "Y_52", "Y_15"), drop = FALSE]
# Remove any of the cells which is not assigned to any path
cellAssignments <- cellAssignments[rowSums(cellAssignments) != 0, ]
# Create Cell Data
cellData <- data.frame(
cell_id = rownames(cellAssignments),
row.names = rownames(cellAssignments)
)
# Extract counts
counts <- cds@assays@data@listData$counts
# Subset counts
counts <- counts[, rownames(cellAssignments), drop = FALSE]
Another important step is to label the elements (Refer to Figure-B). If a cell belongs to the root, then it is part of all the paths. If a cell is part of a link branch, then it is part of both the “Y_15 and Y_18” branches.
# Create Cell Metadata
cellData[["group"]] <- apply(cellAssignments, 1, FUN = function(x) {
npath <- length(names(x[x == 1]))
if (npath == 3) {
return("root")
} else if (npath == 2) {
return(paste(names(x[x == 1]), collapse = "_links_"))
} else {
return(names(x[x == 1]))
}
})
table(cellData[["group"]])
##
## root Y_15 Y_18 Y_18_links_Y_15 Y_52
## 453 307 676 43 498
# Extract from CDS
ptimes <- cds@principal_graph_aux$UMAP$pseudotime
# Remove cells which are not assigned to any path
ptimes <- ptimes[rownames(cellAssignments)]
# Assign to cellData
cellData[["m3_pseudotime"]] <- ptimes
# Create Object
multi_scmp_ob <- create_scmp(
counts = counts,
cell_data = cellData,
ptime_col = "m3_pseudotime",
path_col = "group",
use_as_bin = FALSE
)
multi_scmp_ob
## Class: ScMaSigProClass
## nCells: 1977
## nFeatures: 501
## Pseudotime Range: 0 62.806
## Branching Paths: root, Y_15, Y_52, Y_18, Y_18_links_Y_15
The first approach to using scMaSigPro
to analyze branches with common cells
is to consider the common cells as a separate group. For the interpretation we
will evaluate whether a particular gene’s expression in the common root cells and
it’s downstream branch is similar.
## Pseudotime based binning
multi_scmp_ob_A <- sc.squeeze(multi_scmp_ob)
## Plot bin information
plotBinTile(multi_scmp_ob_A)
In the above figure, we can see that scMaSigPro
considers each group as a
separate branch. Running scMaSigPro
with this configuration will lead to a
different interpretation of the data. This will help us evalute whether the
expression changes across branches compared to root. If you want to consider
the common root cells and branches together, you can follow the next section.
Additionally, we can see that the common root cells and the resulting branches have the same values of binned pseudotime values. This is because the binning is performed independently over each branch.
# Polynomial Degree 2
multi_scmp_ob_A <- sc.set.poly(multi_scmp_ob_A, poly_degree = 3)
# Detect non-flat profiles
multi_scmp_ob_A <- sc.p.vector(
multi_scmp_ob_A,
verbose = FALSE
)
# Model refinement
multi_scmp_ob_A <- sc.t.fit(
multi_scmp_ob_A,
verbose = FALSE
)
# Apply filter
multi_scmp_ob_A <- sc.filter(
scmpObj = multi_scmp_ob_A,
rsq = 0.55,
vars = "groups",
intercept = "dummy",
includeInflu = TRUE
)
# Plot upset
plotIntersect(multi_scmp_ob_A)
In the upset plot above, we can see that there are 294 genes which change among branches and also change with pseudotime. We will explore them later. First, let’s look at the genes that have similar expression in the “root”, “Y_18_links_Y_15vsroot” (link) and “Y_18” (branch).
scMaSigPro
treated the “root” as the reference for all branches. Let’s extract
the intersection information from the scMaSigPro
object. Starting from scMaSigPro
version 0.0.4
, the plotIntersect(return = TRUE)
function can be used to
extract the intersection information from the object as a dataframe.
shared_genes <- plotIntersect(multi_scmp_ob_A, return = TRUE)
head(shared_genes)
## Y_15vsroot Y_18vsroot Y_18_links_Y_15vsroot Y_52vsroot root
## G3 1 1 1 1 1
## G7 1 1 1 1 1
## G9 1 1 1 1 1
## G10 1 1 1 1 1
## G11 1 1 1 1 1
## G12 1 1 1 1 1
## Similar expression in root, "Y_18_links_Y_15vsroot" and "Y_15"
gene_br_Y_15 <- rownames(shared_genes[shared_genes$Y_52vsroot == 1 &
shared_genes$Y_18vsroot == 1 &
shared_genes$Y_15vsroot == 0 &
shared_genes$root == 1 &
shared_genes$Y_18_links_Y_15vsroot == 0, ]) # "G42"
print(gene_br_Y_15)
## [1] "G42" "G71"
plotTrend(multi_scmp_ob_A, feature_id = "G42", points = F, lines = T)
In the above plot, we can see that the gene G42 has a similar expression in the
“root”, “Y_18_links_Y_15vsroot,” and “Y_15.”. This is a good example of how
scMaSigPro
can be used to identify genes that are showing similar patterns in the common root
cells and how different is it in other branches. The main idea is to set the root
as a reference and then compare the branches to the root.
Next we can explore the 294 genes that we have in the intersection. As it is hard to see them all at once, we will first perform clustering and then look at them in groups.
## Perform Clustering
multi_scmp_ob_A <- scMaSigPro::sc.cluster.trend(multi_scmp_ob_A)
# Plot Clusters
plotTrendCluster(multi_scmp_ob_A, verbose = FALSE, loess_span = 0.8)
Another approach is to reorder the pseudotime bins, meaning the bins representing the common root cells are ordered first, followed by the bins of the branches. This helps compare the gene expression of the common root cells togther with the downstream branches.
binned_data <- cDense(multi_scmp_ob_A)
head(binned_data[, -2])
## scmp_binned_pseudotime scmp_bin group scmp_bin_size scmp_l_bound
## root_bin_1 1 root_bin_1 root 184 0.00
## root_bin_2 2 root_bin_2 root 75 2.22
## root_bin_3 3 root_bin_3 root 17 4.45
## root_bin_4 4 root_bin_4 root 46 6.67
## root_bin_5 5 root_bin_5 root 49 8.90
## root_bin_6 6 root_bin_6 root 32 11.10
## scmp_u_bound
## root_bin_1 2.22
## root_bin_2 4.45
## root_bin_3 6.67
## root_bin_4 8.90
## root_bin_5 11.10
## root_bin_6 13.30
The main idea is to update the values in the binned pseudotime column using an offset. Based on our dataset, we observe that Y_52 is a branch that directly originated from the Root. Therefore, the bins of the Y_52 branch should be aligned right after it. This means that the offset for the Y_52 branch will be the maximum value of the binned pseudotime of the root branch.
## Create a new df
Y_52_binned_data <- binned_data[binned_data$group %in% c("root", "Y_52"), , drop = FALSE]
## Calculate Offset (Max Value of the root)
Y_52_offset <- max(Y_52_binned_data[Y_52_binned_data$group == "root", "scmp_binned_pseudotime"])
## Add offset
Y_52_binned_data[Y_52_binned_data$group == "Y_52", "scmp_binned_pseudotime"] <- Y_52_binned_data[Y_52_binned_data$group == "Y_52", "scmp_binned_pseudotime"] + Y_52_offset
## Update bins and rownames
Y_52_binned_data$group2 <- "Y_52"
rownames(Y_52_binned_data) <- paste(Y_52_binned_data$group2, Y_52_binned_data$scmp_binned_pseudotime, sep = "_bin_")
head(Y_52_binned_data[, -2])
## scmp_binned_pseudotime scmp_bin group scmp_bin_size scmp_l_bound
## Y_52_bin_1 1 root_bin_1 root 184 0.00
## Y_52_bin_2 2 root_bin_2 root 75 2.22
## Y_52_bin_3 3 root_bin_3 root 17 4.45
## Y_52_bin_4 4 root_bin_4 root 46 6.67
## Y_52_bin_5 5 root_bin_5 root 49 8.90
## Y_52_bin_6 6 root_bin_6 root 32 11.10
## scmp_u_bound group2
## Y_52_bin_1 2.22 Y_52
## Y_52_bin_2 4.45 Y_52
## Y_52_bin_3 6.67 Y_52
## Y_52_bin_4 8.90 Y_52
## Y_52_bin_5 11.10 Y_52
## Y_52_bin_6 13.30 Y_52
For the other branches we will add an offset of both “link” and “Y_15|Y_18”.
## Create a new df
Y_18_binned_data <- binned_data[binned_data$group %in% c("root", "Y_18_links_Y_15", "Y_18"), , drop = FALSE]
Y_15_binned_data <- binned_data[binned_data$group %in% c("root", "Y_18_links_Y_15", "Y_15"), , drop = FALSE]
## Calculate Offset (Max Value of the root and Link)
Y_18_root_offset <- max(Y_18_binned_data[Y_18_binned_data$group == "root", "scmp_binned_pseudotime"])
Y_18_link_offset <- max(Y_18_binned_data[Y_18_binned_data$group == "Y_18_links_Y_15", "scmp_binned_pseudotime"])
## Add link offset to branch Y_18
Y_18_binned_data[Y_18_binned_data$group == "Y_18", "scmp_binned_pseudotime"] <- Y_18_binned_data[Y_18_binned_data$group == "Y_18", "scmp_binned_pseudotime"] + Y_18_link_offset
Y_15_binned_data[Y_15_binned_data$group == "Y_15", "scmp_binned_pseudotime"] <- Y_15_binned_data[Y_15_binned_data$group == "Y_15", "scmp_binned_pseudotime"] + Y_18_link_offset
## Add root offset to (link+ branch)
Y_18_binned_data[Y_18_binned_data$group != "root", "scmp_binned_pseudotime"] <- Y_18_binned_data[Y_18_binned_data$group != "root", "scmp_binned_pseudotime"] + Y_18_root_offset
Y_15_binned_data[Y_15_binned_data$group != "root", "scmp_binned_pseudotime"] <- Y_15_binned_data[Y_15_binned_data$group != "root", "scmp_binned_pseudotime"] + Y_18_root_offset
## Update bins and rownames
Y_18_binned_data$group2 <- "Y_18"
rownames(Y_18_binned_data) <- paste(Y_18_binned_data$group2, Y_18_binned_data$scmp_binned_pseudotime, sep = "_bin_")
Y_15_binned_data$group2 <- "Y_15"
rownames(Y_15_binned_data) <- paste(Y_15_binned_data$group2, Y_15_binned_data$scmp_binned_pseudotime, sep = "_bin_")
# Create plot
ggplot() +
geom_tile(
data = Y_18_binned_data,
aes(x = scmp_binned_pseudotime, y = group), fill = "blue", alpha = 0.2
) +
geom_tile(
data = Y_18_binned_data,
aes(x = scmp_binned_pseudotime, y = group2), fill = "red", alpha = 0.2
) +
theme_minimal()
As we can see above, the tiles in red represent the original bins, while the blue represents the new bins “Y_18”. At the end, we also observe the overlap where the original bins of branch Y_18 are shifted to the end.
## Combine all metadata
new_binned_data <- rbind(Y_52_binned_data, Y_18_binned_data, Y_15_binned_data)
## Create a new object
multi_scmp_ob_A_manual <- multi_scmp_ob_A
## Update dense Metadata slot
cDense(multi_scmp_ob_A_manual) <- new_binned_data
## Set the group used for binning
multi_scmp_ob_A_manual@Parameters@path_col <- "group2"
## Perform Aggregation
multi_scmp_ob_A_manual <- pb_counts(
scmpObj = multi_scmp_ob_A_manual,
assay_name = "counts",
)
## Plot
plotBinTile(multi_scmp_ob_A_manual)
From scMaSigPro
version 0.0.4
onwards, a wrapper function was introduced
to assist in ordering the bins. The working of the wrapper function sc.restruct()
is shown below, and it works on fairly simplified datasets like the one demonstrated
in this vignette.
sc.restruct()
## Showcase sc.restruct wrapper
multi_scmp_ob_B <- sc.restruct(multi_scmp_ob_A,
end_node_list = list("Y_15", "Y_18", "Y_52"),
root_node = "root", link_node_list = list("Y_18_links_Y_15"),
verbose = FALSE, link_sep = "_links_", assay_name = "counts", aggregate = "sum"
)
# Show the new path
plotBinTile(multi_scmp_ob_B)
As we see in the plot above, the sc.restruct()
wrapper function performs the
exact transformations internally. For future updates (CRAN version), we will offer
a Shiny-based solution that can effectively handle more complicated datasets.
However, as of scMaSigPro
version 0.0.4
, users are required to either use
sc.restruct()
or manually order the datasets.
# Polynomial Degree 2
multi_scmp_ob_B <- sc.set.poly(multi_scmp_ob_B, poly_degree = 3)
# Detect non-flat profiles
multi_scmp_ob_B <- sc.p.vector(
multi_scmp_ob_B,
verbose = FALSE
)
# Model refinement
multi_scmp_ob_B <- sc.t.fit(
multi_scmp_ob_B,
verbose = FALSE
)
# Apply filter
multi_scmp_ob_B <- sc.filter(
scmpObj = multi_scmp_ob_B,
rsq = 0.5,
vars = "groups",
intercept = "dummy",
includeInflu = TRUE
)
# Plot upset
plotIntersect(multi_scmp_ob_B)
Now, in the above scenario, Y_15 is treated as the reference, and we will look at the genes that are expressed differently compared to this branch. This approach will help us understand how genes are changing across the branches while considering the root cells for each of the downstream branches.
## Perform Clustering
multi_scmp_ob_B <- sc.cluster.trend(multi_scmp_ob_B,
k = 4, cluster_method = "kmeans"
)
## Plot Clusters
plotTrendCluster(multi_scmp_ob_B, verbose = FALSE)
sessionInfo(package = "scMaSigPro")
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## character(0)
##
## other attached packages:
## [1] scMaSigPro_0.0.4
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 gridExtra_2.3
## [3] rlang_1.1.3 magrittr_2.0.3
## [5] RcppAnnoy_0.0.22 matrixStats_1.2.0
## [7] e1071_1.7-14 compiler_4.3.0
## [9] mgcv_1.9-1 venn_1.11
## [11] vctrs_0.6.5 stringr_1.5.1
## [13] pkgconfig_2.0.3 crayon_1.5.2
## [15] fastmap_1.1.1 backports_1.4.1
## [17] XVector_0.42.0 ellipsis_0.3.2
## [19] labeling_0.4.3 utf8_1.2.4
## [21] promises_1.2.1 rmarkdown_2.25
## [23] grDevices_4.3.0 UpSetR_1.4.0
## [25] purrr_1.0.2 xfun_0.42
## [27] zlibbioc_1.48.0 cachem_1.0.8
## [29] graphics_4.3.0 GenomeInfoDb_1.38.6
## [31] jsonlite_1.8.8 highr_0.10
## [33] later_1.3.2 DelayedArray_0.28.0
## [35] terra_1.7-71 broom_1.0.5
## [37] parallel_4.3.0 R6_2.5.1
## [39] bslib_0.6.1 stringi_1.8.3
## [41] car_3.1-2 parallelly_1.37.0
## [43] GenomicRanges_1.54.1 jquerylib_0.1.4
## [45] Rcpp_1.0.12 bookdown_0.37
## [47] assertthat_0.2.1 SummarizedExperiment_1.32.0
## [49] knitr_1.45 IRanges_2.36.0
## [51] splines_4.3.0 httpuv_1.6.14
## [53] Matrix_1.6-5 igraph_1.6.0
## [55] tidyselect_1.2.0 rstudioapi_0.15.0
## [57] abind_1.4-5 yaml_2.3.8
## [59] codetools_0.2-19 admisc_0.34
## [61] lattice_0.22-5 tibble_3.2.1
## [63] plyr_1.8.9 Biobase_2.62.0
## [65] shiny_1.8.0 withr_3.0.0
## [67] evaluate_0.23 base_4.3.0
## [69] proxy_0.4-27 mclust_6.0.1
## [71] pillar_1.9.0 BiocManager_1.30.22
## [73] ggpubr_0.6.0 carData_3.0-5
## [75] MatrixGenerics_1.14.0 stats4_4.3.0
## [77] plotly_4.10.4 generics_0.1.3
## [79] RCurl_1.98-1.14 S4Vectors_0.40.2
## [81] ggplot2_3.5.1 munsell_0.5.0
## [83] scales_1.3.0 BiocStyle_2.30.0
## [85] stats_4.3.0 xtable_1.8-4
## [87] class_7.3-22 glue_1.7.0
## [89] lazyeval_0.2.2 tools_4.3.0
## [91] datasets_4.3.0 data.table_1.15.0
## [93] ggsignif_0.6.4 cowplot_1.1.2
## [95] grid_4.3.0 utils_4.3.0
## [97] tidyr_1.3.1 methods_4.3.0
## [99] colorspace_2.1-0 SingleCellExperiment_1.24.0
## [101] nlme_3.1-164 GenomeInfoDbData_1.2.11
## [103] cli_3.6.2 fansi_1.0.6
## [105] S4Arrays_1.2.0 viridisLite_0.4.2
## [107] dplyr_1.1.4 gtable_0.3.4
## [109] rstatix_0.7.2 sass_0.4.8
## [111] digest_0.6.34 BiocGenerics_0.48.1
## [113] SparseArray_1.2.4 htmlwidgets_1.6.4
## [115] farver_2.1.1 maSigPro_1.74.0
## [117] entropy_1.3.1 htmltools_0.5.7
## [119] lifecycle_1.0.4 httr_1.4.7
## [121] mime_0.12 MASS_7.3-60