Single-cell/pseudobulk Differential Analysis

Purpose and Usage

1. Single-cell Level Differential Analysis (No Biological Replicates)

Supports t-test, wilcoxon, MAST

Scenario 1: Differential analysis between specified category 1 and category 2

SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon  \
--ident1 1 --ident2 2 \
--fdr 0.05 --log2fc 1

Scenario 2: Differential analysis for each category versus all others

SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon \
 --fdr 0.05 --log2fc 1

Scenario 3: Subset a column in obs before differential analysis

SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon \ 
--ident1 1 --ident2 2 \
--fdr 0.05 --log2fc 1 \
--subset_key cell_type --subset_values B

2. Pseudobulk Differential Analysis (With Biological Replicates)

Recommended: DESeq2, edgeR (pseudobulk analysis). You must specify --sample_key, and the number of samples per group must meet the method requirements (DESeq2 ≥ 3, edgeR ≥ 2).

Scenario 1: Direct differential analysis between two groups of samples

SDAS DEG -i st.h5ad -o outdir --group_key sampleID --de_method DESeq2 \ 
--ident1 Tumor --ident2 Normal \
--fdr 0.05 --log2fc 1 \
--sample_key sampleID

Scenario 2: Subset before pseudobulk differential analysis

SDAS DEG -i st.h5ad -o outdir --group_key sampleID --de_method DESeq2 \ 
--ident1 Tumor --ident2 Normal \
--fdr 0.05 --log2fc 1 \
--sample_key sampleID \
--subset_key cell_type --subset_values B

Input Parameter Description

DEG Parameter

Required

Default Value

Description

-i / --input

Yes

Input a h5ad file which contain gene expression matrix.

-o / --output

Yes

output directory.

--de_method

Yes

Chose a DEG method.

--group_key

Yes

Identifier name in h5ad obs, must contain ident1 and ident2.

--ident1

Identity class to define DEG for, if NULL, each object in --group_key will be used.

--ident2

A second identity class for comparison, if NULL, use the union of the rest in --group_key.

--sample_key

Sample key in obs (optional), must set when de_method is DESeq2 or edgeR

--subset_key

Key for subsetting (optional), each value will be subset for DEG if not set --subset_values

--subset_values

Values in --subset_key used for subsetting (optional), eg. cell1,cell2

--layer

Set gene raw expression layer, if NULL, adata.raw.X or adata.X will be used

--gene_symbol_key

real_gene_name

set gene name. default: real_gene_name for saw h5ad

--fdr

0.05

set adjusted p-value (FDR) cutoff to chose significant deg genes. default: 0.05

--log2fc

set absolute logfoldchanges value cutoff to chose significant deg genes. default: 1

--genelist

draw genes in volcano_plot, split genes by ',', default 5 significant genes in up and down, set 0 to not draw gene in volcano_plot

--add_label

Input a csv format file to add a label to obs columns

--min_gene

min genes per spot for filter, default: 1

--max_gene

max genes per spot for filter, default not filter

--min_cell

a gene in min cells for filter, default: 1

--volcano_xlim

set x limit in volcano plot, eg: -5 5.

Output Results Display

Result File

Description

de_{method}.{group_key}.{ident1}-vs-{ident2}.raw.csv

Raw output from the software

de_{method}.{group_key}.{ident1}-vs-{ident2}.all.csv

Extracted results with geneName, log2FC, Pvalue, FDR, etc.

de_{method}.{group_key}.{ident1}-vs-{ident2}.sig_filtered.csv

Significant DEGs filtered by log2FC and Pvalue

de_{method}.{group_key}.{ident1}-vs-{ident2}.png/pdf

Volcano plot in png or pdf format

Raw file format example: de_{method}.{group_key}.{ident1}-vs-{ident2}.raw.csv This file is the original output from the differential analysis software, which may contain information such as gene name, fold change, Pvalue, adjusted Pvalue (FDR), and other details.

names

scores

logfoldchanges

pvals

pvals_adj

MTATP6P1

16.74336

1.3794351

1.3877341418899603e-42

2.2785333033340524e-39

AGR2

13.671169

1.7758344

1.419568544127444e-32

1.1147316293689464e-29

CLDN4

13.663365

1.9820584

1.9626883546881656e-34

1.6880054463820458e-31

...

all/sig_filtered file format example: de_{method}.{group_key}.{ident1}-vs-{ident2}.all.csv This file extracts gene name, fold change, Pvalue, and adjusted Pvalue (FDR) from the original results and renames them uniformly. de_{method}.{group_key}.{ident1}-vs-{ident2}.sig_filtered.csv is the list of significant DEGs filtered by log2FC and FDR thresholds.

geneName

log2FC

pvalue

FDR

MTATP6P1

1.3794351

1.3877341418899603e-42

2.2785333033340524e-39

AGR2

1.7758344

1.419568544127444e-32

1.1147316293689464e-29

CLDN4

1.9820584

1.9626883546881656e-34

1.6880054463820458e-31

...

Volcano plot result example: de_{method}.{group_key}.{ident1}-vs-{ident2}.png/pdf In the plot, red dots represent significant DEGs that meet both log2FC and FDR thresholds, blue dots meet the FDR but not log2FC threshold, and green dots meet the log2FC but not FDR threshold. By default, the top 5 up- and down-regulated genes are labeled. You can specify genes to label in the plot using the genelist parameter (e.g., --genelist geneA,geneB,geneC).

Result Interpretation

Gene name uniqueness
- Before differential analysis, gene names are automatically made unique using make_unique. All outputs and plots use the unique gene names.
Cell and gene filtering
- Supports filtering cells and genes using parameters such as --min_gene, --max_gene, and --min_cell. If the h5ad file has already been filtered, these can be omitted.

Parameter Tuning Suggestions

When the number of bins/cells exceeds 200k, MAST cannot run successfully. In this case, stricter filtering parameters (min_gene and min_cell) can be set to reduce the number of bins/cells before analysis.

PreviousInput File Example NextPerformance Test

Last updated 2 months ago