Single-cell/pseudobulk Differential Analysis

Purpose and Usage

1. Single-cell Level Differential Analysis (No Biological Replicates)

Supports t-test, wilcoxon, MAST

  • Scenario 1: Differential analysis between specified category 1 and category 2

    SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon  \
    --ident1 1 --ident2 2 \
    --fdr 0.05 --log2fc 1
  • Scenario 2: Differential analysis for each category versus all others

    SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon \
     --fdr 0.05 --log2fc 1 
  • Scenario 3: Subset a column in obs before differential analysis

    SDAS DEG -i st.h5ad -o outdir --group_key leiden --de_method wilcoxon \ 
    --ident1 1 --ident2 2 \
    --fdr 0.05 --log2fc 1 \
    --subset_key cell_type --subset_values B

2. Pseudobulk Differential Analysis (With Biological Replicates)

Recommended: DESeq2, edgeR (pseudobulk analysis). You must specify --sample_key, and the number of samples per group must meet the method requirements (DESeq2 ≥ 3, edgeR ≥ 2).

  • Scenario 1: Direct differential analysis between two groups of samples

    SDAS DEG -i st.h5ad -o outdir --group_key sampleID --de_method DESeq2 \ 
    --ident1 Tumor --ident2 Normal \
    --fdr 0.05 --log2fc 1 \
    --sample_key sampleID
  • Scenario 2: Subset before pseudobulk differential analysis

    SDAS DEG -i st.h5ad -o outdir --group_key sampleID --de_method DESeq2 \ 
    --ident1 Tumor --ident2 Normal \
    --fdr 0.05 --log2fc 1 \
    --sample_key sampleID \
    --subset_key cell_type --subset_values B

Input Parameter Description

DEG Parameter
Required
Default Value
Description

-i / --input

Yes

Input a h5ad file which contain gene expression matrix.

-o / --output

Yes

output directory.

--de_method

Yes

Chose a DEG method.

--group_key

Yes

Identifier name in h5ad obs, must contain ident1 and ident2.

--ident1

No

Identity class to define DEG for, if NULL, each object in --group_key will be used.

--ident2

No

A second identity class for comparison, if NULL, use the union of the rest in --group_key.

--sample_key

No

Sample key in obs (optional), must set when de_method is DESeq2 or edgeR

--subset_key

No

Key for subsetting (optional), each value will be subset for DEG if not set --subset_values

--subset_values

No

Values in --subset_key used for subsetting (optional), eg. cell1,cell2

--layer

No

Set gene raw expression layer, if NULL, adata.raw.X or adata.X will be used

--gene_symbol_key

No

real_gene_name

set gene name. default: real_gene_name for saw h5ad

--fdr

No

0.05

set adjusted p-value (FDR) cutoff to chose significant deg genes. default: 0.05

--log2fc

No

1

set absolute logfoldchanges value cutoff to chose significant deg genes. default: 1

--genelist

No

5

draw genes in volcano_plot, split genes by ',', default 5 significant genes in up and down, set 0 to not draw gene in volcano_plot

--add_label

No

Input a csv format file to add a label to obs columns

--min_gene

No

0

min genes per spot for filter, default: 1

--max_gene

No

max genes per spot for filter, default not filter

--min_cell

No

0

a gene in min cells for filter, default: 1

--volcano_xlim

No

set x limit in volcano plot, eg: -5 5.

Output Results Display

Result File
Description

de_{method}.{group_key}.{ident1}-vs-{ident2}.raw.csv

Raw output from the software

de_{method}.{group_key}.{ident1}-vs-{ident2}.all.csv

Extracted results with geneName, log2FC, Pvalue, FDR, etc.

de_{method}.{group_key}.{ident1}-vs-{ident2}.sig_filtered.csv

Significant DEGs filtered by log2FC and Pvalue

de_{method}.{group_key}.{ident1}-vs-{ident2}.png/pdf

Volcano plot in png or pdf format

  • Raw file format example: de_{method}.{group_key}.{ident1}-vs-{ident2}.raw.csv This file is the original output from the differential analysis software, which may contain information such as gene name, fold change, Pvalue, adjusted Pvalue (FDR), and other details.

names
scores
logfoldchanges
pvals
pvals_adj

MTATP6P1

16.74336

1.3794351

1.3877341418899603e-42

2.2785333033340524e-39

AGR2

13.671169

1.7758344

1.419568544127444e-32

1.1147316293689464e-29

CLDN4

13.663365

1.9820584

1.9626883546881656e-34

1.6880054463820458e-31

...

...

...

...

...

  • all/sig_filtered file format example: de_{method}.{group_key}.{ident1}-vs-{ident2}.all.csv This file extracts gene name, fold change, Pvalue, and adjusted Pvalue (FDR) from the original results and renames them uniformly. de_{method}.{group_key}.{ident1}-vs-{ident2}.sig_filtered.csv is the list of significant DEGs filtered by log2FC and FDR thresholds.

geneName
log2FC
pvalue
FDR

MTATP6P1

1.3794351

1.3877341418899603e-42

2.2785333033340524e-39

AGR2

1.7758344

1.419568544127444e-32

1.1147316293689464e-29

CLDN4

1.9820584

1.9626883546881656e-34

1.6880054463820458e-31

...

...

...

...

  • Volcano plot result example: de_{method}.{group_key}.{ident1}-vs-{ident2}.png/pdf In the plot, red dots represent significant DEGs that meet both log2FC and FDR thresholds, blue dots meet the FDR but not log2FC threshold, and green dots meet the log2FC but not FDR threshold. By default, the top 5 up- and down-regulated genes are labeled. You can specify genes to label in the plot using the genelist parameter (e.g., --genelist geneA,geneB,geneC).

Result Interpretation

  1. Gene name uniqueness

    • Before differential analysis, gene names are automatically made unique using make_unique. All outputs and plots use the unique gene names.

  2. Cell and gene filtering

    • Supports filtering cells and genes using parameters such as --min_gene, --max_gene, and --min_cell. If the h5ad file has already been filtered, these can be omitted.

Parameter Tuning Suggestions

  1. When the number of bins/cells exceeds 200k, MAST cannot run successfully. In this case, stricter filtering parameters (min_gene and min_cell) can be set to reduce the number of bins/cells before analysis.

Last updated