GSEA Algorithm

Purpose and Usage

  • Scenario 1: Perform GSEA analysis between specified category 1 and category 2, where --ident1 is the treatment and --ident2 is the control

    SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \
    --group_key leiden --ident1 1 --ident2 2 --species human
  • Scenario 2: Subset a column in obs before GSEA analysis

    SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \
    --group_key leiden --ident1 1 --ident2 2 --species human \
    --subset_key cell_type --subset_values B
  • Scenario 3: Analyze only with databases of interest

    SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \
    --group_key leiden --ident1 1 --ident2 2 \
    --gmt sdas_deg_enrichment/lib/GSEADB/h.all.v2024.1.Hs.symbols.gmt,sdas_deg_enrichment/lib/GSEADB/KEGG_2021_Human.gmt
  • Scenario 4: Plot only pathways of interest. Write the full names of the pathways of interest into a txt file, one per line, and pass this txt file to the analysis process via the --pathways parameter. Note that the specified pathways must be included in the database used.

    SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \
    --group_key leiden --ident1 1 --ident2 2 \
    --pathways ./pathway.txt \
    --gmt sdas_deg_enrichment/lib/GSEADB/h.all.v2024.1.Hs.symbols.gmt,sdas_deg_enrichment/lib/GSEADB/KEGG_2021_Human.gmt

Input Parameter Description

GSEA Parameter
Required
Default Value
Description

-i / --input

Yes

input h5ad file.

--group_key

Yes

Identifier name in h5ad obs, must contain ident1 and ident2.

--ident1

Yes

Identity class to define.

--ident2

Yes

A second identity class for comparison.

-o / --output

Yes

The GSEApy output directory. Default: the current working directory

--layer

No

set gene raw expression layer, adata.raw.X or adata.X will be used if set None . default: None

--gene_symbol_key

No

real_gene_name

set gene name, default: real_gene_name

--species

No

human

Use biuld-in gmt database: human or mouse. Default: human. More database see here: https://amp.pharm.mssm.edu/modEnrichr.

--subset_key

No

Key for subsetting (optional), eg. cell_type

--subset_values

No

Values used for subsetting (optional), eg. cell1,cell2

--sample_size

No

0

Random sample cells number, 0 for not. Default: 0

--gmt

No

Customized gene set database in GMT format. One or more databases split by ",". Default use --species build-in database.

--graph

No

5

Numbers of top graphs produced. Default: 5

--pathways

No

Specify graphs name in a txt file to draw GSEA picture. Default: top way of --graph

--permutation_type

No

gene_set

Type of permutation reshuffling, Choose from {'phenotype': 'sample.labels' , 'gene_set' : gene.labels}. Default: gene_set

-v / --verbose

No

Increase output verbosity, print out progress of your job. Default False

--permutation_num

No

1000

Number of random permutations. For calculating esnulls. Default: 1000

--min_size

No

15

Min size of input genes presented in Gene Sets. Default: 15

--max_size

No

500

Max size of input genes presented in Gene Sets. Default: 500

--weight

No

1

Weighted_score of rank_metrics. For weighting input genes. Choose from {0, 1, 1.5, 2}. Default: 1

--method

No

Methods to calculate correlations of ranking metrics. Choose from {'signal_to_noise', 'abs_signal_to_noise', 't_test', 'ratio_of_classes','diff_of_classes','log2_ratio_of_classes'}. Default: 'signal_to_noise'

--ascending

No

Rank metric sorting order. If the --ascending flag was chosen, then ascending equals to True. Default: False.

--seed

No

123

Number of random seed. Default: 123

--threads

No

1

Number of threads you are going to use. Default: 1

Output Results Display

GSEA Result File
Description

GSEA.{database}.csv

Result file in csv format

GSEA.{database}.top5.pdf/png

Top 5 pathway plots in pdf and png formats

  • File format example: GSEA.{database}.csv is the GSEA analysis result file, containing Name, Term, ES, NES, NOM p-val, FDR q-val, FWER p-val, Tag %, Gene %, Lead_genes, etc. Term is the pathway name; ES is the Enrichment Score, reflecting the degree of enrichment of gene set members in the ranked gene list (e.g., ranked by differential expression). Positive ES: gene set is enriched at the top of the list (positively correlated with phenotype); negative ES: enriched at the bottom (negatively correlated). NES is the Normalized Enrichment Score; NOM p-val is the nominal p-value; FDR q-val is the adjusted p-value; FWER p-val is the family-wise error rate adjusted p-value; Tag % is the percentage of genes in the core enrichment region; Gene % is the percentage of genes used in the analysis out of the total in the gene set; Lead_genes are the core genes contributing most to the ES.

Name
Term
ES
NES
NOM p-val
FDR q-val
FWER p-val
Tag %
Gene %
Lead_genes

gsea

HALLMARK_MYC_TARGETS_V1

0.7472938191195556

2.39333105644001

0.0

0.0

0.0

160/195

18.89%

RPL14;HNRNPA2B1;...

gsea

HALLMARK_OXIDATIVE_PHOSPHORYLATION

0.7431758291176868

2.376055485647371

0.0

0.0

0.0

168/200

20.44%

MDH2;COX8A;...

gsea

HALLMARK_ALLOGRAFT_REJECTION

0.744882727767552

2.3688992213810462

0.0

0.0

0.0

118/194

14.03%

ITGB2;HLA-DRA;...

gsea

...

...

...

...

...

...

...

...

...

  • Top Terms Enrichment Curve Plot: GSEA.{database}.top5.pdf/png (see example below). In the plot, a positive Enrichment Score indicates the term is positively correlated with --ident1, while a negative score indicates negative correlation.

Last updated