GSEA Algorithm
Purpose and Usage
Scenario 1: Perform GSEA analysis between specified category 1 and category 2, where
--ident1
is the treatment and--ident2
is the controlSDAS geneSetEnrichment gsea -i st.h5ad -o outdir \ --group_key leiden --ident1 1 --ident2 2 --species human
Scenario 2: Subset a column in obs before GSEA analysis
SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \ --group_key leiden --ident1 1 --ident2 2 --species human \ --subset_key cell_type --subset_values B
Scenario 3: Analyze only with databases of interest
SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \ --group_key leiden --ident1 1 --ident2 2 \ --gmt sdas_deg_enrichment/lib/GSEADB/h.all.v2024.1.Hs.symbols.gmt,sdas_deg_enrichment/lib/GSEADB/KEGG_2021_Human.gmt
Scenario 4: Plot only pathways of interest. Write the full names of the pathways of interest into a txt file, one per line, and pass this txt file to the analysis process via the
--pathways
parameter. Note that the specified pathways must be included in the database used.SDAS geneSetEnrichment gsea -i st.h5ad -o outdir \ --group_key leiden --ident1 1 --ident2 2 \ --pathways ./pathway.txt \ --gmt sdas_deg_enrichment/lib/GSEADB/h.all.v2024.1.Hs.symbols.gmt,sdas_deg_enrichment/lib/GSEADB/KEGG_2021_Human.gmt
Input Parameter Description
-i / --input
Yes
input h5ad file.
--group_key
Yes
Identifier name in h5ad obs, must contain ident1 and ident2.
--ident1
Yes
Identity class to define.
--ident2
Yes
A second identity class for comparison.
-o / --output
Yes
The GSEApy output directory. Default: the current working directory
--layer
No
set gene raw expression layer, adata.raw.X or adata.X will be used if set None . default: None
--gene_symbol_key
No
real_gene_name
set gene name, default: real_gene_name
--species
No
human
Use biuld-in gmt database: human or mouse. Default: human. More database see here: https://amp.pharm.mssm.edu/modEnrichr.
--subset_key
No
Key for subsetting (optional), eg. cell_type
--subset_values
No
Values used for subsetting (optional), eg. cell1,cell2
--sample_size
No
0
Random sample cells number, 0 for not. Default: 0
--gmt
No
Customized gene set database in GMT format. One or more databases split by ",". Default use --species build-in database.
--graph
No
5
Numbers of top graphs produced. Default: 5
--pathways
No
Specify graphs name in a txt file to draw GSEA picture. Default: top way of --graph
--permutation_type
No
gene_set
Type of permutation reshuffling, Choose from {'phenotype': 'sample.labels' , 'gene_set' : gene.labels}. Default: gene_set
-v / --verbose
No
Increase output verbosity, print out progress of your job. Default False
--permutation_num
No
1000
Number of random permutations. For calculating esnulls. Default: 1000
--min_size
No
15
Min size of input genes presented in Gene Sets. Default: 15
--max_size
No
500
Max size of input genes presented in Gene Sets. Default: 500
--weight
No
1
Weighted_score of rank_metrics. For weighting input genes. Choose from {0, 1, 1.5, 2}. Default: 1
--method
No
Methods to calculate correlations of ranking metrics. Choose from {'signal_to_noise', 'abs_signal_to_noise', 't_test', 'ratio_of_classes','diff_of_classes','log2_ratio_of_classes'}. Default: 'signal_to_noise'
--ascending
No
Rank metric sorting order. If the --ascending flag was chosen, then ascending equals to True. Default: False.
--seed
No
123
Number of random seed. Default: 123
--threads
No
1
Number of threads you are going to use. Default: 1
Output Results Display
GSEA.{database}.csv
Result file in csv format
GSEA.{database}.top5.pdf/png
Top 5 pathway plots in pdf and png formats
File format example:
GSEA.{database}.csv
is the GSEA analysis result file, containing Name, Term, ES, NES, NOM p-val, FDR q-val, FWER p-val, Tag %, Gene %, Lead_genes, etc. Term is the pathway name; ES is the Enrichment Score, reflecting the degree of enrichment of gene set members in the ranked gene list (e.g., ranked by differential expression). Positive ES: gene set is enriched at the top of the list (positively correlated with phenotype); negative ES: enriched at the bottom (negatively correlated). NES is the Normalized Enrichment Score; NOM p-val is the nominal p-value; FDR q-val is the adjusted p-value; FWER p-val is the family-wise error rate adjusted p-value; Tag % is the percentage of genes in the core enrichment region; Gene % is the percentage of genes used in the analysis out of the total in the gene set; Lead_genes are the core genes contributing most to the ES.
gsea
HALLMARK_MYC_TARGETS_V1
0.7472938191195556
2.39333105644001
0.0
0.0
0.0
160/195
18.89%
RPL14;HNRNPA2B1;...
gsea
HALLMARK_OXIDATIVE_PHOSPHORYLATION
0.7431758291176868
2.376055485647371
0.0
0.0
0.0
168/200
20.44%
MDH2;COX8A;...
gsea
HALLMARK_ALLOGRAFT_REJECTION
0.744882727767552
2.3688992213810462
0.0
0.0
0.0
118/194
14.03%
ITGB2;HLA-DRA;...
gsea
...
...
...
...
...
...
...
...
...
Top Terms Enrichment Curve Plot:
GSEA.{database}.top5.pdf/png
(see example below). In the plot, a positive Enrichment Score indicates the term is positively correlated with--ident1
, while a negative score indicates negative correlation.

Last updated