GN/MCL/Kmeans Clustering
Purpose
Based on the STRING database, automatically construct protein interaction networks for gene sets, and output 3 types of clustering results simultaneously.
Running Method
Use the following methods to customize database or use default species database, both can run
SDAS PPI --input gene_300.txt --species human --score_threshold 600 --centers 9 --output results_300
SDAS PPI -i gene_300.txt -o ./result --species human --cluster GN kmeans
SDAS PPI -i gene_300.txt -o ./result --links_db 9606.protein.links.v12.0.txt --aliases_db 9606.protein.aliases.v12.0.txt --cluster GN kmeans
Input Parameter Description
-i/--input
Yes
Input gene name list file (gene symbol, one per line)
-o/--output
Yes
Output folder, will be created automatically if not exists
--species
No
human
Species (human/mouse), can be ignored when using custom database,use--links_db
and--aliases_db
--links_db
No
Custom protein interaction file path
--aliases_db
No
Custom protein alias file path
--score_threshold
No
700
Protein interaction score threshold, choose between 400-900, higher score means higher reliability, fewer network nodes, default 700
--cluster
No
GN
Clustering algorithm (GN/kmeans/mcl), multiple choices available, default GN
--centers
No
5
kmeans clustering center number, default 5
--inflation
No
2.0
MCL clustering inflation parameter, 1.5-3.0, default 2.0
Output Results Display
PPI_results.csv
Interaction score between two genes, supports Cytoscape import
cluster_results.csv
Input gene node connectivity and clustering assignment
network_<cluster>_visualization.png/pdf
All gene interaction network diagram, line thickness represents interaction score size, node size represents connectivity, color represents clustering
network_<cluster>_top_clusters.png/pdf
Network diagram of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50 (separate diagrams for each selected clustering method)
Protein interaction relationship table:
PPI_results.csv
Each row represents the interaction score of a pair of genes, can be directly imported into Cytoscape.from_geneto_genecombined_scoreMEPIA1
CDH17
466
LGLA3
CDH17
561
PTK2
CDH17
482
Clustering result table:
cluster_results.csv
Each row represents one gene, including its connectivity (how many genes it interacts with), cluster assignment for each clustering algorithm.genedegreemcl_clusterkmeans_clusterbetweenness_clusterMEPIA1
6
1
2
5
LGLA3
5
1
2
5
Interaction network visualization diagram:
network_<cluster>_visualization.png/pdf
: Shows all gene interaction networks, line thickness between nodes represents interaction score size, same color represents one cluster, node size represents connectivity size. (Separate diagrams for each selected clustering method).


Maximum clustering subnet circle diagram:
network_<cluster>_top_clusters.png/pdf
Shows network diagrams of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50.


Performance Description
Takes a few minutes to run, memory consumption within 1G
Last updated