GN/MCL/Kmeans Clustering

Purpose

Based on the STRING database, automatically construct protein interaction networks for gene sets, and output 3 types of clustering results simultaneously.

Running Method

Use the following methods to customize database or use default species database, both can run

SDAS PPI --input gene_300.txt --species human --score_threshold 600 --centers 9 --output results_300
SDAS PPI -i gene_300.txt -o ./result --species human --cluster GN kmeans
SDAS PPI -i gene_300.txt -o ./result --links_db 9606.protein.links.v12.0.txt --aliases_db 9606.protein.aliases.v12.0.txt --cluster GN kmeans

Input Parameter Description

Parameter

Required

Default

Description

-i/--input

Yes

Input gene name list file (gene symbol, one per line)

-o/--output

Yes

Output folder, will be created automatically if not exists

--species

human

Species (human/mouse), can be ignored when using custom database，use--links_db and--aliases_db

--links_db

Custom protein interaction file path

--aliases_db

Custom protein alias file path

--score_threshold

700

Protein interaction score threshold, choose between 400-900, higher score means higher reliability, fewer network nodes, default 700

--cluster

Clustering algorithm (GN/kmeans/mcl), multiple choices available, default GN

--centers

kmeans clustering center number, default 5

--inflation

2.0

MCL clustering inflation parameter, 1.5-3.0, default 2.0

Output Results Display

Result File

Description

PPI_results.csv

Interaction score between two genes, supports Cytoscape import

cluster_results.csv

Input gene node connectivity and clustering assignment

network_<cluster>_visualization.png/pdf

All gene interaction network diagram, line thickness represents interaction score size, node size represents connectivity, color represents clustering

network_<cluster>_top_clusters.png/pdf

Network diagram of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50 (separate diagrams for each selected clustering method)

Protein interaction relationship table: PPI_results.csv Each row represents the interaction score of a pair of genes, can be directly imported into Cytoscape.
from_gene
to_gene
combined_score
MEPIA1
CDH17
466
LGLA3
CDH17
561
PTK2
CDH17
482
Clustering result table: cluster_results.csv Each row represents one gene, including its connectivity (how many genes it interacts with), cluster assignment for each clustering algorithm.
gene
degree
mcl_cluster
kmeans_cluster
betweenness_cluster
MEPIA1
6
1
2
5
LGLA3
5
1
2
5
Interaction network visualization diagram: network_<cluster>_visualization.png/pdf: Shows all gene interaction networks, line thickness between nodes represents interaction score size, same color represents one cluster, node size represents connectivity size. (Separate diagrams for each selected clustering method).

Maximum clustering subnet circle diagram: network_<cluster>_top_clusters.png/pdf Shows network diagrams of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50.

Performance Description

Takes a few minutes to run, memory consumption within 1G

PreviousInput File Example NextCell Communication Analysis

Last updated 2 months ago