GN/MCL/Kmeans Clustering

Purpose

Based on the STRING database, automatically construct protein interaction networks for gene sets, and output 3 types of clustering results simultaneously.

Running Method

Use the following methods to customize database or use default species database, both can run

SDAS PPI --input gene_300.txt --species human --score_threshold 600 --centers 9 --output results_300
SDAS PPI -i gene_300.txt -o ./result --species human --cluster GN kmeans
SDAS PPI -i gene_300.txt -o ./result --links_db 9606.protein.links.v12.0.txt --aliases_db 9606.protein.aliases.v12.0.txt --cluster GN kmeans

Input Parameter Description

Parameter
Required
Default
Description

-i/--input

Yes

Input gene name list file (gene symbol, one per line)

-o/--output

Yes

Output folder, will be created automatically if not exists

--species

No

human

Species (human/mouse), can be ignored when using custom database,use--links_db and--aliases_db

--links_db

No

Custom protein interaction file path

--aliases_db

No

Custom protein alias file path

--score_threshold

No

700

Protein interaction score threshold, choose between 400-900, higher score means higher reliability, fewer network nodes, default 700

--cluster

No

GN

Clustering algorithm (GN/kmeans/mcl), multiple choices available, default GN

--centers

No

5

kmeans clustering center number, default 5

--inflation

No

2.0

MCL clustering inflation parameter, 1.5-3.0, default 2.0

Output Results Display

Result File
Description

PPI_results.csv

Interaction score between two genes, supports Cytoscape import

cluster_results.csv

Input gene node connectivity and clustering assignment

network_<cluster>_visualization.png/pdf

All gene interaction network diagram, line thickness represents interaction score size, node size represents connectivity, color represents clustering

network_<cluster>_top_clusters.png/pdf

Network diagram of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50 (separate diagrams for each selected clustering method)

  • Protein interaction relationship table: PPI_results.csv Each row represents the interaction score of a pair of genes, can be directly imported into Cytoscape.

    from_gene
    to_gene
    combined_score

    MEPIA1

    CDH17

    466

    LGLA3

    CDH17

    561

    PTK2

    CDH17

    482

  • Clustering result table: cluster_results.csv Each row represents one gene, including its connectivity (how many genes it interacts with), cluster assignment for each clustering algorithm.

    gene
    degree
    mcl_cluster
    kmeans_cluster
    betweenness_cluster

    MEPIA1

    6

    1

    2

    5

    LGLA3

    5

    1

    2

    5

  • Interaction network visualization diagram: network_<cluster>_visualization.png/pdf: Shows all gene interaction networks, line thickness between nodes represents interaction score size, same color represents one cluster, node size represents connectivity size. (Separate diagrams for each selected clustering method).

  • Maximum clustering subnet circle diagram: network_<cluster>_top_clusters.png/pdf Shows network diagrams of the 9 clusters with the most nodes, circle mode arrangement when node count <50, global arrangement when >50.

Performance Description

Takes a few minutes to run, memory consumption within 1G

Last updated