Cluster Submission Mode​

Introduction

SDAS Pipelines Automated Job Submission is an intelligent job scheduling system for automatically managing and submitting SDAS analysis workflow jobs to PBS/Torque clusters. The system provides:

  • Automatic Dependency Resolution: Intelligent scheduling based on job dependencies

  • Concurrency Control: Limits the number of simultaneously running jobs to avoid resource conflicts

  • Status Monitoring: Real-time monitoring of job execution status

  • Error Handling: Automatic retry of failed jobs

  • Detailed Reporting: Generates complete execution reports and logs

System Requirements

  • Python 3.6+

  • Job scheduling system (supports any of the following):

    • PBS/Torque

    • SGE (Sun Grid Engine)

    • Slurm

    • LSF (IBM Platform Load Sharing Facility)

  • Appropriate queue permissions

  • SDAS software properly configured

Usage Steps

1. Configure pipeline_input.conf File

Before running SDAS Pipeline, you need to configure the pipeline_input.conf file. This file defines:

  • Input Data: h5ad file paths and grouping information

  • Analysis Workflow: Select SDAS modules to run

  • Module Parameters: Specific parameter configurations for each module

  • Dependencies: Input/output relationships between modules

1.1 Basic Configuration Structure

# 1. Software path
SDAS_software = /path/to/SDAS

# 2. Input data configuration
# Single file input
h5ad_files = /path/to/data.h5ad

# Multiple file input (with grouping information)
h5ad_files = S1,group1,A.h5ad;S2,group1,B.h5ad;S3,group2,C.h5ad

# Multiple file input (without grouping information)
h5ad_files = S1,,A.h5ad;S2,,B.h5ad

# 3. Analysis workflow selection
process = coexpress,spatialDomain,cellAnnotation,cellularNeighborhood,CCI,trajectory,DEG,geneSetEnrichment,TF,PPI,spatialRelate

1.2 Module Parameter Configuration Examples

Configuration Notes:

  • Parameter format: parameter_name = parameter_value

  • Space indicates: parameter value is empty, use default value

  • Comments: Start with # for parameter explanations

  • Path parameters: Use absolute paths to avoid relative path issues

Spatial Gene Co-expression Analysis (coexpress)

# Basic parameters
coexpress_input_process = basic
coexpress_method = hotspot  # Options: hotspot, nest, hdwgcna
coexpress_bin_size = 100
coexpress_selected_genes = top5000

# Hotspot parameters
hotspot_fdr_cutoff = 0.05
hotspot_model = normal

Cell Type Annotation (cellAnnotation)

# Basic parameters
cellAnnotation_input_process = basic
cellAnnotation_method = rctd  # Options: cell2location, spotlight, rctd, tangram, scimilarity

# RCTD parameters
rctd_reference = /path/to/reference.h5ad
rctd_label_key = annotation
rctd_bin_size = 100
rctd_input_gene_symbol_key = real_gene_name
rctd_ref_gene_symbol_key = _index
rctd_filter_rare_cell = 100
rctd_n_cpus = 8

Spatial Domain Identification (spatialDomain)

# Basic parameters
spatialDomain_input_process = basic
spatialDomain_method = graphst

# GraphST parameters
graphst_tool = mclust
graphst_bin_size = 100
graphst_n_clusters = 10
graphst_n_hvg = 3000
graphst_gpu_id = -1

1.3 Module Dependency Configuration

SDAS modules have dependencies specified through *_input_process parameters:

# Basic modules (no dependencies)
coexpress_input_process = basic
spatialDomain_input_process = basic
cellAnnotation_input_process = basic

# Modules dependent on others
cellularNeighborhood_input_process = cellAnnotation
CCI_input_process = cellularNeighborhood
trajectory_input_process = cellAnnotation
DEG_input_process = spatialDomain
geneSetEnrichment_input_process = spatialDomain
spatialRelate_input_process = cellAnnotation

2. Generate Job Configuration

After configuration, run SDAS Pipeline to generate job configuration files:

python3 SDAS_pipeline.py -c pipeline_input.conf -o ./output

This will generate the all_shell.conf file containing all jobs and their dependencies.

Before actual submission, it's recommended to preview using dry-run mode:

python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run

This will display:

  • All job dependencies

  • Resource requirements (CPU, memory)

  • Qsub scripts to be generated

4. Submit Jobs

After confirmation, submit jobs to the queue:

python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output

auto_qsub_scheduler.py Job Scheduling System Configuration

Based on your cluster environment, you need to modify the create_qsub_script method in auto_qsub_scheduler.py to customize the job submission script format. This method is located in the QsubScheduler class:

def create_qsub_script(self, shell_file: str, cpu: int, memory: int) -> str:
    """
    Generate qsub job submission script
    Parameters:
        shell_file: Path to the shell script to execute
        cpu: Number of CPU cores
        memory: Memory requirement (GB)
    Returns:
        Generated qsub script content
    """
    # Modify the script template here based on your job scheduling system
    script = f"""#!/bin/bash
#PBS -q {self.queue}
#PBS -N {os.path.basename(shell_file)}
#PBS -o {shell_file}.log
#PBS -j oe
#PBS -l nodes=1:ppn={cpu}
#PBS -l mem={memory}gb

cd $PBS_O_WORKDIR
bash {shell_file}
"""
    return script

You need to:

  1. Modify the script template based on your job scheduling system (PBS/Torque, SGE, Slurm, or LSF)

  2. Ensure necessary resource configuration parameters are included (CPU, memory, etc.)

  3. Maintain references to the following variables:

    • self.queue: Queue name

    • shell_file: Execution script path

    • cpu: Number of CPU cores

    • memory: Memory requirement

Test Data and Configuration Files

SDAS Pipelines provides single-slice and multiple-slice test data with corresponding configuration files for users to quickly get started and test the system.

Directory Structure

SDAS_download/
├── Scripts/
│   └── pipeline_cluster/
│       ├── auto_qsub_scheduler.py      # Automated submission script
│       ├── SDAS_pipeline.py            # Pipeline generation script
│       ├── pipeline_input.single_slice.conf   # Single-slice data configuration example
│       └── pipeline_input.multiple_slice.conf  # Multiple-slice data configuration example
└── Test_data/
    ├── single_slice/     # Single-slice test data
    │   └── sample.h5ad
    └── multiple_slices/  # Multiple-slice test data
        ├── P19_NT_transition.h5ad
        ├── P19_T_transition.h5ad
        ├── P34_NT_transition.h5ad
        ├── P34_T_transition.h5ad
        ├── P33_T_transition.h5ad
        └── P36_T_transition.h5ad

Single-Slice Data Analysis Configuration

pipeline_input.single_slice.conf is designed for single spatial transcriptome slice analysis workflow:

  • Input Data: Single h5ad file

  • Analysis Modules: Includes most SDAS analysis modules

  • Features:

    • Simple data input configuration

    • Complete module parameter examples

    • Suitable for first-time users

Multiple-Slice Data Analysis Configuration

pipeline_input.multiple_slice.conf is designed for multiple spatial transcriptome slice analysis workflow:

  • Input Data: Multiple h5ad files with grouping information (e.g., Normal/Tumor)

  • Analysis Modules: Select appropriate modules based on experimental design

  • Features:

    • Demonstrates multi-sample input format

    • Includes inter-group comparison parameter settings

    • Suitable for comparative analysis

Testing Steps

1. Single-Slice Data Testing

Step 1: Prepare Configuration File

# 1. Copy configuration file to working directory
cp Scripts/pipeline_cluster/pipeline_input.single_slice.conf ./

# 2. Modify paths in configuration file
# - SDAS_software path
# - h5ad_files path
# - Reference data path (if needed)

Step 2: Generate Job Configuration

# Run Pipeline to generate job configuration
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.single_slice.conf -o ./output_single_slice

Step 3: Preview Jobs (Recommended)

# Use dry-run mode to preview job configuration
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice --dry-run

Step 4: Submit Jobs

# Actually submit jobs to queue
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice

2. Multiple-Slice Data Testing

Step 1: Prepare Configuration File

# 1. Copy configuration file to working directory
cp Scripts/pipeline_cluster/pipeline_input.multiple_slice.conf ./

# 2. Modify paths in configuration file
# - SDAS_software path
# - h5ad_files path (multiple file paths)
# - Reference data path (if needed)

Step 2: Generate Job Configuration

# Run Pipeline to generate job configuration
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.multiple_slice.conf -o ./output_multiple_slice

Step 3: Preview Jobs (Recommended)

# Use dry-run mode to preview job configuration
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice --dry-run

Step 4: Submit Jobs

# Actually submit jobs to queue
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice

3. auto_qsub_scheduler.py Parameter Description

Basic Parameters:

  • -c, --config: Job configuration file path (required)

  • -o, --output: Output directory path (required)

  • --queue: Queue name (default: stereo.q)

  • --max-concurrent: Maximum concurrent jobs (default: 10)

  • --retry-times: Number of retries for failed jobs (default: 3)

  • --wait-time: Status check interval in seconds (default: 30)

  • --dry-run: Preview mode, no actual job submission

Usage Examples:

# Basic usage
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output

# Custom queue and concurrency
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --queue my_queue --max-concurrent 5

# Preview mode
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run

# Fast mode (short check interval)
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --wait-time 10

4. Monitoring and Logs

Real-time Monitoring:

  • The program displays job status updates during execution

  • Press Ctrl+C to safely stop the scheduler

Log Files:

  • Scheduler log: ./output/scheduler.log

  • Job logs: ./output/qsub_info/logs/

  • Job scripts: ./output/qsub_info/shell/

Status Checking:

# View scheduler log
tail -f ./output/scheduler.log

# View specific job logs
tail -f ./output/qsub_info/logs/[job_name].o
tail -f ./output/qsub_info/logs/[job_name].e

# Check job status
qstat

5. Troubleshooting

Common Issues:

  1. Job Submission Failure

    • Check if queue name is correct

    • Confirm sufficient queue permissions

    • Check if resource requirements are reasonable

  2. Dependency Relationship Errors

    • Check all_shell.conf file format

    • Confirm dependent job names are correct

  3. Jobs Getting Stuck

    • Check if cluster resources are sufficient

    • View error messages in job logs

    • Consider adjusting --wait-time parameter

  4. Permission Issues

    • Ensure write permissions for output directory

    • Check queue submission permissions

Debug Mode:

# Enable detailed logging
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run --verbose

Last updated