Cluster Submission Mode
Introduction
SDAS Pipelines Automated Job Submission is an intelligent job scheduling system for automatically managing and submitting SDAS analysis workflow jobs to PBS/Torque clusters. The system provides:
Automatic Dependency Resolution: Intelligent scheduling based on job dependencies
Concurrency Control: Limits the number of simultaneously running jobs to avoid resource conflicts
Status Monitoring: Real-time monitoring of job execution status
Error Handling: Automatic retry of failed jobs
Detailed Reporting: Generates complete execution reports and logs
System Requirements
Python 3.6+
Job scheduling system (supports any of the following):
PBS/Torque
SGE (Sun Grid Engine)
Slurm
LSF (IBM Platform Load Sharing Facility)
Appropriate queue permissions
SDAS software properly configured
Usage Steps
1. Configure pipeline_input.conf File
Before running SDAS Pipeline, you need to configure the pipeline_input.conf
file. This file defines:
Input Data: h5ad file paths and grouping information
Analysis Workflow: Select SDAS modules to run
Module Parameters: Specific parameter configurations for each module
Dependencies: Input/output relationships between modules
1.1 Basic Configuration Structure
# 1. Software path
SDAS_software = /path/to/SDAS
# 2. Input data configuration
# Single file input
h5ad_files = /path/to/data.h5ad
# Multiple file input (with grouping information)
h5ad_files = S1,group1,A.h5ad;S2,group1,B.h5ad;S3,group2,C.h5ad
# Multiple file input (without grouping information)
h5ad_files = S1,,A.h5ad;S2,,B.h5ad
# 3. Analysis workflow selection
process = coexpress,spatialDomain,cellAnnotation,cellularNeighborhood,CCI,trajectory,DEG,geneSetEnrichment,TF,PPI,spatialRelate
1.2 Module Parameter Configuration Examples
Configuration Notes:
Parameter format:
parameter_name = parameter_value
Space indicates: parameter value is empty, use default value
Comments: Start with
#
for parameter explanationsPath parameters: Use absolute paths to avoid relative path issues
Spatial Gene Co-expression Analysis (coexpress)
# Basic parameters
coexpress_input_process = basic
coexpress_method = hotspot # Options: hotspot, nest, hdwgcna
coexpress_bin_size = 100
coexpress_selected_genes = top5000
# Hotspot parameters
hotspot_fdr_cutoff = 0.05
hotspot_model = normal
Cell Type Annotation (cellAnnotation)
# Basic parameters
cellAnnotation_input_process = basic
cellAnnotation_method = rctd # Options: cell2location, spotlight, rctd, tangram, scimilarity
# RCTD parameters
rctd_reference = /path/to/reference.h5ad
rctd_label_key = annotation
rctd_bin_size = 100
rctd_input_gene_symbol_key = real_gene_name
rctd_ref_gene_symbol_key = _index
rctd_filter_rare_cell = 100
rctd_n_cpus = 8
Spatial Domain Identification (spatialDomain)
# Basic parameters
spatialDomain_input_process = basic
spatialDomain_method = graphst
# GraphST parameters
graphst_tool = mclust
graphst_bin_size = 100
graphst_n_clusters = 10
graphst_n_hvg = 3000
graphst_gpu_id = -1
1.3 Module Dependency Configuration
SDAS modules have dependencies specified through *_input_process
parameters:
# Basic modules (no dependencies)
coexpress_input_process = basic
spatialDomain_input_process = basic
cellAnnotation_input_process = basic
# Modules dependent on others
cellularNeighborhood_input_process = cellAnnotation
CCI_input_process = cellularNeighborhood
trajectory_input_process = cellAnnotation
DEG_input_process = spatialDomain
geneSetEnrichment_input_process = spatialDomain
spatialRelate_input_process = cellAnnotation
2. Generate Job Configuration
After configuration, run SDAS Pipeline to generate job configuration files:
python3 SDAS_pipeline.py -c pipeline_input.conf -o ./output
This will generate the all_shell.conf
file containing all jobs and their dependencies.
3. Preview Jobs (Recommended)
Before actual submission, it's recommended to preview using dry-run mode:
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run
This will display:
All job dependencies
Resource requirements (CPU, memory)
Qsub scripts to be generated
4. Submit Jobs
After confirmation, submit jobs to the queue:
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output
auto_qsub_scheduler.py Job Scheduling System Configuration
Based on your cluster environment, you need to modify the create_qsub_script
method in auto_qsub_scheduler.py
to customize the job submission script format. This method is located in the QsubScheduler
class:
def create_qsub_script(self, shell_file: str, cpu: int, memory: int) -> str:
"""
Generate qsub job submission script
Parameters:
shell_file: Path to the shell script to execute
cpu: Number of CPU cores
memory: Memory requirement (GB)
Returns:
Generated qsub script content
"""
# Modify the script template here based on your job scheduling system
script = f"""#!/bin/bash
#PBS -q {self.queue}
#PBS -N {os.path.basename(shell_file)}
#PBS -o {shell_file}.log
#PBS -j oe
#PBS -l nodes=1:ppn={cpu}
#PBS -l mem={memory}gb
cd $PBS_O_WORKDIR
bash {shell_file}
"""
return script
You need to:
Modify the script template based on your job scheduling system (PBS/Torque, SGE, Slurm, or LSF)
Ensure necessary resource configuration parameters are included (CPU, memory, etc.)
Maintain references to the following variables:
self.queue
: Queue nameshell_file
: Execution script pathcpu
: Number of CPU coresmemory
: Memory requirement
Test Data and Configuration Files
SDAS Pipelines provides single-slice and multiple-slice test data with corresponding configuration files for users to quickly get started and test the system.
Directory Structure
SDAS_download/
├── Scripts/
│ └── pipeline_cluster/
│ ├── auto_qsub_scheduler.py # Automated submission script
│ ├── SDAS_pipeline.py # Pipeline generation script
│ ├── pipeline_input.single_slice.conf # Single-slice data configuration example
│ └── pipeline_input.multiple_slice.conf # Multiple-slice data configuration example
└── Test_data/
├── single_slice/ # Single-slice test data
│ └── sample.h5ad
└── multiple_slices/ # Multiple-slice test data
├── P19_NT_transition.h5ad
├── P19_T_transition.h5ad
├── P34_NT_transition.h5ad
├── P34_T_transition.h5ad
├── P33_T_transition.h5ad
└── P36_T_transition.h5ad
Single-Slice Data Analysis Configuration
pipeline_input.single_slice.conf
is designed for single spatial transcriptome slice analysis workflow:
Input Data: Single h5ad file
Analysis Modules: Includes most SDAS analysis modules
Features:
Simple data input configuration
Complete module parameter examples
Suitable for first-time users
Multiple-Slice Data Analysis Configuration
pipeline_input.multiple_slice.conf
is designed for multiple spatial transcriptome slice analysis workflow:
Input Data: Multiple h5ad files with grouping information (e.g., Normal/Tumor)
Analysis Modules: Select appropriate modules based on experimental design
Features:
Demonstrates multi-sample input format
Includes inter-group comparison parameter settings
Suitable for comparative analysis
Testing Steps
1. Single-Slice Data Testing
Step 1: Prepare Configuration File
# 1. Copy configuration file to working directory
cp Scripts/pipeline_cluster/pipeline_input.single_slice.conf ./
# 2. Modify paths in configuration file
# - SDAS_software path
# - h5ad_files path
# - Reference data path (if needed)
Step 2: Generate Job Configuration
# Run Pipeline to generate job configuration
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.single_slice.conf -o ./output_single_slice
Step 3: Preview Jobs (Recommended)
# Use dry-run mode to preview job configuration
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice --dry-run
Step 4: Submit Jobs
# Actually submit jobs to queue
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_single_slice
2. Multiple-Slice Data Testing
Step 1: Prepare Configuration File
# 1. Copy configuration file to working directory
cp Scripts/pipeline_cluster/pipeline_input.multiple_slice.conf ./
# 2. Modify paths in configuration file
# - SDAS_software path
# - h5ad_files path (multiple file paths)
# - Reference data path (if needed)
Step 2: Generate Job Configuration
# Run Pipeline to generate job configuration
python3 Scripts/pipeline_cluster/SDAS_pipeline.py -c pipeline_input.multiple_slice.conf -o ./output_multiple_slice
Step 3: Preview Jobs (Recommended)
# Use dry-run mode to preview job configuration
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice --dry-run
Step 4: Submit Jobs
# Actually submit jobs to queue
python3 Scripts/pipeline_cluster/auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output_multiple_slice
3. auto_qsub_scheduler.py Parameter Description
Basic Parameters:
-c, --config
: Job configuration file path (required)-o, --output
: Output directory path (required)--queue
: Queue name (default: stereo.q)--max-concurrent
: Maximum concurrent jobs (default: 10)--retry-times
: Number of retries for failed jobs (default: 3)--wait-time
: Status check interval in seconds (default: 30)--dry-run
: Preview mode, no actual job submission
Usage Examples:
# Basic usage
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output
# Custom queue and concurrency
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --queue my_queue --max-concurrent 5
# Preview mode
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run
# Fast mode (short check interval)
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --wait-time 10
4. Monitoring and Logs
Real-time Monitoring:
The program displays job status updates during execution
Press
Ctrl+C
to safely stop the scheduler
Log Files:
Scheduler log:
./output/scheduler.log
Job logs:
./output/qsub_info/logs/
Job scripts:
./output/qsub_info/shell/
Status Checking:
# View scheduler log
tail -f ./output/scheduler.log
# View specific job logs
tail -f ./output/qsub_info/logs/[job_name].o
tail -f ./output/qsub_info/logs/[job_name].e
# Check job status
qstat
5. Troubleshooting
Common Issues:
Job Submission Failure
Check if queue name is correct
Confirm sufficient queue permissions
Check if resource requirements are reasonable
Dependency Relationship Errors
Check
all_shell.conf
file formatConfirm dependent job names are correct
Jobs Getting Stuck
Check if cluster resources are sufficient
View error messages in job logs
Consider adjusting
--wait-time
parameter
Permission Issues
Ensure write permissions for output directory
Check queue submission permissions
Debug Mode:
# Enable detailed logging
python3 auto_qsub_scheduler.py -c ./output/all_shell.conf -o ./output --dry-run --verbose
Last updated