Build Single-cell Reference Data

Purpose

Use cell2locationMakeRef to construct the cell2location single-cell reference inf_aver.csv file.

Usage

SDAS cellAnnotation cell2locationMakeRef -o ./ref --reference sc.h5ad --label_key annotation \
--batch_key id \
--nonz_mean_cutoff 1.45 \
--gpu_id 3

Input Parameter Description

Parameter
Required
Default
Description

-o / --output

Yes

Output folder

--reference

Yes

Single-cell ref h5ad, must contain the raw expression matrix

--label_key

Yes

Name of the column in single-cell ref h5ad.obs indicating cell type

--ref_layer

No

Layer in single-cell ref h5ad storing raw counts

--ref_gene_symbol_key

No

_index

Name of the column in single-cell ref h5ad.var indicating gene symbol (_index means using h5ad.var.index)

--batch_key

No

Name of the column in single-cell ref h5ad.obs indicating batch; if not provided, batch is not considered

--filter_rare_cell

No

100

The minimum cell count for a cell type to be included

--check_filter_genes

No

If this parameter is set, only the result plot of filtered genes (filter_genes.png) will be output

--cell_count_cutoff

No

5

Parameter controlling gene filtering in cell2location, usually not adjusted

--cell_percentage_cutoff2

No

0.03

Parameter controlling gene filtering in cell2location; the larger the value, the fewer genes are selected. It is recommended to keep the number of genes between 8k-16k

--nonz_mean_cutoff

No

1.12

Parameter controlling gene filtering in cell2location; the larger the value, the fewer genes are selected. It is recommended to keep the number of genes between 8k-16k

--max_epochs

No

250

Number of epochs for model training

--seed

No

42

Random seed

--gpu_id

No

-1

ID of the GPU to use. If -1, use CPU. This parameter only specifies the main GPU to use; other GPUs may also be occupied but with very low usage. If you need to strictly specify the GPU, set the environment variable before running, e.g.: export CUDA_VISIBLE_DEVICES=2, then set --gpu_id 0 to use only GPU 2.

--n_threads

No

Number of threads to use in CPU mode, defaults to all CPUs

Output Results

Result File
Description

<reference_name>_filter_genes.png/pdf

Gene filtering result plot by cell2location (<reference_name> is the single-cell ref h5ad file name)

<reference_name>_train_history.png/pdf

Training loss curve

<reference_name>_inf_aver.csv

Single-cell ref csv constructed by cell2location

  • Gene Filtering Result Plot by Cell2location: <reference_name>_filter_genes.png/pdf The orange rectangle highlights genes excluded based on the combination of number of cells expressing that gene (Y-axis) and average RNA count for cells where the gene was detected (X-axis). It is recommended to keep this between 8k-16k.

  • Training Loss Curve: <reference_name>_train_history.png/pdf The ELBO loss curve during training; the first 20 epochs are removed from the plot.

  • Single-Cell Reference CSV Constructed by Cell2location: <reference_name>_inf_aver.csv Each row is a gene, each column is a cell type, and the value is the cell type feature calculated by cell2location (the estimated expression of each gene in each cell type using a negative binomial regression model).

B_act
B_naive
CD4_CXCL13
...

7SK

0.3071783

0.22791654

0.059129756

...

A1BG

0.18173707

0.096046284

0.0936929

...

A1BG-AS1

0.04608244

0.042425267

0.08740552

...

A1CF

0.00167472

0.000960604

0.002093679

...

...

...

...

...

...

Last updated