OnClass: single-cell annotation based on the Cell Ontology¶
OnClass is a python package for single-cell cell type annotation. It uses the Cell Ontology to capture the cell type similarity. These similarities enable OnClass to annotate cell types that are never seen in the training data.
Introduction¶
OnClass is a python package for single-cell cell type annotation. It uses the Cell Ontology to capture the cell type similarity. These similarities enable OnClass to annotate cell types that are never seen in the training data.
A preprint of OnClass paper is on bioRxiv All datasets used in OnClass can be found in figshare. Currently, OnClass supports
- annotate cell type
- integrating different single-cell datasets based on the Cell Ontology
- marker genes identification
OnClass is a joint work by Altman lab at stanford and czbiohub.
For questions about the software, please contact Sheng Wang at swang91@uw.edu.
Our web server can be found at: https://onclass.readthedocs.io/.
Install OnClass¶
OnClass can be substantially accelerated by using GPU (tensorflow 2.0). However, this is only required when you want to train your own model. OnClass can also be used with only CPU.
OnClass package only has three scripts: the OnClass class file, the deep learning model file, and the utility functions file.
You can simply get these three files and put it in the right directory (see how to use run_OnClass_example.py in Tutorial). You can also install OnClass using pip as following:
- Only use CPU
pip install OnClass==1.2
pip install tensorflow==2.0
- Use GPU
pip install OnClass==1.2
pip install tensorflow-gpu==2.0
Dataset and pretrained models¶
If you want to use your own training data, please use OnClass_data_public_minimal.tar.gz. OnClass_data_public_minimal.tar.gz includes the minimal files (ontology files) needed to run OnClass on your own dataset. You need to provide the annotated cells (training data) and then OnClass can classify unannotated cells.
If you don’t have your own training data, you can use the annotated gene expression data (e.g., Tabula Muris Senis, Lemur, HLCA, Allen brian) used in OnClass paper or pretrained models. See scRNA_data below for how to download the annotated gene expression data. See Pretrained_model below for how to download the pretrained model.
1) scRNA_data¶
Download three parts from link 1, link 2, link 3. Jointly extract the files using
cat OnClass_data_public_scRNA_data.tar.gz.* | tar -xz
This will give you all the single cell gene expression data used in our paper (see Fig. 2, Extended Data Figs. 1-3, Supplementary Figs. 4-7).
2) Ontology_data¶
These files are in OnClass_data_public_minimal.tar.gz. They include Cell Ontology and Allen brain Ontology. Cell Ontology has cell type text definition. cl.ontology.nlp.emb is the text embedding of the definition of each cell type.
3) Pretrained_model¶
Download 8 tensorflow pretrained models here. They are trained from 8 dataset in Fig. 2, Extended Data Figs. 1-3, Supplementary Figs. 4-7.
4) Intermediate_files¶
This folder contains the intermediate files. Data generated by example scripts will be stored here.
We suggest you organize all the downloaded files as the following:
For questions about the datasets, please contact Sheng Wang at swang91@uw.edu.
How to run OnClass¶
To run OnClass, please first install OnClass, download datasets and then change file paths in config.py
We provide a run_OnClass_example.py and Jupyter notebook as an example to run OnClass. This script trains an OnClass model on all cells from one Lemur dataset, saves that model to a model file, then use this model to classify cells from another Lemur dataset.
Run your own dataset for cell type annotation¶
You only need to modify line 9-13 in run_OnClass_example.py by replacing train_file, test_file with your training and test file, and train_label and test_label with the cell ontology label key in your dataset.
Import OnClass and other libs as:
from anndata import read_h5ad
from scipy import stats, sparse
import numpy as np
import sys
from collections import Counter
from OnClass.OnClassModel import OnClassModel
from utils import read_ontology_file, read_data, run_scanorama_multiply_datasets
from config import ontology_data_dir, scrna_data_dir, model_dir, Run_scanorama_batch_correction, NHIDDEN, MAX_ITER
Read training and test data. Set nlp_mapping = True to use the Char-level LSTM that maps uncontrolled vocabulary to controlled vocabulary. If you don’t want to use h5ad file, you can provide training and test data in the format of numpy array to OnClass. Training and test features (gene expression) should be cell by gene 2D array. Training label should be a vector of cell labels.
train_file = scrna_data_dir + '/Lemur/microcebusBernard.h5ad'
test_file = scrna_data_dir + '/Lemur/microcebusAntoine.h5ad'
train_label = 'cell_ontology_id'
test_label = 'cell_ontology_id'
model_path = model_dir + 'example_file_model'
cell_type_nlp_emb_file, cell_type_network_file, cl_obo_file = read_ontology_file('cell ontology', ontology_data_dir)
OnClass_train_obj = OnClassModel(cell_type_nlp_emb_file = cell_type_nlp_emb_file, cell_type_network_file = cell_type_network_file)
train_feature, train_genes, train_label, _, _ = read_data(train_file, cell_ontology_ids = OnClass_train_obj.cell_ontology_ids,
exclude_non_leaf_ontology = False, tissue_key = 'tissue', AnnData_label_key = train_label, filter_key = {},
nlp_mapping = False, cl_obo_file = cl_obo_file, cell_ontology_file = cell_type_network_file, co2emb = OnClass_train_obj.co2vec_nlp)
Embed the cell ontology:
OnClass_train_obj.EmbedCellTypes(train_label)
Batch correction using Scanorama:
if Run_scanorama_batch_correction:
train_feature, test_feature = run_scanorama_multiply_datasets([train_feature, test_feature], [train_genes, test_genes], scan_dim = 10)[1]
Training:
cor_train_feature, cor_test_feature, cor_train_genes, cor_test_genes = OnClass_train_obj.ProcessTrainFeature(train_feature, train_label, train_genes, test_feature = test_feature, test_genes = test_genes)
OnClass_train_obj.BuildModel(ngene = len(cor_train_genes), nhidden = NHIDDEN)
OnClass_train_obj.Train(cor_train_feature, train_label, save_model = model_path, max_iter = MAX_ITER)
Test:
OnClass_test_obj = OnClassModel(cell_type_nlp_emb_file = cell_type_nlp_emb_file, cell_type_network_file = cell_type_network_file)
cor_test_feature = OnClass_train_obj.ProcessTestFeature(cor_test_feature, cor_test_genes, use_pretrain = model_path, log_transform = False)
OnClass_test_obj.BuildModel(ngene = None, use_pretrain = model_path)
pred_Y_seen, pred_Y_all, pred_label = OnClass_test_obj.Predict(cor_test_feature, test_genes = cor_test_genes, use_normalize=True)
pred_label_str = [OnClass_test_obj.i2co[l] for l in pred_label]
One dataset cross-validation¶
run_one_dataset_cross_validation.py can be used to reproduce Figure 2 in our paper. All data are provided in figshare (please see Dataset and pretrained model)
Cross dataset prediction¶
run_cross_dataset_prediction.py can be used to reproduce Figure 4 in our paper. All data are provided in figshare (please see Dataset and pretrained model)
Marker genes identification¶
Please first run run_generate_pretrained_model.py to generate the intermediate files (line 53-54) for marker gene prediction.
Train a model using the seen cell types:
OnClass_train_obj.EmbedCellTypes(train_label)
print ('generate pretrain model. Save the model to $model_path...')
model_path = model_dir + 'OnClass_full_'+dname
train_feature, train_genes = OnClass_train_obj.ProcessTrainFeature(train_feature, train_label, train_genes)
OnClass_train_obj.BuildModel(ngene = len(train_genes))
OnClass_train_obj.Train(train_feature, train_label, save_model = model_path)
Use this model to classify cells into all cell types in the Cell Ontology. Here pred_Y_seen is a cell by seen cell type matrix, pred_Y_all is a cell by all cell type type matrix.
OnClass_test_obj = OnClassModel(cell_type_nlp_emb_file = cell_type_nlp_emb_file, cell_type_network_file = cell_type_network_file)
OnClass_test_obj.BuildModel(ngene = None, use_pretrain = model_path)
pred_Y_seen, pred_Y_all, pred_label = OnClass_test_obj.Predict(train_feature, test_genes = train_genes, use_normalize=False, use_unseen_distance = -1)
np.save(output_dir+dname + 'pred_Y_seen.released.npy',pred_Y_seen)
np.save(output_dir+dname + 'pred_Y_all.released.npy',pred_Y_all)
Then run run_marker_genes_identification.py for marker gene identification (Figure 5c).
Run run_marker_gene_based_prediction.py for marker gene based prediction (Figure 5d,e,f, Extended Data Figure 7).
References¶
[Hie19] | Brian Hie, Bryan Bryson, and Bonnie Berger. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama., Nature biotechnology 37.6 (2019): 685. |
[Wang19] | Sheng Wang, Angela Oliveira Pisco, Jim Karkanias, and Russ B. Altman. Unifying single-cell annotations based on the Cell Ontology., bioRxiv |