2. Basic Example

Input Data Requirements

The data matrix should be a data frame with samples as rows and features as columns. Example dataset can be accessed via:

data("UrbanRandomSubset")
head(UrbanRandomSubset[,1:10])

This dataset is a bulk RNAseq dataset that was obtained from blood samples of the 149 African children that were stratified into ones having Atopic Dermatitis (AD) and healthy controls (HC). Additionally, the whole dataset contains the stratification variable by the children’s location (Urban and Rural). This data snippet contains the Urban samples only. The columns represent the genes and samples are in rows. The column treatment contains diagnosis label.

This data subset was used in the PharML 2022 workshop within ECML-PKDD 2022 conference. The whole paper is available with the doi: doi.org/10.1007/978-3-031-23633-4_18

Feature Selection Procedure Basic Usage

As noted before every new session in which you use GeneSelectR should start with the following line, that would allow you to set the correct conda environment:

GeneSelectR::set_reticulate_python()
library(GeneSelectR)
library(dplyr)
# the rest of the code

Prior to running the feature selection function, let’s prepare the data to have a suitable format:

X <- UrbanRandomSubset %>% dplyr::select(-treatment) # get the feature matrix
y <- UrbanRandomSubset['treatment'] # store the data point label in a separate vector

Important

Due to how sklearn handles data, it is critical to convert the label column y into numeric representation. For example, you can do it like this:

y <- as.factor(y)
y <- as.integer(y)

Also, make sure that y is a vector, and not a data frame with one column:

y <- as.vector(y)

To run the feature selection procedure with default settings, you should run the GeneSelectR::GeneSelectR() function:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1) # all cores will be used
selection_results

3. Feature Selection

Feature Selection Procedure

Settings Defaults: If not provided, default feature selection methods and hyperparameter grids are established.By default there are four methods implemented to select the features: Univariate feature selection , Logistic regression with L1 penalty, boruta and Random Forest. The default grids for every default feature selection method are as follows:

fs_param_grids <- list(
  "Lasso" = list(
    "feature_selector__estimator__C" = c(0.01, 0.1, 1L, 10L),
    "feature_selector__estimator__solver" = c('liblinear','saga')
  ),
  "Univariate" = list(
    "feature_selector__param" = seq(50L, 200L, by = 50L)
  ),
  "boruta" = list(
    "feature_selector__perc" = seq(80L, 100L, by = 10L),
    'feature_selector__n_estimators' = c(50L, 100L, 250L, 500L)
  ),
  "RandomForest" = list(
    "feature_selector__estimator__n_estimators" = seq(100L, 500L,by = 50L),
    "feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
    "feature_selector__estimator__min_samples_split" = c(2L, 5L, 10L),
    "feature_selector__estimator__min_samples_leaf" = c(1L, 2L, 4L),
    "feature_selector__estimator__bootstrap" = c(TRUE, FALSE)
  )
)

Pipeline Selection: If custom pipelines are provided, they’re used. Otherwise, default pipelines are created using the chosen feature selection methods. Pipelines contain two steps: VarianceThreshold that is set to 0.85 to filter out low variance features and StandardScaler. As per sklearn workflow to prevent data leakage every pipeline is fit on train and test/validate data separately.
Repeated Train-Test Splitting: The data undergoes repeated splitting into training and test sets, with the number of repetitions being defined by n_splits. For each split:

Pipelines are iterated over for fitting.
Each pipeline undergoes a secondary splitting for validation purposes.
Hyperparameter tuning is conducted through grid, randomized or grid search with Bayesian Optimization.
If the perform_test_split parameter is set to TRUE, the test metrics for the split are computed. Otherwise, only cross validation metrics are reported. Attention: due to the nature of biomedical data most of the time there is not so many samples available. As a general recommendation it is better to set perform_test_split to FALSE if number of samples is under 100. This way only cross validation performance will be reported.

Feature Importance Calculation: For each split and each pipeline, feature importance scores are computed. By default model specific, inbuilt feature importance is reported. If calculate_permutation_importance is set to TRUE, permutation feature importance scores are calculated, in addition to inbuilt feature importance scores.

4.1 Note on Inbuilt vs Permutation feature importance: GeneSelectR package offers to ways of selecting features, and it is important to understand the difference in these two approaches. In short, these two methods can be summarized in this manner:

Inbuilt (Model-specific) Feature Importance:
- Nature: Derived directly from the model. For instance, in tree-based models like decision trees or random forests, feature importance is based on the number of times a feature is used to split the data.
- Speed: Generally faster as it’s a by-product of the model training process.
- Bias: Might be biased towards features with more categories or higher cardinality.
Permutation Importance:
- Nature: Calculated by permuting the values of each feature and measuring the decrease in the model’s performance. If shuffling a feature’s values drops the performance significantly, it’s deemed important.
- Speed: Can be computationally intensive as it involves multiple evaluations of the model.
- Bias: Offers a more unbiased approach, especially for features with high cardinality.

A Note on Permutation Importance: Permutation importance provides insights specific to the model it’s computed for. A feature that’s vital for one model might not be for another. Hence, it’s crucial to understand that permutation importance doesn’t indicate the intrinsic predictive value of a feature in isolation. Moreover, it’s vital to have a trustworthy model (with good cross-validation scores) before considering its permutation importance values. Features deemed unimportant in a poorly performing model could be crucial for a well-performing one. It’s always recommended to evaluate a model’s predictive power using held-out data or cross-validation prior to computing permutation importances. For more information please refere to the sklearn official documentation on permutation importance.

Note: Permutation Importance calculation is pretty computationally intensive and may increase the time to finish the analysis run. As a general recommendation it should be used if there large computational resources are avaialble.

Hyperparamter adjustment For the sake of computational time the RandomizedSearchCV is used by default. However, if you want to perform an extensive hyperparameter optimization grid search testing all possible combinations can be performed:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              search_type = 'grid')

GeneSelectR also supports scikit-optimize method BayesSearchCV, that performs Bayesian Optimization of hyperparameters. Depending on settings and data it could potentially speed up computation:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              search_type = 'bayesian')

Aggregating Results: Across all the splits, mean inbuilt feature importance values with standard deviation are aggregated for each method. Same is done for the permutation importance if it is enabled. Additionally, rank for every feature at every data split is reported.

Customizing GeneSelectR Workflow with Other sklearn Methods

The GeneSelectR workflow is highly customizable depending on your needs. For example, if you wish to add any other feature selection methods from sklearn, you should pass it as a named list like this:

# sklearn is already imported when the library is loaded 
# define the feature selection submodule and wanted methods with an estimator
feature_selection <- sklearn$feature_selection
select_from_model <- feature_selection$SelectFromModel
RFE <- feature_selection$RFE
rf <- sklearn$ensemble$RandomForestClassifier

# feature selection methods of your choice 
my_methods <- list('RFE' = RFE(estimator = rf(), n_features_to_select = 100L),
                   'SelectFromModel' = select_from_model(estimator = rf()))

The parameters for the methods should be passed like this with a prefix ‘feature_selector__’ for every parameter:

# params for the feature selection methods of your choice 
my_params <- list('RFE' = list('feature_selector__step' = seq(0.1, 0.001, 1, 10)),
                  'SelectFromModel' = list('feature_selector__estimator__n_estimators' = c(50L, 100L, 250L, 500L),
                                           "feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
                                           "feature_selector__estimator__bootstrap" = c(TRUE, FALSE))
                  )

Finally, we can pass it as an arguments to the GeneSelectR() function:

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              feature_selection_methods=my_methods,
                                              fs_param_grids = my_params)

Other than feature selection methods, preprocessing steps as well as other estimators for classifications can be passed. For example, if you want to pass other preprocessing steps you can do it like this:

minmax <- sklearn$preprocessing$MinMaxScaler()
varthr <- sklearn$feature_selection$VarianceThreshold()
preprocessing <- list( 'MinMaxScaler' = minmax, 
                       'VarianceThreshold' = varthr)

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              feature_selection_methods=my_methods,
                                              fs_param_grids = my_params,
                                              preprocessing_steps = preprocessing)

Same can be done with the classifiying estimator. For example if you want to use the XGBoost classifier instead of the default random forest:

# import xgboost 
# NOTE: it should be installed in the working conda env 
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              classifier = xgb.classifier)

Note: if you supply your own classifier, you have to specify the parameter grid for it with a ‘classifier__’ prefix:

xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()

xgb_param_grid <- list(
  "classifier__learning_rate" = c(0.01, 0.05, 0.1),
  "classifier__n_estimators" = c(100L, 200L, 300L),
  "classifier__max_depth" = c(3L, 5L, 7L),
  "classifier__min_child_weight" = c(1L, 3L, 5L),
  "classifier__gamma" = c(0, 0.1, 0.2),
  "classifier__subsample" = c(0.8, 1.0),
  "classifier__colsample_bytree" = c(0.8, 1.0)
)

selection_results <- GeneSelectR::GeneSelectR(X = X, 
                                              y = y,
                                              njobs = -1,
                                              classifier = xgb.classifier,
                                              classifier_grid = xgb_param_grid)

4. Machine Learning Performance Results Inspection

Structure of the PipelineResults object

The function returns an object of class “PipelineResults”, containing:
- ‘best_pipeline’: A named list containing parameters of the best performer pipeline.

‘cv_results’: Results of cross-validation for each pipeline.
‘inbuilt_feature_importance’: Aggregated inbuilt feature importance scores.
‘test_metrics’: If the perform_split parameter was set to TRUE, a dataframe of test metrics for each pipeline is returned.

-‘cv_mean_score’: A dataframe that summarizes the mean cross-validation scores.

‘permutation_importance’: If permutation importance was calculated, its mean values are returned.

Feature Importance Plots

Finally, you can visualize the most important features for every feature importance method calculation by calling the plotting function:

plot_feature_importance(selection_results, top_n_features = 10)

Alt text

The plot will show you the mean feature importance of top n features of your choice by their importance score in descending order.

ML Performance Metrics

You can plot the feature selection procedure metrics with the following function:

plot_metrics(selection_results)

# or access it as a dataframe 
selection_results$test_metrics 
selection_results$cv_mean_score

Alt text

Overlap Between Gene Lists

Additionally, you can inspect whether genes in your feature selection lists have overlapping features. To do that use the following:

overlap <- calculate_overlap_coefficients(selection_results)
overlap

This will return a dataframe that demonstrates three types of overlap coefficients for inbuilt feature importance and permutation importance (if calculated): Soerensen-Dice, Overlap and Jaccard. The coefficients can also be visualized as overlap heatmaps. To do so do the following:

plot_overlap_heatmaps(overlap)

Alt text

Additionally, if you have any custom lists (e.g. differential gene expression list) you can pass it as an argument like this:

custom_list <- list(custom_list = c('char1','char2','char3','char4','char5'),
                    custom_list2 = c('char1','char2','char3','char4','char5'))
overlap1 <- calculate_overlap_coefficients(selection_results, custom_lists = custom_list)
plot_overlap_heatmaps(overlap1)

Alt text

Upset plot

To get the exact numbers of intersections between the feature lists you can use the upset plot function:

upset_upet(selection_results)

# plot upset with custom lists 
upset_upet(selection_results, custom_lists = custom_list)

Alt text

5.GO Enrichment analysis

Annotation of Selected Genes

For your convenience there is a wrapper function for clusterprofiler(link) GO enrichment implemented, along with a function to get gene annotations. After GeneSelectR has been run, to fetch gene annotations do the following:

# Get the annotations for the genes in the analysis 
# The example dataset contains sequencing from humans 
ah <- AnnotationHub::AnnotationHub()
human_ens <- AnnotationHub::query(ah, c("Homo sapiens", "EnsDb"))
human_ens <- human_ens[['AH98047']]
annotations_ahb <- ensembldb::genes(human_ens, return.type = "data.frame") %>%
 dplyr::select(gene_id,gene_name,entrezid,gene_biotype)

annotations_df <- annotate_gene_lists(pipeline_results = selection_results, 
                                      annotations_ahb = annotations_ahb,
                                      format = 'ENSEMBL')
annotations_df

The format argument can be set to three possible options: (1) ‘ENSEMBL’ (eg. ENSGXXXXXXX), (2) ‘SYMBOL’ (eg. TSPAN) or ENTREZ ID. The resulting AnnotatedGeneLists class object contains following fields:

# Formal class 'AnnotatedGeneLists' [package "GeneSelectR"] with 2 slots
#   ..@ inbuilt    :List of 4
#   .. ..$ Lasso       :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:54] "MCOLN3" "AGL" "NBPF26" "EXO1" ...
#   .. .. .. ..@ ENSEMBL : chr [1:54] "ENSG00000055732" "ENSG00000162688" "ENSG00000273136" "ENSG00000174371" ...
#   .. .. .. ..@ ENTREZID: chr [1:54] "55283" "178" "101060684" "9156" ...
#   .. ..$ Univariate  :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:74] "CDA" "BMP8B" "MCOLN3" "AGL" ...
#   .. .. .. ..@ ENSEMBL : chr [1:74] "ENSG00000158825" "ENSG00000116985" "ENSG00000055732" "ENSG00000162688" ...
#   .. .. .. ..@ ENTREZID: chr [1:74] "978" "656" "55283" "178" ...
#   .. ..$ RandomForest:Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:73] "CDA" "BMP8B" "MCOLN3" "NBPF26" ...
#   .. .. .. ..@ ENSEMBL : chr [1:73] "ENSG00000158825" "ENSG00000116985" "ENSG00000055732" "ENSG00000273136" ...
#   .. .. .. ..@ ENTREZID: chr [1:73] "978" "656" "55283" "101060684" ...
#   .. ..$ boruta      :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:36] "MCOLN3" "AGL" "NBPF26" "BNIPL" ...
#   .. .. .. ..@ ENSEMBL : chr [1:36] "ENSG00000055732" "ENSG00000162688" "ENSG00000273136" "ENSG00000163141" ...
#   .. .. .. ..@ ENTREZID: chr [1:36] "55283" "178" "101060684" "149428" ...
#   ..@ permutation:List of 4
#   .. ..$ Lasso       :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:4] "HBB" "" "ABCC12" "LTF"
#   .. .. .. ..@ ENSEMBL : chr [1:4] "ENSG00000244734" "ENSG00000259529" "ENSG00000140798" "ENSG00000012223"
#   .. .. .. ..@ ENTREZID: chr [1:4] "3043" "NA" "94160" "4057"
#   .. ..$ Univariate  :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:19] "KAZN" "RGS16" "IGFN1" "CHI3L1" ...
#   .. .. .. ..@ ENSEMBL : chr [1:19] "ENSG00000189337" "ENSG00000143333" "ENSG00000163395" "ENSG00000133048" ...
#   .. .. .. ..@ ENTREZID: chr [1:19] "23254" "6004" "91156" "1116" ...
#   .. ..$ RandomForest:Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:16] "ZBED6" "CYP2E1" "HBB" "OTOGL" ...
#   .. .. .. ..@ ENSEMBL : chr [1:16] "ENSG00000257315" "ENSG00000130649" "ENSG00000244734" "ENSG00000165899" ...
#   .. .. .. ..@ ENTREZID: chr [1:16] "100381270" "1571" "3043" "283310" ...
#   .. ..$ boruta      :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
#   .. .. .. ..@ SYMBOL  : chr [1:8] "CHI3L1" "HBB" "IDH1" "PTPRN" ...
#   .. .. .. ..@ ENSEMBL : chr [1:8] "ENSG00000133048" "ENSG00000244734" "ENSG00000138413" "ENSG00000054356" ...
#   .. .. .. ..@ ENTREZID: chr [1:8] "1116" "3043" "3417" "5798" ...

For example, ENTREZID for Random Forest gene list derived from inbuilt feature importance can be accessed like this:

print(annotated_df@inbuilt$RandomForest@ENTREZID)

You can also add custom lists (eg. background for enrichment, differential gene expression list) to fetch annotations for them:

custom_list <- list(background = c('gene1', 'gene2', 'gene3', 'gene4'), 
                    DEGs = c('gene5','gene6','gene7'))
annotations_df_with_custom <- annotate_gene_lists(pipeline_results = selection_results, 
                                      annotations_ahb = annotations_ahb,
                                      format = 'ensembl_gene_name', 
                                      custom_lists = custom_list)
annotations_df_with_custom

GO Enrichment Analysis

There is a wrapper function to run GO enrichment analysis with the clusterprofiler package. To run a GO enrichment analysis with default settings simply run:

annotated_GO <- GO_enrichment_analysis(annotated_df)

This function can take different arguments depending on your need:

annotated_GO <- GO_enrichment_analysis(annotated_df,
                                       list_type = 'permutation', #run GO enrichment on permutation based selected features 
                                       keyType = 'ENSEMBL', # run analysis with ENSEMBLIDs  
                                       ont = 'BP') # run BP ontology
annotated_GO

The returned object is a list containing clusterprofiler enrichResult for every feature selection list.

Visualization of Parent Term Fractions

You can calculate what is the fraction of target parent GO terms is enriched in your feature selection lists:

annot_child_fractions <- compute_GO_child_term_metrics(GO_data = annotated_GO,
                              GO_terms = c("GO:0002376", "GO:0044419"),
                              plot = FALSE)

This will produce a dataframe containing information on number of children terms of specified terms, number of all the terms and fraction of children terms related to the parent one in the total number of identified GO terms (without p-value filtering) and hypergeometric test p-value:

str(annot_child_fractions)
# 'data.frame': 12 obs. of  6 variables:
#  $ feature_list          : chr  "Lasso" "Univariate" "RandomForest" "boruta" ...
#  $ all_terms_number      : int  5413 5421 5981 4643 1308 1752 5413 5421 5981 4643 ...
#  $ offspring_nodes_number: int  432 449 495 379 196 170 112 110 128 100 ...
#  $ offspring_terms       : chr  "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001780;GO:0001782;GO:0001787;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001780;GO:0001782;GO:0001787;GO:0001909;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001782;GO:0001787;GO:0001865;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001780;GO:0001782;GO:0001865;G"| __truncated__ ...
#  $ fraction              : num  7.98 8.28 8.28 8.16 14.98 ...
#  $ GO_term               : chr  "GO:0002376" "GO:0002376" "GO:0002376" "GO:0002376" ...

If the plot argument is set to TRUE it will produce a following plot:

Alt text

GeneSelectR Tutorial

1. Installation and Requirements

Python requirements

Before you begin

2. Basic Example

Input Data Requirements

Feature Selection Procedure Basic Usage

Important

3. Feature Selection

Feature Selection Procedure

Customizing GeneSelectR Workflow with Other sklearn Methods

4. Machine Learning Performance Results Inspection

Structure of the PipelineResults object

Feature Importance Plots

ML Performance Metrics

Overlap Between Gene Lists

Upset plot

5.GO Enrichment analysis

Annotation of Selected Genes

GO Enrichment Analysis

Visualization of Parent Term Fractions

6. Semantic Similarity Analysis

simplifyEnrichment Wrapper Function