Skip to content

Feature Importance Analysis

The feature_importance_analysis.py script provides comprehensive analysis of feature importance matrices generated by Joint VAE models. It analyzes patterns in how input features contribute to predicting output features, constructs importance-based networks, and validates findings against protein-protein interaction (PPI) databases.

Overview

The feature importance analysis helps answer key questions about the Joint VAE's learned feature relationships: - Which input features are most consistently important across different outputs? - How do self-feature importance patterns relate to actual performance? - What network structures emerge from feature importance relationships? - How well do importance-derived networks align with known biological interactions?

Features

Core Analysis Components

  1. Rank Consistency Analysis
  2. Analyzes how consistently features rank across different prediction targets
  3. Identifies features with stable vs. variable importance patterns
  4. Generates clustered heatmaps and distribution plots

  5. Self-Feature Importance Analysis

  6. Examines diagonal elements (self-importance) in importance matrices
  7. Compares self-importance vs. cross-importance patterns
  8. Correlates self-importance with actual imputation performance

  9. Feature Specialization Analysis

  10. Determines whether features are specialists (important for few targets) or generalists
  11. Uses Gini coefficient to measure importance concentration
  12. Visualizes specialization vs. overall importance

  13. Network Construction and Analysis

  14. Builds networks based on feature importance relationships
  15. Two thresholding methods: self-importance ratio and absolute importance
  16. Comprehensive network topology analysis including centrality and community detection

  17. PPI Validation

  18. Compares importance-derived networks with reference PPI databases
  19. Identifies novel connections not in PPI and validates known interactions
  20. Performs enrichment analysis to assess biological relevance

  21. Threshold Optimization

  22. Analyzes different threshold values for optimal network construction
  23. Provides recommendations based on edge count, density, and biological validation
  24. Generates precision-recall curves for PPI validation

Usage

Basic Usage

python scripts/feature_importance_analysis.py \
    --importance_a_to_b importance_a_to_b.csv \
    --platform_a_name "Olink" \
    --platform_b_name "SomaScan" \
    --output_dir importance_analysis_results

With Network Analysis

python scripts/feature_importance_analysis.py \
    --importance_a_to_b importance_a_to_b.csv \
    --importance_b_to_a importance_b_to_a.csv \
    --platform_a_name "Olink" \
    --platform_b_name "SomaScan" \
    --threshold_method self_importance_ratio \
    --threshold_params 10.0 \
    --output_dir importance_analysis_results

With Performance Correlation Analysis

python scripts/feature_importance_analysis.py \
    --importance_a_to_b importance_a_to_b.csv \
    --importance_b_to_a importance_b_to_a.csv \
    --truth_a data/truth_platform_a.csv \
    --truth_b data/truth_platform_b.csv \
    --imp_a_m1 data/imputed_a_method1.csv \
    --imp_b_m1 data/imputed_b_method1.csv \
    --platform_a_name "Olink" \
    --platform_b_name "SomaScan" \
    --output_dir importance_analysis_results

With PPI Validation

python scripts/feature_importance_analysis.py \
    --importance_a_to_b importance_a_to_b.csv \
    --importance_b_to_a importance_b_to_a.csv \
    --platform_a_name "Olink" \
    --platform_b_name "SomaScan" \
    --ppi_reference data/string_ppi.txt \
    --ppi_symbol1_col protein1 \
    --ppi_symbol2_col protein2 \
    --ppi_confidence_col combined_score \
    --ppi_confidence_threshold 400 \
    --output_dir importance_analysis_results

Complete Analysis with All Features

python scripts/feature_importance_analysis.py \
    --importance_a_to_b importance_a_to_b.csv \
    --importance_b_to_a importance_b_to_a.csv \
    --truth_a data/truth_platform_a.csv \
    --truth_b data/truth_platform_b.csv \
    --imp_a_m1 data/imputed_a_method1.csv \
    --imp_b_m1 data/imputed_b_method1.csv \
    --feature_mapping data/feature_mapping.csv \
    --platform_a_name "Olink" \
    --platform_b_name "SomaScan" \
    --threshold_method absolute_importance \
    --threshold_params 0.005 \
    --ppi_reference data/string_ppi.txt \
    --target_density 0.05 \
    --output_dir importance_analysis_results

Command Line Arguments

Required Arguments

  • --importance_a_to_b: Path to importance matrix CSV (Platform A → Platform B)
  • --platform_a_name: Display name for platform A (e.g., "Olink")
  • --platform_b_name: Display name for platform B (e.g., "SomaScan")

Optional Data Arguments

  • --importance_b_to_a: Path to importance matrix CSV (Platform B → Platform A)
  • --truth_a/--truth_b: Truth data files for performance calculation
  • --imp_a_m1/--imp_a_m2: Imputed data files for platform A (methods 1 and 2)
  • --imp_b_m1/--imp_b_m2: Imputed data files for platform B (methods 1 and 2)
  • --feature_mapping: Feature mapping file (CSV or JSON) for numeric ID → gene name conversion

Network Analysis Arguments

  • --threshold_method: Thresholding method (self_importance_ratio or absolute_importance)
  • --threshold_params: Threshold parameter value (auto-determined if not provided)
  • --network_type: Network type (directed or undirected)
  • --target_density: Target network density for recommendations (default: 0.0366)

PPI Validation Arguments

  • --ppi_reference: Path to PPI reference file (tab-delimited)
  • --ppi_symbol1_col: Column name for first protein symbol (default: "symbol1")
  • --ppi_symbol2_col: Column name for second protein symbol (default: "symbol2")
  • --ppi_confidence_col: Column name for confidence scores (optional)
  • --ppi_confidence_threshold: Minimum confidence threshold (default: 0.0)

Output Arguments

  • --output_dir: Output directory for results (default: "importance_matrix_analysis")

Input File Formats

Importance Matrix Files

CSV format with input features as rows and output features as columns:

Feature,Output1,Output2,Output3,...
Input1,0.123,0.456,0.789,...
Input2,0.234,0.567,0.890,...

Truth and Imputed Data Files

CSV format with samples as rows and features as columns:

SampleID,Feature1,Feature2,Feature3,...
Sample001,1.23,2.45,0.89,...
Sample002,1.67,2.91,1.12,...

Feature Mapping File (Optional)

CSV format mapping numeric IDs to gene names:

NumericID,GeneName
1,APOE
2,LDLR
3,PCSK9

PPI Reference File (Optional)

Tab-delimited format with protein interaction data:

protein1    protein2    combined_score
APOE    LDLR    850
LDLR    PCSK9   650

Output Structure

The analysis generates comprehensive results organized in subdirectories:

importance_analysis_results/
├── figures/
│   ├── rank_consistency_overview_YYYYMMDD_HHMMSS.pdf/.png
│   ├── rank_distribution_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── importance_matrix_heatmaps_clustered_YYYYMMDD_HHMMSS.pdf/.png
│   ├── overlapping_features_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── feature_specialization_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── self_feature_importance_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── self_importance_vs_performance_correlation_YYYYMMDD_HHMMSS.pdf/.png
│   ├── importance_network_*_YYYYMMDD_HHMMSS.pdf/.png
│   ├── network_topology_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── network_hub_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── ppi_network_comparison_YYYYMMDD_HHMMSS.pdf/.png
│   ├── ppi_validation_networks_YYYYMMDD_HHMMSS.pdf/.png
│   ├── threshold_analysis_YYYYMMDD_HHMMSS.pdf/.png
│   ├── threshold_recommendations_YYYYMMDD_HHMMSS.pdf/.png
│   └── threshold_pr_curves_YYYYMMDD_HHMMSS.pdf/.png
├── data/
│   └── [processed analysis data]
├── networks/
│   ├── Platform_A_to_Platform_B_edges.tsv
│   ├── Platform_A_to_Platform_B_nodes.tsv
│   ├── Platform_A_to_Platform_B_network.graphml
│   └── network_summary.yaml
└── logs/
    └── analysis_summary_YYYYMMDD_HHMMSS.yaml

Network Construction Methods

Thresholding Methods

  1. Self-Importance Ratio Method (self_importance_ratio)
  2. Logic: Connect if cross-importance > threshold_params × self-importance
  3. Advantage: Adapts to each feature's self-prediction capability
  4. Typical Values: 1.0 - 10.0 (higher = more stringent)
  5. Use Case: When features have very different self-importance levels

  6. Absolute Importance Method (absolute_importance)

  7. Logic: Connect if importance > threshold_params absolute value
  8. Advantage: Simple, universal threshold
  9. Typical Values: 0.001 - 0.01 (depends on importance scale)
  10. Use Case: When importance values are on similar scales

Automatic Threshold Selection

If no threshold is provided, the script automatically determines optimal values using:

  1. Edge Saturation Method (Primary)
  2. Finds threshold where edges ≤ nodes for meaningful network structure
  3. Balances network size with interpretability

  4. Target Density Method

  5. Aims for user-specified network density (default: 3.66%)
  6. Based on biological network density estimates

  7. Edge Count Method

  8. Targets 1K-10K edges for computational efficiency
  9. Balances analysis depth with performance

Troubleshooting

Common Issues

  1. "No overlapping features found"
  2. Check that feature names match between importance matrices
  3. Verify that matrices have appropriate input/output structure

  4. "NetworkX not available"

  5. Install NetworkX: pip install networkx python-louvain
  6. Network analysis will be skipped without these packages

  7. Memory issues with large matrices

  8. Script automatically subsamples for visualization
  9. Consider using absolute importance method for very large matrices

  10. "Insufficient data for correlation analysis"

  11. Ensure truth and imputed data files are provided
  12. Check that feature names match between files
  13. Verify minimum of 3 overlapping features for correlation

  14. PPI file loading errors

  15. Verify tab-delimited format with correct column names
  16. Check protein name format consistency
  17. Ensure confidence column contains numerical values

Performance Optimization

For large datasets: - Use absolute_importance method (generally faster than self_importance_ratio) - Provide specific threshold_params to skip threshold analysis - Limit network size by using higher threshold values - Consider running analysis on feature subsets for initial exploration