Feature Importance Analysis¶
The feature_importance_analysis.py
script provides comprehensive analysis of feature importance matrices generated by Joint VAE models. It analyzes patterns in how input features contribute to predicting output features, constructs importance-based networks, and validates findings against protein-protein interaction (PPI) databases.
Overview¶
The feature importance analysis helps answer key questions about the Joint VAE's learned feature relationships: - Which input features are most consistently important across different outputs? - How do self-feature importance patterns relate to actual performance? - What network structures emerge from feature importance relationships? - How well do importance-derived networks align with known biological interactions?
Features¶
Core Analysis Components¶
- Rank Consistency Analysis
- Analyzes how consistently features rank across different prediction targets
- Identifies features with stable vs. variable importance patterns
-
Generates clustered heatmaps and distribution plots
-
Self-Feature Importance Analysis
- Examines diagonal elements (self-importance) in importance matrices
- Compares self-importance vs. cross-importance patterns
-
Correlates self-importance with actual imputation performance
-
Feature Specialization Analysis
- Determines whether features are specialists (important for few targets) or generalists
- Uses Gini coefficient to measure importance concentration
-
Visualizes specialization vs. overall importance
-
Network Construction and Analysis
- Builds networks based on feature importance relationships
- Two thresholding methods: self-importance ratio and absolute importance
-
Comprehensive network topology analysis including centrality and community detection
-
PPI Validation
- Compares importance-derived networks with reference PPI databases
- Identifies novel connections not in PPI and validates known interactions
-
Performs enrichment analysis to assess biological relevance
-
Threshold Optimization
- Analyzes different threshold values for optimal network construction
- Provides recommendations based on edge count, density, and biological validation
- Generates precision-recall curves for PPI validation
Usage¶
Basic Usage¶
python scripts/feature_importance_analysis.py \
--importance_a_to_b importance_a_to_b.csv \
--platform_a_name "Olink" \
--platform_b_name "SomaScan" \
--output_dir importance_analysis_results
With Network Analysis¶
python scripts/feature_importance_analysis.py \
--importance_a_to_b importance_a_to_b.csv \
--importance_b_to_a importance_b_to_a.csv \
--platform_a_name "Olink" \
--platform_b_name "SomaScan" \
--threshold_method self_importance_ratio \
--threshold_params 10.0 \
--output_dir importance_analysis_results
With Performance Correlation Analysis¶
python scripts/feature_importance_analysis.py \
--importance_a_to_b importance_a_to_b.csv \
--importance_b_to_a importance_b_to_a.csv \
--truth_a data/truth_platform_a.csv \
--truth_b data/truth_platform_b.csv \
--imp_a_m1 data/imputed_a_method1.csv \
--imp_b_m1 data/imputed_b_method1.csv \
--platform_a_name "Olink" \
--platform_b_name "SomaScan" \
--output_dir importance_analysis_results
With PPI Validation¶
python scripts/feature_importance_analysis.py \
--importance_a_to_b importance_a_to_b.csv \
--importance_b_to_a importance_b_to_a.csv \
--platform_a_name "Olink" \
--platform_b_name "SomaScan" \
--ppi_reference data/string_ppi.txt \
--ppi_symbol1_col protein1 \
--ppi_symbol2_col protein2 \
--ppi_confidence_col combined_score \
--ppi_confidence_threshold 400 \
--output_dir importance_analysis_results
Complete Analysis with All Features¶
python scripts/feature_importance_analysis.py \
--importance_a_to_b importance_a_to_b.csv \
--importance_b_to_a importance_b_to_a.csv \
--truth_a data/truth_platform_a.csv \
--truth_b data/truth_platform_b.csv \
--imp_a_m1 data/imputed_a_method1.csv \
--imp_b_m1 data/imputed_b_method1.csv \
--feature_mapping data/feature_mapping.csv \
--platform_a_name "Olink" \
--platform_b_name "SomaScan" \
--threshold_method absolute_importance \
--threshold_params 0.005 \
--ppi_reference data/string_ppi.txt \
--target_density 0.05 \
--output_dir importance_analysis_results
Command Line Arguments¶
Required Arguments¶
--importance_a_to_b
: Path to importance matrix CSV (Platform A → Platform B)--platform_a_name
: Display name for platform A (e.g., "Olink")--platform_b_name
: Display name for platform B (e.g., "SomaScan")
Optional Data Arguments¶
--importance_b_to_a
: Path to importance matrix CSV (Platform B → Platform A)--truth_a/--truth_b
: Truth data files for performance calculation--imp_a_m1/--imp_a_m2
: Imputed data files for platform A (methods 1 and 2)--imp_b_m1/--imp_b_m2
: Imputed data files for platform B (methods 1 and 2)--feature_mapping
: Feature mapping file (CSV or JSON) for numeric ID → gene name conversion
Network Analysis Arguments¶
--threshold_method
: Thresholding method (self_importance_ratio
orabsolute_importance
)--threshold_params
: Threshold parameter value (auto-determined if not provided)--network_type
: Network type (directed
orundirected
)--target_density
: Target network density for recommendations (default: 0.0366)
PPI Validation Arguments¶
--ppi_reference
: Path to PPI reference file (tab-delimited)--ppi_symbol1_col
: Column name for first protein symbol (default: "symbol1")--ppi_symbol2_col
: Column name for second protein symbol (default: "symbol2")--ppi_confidence_col
: Column name for confidence scores (optional)--ppi_confidence_threshold
: Minimum confidence threshold (default: 0.0)
Output Arguments¶
--output_dir
: Output directory for results (default: "importance_matrix_analysis")
Input File Formats¶
Importance Matrix Files¶
CSV format with input features as rows and output features as columns:
Feature,Output1,Output2,Output3,...
Input1,0.123,0.456,0.789,...
Input2,0.234,0.567,0.890,...
Truth and Imputed Data Files¶
CSV format with samples as rows and features as columns:
SampleID,Feature1,Feature2,Feature3,...
Sample001,1.23,2.45,0.89,...
Sample002,1.67,2.91,1.12,...
Feature Mapping File (Optional)¶
CSV format mapping numeric IDs to gene names:
NumericID,GeneName
1,APOE
2,LDLR
3,PCSK9
PPI Reference File (Optional)¶
Tab-delimited format with protein interaction data:
protein1 protein2 combined_score
APOE LDLR 850
LDLR PCSK9 650
Output Structure¶
The analysis generates comprehensive results organized in subdirectories:
importance_analysis_results/
├── figures/
│ ├── rank_consistency_overview_YYYYMMDD_HHMMSS.pdf/.png
│ ├── rank_distribution_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── importance_matrix_heatmaps_clustered_YYYYMMDD_HHMMSS.pdf/.png
│ ├── overlapping_features_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── feature_specialization_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── self_feature_importance_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── self_importance_vs_performance_correlation_YYYYMMDD_HHMMSS.pdf/.png
│ ├── importance_network_*_YYYYMMDD_HHMMSS.pdf/.png
│ ├── network_topology_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── network_hub_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── ppi_network_comparison_YYYYMMDD_HHMMSS.pdf/.png
│ ├── ppi_validation_networks_YYYYMMDD_HHMMSS.pdf/.png
│ ├── threshold_analysis_YYYYMMDD_HHMMSS.pdf/.png
│ ├── threshold_recommendations_YYYYMMDD_HHMMSS.pdf/.png
│ └── threshold_pr_curves_YYYYMMDD_HHMMSS.pdf/.png
├── data/
│ └── [processed analysis data]
├── networks/
│ ├── Platform_A_to_Platform_B_edges.tsv
│ ├── Platform_A_to_Platform_B_nodes.tsv
│ ├── Platform_A_to_Platform_B_network.graphml
│ └── network_summary.yaml
└── logs/
└── analysis_summary_YYYYMMDD_HHMMSS.yaml
Network Construction Methods¶
Thresholding Methods¶
- Self-Importance Ratio Method (
self_importance_ratio
) - Logic: Connect if cross-importance > threshold_params × self-importance
- Advantage: Adapts to each feature's self-prediction capability
- Typical Values: 1.0 - 10.0 (higher = more stringent)
-
Use Case: When features have very different self-importance levels
-
Absolute Importance Method (
absolute_importance
) - Logic: Connect if importance > threshold_params absolute value
- Advantage: Simple, universal threshold
- Typical Values: 0.001 - 0.01 (depends on importance scale)
- Use Case: When importance values are on similar scales
Automatic Threshold Selection¶
If no threshold is provided, the script automatically determines optimal values using:
- Edge Saturation Method (Primary)
- Finds threshold where edges ≤ nodes for meaningful network structure
-
Balances network size with interpretability
-
Target Density Method
- Aims for user-specified network density (default: 3.66%)
-
Based on biological network density estimates
-
Edge Count Method
- Targets 1K-10K edges for computational efficiency
- Balances analysis depth with performance
Troubleshooting¶
Common Issues¶
- "No overlapping features found"
- Check that feature names match between importance matrices
-
Verify that matrices have appropriate input/output structure
-
"NetworkX not available"
- Install NetworkX:
pip install networkx python-louvain
-
Network analysis will be skipped without these packages
-
Memory issues with large matrices
- Script automatically subsamples for visualization
-
Consider using absolute importance method for very large matrices
-
"Insufficient data for correlation analysis"
- Ensure truth and imputed data files are provided
- Check that feature names match between files
-
Verify minimum of 3 overlapping features for correlation
-
PPI file loading errors
- Verify tab-delimited format with correct column names
- Check protein name format consistency
- Ensure confidence column contains numerical values
Performance Optimization¶
For large datasets:
- Use absolute_importance
method (generally faster than self_importance_ratio
)
- Provide specific threshold_params
to skip threshold analysis
- Limit network size by using higher threshold values
- Consider running analysis on feature subsets for initial exploration