Baseline Methods¶
Overview¶
The cpiVAE framework includes implementations of established baseline methods for cross-platform proteomics imputation. These methods serve as benchmarks for evaluating the performance of the cpiVAE model and provide alternative approaches for cross-platform data harmonization.
K-Nearest Neighbors (KNN) Baseline¶
Script: run_knn_comparison.py
¶
Implements K-nearest neighbors regression for cross-platform imputation with comprehensive parameter optimization.
Usage¶
python scripts/run_knn_comparison.py --platform_a PLATFORM_A_FILE --platform_b PLATFORM_B_FILE [OPTIONS]
Key Parameters¶
--platform_a
,--platform_b
: Training data files for both platforms--platform_impute
: Test data file for cross-platform imputation--impute_target
: Target platform (a
orb
)--k_values
: List of k values to test (default: [3,5,7,10,15,30,50,100,200])--kernel
: Weighting function (uniform
,distance
,gaussian
,exponential
,tricube
)--cv_folds
: Cross-validation folds for parameter optimization
Algorithm Details¶
- Data Preprocessing: Optional log transformation and standardization
- Parameter Search: Grid search over k values and kernel functions
- Cross-Validation: k-fold CV for robust parameter selection
- Imputation: Weighted average of k nearest neighbors in the source platform
- Kernel Functions: Multiple weighting schemes including Gaussian and polynomial
Example¶
python scripts/run_knn_comparison.py \
--platform_a data/olink_overlap_train.csv \
--platform_b data/somascan_overlap_train.csv \
--platform_impute data/olink_overlap_test.csv \
--impute_target b \
--kernel gaussian \
--output_dir outputs_knn
Output Files¶
best_params.json
: Optimal parameters from grid searchcv_results.csv
: Cross-validation performance for all parameter combinations{input_file}_cross_imputed_{target}.csv
: Imputed dataimputation_report_best.txt
: Performance metrics and summary
Weighted Nearest Neighbors (WNN) Baseline¶
Script: wnn_baseline.py
¶
Implements the Weighted Nearest Neighbors algorithm adapted from Hao et al. (2021), originally developed for single-cell multimodal data integration.
Usage¶
python scripts/wnn_baseline.py --platform_a PLATFORM_A_FILE --platform_b PLATFORM_B_FILE [OPTIONS]
Key Parameters¶
--platform_a
,--platform_b
: Training data files--platform_impute
: Test data for imputation--impute_target
: Target platform for imputation--n_neighbors
: Number of neighbors for graph construction (default: 20)--n_components
: PCA components for dimensionality reduction (default: 50)--sigma
: Bandwidth parameter for weight computation (default: 1.0)
Algorithm Details¶
- Dimensionality Reduction: PCA on both platforms
- Neighbor Graph: Construct k-nearest neighbor graphs
- Weight Computation: Jaccard similarity-based bandwidth estimation
- Graph Integration: Weighted combination of platform-specific graphs
- Imputation: Weighted averaging using integrated neighborhood structure
Key Features¶
- Bandwidth Adaptation: Automatic bandwidth selection based on local neighborhood density
- Graph Integration: Sophisticated weighting of cross-platform neighborhoods
- Sparse Computation: Efficient handling of large datasets using sparse matrices
Example¶
# WNN imputation from SomaScan to Olink
python scripts/wnn_baseline.py \
--platform_a data/olink_overlap_train.csv \
--platform_b data/somascan_overlap_train.csv \
--platform_impute data/somascan_overlap_test.csv \
--impute_target a \
--output_dir outputs_wnn
Output Files¶
{input_file}_cross_imputed_{target}.csv
: Imputed dataimputation_metrics.json
: Performance statisticswnn_parameters.json
: Algorithm parameters usedbandwidth_statistics.txt
: Bandwidth computation summary