peach.pp#
Preprocessing functions for archetypal analysis.
- peach.pp.load_data(path, use_raw=True, dim_reduction_key='X_PCA', batch_size=128)[source]#
Load AnnData for archetypal analysis.
Note: Use scanpy.pp.pca() to compute PCA coordinates after loading.
- Parameters:
- Returns:
Loaded data. Use sc.pp.pca(adata) to add PCA coordinates.
- Return type:
AnnData
- peach.pp.generate_synthetic(n_points=1000, n_dimensions=50, n_archetypes=4, noise=0.1, *, seed=1205, archetype_type='random', scale=20.0, return_torch=True)[source]#
Generate synthetic convex data for testing.
- Parameters:
n_points (int, default: 1000) – Number of data points to generate (matches _core parameter)
n_dimensions (int, default: 50) – Number of dimensions/features (matches _core parameter)
n_archetypes (int, default: 4) – Number of archetypes
noise (float, default: 0.1) – Noise level (matches _core parameter)
seed (int, default: 1205) – Random seed for reproducibility
archetype_type (str, default: "random") – Type of archetype generation (‘random’, ‘corners’, ‘sphere’)
scale (float, default: 20.0) – Scale factor for data generation
return_torch (bool, default: True) – Whether to return PyTorch tensors
- Returns:
Synthetic data with ground truth archetypes in .uns
- Return type:
AnnData
- peach.pp.prepare_training(adata, batch_size=128, shuffle=True, pca_key=None, num_workers='auto', pin_memory='auto', persistent_workers='auto', prefetch_factor=2)[source]#
Create DataLoader from AnnData for training with HPC optimizations.
- Parameters:
adata (AnnData) – Annotated data object with PCA coordinates
batch_size (int, default: 128) – Batch size for training
shuffle (bool, default: True) – Whether to shuffle data in DataLoader
pca_key (str, default: None) – Key in adata.obsm containing PCA coordinates (auto-detected if None)
num_workers (int or 'auto', default: 'auto') – Number of subprocesses for data loading. ‘auto’ detects optimal value based on environment (0 for Apple Silicon, 6 for HPC, 2 for local)
pin_memory (bool or 'auto', default: 'auto') – Use pinned memory for faster GPU transfer. ‘auto’ sets True if CUDA available
persistent_workers (bool or 'auto', default: 'auto') – Keep workers alive between epochs. ‘auto’ sets True if num_workers > 0
prefetch_factor (int, default: 2) – Number of batches loaded in advance by each worker
- Returns:
PyTorch DataLoader optimized for the execution environment
- Return type:
DataLoader
Examples
>>> # Auto-detect optimal settings >>> dataloader = peach.pp.prepare_training(adata)
>>> # Force HPC settings >>> dataloader = peach.pp.prepare_training(adata, num_workers=8, pin_memory=True)
>>> # Minimal settings for debugging >>> dataloader = peach.pp.prepare_training(adata, num_workers=0)
- peach.pp.prepare_atacseq(adata, *, n_components=50, drop_first=True, log_tf=True, store_key='X_lsi', random_state=42)[source]#
TF-IDF + LSI preprocessing for scATAC-seq peak count data.
Computes TF-IDF normalization followed by Latent Semantic Indexing (truncated SVD), the standard dimensionality reduction for chromatin accessibility data. The resulting embeddings can be used directly with
pc.tl.train_archetypal(adata, pca_key='X_lsi').- Parameters:
adata (AnnData) – Annotated data object with peak count matrix in
adata.X. Typically a sparse [n_cells, n_peaks] matrix from scATAC-seq.n_components (int, default: 50) – Number of LSI components to compute. 30-50 is standard for scATAC-seq.
drop_first (bool, default: True) – Drop first SVD component. The first component in scATAC-seq LSI typically captures sequencing depth rather than biological signal.
log_tf (bool, default: True) – Use log(1 + TF) variant of term frequency. Standard in scATAC-seq preprocessing to reduce influence of high-count peaks.
store_key (str, default: "X_lsi") – Key in
adata.obsmto store the LSI embeddings.random_state (int, default: 42) – Random seed for reproducibility of truncated SVD.
- Returns:
Modifies
adatain place:adata.obsm[store_key]: LSI embeddings [n_cells, n_components]adata.uns['lsi']: dict with ‘variance_ratio’ and ‘components’
- Return type:
None
Examples
>>> import peach as pc >>> import scanpy as sc >>> adata = sc.read_h5ad("scatac_peaks.h5ad") >>> pc.pp.prepare_atacseq(adata, n_components=30) >>> results = pc.tl.train_archetypal(adata, n_archetypes=5, pca_key="X_lsi")
- peach.pp.load_pathway_networks(sources=['c5_bp'], *, organism='human', geneset_repo='msigdb', verbose=True, **kwargs)[source]#
Load pathway networks from MSigDB or OmniPath.
- Parameters:
sources (List[str], default: ["c5_bp"]) – Pathway sources to load. MSigDB collections: ‘hallmark’, ‘c2_cp’, ‘c2_cgp’, ‘c3_mir’, ‘c5_bp’, ‘c5_cc’, ‘c5_mf’, ‘c8’
organism (str, default: "human") – Organism to load pathways for: ‘human’ or ‘mouse’
geneset_repo (str, default: "msigdb") – Repository to use: ‘msigdb’ (recommended) or ‘omnipath’
verbose (bool, default: True) – Whether to print loading progress
**kwargs – Additional arguments passed to load_pathway_networks
- Returns:
Pathway network with ‘source’, ‘target’, ‘pathway’ columns
- Return type:
pd.DataFrame
- peach.pp.compute_pathway_scores(adata, net=None, use_layer=None, obsm_key='pathway_scores', verbose=True)[source]#
Compute pathway activity scores using MSigDB pathways.
- Parameters:
adata (AnnData) – Annotated data object
net (pd.DataFrame, optional) – Pathway network dataframe. If None, will load using sources parameter
use_layer (str, optional) – Layer in adata to use for scoring
obsm_key (str, default: "pathway_scores") – Key in adata.obsm to store pathway scores
verbose (bool, default: True) – Whether to print progress
- Return type:
Modules
Basic preprocessing functions for archetypal analysis. |