Tutorial 09: scATAC-seq Archetypal Analysis#

This tutorial walks through end-to-end archetypal analysis on scATAC-seq data.

What you’ll learn:

  1. TF-IDF + LSI preprocessing for chromatin accessibility data

  2. Hyperparameter search with LSI embeddings

  3. Training and evaluating the final model

  4. Archetype distances, positions, and assignments

  5. Peak associations and characterization

Key difference from scRNA-seq: scATAC-seq uses TF-IDF + LSI (Truncated SVD) instead of log-normalization + PCA. PEACH handles this seamlessly — just set pca_key='X_lsi' when training.

Requirements:

  • peach (v0.4.0+)

  • scanpy

  • Data: data/ovary_ATAC.h5ad (18K cells x 19K peaks)

WARNING: The current results are not biologically interpretable. Note the lack of model convergence and low archetypal \(R^2\) obtained. This is a demonstration only. Future work will test archetype analysis from raw ATAC peaks to preserve rank structure in LSI dimensional reduction.

[2]:
import scanpy as sc
import peach as pc
import numpy as np
from pathlib import Path

print(f"PEACH version: {pc.__version__}")
PEACH version: 0.4.0

Step 1: Load scATAC-seq Data#

scATAC-seq data is a sparse peak-by-cell matrix where each entry represents the number of Tn5 insertions in a genomic peak for a given cell. These are typically very sparse (95-99% zeros) and high-dimensional.

[10]:
data_path = Path("data/ovary_ATAC.h5ad")
adata = sc.read_h5ad('/Users/honkala/Desktop/PEACH_public/data/ovary_ATAC.h5ad')

print(f"Shape: {adata.n_obs:,} cells x {adata.n_vars:,} peaks")
print(f"Existing obsm keys: {list(adata.obsm.keys())}")
print(f"Existing obs columns: {list(adata.obs.columns[:10])}")

# Check sparsity
import scipy.sparse as sp
if sp.issparse(adata.X):
    density = adata.X.nnz / (adata.n_obs * adata.n_vars)
    print(f"Matrix density: {density:.1%} non-zero")
Shape: 18,315 cells x 19,281 peaks
Existing obsm keys: ['X_harmony', 'X_lsi', 'X_umap']
Existing obs columns: ['mapped_reference_assembly', 'alignment_software', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'donor_menopausal_status', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id']
Matrix density: 24.9% non-zero

Step 2: TF-IDF + LSI Preprocessing#

pc.pp.prepare_atacseq() performs:

  1. TF-IDF normalization: Term-frequency * inverse-document-frequency weights each peak by how informative it is across the dataset.

  2. LSI (Latent Semantic Indexing): Truncated SVD dimensionality reduction. The first component is dropped by default because it typically captures sequencing depth rather than biological variation.

The result is stored in adata.obsm['X_lsi'], analogous to X_pca for RNA.

[11]:
# If pre-computed LSI exists, we can use it directly
# To force recomputation: del adata.obsm['X_lsi']

if "X_lsi" not in adata.obsm:
    pc.pp.prepare_atacseq(
        adata,
        n_components=50,   # 30-50 standard for scATAC
        drop_first=True,   # Drop depth component
    )
else:
    print(f"Using pre-computed X_lsi: {adata.obsm['X_lsi'].shape}")

print(f"\nLSI embeddings: {adata.obsm['X_lsi'].shape}")
if 'lsi' in adata.uns:
    var_ratio = adata.uns['lsi']['variance_ratio']
    print(f"Variance explained: {var_ratio.sum()*100:.1f}% total")
    print(f"  Top 5 components: {var_ratio[:5]*100}")
Using pre-computed X_lsi: (18315, 50)

LSI embeddings: (18315, 50)

Step 4: Train Final Model#

Train with the selected number of archetypes. Remember to set pca_key='X_lsi'.

[14]:
n_archetypes = 5  # Adjust based on elbow plot / CV results

results = pc.tl.train_archetypal(
    adata,
    n_archetypes=n_archetypes,
    n_epochs=50,
    hidden_dims=[256, 128, 64],
    pca_key="X_lsi",
    device="cpu",
)

print(f"Final archetype R²: {results.get('final_archetype_r2', 'N/A')}")
[OK] Using specified PCA coordinates: adata.obsm['X_lsi'] (18315, 50)
[STATS] DataLoader created: 18315 cells Ă— 50 PCA components
   Config: batch_size=128, workers=0 (Apple Silicon)
Archetypes parameter registered: True
Archetypes requires_grad: True
Deep_AA (Deep Archetypal Analysis) initialized:
  - Single-stage architecture (like Deep_2)
  - Inflation factor: 1.5
  - Direct archetypal coordinates (no bottleneck)
 Initializing with PCHA + inflation_factor=1.5...

 Consolidated Archetype Initialization
   PCHA: True, Inflation: True (factor: 1.5)
   Test inflation: False
Running PCHA initialization...
  Input shape: (1000, 50)
  Target archetypes: 5
Running PCHA with 5 archetypes...
Data shape for PCHA: (50, 1000)
PCHA Results:
  Archetypes shape: (5, 50)
  Archetype R²: 0.1430
  SSE: 42919.9264
  PCHA archetype R²: 0.1430
  Archetype shape: (5, 50)
[OK] Initialized 5 archetypes using PCHA

 Scalar Archetypal Inflation (factor: 1.50)
   [OK] Inflation complete
      [STATS] Positioning verification:
         Data radius (max): 114.600
         Data radius (mean): 6.289
         Archetype distances: 5.203 to 53.422
         Archetypes outside data: 0/5
         Min archetype separation: 7.583
[OK] Archetype initialization complete (success: True)
[OK] PCHA + inflation initialization successful!
Starting training for 50 epochs...
Device: cpu
Archetypal weight: 0.9, KLD weight: 0.1, Reconstruction weight: 0.0
  (Model configured: arch=0.9, kld=0.1)
Tracking stability: True, Validating constraints: True

Epoch 0 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0042, max=0.8393, mean=0.2000
Batch reconstruction MSE: 1.3985
Archetype stats: min=-5.4128, max=3.3228
Archetype gradients: norm=0.032672, mean=0.001352

Epoch 1/50
Average loss: 0.9056
Archetypal loss: 1.0030
KLD loss: 0.0289
Reconstruction loss: 1.0030
Archetype R²: -0.0154
Archetype transform grad norm: 0.275696
Archetype transform param norm: 5.7963
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 0.000000

Epoch 2 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0007, max=0.9910, mean=0.2000
Batch reconstruction MSE: 1.5985
Archetype stats: min=-7.0265, max=6.7942
Archetype change since last debug: 25.657192
Archetype gradients: norm=0.043958, mean=0.002055

Epoch 4 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0048, max=0.9139, mean=0.2000
Batch reconstruction MSE: 1.9805
Archetype stats: min=-12.5378, max=12.9206
Archetype change since last debug: 22.271404
Archetype gradients: norm=0.104397, mean=0.004766

Epoch 6/50
Average loss: 0.8371
Archetypal loss: 0.9187
KLD loss: 0.1028
Reconstruction loss: 0.9187
Archetype R²: 0.0753
Archetype transform grad norm: 0.318739
Archetype transform param norm: 7.2826

Epoch 6 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0081, max=0.7607, mean=0.2000
Batch reconstruction MSE: 1.7452
Archetype stats: min=-13.9404, max=14.1674
Archetype change since last debug: 9.889637
Archetype gradients: norm=0.118433, mean=0.005023

Epoch 8 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0264, max=0.7426, mean=0.2000
Batch reconstruction MSE: 1.5404
Archetype stats: min=-14.7360, max=14.6087
Archetype change since last debug: 5.726437
Archetype gradients: norm=0.105707, mean=0.004238

Epoch 10 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0221, max=0.6078, mean=0.2000
Batch reconstruction MSE: 1.3963
Archetype stats: min=-15.6068, max=15.5202
Archetype change since last debug: 4.754302
Archetype gradients: norm=0.088161, mean=0.003758

Epoch 11/50
Average loss: 0.8286
Archetypal loss: 0.9114
KLD loss: 0.0838
Reconstruction loss: 0.9114
Archetype R²: 0.0811
Archetype transform grad norm: 0.320099
Archetype transform param norm: 7.6759
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 2.023919

Epoch 12 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0272, max=0.6957, mean=0.2000
Batch reconstruction MSE: 1.3871
Archetype stats: min=-16.5924, max=16.6468
Archetype change since last debug: 4.278627
Archetype gradients: norm=0.066130, mean=0.002976

Epoch 14 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0398, max=0.4538, mean=0.2000
Batch reconstruction MSE: 1.2673
Archetype stats: min=-17.2636, max=17.5290
Archetype change since last debug: 3.775030
Archetype gradients: norm=0.056299, mean=0.002323

Epoch 16/50
Average loss: 0.8286
Archetypal loss: 0.9118
KLD loss: 0.0798
Reconstruction loss: 0.9118
Archetype R²: 0.0831
Archetype transform grad norm: 0.333253
Archetype transform param norm: 7.9429

Epoch 16 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0505, max=0.5187, mean=0.2000
Batch reconstruction MSE: 1.4090
Archetype stats: min=-17.7244, max=17.9380
Archetype change since last debug: 3.002614
Archetype gradients: norm=0.100630, mean=0.004259

Epoch 18 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0345, max=0.4531, mean=0.2000
Batch reconstruction MSE: 1.3097
Archetype stats: min=-17.8201, max=18.2691
Archetype change since last debug: 2.077428
Archetype gradients: norm=0.103521, mean=0.003866

Epoch 20 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0531, max=0.4646, mean=0.2000
Batch reconstruction MSE: 1.5179
Archetype stats: min=-18.7622, max=18.7939
Archetype change since last debug: 2.759326
Archetype gradients: norm=0.062463, mean=0.003142

Epoch 21/50
Average loss: 0.8304
Archetypal loss: 0.9138
KLD loss: 0.0804
Reconstruction loss: 0.9138
Archetype R²: 0.0841
Archetype transform grad norm: 0.329047
Archetype transform param norm: 8.0555
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 1.147955

Epoch 22 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0608, max=0.4262, mean=0.2000
Batch reconstruction MSE: 1.4698
Archetype stats: min=-19.4325, max=19.3537
Archetype change since last debug: 2.749844
Archetype gradients: norm=0.079130, mean=0.003355

Epoch 24 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0533, max=0.4892, mean=0.2000
Batch reconstruction MSE: 1.4084
Archetype stats: min=-18.8239, max=19.2779
Archetype change since last debug: 1.956011
Archetype gradients: norm=0.075708, mean=0.003156

Epoch 26/50
Average loss: 0.8246
Archetypal loss: 0.9077
KLD loss: 0.0766
Reconstruction loss: 0.9077
Archetype R²: 0.0846
Archetype transform grad norm: 0.317719
Archetype transform param norm: 8.1480

Epoch 26 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0682, max=0.4923, mean=0.2000
Batch reconstruction MSE: 1.3608
Archetype stats: min=-18.7218, max=18.9529
Archetype change since last debug: 1.950716
Archetype gradients: norm=0.070516, mean=0.003153

Epoch 28 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0673, max=0.4400, mean=0.2000
Batch reconstruction MSE: 1.3485
Archetype stats: min=-18.6983, max=18.8662
Archetype change since last debug: 2.215880
Archetype gradients: norm=0.076400, mean=0.003763

Epoch 30 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0719, max=0.4162, mean=0.2000
Batch reconstruction MSE: 1.2096
Archetype stats: min=-18.8647, max=19.0529
Archetype change since last debug: 1.789942
Archetype gradients: norm=0.065928, mean=0.003180

Epoch 31/50
Average loss: 0.8240
Archetypal loss: 0.9074
KLD loss: 0.0735
Reconstruction loss: 0.9074
Archetype R²: 0.0857
Archetype transform grad norm: 0.312012
Archetype transform param norm: 8.2404
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 0.852015

Epoch 32 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0493, max=0.5094, mean=0.2000
Batch reconstruction MSE: 1.5604
Archetype stats: min=-19.2614, max=19.5202
Archetype change since last debug: 1.909075
Archetype gradients: norm=0.094676, mean=0.004720

Epoch 34 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0682, max=0.4532, mean=0.2000
Batch reconstruction MSE: 1.3119
Archetype stats: min=-18.8431, max=19.6002
Archetype change since last debug: 1.879062
Archetype gradients: norm=0.077832, mean=0.003350

Epoch 36/50
Average loss: 0.8260
Archetypal loss: 0.9096
KLD loss: 0.0739
Reconstruction loss: 0.9096
Archetype R²: 0.0852
Archetype transform grad norm: 0.327810
Archetype transform param norm: 8.2964

Epoch 36 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0563, max=0.5247, mean=0.2000
Batch reconstruction MSE: 1.2099
Archetype stats: min=-19.0733, max=19.8226
Archetype change since last debug: 1.836334
Archetype gradients: norm=0.101521, mean=0.003856

Epoch 38 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0600, max=0.4864, mean=0.2000
Batch reconstruction MSE: 1.1851
Archetype stats: min=-19.5609, max=19.9839
Archetype change since last debug: 1.655625
Archetype gradients: norm=0.061999, mean=0.002747

Epoch 40 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0482, max=0.4198, mean=0.2000
Batch reconstruction MSE: 1.1975
Archetype stats: min=-20.3783, max=20.5837
Archetype change since last debug: 2.175887
Archetype gradients: norm=0.060150, mean=0.002963

Epoch 41/50
Average loss: 0.8233
Archetypal loss: 0.9069
KLD loss: 0.0708
Reconstruction loss: 0.9069
Archetype R²: 0.0856
Archetype transform grad norm: 0.296459
Archetype transform param norm: 8.3410
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 0.756053

Epoch 42 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0570, max=0.4461, mean=0.2000
Batch reconstruction MSE: 1.3043
Archetype stats: min=-20.7393, max=20.8726
Archetype change since last debug: 1.760280
Archetype gradients: norm=0.074333, mean=0.002672

Epoch 44 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0442, max=0.4004, mean=0.2000
Batch reconstruction MSE: 1.2655
Archetype stats: min=-20.7504, max=20.8811
Archetype change since last debug: 1.480756
Archetype gradients: norm=0.125862, mean=0.004937

Epoch 46/50
Average loss: 0.8254
Archetypal loss: 0.9089
KLD loss: 0.0745
Reconstruction loss: 0.9089
Archetype R²: 0.0836
Archetype transform grad norm: 0.316347
Archetype transform param norm: 8.3383

Epoch 46 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0736, max=0.3772, mean=0.2000
Batch reconstruction MSE: 1.3121
Archetype stats: min=-20.4987, max=20.4097
Archetype change since last debug: 2.464167
Archetype gradients: norm=0.066619, mean=0.003248

Epoch 48 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0815, max=0.4578, mean=0.2000
Batch reconstruction MSE: 1.2338
Archetype stats: min=-20.9418, max=20.7715
Archetype change since last debug: 1.910217
Archetype gradients: norm=0.082456, mean=0.003522

Epoch 49 Debug:
z row sums (should be ~1.0): 1.0000 ± 0.0000
z stats: min=0.0341, max=0.5453, mean=0.2000
Batch reconstruction MSE: 2.0755
Archetype stats: min=-21.2075, max=20.9727
Archetype change since last debug: 1.373140
Archetype gradients: norm=0.085136, mean=0.004253

Epoch 50/50
Average loss: 0.8275
Archetypal loss: 0.9114
KLD loss: 0.0725
Reconstruction loss: 0.9114
Archetype R²: 0.0859
Archetype transform grad norm: 0.310094
Archetype transform param norm: 8.3717
Constraints satisfied: True
Constraint violation rate: 0.000
Archetype drift (mean): 0.871960

============================================================
TRAINING COMPLETED
============================================================
   [OK] Stored archetype coordinates in adata.uns['archetype_coordinates']: torch.Size([5, 50])

Final Performance:
  loss: 0.8275 (range: 0.8229 - 0.9056)
  archetypal_loss: 0.9114 (range: 0.9066 - 1.0030)
  kld_loss: 0.0725 (range: 0.0289 - 0.1110)
  reconstruction_loss: 0.9114 (range: 0.9066 - 1.0030)
  archetype_r2: 0.0859 (range: -0.0154 - 0.0865)

Archetype Transform Summary:
  archetype_transform_mean: -0.000536 (range: -0.000585 - 0.000384)
  archetype_transform_std: 0.117237 (range: 0.081172 - 0.117457)
  archetype_transform_norm: 8.371658 (range: 5.796302 - 8.387358)
  archetype_transform_grad_norm: 0.310094 (range: 0.252822 - 0.338460)
  Transform mean change: 0.000302
  [WARNING] Archetype transform might not be learning enough

Final Stability Metrics:
  archetype_drift_mean: 0.871960 (range: 0.000000 - 12.964853)
  archetype_variance_mean: 0.008226 (range: 0.000000 - 1.827807)

Final Constraint Status:
  A matrix sum error: 0.000000
  B matrix sum error: 0.000000
  Constraints satisfied: True
Final archetype R²: 0.08587823228703605
[15]:
# Training metrics
pc.pl.training_metrics(results["history"])

Data type cannot be displayed: application/vnd.plotly.v1+json

Data type cannot be displayed: application/vnd.plotly.v1+json

Step 5: Archetype Distances, Positions & Assignments#

Compute distances from each cell to each archetype in LSI space, then assign cells to their nearest archetype.

[16]:
# Compute distances in LSI space
pc.tl.archetypal_coordinates(adata, pca_key="X_lsi")

print(f"Archetype positions: {adata.uns['archetype_coordinates'].shape}")
print(f"Distance matrix: {adata.obsm['archetype_distances'].shape}")
 Computing archetype distances in PCA space...
   Canonical reference: adata.obs.index (18315 cells)
   Found PCA coordinates: X_lsi (18315, 50)
   Found archetype coordinates: archetype_coordinates (5, 50)
 Computing pairwise distances in PCA space...
[OK] Distance computation complete
   Distance matrix shape: (18315, 5)
[OK] Stored in AnnData:
   adata.obsm['archetype_distances']: (18315, 5) distance matrix
   adata.uns['archetype_positions']: (5, 50) archetype positions
   adata.uns['archetype_distance_info']: distance computation metadata

[STATS] Distance Statistics:
   Nearest archetype distribution:
      Archetype 0: 2795 cells (15.3%), mean distance: 33.5142
      Archetype 1: 451 cells (2.5%), mean distance: 34.2035
      Archetype 2: 5508 cells (30.1%), mean distance: 33.3467
      Archetype 3: 9552 cells (52.2%), mean distance: 33.0764
      Archetype 4: 9 cells (0.0%), mean distance: 34.7673
   Overall statistics:
      Mean nearest distance: 33.2531
      Distance range: [28.884, 121.268]
Archetype positions: (5, 50)
Distance matrix: (18315, 5)
[17]:
# Assign cells to archetypes (top 10% closest per archetype)
pc.tl.assign_archetypes(adata, percentage_per_archetype=0.1)
adata.obs["archetypes"].value_counts()
 AnnData-centric archetype binning...
   Distance matrix: (18315, 5) (from adata.obsm['archetype_distances'])
   Canonical cell reference: adata.obs.index (18315 cells)
   Selecting top 1831 cells (10.0%) per archetype
   INCLUDING central archetype_0 (generalist cells)
   Archetype 0 (central): 1831 cells, centroid distance range: [34.8888, 34.9877], mean: 34.9611
   Archetype 1: 1831 cells, distance range: [30.8723, 33.4236], mean: 33.0499
   Archetype 2: 1831 cells, distance range: [31.7376, 35.0504], mean: 34.4331
   Archetype 3: 1831 cells, distance range: [30.5359, 33.0230], mean: 32.6955
   Archetype 4: 1831 cells, distance range: [28.8842, 32.4825], mean: 31.8148
   Archetype 5: 1831 cells, distance range: [34.3878, 36.4911], mean: 36.1245

[STATS] Assignment Summary:
   Total cells: 18315
   Archetype 0 (central): 1831 cells (10.0%)
   Archetype 1: 1831 cells (10.0%)
   Archetype 2: 1831 cells (10.0%)
   Archetype 3: 1831 cells (10.0%)
   Archetype 4: 1831 cells (10.0%)
   Archetype 5: 1831 cells (10.0%)
   No archetype: 8816 cells (48.1%)
   [WARNING]  Overlapping assignments: 2923 cells

[OK] Stored assignments in adata.obs['archetypes']:
   Categories: ['archetype_0', 'archetype_1', 'archetype_2', 'archetype_3', 'archetype_4', 'archetype_5', 'no_archetype']
   no_archetype: 8816 cells (48.1%)
   archetype_1: 1671 cells (9.1%)
   archetype_2: 1665 cells (9.1%)
   archetype_4: 1643 cells (9.0%)
   archetype_5: 1615 cells (8.8%)
   archetype_3: 1541 cells (8.4%)
   archetype_0: 1364 cells (7.4%)
[17]:
archetypes
no_archetype    8816
archetype_1     1671
archetype_2     1665
archetype_4     1643
archetype_5     1615
archetype_3     1541
archetype_0     1364
Name: count, dtype: int64
[18]:
# Extract barycentric weight matrix (A matrix: rows sum to 1)
weights = pc.tl.extract_archetype_weights(adata, pca_key="X_lsi")
print(f"Weight matrix: {weights.shape}")
print(f"Row sums: min={weights.sum(axis=1).min():.4f}, max={weights.sum(axis=1).max():.4f}")
[STATS] Using model from adata.uns['trained_model']
[STATS] Extracting weights for 18315 cells...
   PCA shape: (18315, 50)
   Device: cpu
   Batch size: 256
   Processed 18315/18315 cells...
[OK] Stored cell weights in adata.obsm['cell_archetype_weights']
   Shape: (18315, 5)
   Range: [0.059, 0.471]
   Mean sum: 1.0000

[STATS] Archetype weight statistics:
   Archetype 0: mean=0.200, std=0.022, max=0.372, dominant in 0 cells
   Archetype 1: mean=0.200, std=0.025, max=0.340, dominant in 0 cells
   Archetype 2: mean=0.200, std=0.023, max=0.471, dominant in 0 cells
   Archetype 3: mean=0.200, std=0.024, max=0.410, dominant in 0 cells
   Archetype 4: mean=0.200, std=0.024, max=0.284, dominant in 0 cells
Weight matrix: (18315, 5)
Row sums: min=1.0000, max=1.0000
[19]:
# Visualize archetype positions
pc.pl.archetype_positions(adata)
[19]:
<Figure size 1500x600 with 3 Axes>
[20]:
pc.pl.archetype_positions_3d(adata)
[20]:
<Figure size 1200x1000 with 2 Axes>
[21]:
# Archetype usage statistics
pc.pl.archetype_statistics(adata)
[STATS] Archetype Statistics
==================================================
Number of archetypes: 5
Embedding dimensions: 50

Distance statistics:
  Mean distance: 55.0217
  Std distance:  2.5060
  Min distance:  51.7099
  Max distance:  61.1572
  Range:         9.4474

Nearest archetypes:  A2 - A4
Farthest archetypes: A2 - A5
[21]:
{'n_archetypes': 5,
 'n_dimensions': 50,
 'mean_distance': 55.02172860349238,
 'std_distance': 2.5060175772077207,
 'min_distance': 51.70987714135885,
 'max_distance': 61.15724285281022,
 'distance_range': 9.447365711451368,
 'distance_matrix': array([[ 0.        , 54.91483488, 53.69872152, 53.00294838, 56.02309706],
        [54.91483488,  0.        , 55.20350812, 51.70987714, 61.15724285],
        [53.69872152, 55.20350812,  0.        , 53.0033917 , 54.70887264],
        [53.00294838, 51.70987714, 53.0033917 ,  0.        , 56.79479174],
        [56.02309706, 61.15724285, 54.70887264, 56.79479174,  0.        ]]),
 'nearest_pair': (1, 3),
 'farthest_pair': (1, 4),
 'hull_volume': None,
 'hull_area': None}

Step 6: Characterization#

Peak Associations#

Test which peaks are differentially accessible per archetype. In scATAC-seq, peaks represent open chromatin regions, so archetype-associated peaks reveal the regulatory programs driving each extreme cell state.

[22]:
gene_results = pc.tl.gene_associations(adata, use_layer=None)

sig = gene_results[gene_results["significant"]]
print(f"Total significant peak-archetype associations: {len(sig)}")
print(f"\nBreakdown by archetype:")
print(sig["archetype"].value_counts())
đź§Ş Testing archetype-gene associations (AnnData-centric)...
   Method: mannwhitneyu
   FDR correction: benjamini_hochberg (global scope)
   Test direction: two-sided
   Bin proportion: 0.1 (closest cells to each archetype)
   Minimum cells per archetype: 10
   Comparison group: all
   [WARNING]  Using adata.X - assuming data is already log-transformed
   [OK] AnnData validation passed:
      Distance matrix: (18315, 5) from adata.obsm['archetype_distances']
      Archetype assignments: 18315 cells from adata.obs['archetypes']
      Assignment categories: ['archetype_0', 'archetype_1', 'archetype_2', 'archetype_3', 'archetype_4', 'archetype_5', 'no_archetype']
   Using adata.X
   Expression data: 18315 cells Ă— 19281 genes
   Sparse matrix: 75.1% zeros
   Original format: csr
    Using AnnData archetype assignments for binning...
      Found 6 archetype categories: ['archetype_0', 'archetype_1', 'archetype_2', 'archetype_3', 'archetype_4', 'archetype_5']
         no_archetype: 8816 cells (48.1%)
         archetype_1: 1671 cells (9.1%)
         archetype_2: 1665 cells (9.1%)
         archetype_4: 1643 cells (9.0%)
         archetype_5: 1615 cells (8.8%)
         archetype_3: 1541 cells (8.4%)
         archetype_0: 1364 cells (7.4%)
      archetype_0: 1364 assigned cells
      archetype_1: 1671 assigned cells
      archetype_2: 1665 assigned cells
      archetype_3: 1541 assigned cells
      archetype_4: 1643 assigned cells
      archetype_5: 1615 assigned cells
   [OK] Assignment-based binning completed (6 valid archetypes)

   Testing archetype_0...
      Archetype cells: 1364, Comparison cells: 16951
       Extracting expression data for 1364 archetype + 16951 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Testing archetype_1...
      Archetype cells: 1671, Comparison cells: 16644
       Extracting expression data for 1671 archetype + 16644 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Testing archetype_2...
      Archetype cells: 1665, Comparison cells: 16650
       Extracting expression data for 1665 archetype + 16650 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Testing archetype_3...
      Archetype cells: 1541, Comparison cells: 16774
       Extracting expression data for 1541 archetype + 16774 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Testing archetype_4...
      Archetype cells: 1643, Comparison cells: 16672
       Extracting expression data for 1643 archetype + 16672 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Testing archetype_5...
      Archetype cells: 1615, Comparison cells: 16700
       Extracting expression data for 1615 archetype + 16700 other cells...
      [OK] Expression extraction completed - testing 19281 genes...
      Tested 19122 genes

   Applying FDR correction (benjamini_hochberg, global scope)...
   Total tests performed: 79672
[STATS] FDR Correction Statistics:
   Method: benjamini_hochberg
   Number of tests: 79672
   Raw p-values < 0.05: 65371
   Min p-value: 9.23e-66
   Median p-value: 1.50e-05
   FDR-corrected p-values < 0.05: 64632
   Correction ratio: 64632/65371 = 0.989
   Note: BH assumes independence or positive dependence among tests
    Applied global FDR correction: 64632/79672 significant
[OK] Gene association testing completed!
   Total tests: 79672
   Significant associations: 64632 (81.1%)
   Significant by direction: {'two-sided': 64632}
   Raw significant (p<0.05): 65371
   FDR correction impact: 65371 → 64632

[STATS] Top significant associations per archetype:
   archetype_2:
      ↑ ENSG00000251287: FC=0.26, p=7.35e-61 (higher in archetype) (two-sided)
      ↑ ENSG00000171943: FC=0.19, p=5.16e-60 (higher in archetype) (two-sided)
      ↑ ENSG00000081377: FC=0.26, p=5.16e-60 (higher in archetype) (two-sided)
   archetype_4:
      ↑ ENSG00000154864: FC=0.32, p=2.82e-50 (higher in archetype) (two-sided)
      ↑ ENSG00000196581: FC=0.24, p=2.98e-46 (higher in archetype) (two-sided)
      ↑ ENSG00000064042: FC=0.31, p=4.08e-41 (higher in archetype) (two-sided)
   archetype_1:
      ↑ ENSG00000148848: FC=0.37, p=1.46e-45 (higher in archetype) (two-sided)
      ↑ ENSG00000145685: FC=0.27, p=6.24e-33 (higher in archetype) (two-sided)
      ↑ ENSG00000104490: FC=0.29, p=2.33e-31 (higher in archetype) (two-sided)
   archetype_5:
      ↓ ENSG00000104490: FC=-0.43, p=1.55e-43 (lower in archetype) (two-sided)
      ↓ ENSG00000135074: FC=-0.29, p=3.47e-41 (lower in archetype) (two-sided)
      ↓ ENSG00000169744: FC=-0.32, p=7.66e-40 (lower in archetype) (two-sided)
   archetype_0:
      ↓ ENSG00000172572: FC=-0.32, p=5.27e-43 (lower in archetype) (two-sided)
      ↓ ENSG00000107242: FC=-0.28, p=1.73e-36 (lower in archetype) (two-sided)
      ↓ ENSG00000074964: FC=-0.31, p=3.10e-36 (lower in archetype) (two-sided)
   archetype_3:
      ↑ ENSG00000124788: FC=0.23, p=8.40e-30 (higher in archetype) (two-sided)
      ↓ ENSG00000124440: FC=-0.17, p=5.63e-27 (lower in archetype) (two-sided)
      ↓ ENSG00000109906: FC=-0.28, p=2.47e-26 (lower in archetype) (two-sided)
Total significant peak-archetype associations: 64632

Breakdown by archetype:
archetype
archetype_2    13106
archetype_0    12898
archetype_4    12490
archetype_5    11891
archetype_3     9417
archetype_1     4830
Name: count, dtype: int64
[23]:
# Top differentially accessible peaks per archetype
for arch in sorted(sig["archetype"].unique()):
    top = sig[sig["archetype"] == arch].nlargest(5, "log_fold_change")
    print(f"\n{arch}:")
    for _, row in top.iterrows():
        print(f"  {row['gene']:30s} LFC={row['log_fold_change']:.3f} FDR={row['fdr_pvalue']:.2e}")

archetype_0:
  ENSG00000173068                LFC=0.218 FDR=3.77e-11
  ENSG00000157827                LFC=0.207 FDR=2.91e-18
  ENSG00000196189                LFC=0.173 FDR=3.75e-08
  ENSG00000166250                LFC=0.172 FDR=1.32e-10
  ENSG00000007237                LFC=0.149 FDR=2.31e-08

archetype_1:
  ENSG00000148848                LFC=0.373 FDR=1.46e-45
  ENSG00000173068                LFC=0.288 FDR=1.99e-27
  ENSG00000104490                LFC=0.286 FDR=2.33e-31
  ENSG00000145685                LFC=0.269 FDR=6.24e-33
  ENSG00000184719                LFC=0.243 FDR=7.66e-25

archetype_2:
  ENSG00000109906                LFC=0.365 FDR=6.53e-52
  ENSG00000142611                LFC=0.352 FDR=9.31e-37
  ENSG00000058453                LFC=0.336 FDR=3.38e-43
  ENSG00000173175                LFC=0.331 FDR=9.14e-40
  ENSG00000079432                LFC=0.318 FDR=4.43e-38

archetype_3:
  ENSG00000042832                LFC=0.247 FDR=9.24e-18
  ENSG00000124788                LFC=0.233 FDR=8.40e-30
  ENSG00000187239                LFC=0.208 FDR=2.70e-18
  ENSG00000027075                LFC=0.204 FDR=7.56e-11
  ENSG00000067057                LFC=0.202 FDR=2.92e-14

archetype_4:
  ENSG00000183287                LFC=0.325 FDR=1.54e-29
  ENSG00000154864                LFC=0.317 FDR=2.82e-50
  ENSG00000064042                LFC=0.307 FDR=4.08e-41
  ENSG00000153944                LFC=0.292 FDR=5.60e-29
  ENSG00000173068                LFC=0.283 FDR=1.54e-13

archetype_5:
  ENSG00000105650                LFC=0.192 FDR=3.38e-07
  ENSG00000123358                LFC=0.137 FDR=5.32e-15
  ENSG00000095066                LFC=0.129 FDR=4.88e-16
  ENSG00000153487                LFC=0.128 FDR=1.65e-09
  ENSG00000181222                LFC=0.127 FDR=1.39e-04

Cell Type Associations (if annotations available)#

Test whether specific cell types are enriched or depleted in each archetype.

[24]:
# Check for cell type annotations
ct_cols = [c for c in adata.obs.columns if 'type' in c.lower() or 'cluster' in c.lower()]
print(f"Candidate annotation columns: {ct_cols}")

if ct_cols:
    ct_col = ct_cols[0]
    print(f"\nUsing '{ct_col}' for conditional associations")
    cond_results = pc.tl.conditional_associations(adata, obs_column=ct_col)
    sig_cond = cond_results[cond_results["significant"]]
    print(f"{len(sig_cond)} significant associations")

    for _, row in sig_cond.nlargest(10, "odds_ratio").iterrows():
        direction = "enriched" if row["odds_ratio"] > 1 else "depleted"
        print(f"  {row['archetype']} x {row['condition']}: OR={row['odds_ratio']:.2f} ({direction})")
else:
    print("No cell type annotations found — skip conditional associations")
Candidate annotation columns: ['suspension_type', 'cell_type_ontology_term_id', 'author_cell_type', 'sub_celltype', 'tissue_type', 'cell_type', 'archetypes']

Using 'suspension_type' for conditional associations
đź§Ş Testing archetype-conditional associations...
   Condition column: suspension_type
   Method: hypergeometric (boolean matrix approach)
   Testing 6 archetypes Ă— 1 conditions
   Total cells in analysis: 9499
   Condition distribution:
      nucleus: 9499 cells
   Boolean matrices created:
      Archetype matrix: (9499, 6) (cells Ă— archetypes)
      Condition matrix: (9499, 1) (cells Ă— conditions)
[OK] Conditional association testing completed!
   Total tests: 6
   Significant associations: 0 (0.0%)
   Efficiency: 6/6 tests performed (filtered by min_cells)
0 significant associations

Summary#

AnnData keys created in this tutorial:

Key

Description

adata.obsm['X_lsi']

LSI embeddings (TF-IDF + TruncatedSVD)

adata.uns['lsi']

Variance ratio and component loadings

adata.uns['archetype_coordinates']

Archetype positions in LSI space

adata.obsm['archetype_distances']

Cell-to-archetype distance matrix

adata.obs['archetypes']

Categorical archetype assignments

adata.obsm['cell_archetype_weights']

Barycentric coordinates (A matrix)

Key takeaway: scATAC-seq archetypal analysis is identical to scRNA-seq except for the preprocessing step. Replace log-normalize + PCA with TF-IDF + LSI, then set pca_key='X_lsi' everywhere.

WARNING: The current results are not biologically interpretable. Note the lack of model convergence and low archetypal \(R^2\) obtained. This is a demonstration only. Future work will test archetype analysis from raw ATAC peaks to preserve rank structure in LSI dimensional reduction.