Tutorial 10: Spatial Archetypal Analysis#

This tutorial walks through archetypal analysis on spatial transcriptomics data, from data loading through spatial co-localization testing and interaction boundary detection.

What you’ll learn:

  1. Load and preprocess spatial transcriptomics data

  2. Train archetypes (same as scRNA-seq)

  3. Build spatial neighbor graphs

  4. Test archetype neighborhood enrichment (permutation test)

  5. Compute distance-dependent co-occurrence ratios

  6. Visualize spatial archetype patterns

  7. Measure spatial autocorrelation of archetype weights (Moran’s I)

  8. Detect interaction boundaries between cell types

Key concept: Archetypes are learned from gene expression. The spatial analysis layer asks: do cells assigned to the same archetype cluster together in tissue space? This connects transcriptomic extremes to spatial organization.

Requirements:

  • peach (v0.4.0+)

  • peach[spatial]: pip install peach[spatial] (squidpy >= 1.3.0)

  • scanpy

  • Data: data/melanoma/melanoma_LN.h5ad (Slide-seq v2, 41K cells)

[ ]:
import scanpy as sc
import peach as pc
import numpy as np
from pathlib import Path

print(f"PEACH version: {pc.__version__}")

Step 1: Load Spatial Data#

Spatial transcriptomics datasets have gene expression in adata.X and 2D tissue coordinates in adata.obsm['spatial']. This works with:

  • Slide-seq / Slide-seq v2 (bead-based, continuous coords)

  • MERFISH / seqFISH (imaging-based, continuous coords)

  • Visium (spot-based, hexagonal grid)

Set coord_type='generic' for continuous coordinates or 'grid' for Visium.

[ ]:
data_path = Path("data/melanoma/melanoma_LN.h5ad")
adata = sc.read_h5ad(data_path)

print(f"Shape: {adata.n_obs:,} cells x {adata.n_vars:,} genes")
print(f"obsm keys: {list(adata.obsm.keys())}")
print(f"obs columns: {list(adata.obs.columns)}")
[ ]:
# Inspect spatial coordinates
coords = adata.obsm["spatial"]
print(f"Spatial coordinates: {coords.shape}")
print(f"X range: [{coords[:, 0].min():.1f}, {coords[:, 0].max():.1f}]")
print(f"Y range: [{coords[:, 1].min():.1f}, {coords[:, 1].max():.1f}]")
[ ]:
# Cell type distribution (if available)
if "Cell_Type" in adata.obs.columns:
    print(f"Cell types ({adata.obs['Cell_Type'].nunique()}):")
    print(adata.obs["Cell_Type"].value_counts())

Step 2: Preprocessing#

Standard scRNA-seq preprocessing: normalize, log-transform, PCA. The spatial coordinates are not used during preprocessing or archetype training — they come into play later for spatial enrichment analysis.

Important: The cell below checks whether each step has already been applied to avoid double-normalizing or double-log-transforming, which would silently produce nonsense results.

[ ]:
# Preprocessing with guards against re-processing
import scipy.sparse as sp

X = adata.X
if sp.issparse(X):
    max_val = X.max()
    row_sums = np.asarray(X.sum(axis=1)).ravel()
else:
    max_val = X.max()
    row_sums = X.sum(axis=1)

already_normalized = np.allclose(row_sums[:100], row_sums[0], rtol=0.01)
already_logged = "log1p" in adata.uns or max_val < 20  # raw counts are typically > 1000

if already_logged and already_normalized:
    print(f"Data appears already normalized + log-transformed (max={max_val:.1f})")
elif already_normalized and not already_logged:
    print(f"Data appears normalized but not log-transformed (max={max_val:.1f})")
    sc.pp.log1p(adata)
else:
    print(f"Applying normalization + log1p (max={max_val:.1f})")
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)

if "X_pca" not in adata.obsm:
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    sc.pp.pca(adata, n_comps=30)
    print(f"Computed PCA: {adata.obsm['X_pca'].shape}")
else:
    print(f"Using existing PCA: {adata.obsm['X_pca'].shape}")

Step 4: Train Final Model#

[ ]:
n_archetypes = 5  # Adjust based on elbow/CV

results = pc.tl.train_archetypal(
    adata,
    n_archetypes=n_archetypes,
    n_epochs=50,
    hidden_dims=[256, 128, 64],
    device="cpu",
)

print(f"Final R²: {results.get('final_archetype_r2', 'N/A')}")
[ ]:
pc.pl.training_metrics(results["history"])

Step 5: Distances, Positions & Assignments#

[ ]:
pc.tl.archetypal_coordinates(adata)
pc.tl.assign_archetypes(adata, percentage_per_archetype=0.1)

print(f"Archetype positions: {adata.uns['archetype_coordinates'].shape}")
print(f"Distance matrix: {adata.obsm['archetype_distances'].shape}")
print(f"\nAssignment distribution:")
print(adata.obs["archetypes"].value_counts())
[ ]:
# Extract barycentric weights
weights = pc.tl.extract_archetype_weights(adata)
print(f"Weight matrix: {weights.shape}")
[ ]:
pc.pl.archetype_positions(adata)

Step 6: Spatial Neighbor Graph#

Build a k-nearest-neighbor graph in tissue space. This is the foundation for all spatial enrichment statistics.

  • coord_type='generic': For Slide-seq, MERFISH, seqFISH (continuous coordinates)

  • coord_type='grid': For Visium (hexagonal grid)

  • n_neighs: Number of spatial neighbors per cell (10-30 typical)

[ ]:
pc.tl.spatial_neighbors(
    adata,
    n_neighs=30,
    coord_type="generic",  # Slide-seq v2 = continuous coords
)

print(f"Connectivity matrix: {adata.obsp['spatial_connectivities'].shape}")
print(f"Distance matrix: {adata.obsp['spatial_distances'].shape}")

Step 7: Neighborhood Enrichment#

For each pair of archetypes (A, B), this permutation test asks: > Are archetype-A cells found in the spatial neighborhood of archetype-B cells > more often (or less often) than expected by chance?

The result is a z-score matrix:

  • Positive z-score: archetypes co-localize (found near each other)

  • Negative z-score: archetypes segregate (found apart from each other)

  • Near zero: random spatial arrangement

[ ]:
nhood_result = pc.tl.archetype_nhood_enrichment(
    adata,
    n_perms=1000,
)

zscore = nhood_result["zscore"]
print(f"Z-score matrix: {zscore.shape}")
print(f"Range: [{zscore.min():.2f}, {zscore.max():.2f}]")
[ ]:
# Identify strongest spatial relationships
labels = sorted(adata.obs["archetypes"].unique())
n = zscore.shape[0]

pairs = []
for i in range(n):
    for j in range(i + 1, n):
        pairs.append((zscore[i, j], labels[i], labels[j]))

pairs.sort(key=lambda x: abs(x[0]), reverse=True)
print("Strongest spatial relationships (by |z-score|):")
for z, a, b in pairs[:8]:
    status = "CO-LOCALIZE" if z > 0 else "SEGREGATE"
    print(f"  {a} - {b}: z={z:+.2f} ({status})")
[ ]:
# Heatmap of enrichment z-scores
pc.pl.nhood_enrichment(adata)

Step 8: Distance-Dependent Co-occurrence#

While neighborhood enrichment gives one summary statistic per pair, co-occurrence analysis shows how the relationship changes with distance.

The co-occurrence ratio at distance d:

  • > 1: archetype pair found together more than expected at distance d

  • = 1: random arrangement at distance d

  • < 1: avoidance at distance d

This reveals distance-dependent patterns (e.g., archetypes that co-occur at short range but not long range, or vice versa).

[ ]:
cooc_result = pc.tl.archetype_co_occurrence(
    adata,
    interval=50,
)

occ = cooc_result["occ"]
interval = cooc_result["interval"]
print(f"Co-occurrence tensor: {occ.shape}")
print(f"  [n_archetypes, n_archetypes, n_distances]")
print(f"Distance range: [{interval.min():.1f}, {interval.max():.1f}]")
[ ]:
# Co-occurrence ratio vs distance
pc.pl.co_occurrence(adata)

Step 9: Spatial Archetype Map#

Plot cells at their tissue coordinates, colored by archetype assignment. This directly shows where each archetype lives in the tissue.

[ ]:
pc.pl.spatial_archetypes(adata)
[ ]:
# Compare with cell type map (if annotations available)
if "Cell_Type" in adata.obs.columns:
    pc.pl.spatial_archetypes(adata, color_key="Cell_Type", title="Cell Types in Tissue")

Step 10: Spatial Autocorrelation#

Moran’s I per archetype weight tests whether each archetype is spatially smooth (I > 0, nearby cells have similar weights) or spatially random (I ~ 0).

Unlike neighborhood enrichment (which uses discrete archetype labels), this operates on the continuous weight vectors — capturing gradients in archetype mixtures, not just dominant assignments.

[ ]:
autocorr = pc.tl.archetype_spatial_autocorr(adata)
autocorr
[ ]:
pc.pl.spatial_autocorr(adata)

Step 11: Interaction Boundaries#

Where do two cell types run different archetype programs in the same spatial neighborhood? This detects interaction fronts — tissue regions where, e.g., macrophages shift toward inflammatory archetypes while neighboring fibroblasts shift toward fibrotic archetypes.

The function compares the mean archetype weight vector of each cell type’s spatial neighbors per cell, using Jensen-Shannon divergence. High boundary scores mark spatial fronts of divergent archetype activity.

[ ]:
# Detect interaction boundaries between the two most abundant cell types
boundary_result = pc.tl.archetype_interaction_boundaries(
    adata,
    cell_type_col="Cell_Type",
)

print(f"\nBoundary score range: [{adata.obs['boundary_score'].min():.4f}, "
      f"{adata.obs['boundary_score'].max():.4f}]")
[ ]:
# Spatial map of boundary scores
pc.pl.interaction_boundaries(adata)
[ ]:
# Per-archetype cross-correlations between cell types
pc.pl.cross_correlations(adata)

Step 12: Gene Characterization#

Identify which genes drive each archetype and cross-reference with spatial patterns.

[ ]:
gene_results = pc.tl.gene_associations(adata)

sig = gene_results[gene_results["significant"]]
print(f"Significant gene-archetype associations: {len(sig)}")
print(f"\nPer archetype:")
print(sig["archetype"].value_counts())
[ ]:
# Top genes per archetype
for arch in sorted(sig["archetype"].unique()):
    top = sig[sig["archetype"] == arch].nlargest(5, "log_fold_change")
    print(f"\n{arch}:")
    for _, row in top.iterrows():
        print(f"  {row['gene']:20s} LFC={row['log_fold_change']:.3f} FDR={row['fdr_pvalue']:.2e}")
[ ]:
# Cell type conditional associations
if "Cell_Type" in adata.obs.columns:
    cond_results = pc.tl.conditional_associations(adata, obs_column="Cell_Type")
    sig_cond = cond_results[cond_results["significant"]]
    print(f"Significant archetype-cell_type associations: {len(sig_cond)}")
    for _, row in sig_cond.nlargest(10, "odds_ratio").iterrows():
        direction = "enriched" if row["odds_ratio"] > 1 else "depleted"
        print(f"  {row['archetype']} x {row['condition']}: OR={row['odds_ratio']:.2f} ({direction})")

Summary#

AnnData keys created in this tutorial:

Key

Description

adata.obsm['X_pca']

PCA embeddings (standard preprocessing)

adata.uns['archetype_coordinates']

Archetype positions in PCA space

adata.obsm['archetype_distances']

Cell-to-archetype distance matrix

adata.obsm['cell_archetype_weights']

Continuous archetype weight matrix

adata.obs['archetypes']

Categorical archetype assignments

adata.obsp['spatial_connectivities']

Spatial k-NN graph

adata.obsp['spatial_distances']

Spatial distance matrix

adata.uns['archetype_nhood_enrichment']

Z-score matrix (permutation test)

adata.uns['archetype_co_occurrence']

Distance-dependent ratios

adata.uns['archetype_spatial_autocorr']

Moran’s I per archetype weight

adata.obs['boundary_score']

Per-cell interaction boundary score

adata.uns['archetype_interaction_boundaries']

Cross-correlation and boundary results

Interpretation guide:

  • Nhood enrichment z > 2: strong spatial co-localization between archetypes

  • Nhood enrichment z < -2: strong spatial segregation

  • Co-occurrence > 1 at short distance: archetypes are proximal neighbors

  • Moran’s I > 0: archetype weight is spatially smooth (clustered)

  • Boundary score high: cell types running different archetype programs in the same neighborhood

  • Cross-correlation r > 0: archetype co-localizes across cell types; r < 0: anti-correlates

  • Cross-reference with gene associations to understand why archetypes co-localize (e.g., shared signaling pathways, ligand-receptor interactions)