Real-world example - Analysis of scDVP liver zonation dataset#

In this example we showcase how spatialdata can be leveraged in the analysis of real-world DVP data. For this purpose we utilize a small subset of the previously published dataset by [RTS+23] (Rosenberger et al, 2023), specifically the sample m1a. This datasets includes

IF images (as .tiff) of the mouse liver, including the markers Glutamine Synthethase, E-Cadherin, cell membrane marker (WGA), and phalloidin (cytoskeleton)
Cell segmentation shapes of the profiled cells as LMD-compatible .xml file
Proteomics measurements of 77 single shapes as diann protein groups report (.tsv)

We will see how we can use spatialdata to jointly store and visualize these different modalities.

Packages#

To process the different modalities, we import the respective readers from dvpio (see the FAQ) for a full list of available readers). Additionally, we will import some scientific software packages, including spatialdata

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt

import spatialdata as sd
import spatialdata_plot  # noqa

from dvpio.read.image import read_custom
from dvpio.read.omics import parse_df
from dvpio.read.shapes import read_lmd

Get data#

# %%capture
# %%bash
# downloadpath="./scdvp"
# mkdir -p $downloadpath
# wget -O $downloadpath/scdvp-rosenberger2023.zip "https://datashare.biochem.mpg.de/s/Z0BBq65fFHX3ai3/download"
# unzip -d $downloadpath $downloadpath/scdvp-rosenberger2023.zip
# rm $downloadpath/scdvp-rosenberger2023.zip

image_path = "./scdvp/images"
lmd_path = "./scdvp/shapes/EXP-220815_m1A_plate-2.pylmd.xml"
protein_group_df_path = "./scdvp/omics/PXD038699_proteintable.tsv"

Create spatialdata object#

We will successively add attributes to our spatialdata object. Let’s initialize an empty object:

sdata = sd.SpatialData()

Load images#

First, we will add the tiff images using the respective dvpio function and visualize it

sdata.images["m1a_image"] = read_custom(path=f"{image_path}/S-BIAD596_m1A_*.tif")

INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')

norm = mpl.colors.Normalize(vmin=0, vmax=4000)
sdata.pl.render_images("m1a_image", norm=norm).pl.show()

INFO     Rasterizing image for faster rendering.

../_images/d9a8b64fbcae01742c68739918e9560896626a54eb0e8cd64c02fdae7be8f6ef.png

Load shapes#

Next, we load the shapes. To do this, we will first need to define the calibration points in the image. As the calibration points are not visible in the image, we pass the values that were used in the original publication (link). Alternatively, you can also interactively select the points with napari_spatialdata, see Tutorial 2

# Calibration points were determined at 0.135545805x resolution in a coordinate system that was centered at the image center
# + we need to scale them to 1x resolution (x1/0.135545805)
# + shift the coordinate system to the upper left corner of the image (+image width/2, image width/2)
url = "https://raw.githubusercontent.com/MannLabs/single-cell-DVP/cf9d588849f84409c2e0ce6e3703cc69a1823503/data/meta/coordinates_meta.txt"
df = pd.read_csv(url, sep="\t")
df["X_scaled"] = df["X"] / df["resolution"]
df["Y_scaled"] = df["Y"] / df["resolution"]
df["X_transformed"] = df["X_scaled"] + sdata.images["m1a_image"].shape[2] / 2
df["Y_transformed"] = df["Y_scaled"] + sdata.images["m1a_image"].shape[1] / 2

calibration_points_image = df.loc[df["Slide"] == "m1A", ["X_transformed", "Y_transformed"]].values

sdata.points["calibration_points_image"] = sd.models.PointsModel.parse(calibration_points_image)

calibration_points_image

array([[ 1836.11676551,  9121.12330304],
       [18291.61238192,  3189.23998138],
       [ 7121.62375704, 25625.42677448]])

Having defined the calibration points, we can now read the actual shapes from the LMD-compatible .xml file and add them to spatialdata.

sdata.shapes["cell_segmentation"] = (
    read_lmd(
        lmd_path,
        calibration_points_image=sdata.points["calibration_points_image"],
        transformation_type="similarity",
    )
    .set_index("name", drop=False)
    .rename_axis(None, axis=0)
)

You should always check/validate whether cell segmentation and image are correctly overlaid. To do so, we select a small subset of the image and take a look at both cell membrane staining (channel 1) and cell segmentation

fig, ax = plt.subplots(1, 1, figsize=(8, 5))

min_coordinate = np.array([6200, 6600])
max_coordinate = min_coordinate + 700

(
    sdata.query.bounding_box(
        axes=("x", "y"),
        min_coordinate=min_coordinate,
        max_coordinate=max_coordinate,
        target_coordinate_system="global",
    )
    .pl.render_images("m1a_image", channel=1)
    .pl.render_shapes(fill_alpha=0.0, outline_alpha=1, outline_color="#efefef")
    .pl.show(coordinate_systems="global", ax=ax)
)

../_images/9af81646877645f060e59474e61c79778277310705fb4675a7a3b3966c164adb.png

The overlay between shapes and cell membrane stain looks reasonable.

Omics data#

Finally, we add the omics data. In this case, we start from the protein.group dataframe as reported by DIANN (source) and use the :func:dvpio.read.omics.parse_df function to parse the data into a spatialdata-compatible anndata.AnnData container.

Note, that the function expects a processed version of the dataframe: Cell-level metadata must be stored in the dataframe row index and will be stored in the anndata.AnnData.obs attribute. Gene-level metadata must be stored in the dataframe columns index and will be stored in the anndata.AnnData.var attribute. All values of the dataframe will be stored in anndata.AnnData.X.

# Read DIANN Report
df = pd.read_csv(protein_group_df_path, sep="\t", index_col="protein").T.reset_index(names="run")

# Parse metadata run into individual metadata columns
metadata_columns = [
    "date",
    "ms",
    "A",
    "B",
    "C",
    "method",
    "sample_id",
    "uniprot_ref",
    "aid",
    "cell_id",
    "D",
    "E",
    "F",
    "G",
]
df[metadata_columns] = df["run"].str.split("_").tolist()

# Create new metadata column `name` that matches the shape names in in the cell segmentation
df["name"] = df["cell_id"].apply(lambda x: f"Shape_{int(x)}")

# Add metadata column that enables spatialdata to match the measurements
# to the cell segmentation shapes (stored in sdata.shapes["cell_segmentation"]) [expects name of shapes attribute]
df["region_key"] = "cell_segmentation"

# Subset to unique cells
# TODO discuss why there are multiple measurements per cell
df = df.groupby("name").first().reset_index()

# Set all metadata columns, including the newly generated ones to the dataframe index
df = df.set_index(["run", "name", "region_key", *sorted(metadata_columns)])

df

																protein	A0A087WQ94	A0A087WR50	A0A087WSP5	A0A0A6YVS2	A0A0A6YW80	A0A0J9YU79	A0A0J9YUI8	A0A0R4IZY2	A0A0R4J083	A0A0R4J0B4	...	P54726	Q61581	Q3V2R3	Q6S9I0	Q9D0J4	Q99N93	Q8CHY6	P11835	P56183	Q6P8J7
run	name	region_key	A	B	C	D	E	F	G	aid	cell_id	date	method	ms	sample_id	uniprot_ref
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_01_S2-A1_1_4970_target4	Shape_1	cell_segmentation	FAS	SA	S720	S2-A1	1	4970	target4	AID8	01	20220820	scDVP	TIMS06	m1A	Ref10	2546.170495	3112.807142	4247.124853	1906.437899	4825.246524	1589.062179	736.197798	6044.616609	6140.588482	1625.809065	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_10_S2-A10_1_4979_target4	Shape_10	cell_segmentation	FAS	SA	S720	S2-A10	1	4979	target4	AID8	10	20220820	scDVP	TIMS06	m1A	Ref10	8077.136135	3330.472828	2268.027890	1526.827577	6940.136352	3275.123797	2150.213041	10486.206852	6641.898142	12554.911721	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_11_S2-A11_1_4980_target4	Shape_11	cell_segmentation	FAS	SA	S720	S2-A11	1	4980	target4	AID8	11	20220820	scDVP	TIMS06	m1A	Ref10	NaN	4435.863428	5348.012649	1748.762739	10268.052605	1197.900908	2981.050155	12238.028228	8792.139128	1203.145811	...	NaN	NaN	3487.29117	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_12_S2-B1_1_4981_target8	Shape_12	cell_segmentation	FAS	SA	S720	S2-B1	1	4981	target8	AID8	12	20220820	scDVP	TIMS06	m1A	Ref10	1196.769924	3220.892955	5649.794082	5531.072645	11462.061995	1891.362297	3109.722352	8266.772895	4655.184605	3886.250550	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_13_S2-B2_1_4982_target8	Shape_13	cell_segmentation	FAS	SA	S720	S2-B2	1	4982	target8	AID8	13	20220820	scDVP	TIMS06	m1A	Ref10	3896.896268	3892.945494	3787.335743	3654.229206	9770.141567	2570.591653	3336.025993	13528.636038	8340.092785	2791.022612	...	NaN	318.401047	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20220820_TIMS06_FAS_SA_S720_scDVP_m3B_Ref10_AID8_75_S2-G9_1_4961_target4	Shape_75	cell_segmentation	FAS	SA	S720	S2-G9	1	4961	target4	AID8	75	20220820	scDVP	TIMS06	m3B	Ref10	1853.740651	3846.307917	2964.671425	3770.539387	6331.494696	4519.779519	2492.729865	15927.070502	11184.127594	232.821475	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1078.283601
20220820_TIMS06_FAS_SA_S720_scDVP_m3B_Ref10_AID8_76_S2-G10_1_4962_target4	Shape_76	cell_segmentation	FAS	SA	S720	S2-G10	1	4962	target4	AID8	76	20220820	scDVP	TIMS06	m3B	Ref10	4479.211592	2225.215970	3729.255133	1500.205313	20813.766732	5020.883312	2230.125342	8551.449242	5701.956014	1810.053280	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m3B_Ref10_AID8_77_S5-G11_1_4963_target8	Shape_77	cell_segmentation	FAS	SA	S720	S5-G11	1	4963	target8	AID8	77	20220820	scDVP	TIMS06	m3B	Ref10	3994.153689	2295.307398	2192.080927	15405.786639	1908.038186	3279.212194	2958.605514	7308.751211	4870.938330	1780.867609	...	NaN	NaN	NaN	NaN	488.112002	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_08_S2-A8_1_4977_target4	Shape_8	cell_segmentation	FAS	SA	S720	S2-A8	1	4977	target4	AID8	08	20220820	scDVP	TIMS06	m1A	Ref10	3880.901353	3354.038617	3288.165942	3010.736507	5709.819890	3183.414270	2314.839162	16373.762603	11374.528812	3555.521124	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
20220820_TIMS06_FAS_SA_S720_scDVP_m1A_Ref10_AID8_09_S2-A9_1_4978_target8	Shape_9	cell_segmentation	FAS	SA	S720	S2-A9	1	4978	target8	AID8	09	20220820	scDVP	TIMS06	m1A	Ref10	3086.943355	3421.335838	2606.051069	5119.567885	9104.939493	2330.939112	2840.631062	11393.620934	5431.817081	2077.488892	...	173.200087	480.249257	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

77 rows × 3738 columns

# Parse dataframe
# Set anndata.AnnData.var index to protein uniprot accessors to enable querying by protein names (required for plotting)
sdata.tables["m1a_table"] = parse_df(
    df, var_index="protein", region="cell_segmentation", region_key="region_key", instance_key="name"
)
# Inspect
sdata.tables["m1a_table"]

AnnData object with n_obs × n_vars = 77 × 3738
    obs: 'run', 'name', 'region_key', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'aid', 'cell_id', 'date', 'method', 'ms', 'sample_id', 'uniprot_ref'
    uns: 'spatialdata_attrs'

After having added the omics layer to the spatialdata object, we can overlay the proteomics measurement with the imaging. In this example, we will focus on E-Cadherin (P09803) for which the original authors also developed an antibody-based staining.

min_coordinate = np.array([4000, 8000])
size = 5000

(
    sdata
    # Subset to relevant area
    .query.bounding_box(
        axes=("x", "y"),
        min_coordinate=min_coordinate,
        max_coordinate=min_coordinate + size,
        target_coordinate_system="global",
    )
)

SpatialData object
├── Images
│     └── 'm1a_image': DataArray[cyx] (4, 5000, 5000)
├── Shapes
│     └── 'cell_segmentation': GeoDataFrame shape: (11, 3) (2D shapes)
└── Tables
      └── 'm1a_table': AnnData (2, 3738)
with coordinate systems:
    ▸ 'global', with elements:
        m1a_image (Images), cell_segmentation (Shapes)
    ▸ 'to_lmd', with elements:
        cell_segmentation (Shapes)

sdata.tables["m1a_table"].obs["expression"] = sdata.tables["m1a_table"][:, "P09803"].X.flatten()

fig, axs = plt.subplots(1, 2, figsize=(16, 5))

min_coordinate = np.array([4000, 7000])
max_coordinate = min_coordinate + 600

(
    sdata
    # Subset to relevant area
    .query.bounding_box(
        axes=("x", "y"),
        min_coordinate=min_coordinate,
        max_coordinate=max_coordinate,
        target_coordinate_system="global",
    )
    .pl.render_images("m1a_image", channel=1)
    .pl.render_shapes(
        element="cell_segmentation", color="P09803", norm=mpl.colors.Normalize(0, 7000), outline_alpha=1, fill_alpha=1
    )
    .pl.show(coordinate_systems="global", ax=axs[0])
)

axs[0].set_title("LC/MS-measurement + $\\alpha$-E-Cadherin")

(
    sdata
    # Subset to relevant area
    .query.bounding_box(
        axes=("x", "y"),
        min_coordinate=min_coordinate,
        max_coordinate=max_coordinate,
        target_coordinate_system="global",
    )
    .pl.render_images("m1a_image", channel=1)
    .pl.render_shapes(
        element="cell_segmentation", color="P09803", outline_alpha=1, outline_color="#cccccc", fill_alpha=0
    )
    .pl.show(coordinate_systems="global", ax=axs[1])
)

axs[1].set_title("Cell segmentation + $\\alpha$-E-Cadherin")

Text(0.5, 1.0, 'Cell segmentation + $\\alpha$-E-Cadherin')

../_images/cd48a36ce82c5190887019a3dd496e283517dc1084bb2499fedf70d3cf5c5069.png

We note that the staining intensity correlates with the measured abundance of E-Cadherin in the cells.

Real-world example - Analysis of scDVP liver zonation dataset

Contents