Copyright © 2024 Crunch Lab Inc. All rights reserved.
36 Manchester Drive, Westfield, New Jersey, 07090, United States
Copyright © 2024 Crunch Lab Inc. All rights reserved.
36 Manchester Drive, Westfield, New Jersey, 07090, United States
Crunch Lab, The Eric and Wendy Schmidt Center, and The Klarman Cell Observatory invite you to join the Autoimmune Disease ML Challenge to design algorithms to help millions of people.
Autoimmune diseases arise when the immune system mistakenly targets healthy cells. Affecting 50M people in the U.S., with rising global cases, Inflammatory Bowel Disease (IBD) is one of the most prevalent forms. IBD occurs when the barrier between our gut and the microbes living there breaks down, leading to the activation of the immune system and persistent inflammation. This cycle of flares and remission increases the risk of colorectal cancer (up to two-fold). Although modern treatments have improved survival, IBD remains challenging to diagnose and treat due to its complex pathogenic pathways and multifactorial nature.
Pathologists rely on gut tissue images to diagnose and treat IBD, guiding decisions on the most suitable drug treatments and predicting cancer risk. These tissue images, combined with recent advances in genomics, offer a valuable dataset for machine learning models to revolutionize IBD diagnosis and treatment.
This challenge is meant for everyone! We have created a three-lecture crash course that provides background on the biology, technology, and data in the three crunches. You do not need a background in biology or medicine to participate.
The challenge is broken down into three Crunches, ordered by increasing complexity.
Crunchers will build a model to predict the expression of 460 genes in held-out patches of colon tissue using H&E pathology images and Xenium spatial transcriptomics training data. Hematoxylin and Eosin (H&E) images provide insight into cell organization, while Xenium data add information on gene expression and cellular pathways of disease.
In this phase, participants will predict the expression of all protein-coding genes, including those that were not measured in the spatial training data, using single-cell RNA-seq data as support. This Crunch focuses on leveraging cell transcriptional profiles to enhance the predictive model’s ability to infer the expression of unknown genes in spatial contexts.
Participants will rank genes by their ability to distinguish between dysplasia (pre-cancerous) regions and noncancerous tissue in IBD patients, increasing our ability to detect cancer early. The final gene panel will be chosen based on participant performance in Crunch 2 and on peer review of participants' methods taking place after the submission deadline. The gene panel will be experimentally validated in a new colon tissue with dysplasia, and all participants' ranked gene lists will be scored.
The best-performing models will be experimentally tested to validate their ability to predict cancer risk, which could lead to early detection and improved treatment options. The best models will be publish in an official publication from Broad Institute.
For each Crunch, participants must submit predictions in CSV format. Each submission must adhere to the provided log1p-normalization standards.
Outputs will be evaluated using:
Mean Squared Error (Crunch 1). To avoid overfitting Crunch will score all submitted models on a private dataset during two Checkpoints.
Spearman’s Correlation (Crunch 2)
Accuracy and Diversity Metrics (Crunch 3)
Performance will be evaluated through:
Accuracy in gene expression prediction (Crunch 1 & 2)
Gene panel design for distinguishing between noncancerous and dysplasia regions (Crunch 3)
Diversity of selected gene programs in Crunch 3, with extra emphasis on identifying unique biological pathways
Peer review of methods to select dysplasia gene panel in Crunch 3
Crunchers are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.
A list of potential resources and references is provided in the full challenge specifications.
Foundry Institute offers a computing environment with $10 USD equivalent to around 10h of GPU time.
In Crunch 1, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.
There will be multiple validation checkpoints:
Checkpoint 1 - November 30th (Eastern Time 17:59)
Checkpoint 2 - December 16th (Eastern Time 17:59)
Checkpoint 3 - December 30th (Eastern Time 17:59)
Checkpoint 4 - January 13th (Eastern Time 17:59)
Checkpoint 5 - January 27th (Eastern Time 17:59)
Continuous Public Leaderboard - January 20th
Last submission - February 9th (Eastern Time 17:59)
In Crunch 1, you will train an algorithm to predict spatial transcriptomics data (gene expression in each cell) from matched H&E images. In other words predict the gene expression (Y) in cells from specific tissue patches based on the H&E images (X) and surrounding spatial transcriptomics data.
X (Input):
HE_original: The original H&E image in its native pixel coordinates. Alignment from H&E native coordinate system to Xenium coordinate system has been handled from our end. If you prefer to handle alignment yourself, you can check HE_original and DAPI (provided in crunch1_max), but it may require additional processing.
HE_nuc_original: The nucleus segmentation mask of H&E image, in H&E native coordinate system. The cell_id in this segmentation mask matches with the nuclei by gene matrix stored in anucleus.
Y:
anucleus: This file contains the aggregated gene expression data for each nucleus. It is log1p-normalized and stores the gene expression profiles for 460 genes per nucleus. This is the primary target (Y) for your model.
Anucleus – gene expression for each nucleus
Steps to align X and Y:
Step 1: Identify nuclei in the H&E image
Use the nucleus segmentation masks:
H&E nucleus segmentation (HE_nuc_original): This mask identifies the location of nuclei in the original H&E image (i.e. HE_original).
Step 2: Link gene expression to H&E images
For each nucleus in the H&E image, use the anucleus file to get the corresponding gene expression profile (Y) for that nucleus.
The anucleus file provides the gene expression data, where each row corresponds to a nucleus (cell) and each column corresponds to a gene.
The nuclei IDs from the segmentation masks (e.g., from HE_nuc_original) will match the IDs used in the anucleus file.
Matched crop H&E image and its corresponding Gene Expression Heatmap
If you open the image HE_nuc_original,
e.g. through mask=sdata['HE_nuc_original'][0].to_numpy().
You can directly find the location of that cell, with cell_id, through mask==cell_id.
The datasets are store in a SpatialData object. Learn more about this format here.
// SpatialData object structure
Images
//
'DAPI': DAPI image (validation and test tissue patches are removed)
'DAPI_nuc': DAPI nucleus segmentation
'HE_nuc_original': H&E nucleus segmentation on original image
'HE_nuc_registered': H&E nucleus segmentation on registered image (registered to DAPI image)
'HE_original': H&E original image
'HE_registered': H&E registered image
'group': Defining train(0)/validation(1)/test(2), No_transcript-train(4) tissue patches
'group_HEspace': Defining train(0)/validation(1)/test(2), No_transcript-train(4)
tissue patches on the H&E image
Points
'transcripts': DataFrame for each transcript (containing x,y,tissue patch,z_location,
feature_name,transcript_id,qv,cell_id columns)
Tables
'anucleus': AnnData contains .X, .layers['counts'], .obsm['spatial']
'cell_id-group': AnnData only contains .obs DataFrame for mapping of cell_id
to region.
with coordinate systems:
'global', with elements:
DAPI (Images), 'DAPI_nuc' (Images), 'HE_nuc_original' (Images), 'HE_nuc_registered'
(Images), 'HE_original' (Images), 'HE_registered' (Images), 'group' (Images),
6 'group_HEspace' (Images), 'transcripts' (Points)
'scale_um_to_px', with elements:
transcripts (Points)
In the minimum version of the data provided for crunch1 (in crunch1_min.tar), only HE_original, HE_nuc_original, anucleus and cell_id-group are provided.
The output consists of four columns:
cell_id: contains the held-out nuclei (both validation and test tissue regions).
gene: the gene among the 460 genes to be predicted.
prediction: the gene expression value, rounded to two decimal places.
sample: the tissue sample among the 8 samples to process.
Make sure your predictions are log1p-normalized with a scale factor of 100 as in anucleus.X
The scoring metric is a cell-wise Spearman correlation.
A Mean Squared Error metric is also computed, the value must be below 0.2. Since the baseline is 0.1, a model with an MSE that is too high is not considered viable and will not be eligible for rewarded.
To build a valid submission, your model need to be coded within the infer function, effectively respecting the crunch code submission interface.
Due to the large size of the datasets, Crunch provides both a small (aka. default) and a large version.
Depending on your local setup and goals within Crunch, you can choose either one.
By default, the small dataset is downloaded.
To access the larger dataset, specify it explicitly with a different CLI command:
# setup with the large data
crunch setup --size large broad-1 my-model ...
# setup with the small data
crunch setup broad-1 my-model ...
The larger version contain the Xenium transcriptomic data. It allow you to know both the gene expression and the coordinate (x, y, z) of the position of the gene in the Cells.
More details about the gene transcriptomic data in the full documentation.
The large variant is for local use only.
The Cloud Environment will always use the default dataset.
Competition Host
Eric and Wendy Schmidt Center
Prize Pool
Part 1: 12,000 $USDC Part 2: 12,000 $USDC Part 3: 26,000 $USDC Total: 50,000 $USDC