Overview Leaderboard Submissions & Runs Team Resources

Submit Your First Model

Part. 1 Autoimmune Disease Machine Learning Challenge

Crunch Lab, The Eric and Wendy Schmidt Center, and The Klarman Cell Observatory invite you to join the Autoimmune Disease ML Challenge to design algorithms to help millions of people.

Introduction

Autoimmune diseases arise when the immune system mistakenly targets healthy cells. Affecting 50M people in the U.S., with rising global cases, Inflammatory Bowel Disease (IBD) is one of the most prevalent forms. IBD occurs when the barrier between our gut and the microbes living there breaks down, leading to the activation of the immune system and persistent inflammation. This cycle of flares and remission increases the risk of colorectal cancer (up to two-fold). Although modern treatments have improved survival, IBD remains challenging to diagnose and treat due to its complex pathogenic pathways and multifactorial nature.

Pathologists rely on gut tissue images to diagnose and treat IBD, guiding decisions on the most suitable drug treatments and predicting cancer risk. These tissue images, combined with recent advances in genomics, offer a valuable dataset for machine learning models to revolutionize IBD diagnosis and treatment.

Read the full competition specifications here.

This challenge is meant for everyone! We have created a three-lecture crash course that provides background on the biology, technology, and data in the three crunches. You do not need a background in biology or medicine to participate.

Find the lecture crash course here.

Phases

The challenge is broken down into three Crunches, ordered by increasing complexity.

Crunch 1 – Oct 28 to Feb 9 – Predict gene expression in spatial transcriptomics data from matched pathology images

Crunchers will build a model to predict the expression of 460 genes in held-out patches of colon tissue using H&E pathology images and Xenium spatial transcriptomics training data. Hematoxylin and Eosin (H&E) images provide insight into cell organization, while Xenium data add information on gene expression and cellular pathways of disease.

Crunch 2 – Nov 18 to Mar 21 – Predicting Unseen Genes

In this phase, participants will predict the expression of all protein-coding genes, including those that were not measured in the spatial training data, using single-cell RNA-seq data as support. This Crunch focuses on leveraging cell transcriptional profiles to enhance the predictive model’s ability to infer the expression of unknown genes in spatial contexts.

Crunch 3 – Dec 9 to Apr 30 (submission deadline) / May 15 (peer review deadline) – Identifying Gene Markers for Pre-cancerous Regions

Participants will rank genes by their ability to distinguish between dysplasia (pre-cancerous) regions and noncancerous tissue in IBD patients, increasing our ability to detect cancer early. The final gene panel will be chosen based on participant performance in Crunch 2 and on peer review of participants' methods taking place after the submission deadline. The gene panel will be experimentally validated in a new colon tissue with dysplasia, and all participants' ranked gene lists will be scored.

The best-performing models will be experimentally tested to validate their ability to predict cancer risk, which could lead to early detection and improved treatment options. The best models will be publish in an official publication from Broad Institute.

Participant Output Requirements

For each Crunch, participants must submit predictions in CSV format. Each submission must adhere to the provided log1p-normalization standards.

Outputs will be evaluated using:

Mean Squared Error (Crunch 1). To avoid overfitting Crunch will score all submitted models on a private dataset during two Checkpoints.
Spearman’s Correlation (Crunch 2)
Accuracy and Diversity Metrics (Crunch 3)

Evaluation Criteria

Performance will be evaluated through:

Accuracy in gene expression prediction (Crunch 1 & 2)
Gene panel design for distinguishing between noncancerous and dysplasia regions (Crunch 3)
Diversity of selected gene programs in Crunch 3, with extra emphasis on identifying unique biological pathways
Peer review of methods to select dysplasia gene panel in Crunch 3

External Resources

Crunchers are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.

A list of potential resources and references is provided in the full challenge specifications.

Foundry Institute offers a computing environment with $10 USD equivalent to around 10h of GPU time.

Find ML Foundry documentation here.

Evaluation Phases

In Crunch 1, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.

There will be multiple validation checkpoints:

Checkpoint 1 - November 30th (Eastern Time 17:59)
Checkpoint 2 - December 16th (Eastern Time 17:59)
Checkpoint 3 - December 30th (Eastern Time 17:59)
Checkpoint 4 - January 13th (Eastern Time 17:59)
~~Checkpoint 5~~ ~~- January 27th (Eastern Time 17:59)~~
Continuous Public Leaderboard - January 20th
Last submission - February 9th (Eastern Time 17:59)

Overview

In Crunch 1, you will train an algorithm to predict spatial transcriptomics data (gene expression in each cell) from matched H&E images. In other words predict the gene expression (Y) in cells from specific tissue patches based on the H&E images (X) and surrounding spatial transcriptomics data.

X (Input):
- HE_original: The original H&E image in its native pixel coordinates. Alignment from H&E native coordinate system to Xenium coordinate system has been handled from our end. If you prefer to handle alignment yourself, you can check HE_original and DAPI (provided in crunch1_max), but it may require additional processing.
- HE_nuc_original: The nucleus segmentation mask of H&E image, in H&E native coordinate system. The cell_id in this segmentation mask matches with the nuclei by gene matrix stored in anucleus.

Y:
- anucleus: This file contains the aggregated gene expression data for each nucleus. It is log1p-normalized and stores the gene expression profiles for 460 genes per nucleus. This is the primary target (Y) for your model.

Anucleus – gene expression for each nucleus

Linking the H&E image to spatial transcriptomics

Steps to align X and Y:

Step 1: Identify nuclei in the H&E image
- Use the nucleus segmentation masks:
  - H&E nucleus segmentation (HE_nuc_original): This mask identifies the location of nuclei in the original H&E image (i.e. HE_original).
Step 2: Link gene expression to H&E images
- For each nucleus in the H&E image, use the anucleus file to get the corresponding gene expression profile (Y) for that nucleus.
- The anucleus file provides the gene expression data, where each row corresponds to a nucleus (cell) and each column corresponds to a gene.
- The nuclei IDs from the segmentation masks (e.g., from HE_nuc_original) will match the IDs used in the anucleus file.

Matched crop H&E image and its corresponding Gene Expression Heatmap

If you open the image HE_nuc_original,

e.g. through mask=sdata['HE_nuc_original'][0].to_numpy().

You can directly find the location of that cell, with cell_id, through mask==cell_id.

The datasets are store in a SpatialData object. Learn more about this format here.

   // SpatialData object structure 
    
    Images 
    //  
         'DAPI': DAPI image (validation and test tissue patches are removed) 
         'DAPI_nuc': DAPI nucleus segmentation 
         'HE_nuc_original': H&E nucleus segmentation on original image 
         'HE_nuc_registered': H&E nucleus segmentation on registered image (registered to DAPI image) 
         'HE_original': H&E original image 
         'HE_registered': H&E registered image 
         'group': Defining train(0)/validation(1)/test(2), No_transcript-train(4) tissue patches 
         'group_HEspace': Defining train(0)/validation(1)/test(2), No_transcript-train(4) 
   tissue patches on the H&E image 
     
    Points 
         'transcripts': DataFrame for each transcript (containing x,y,tissue patch,z_location, 
       feature_name,transcript_id,qv,cell_id columns) 
     
    Tables 
          'anucleus':  AnnData contains .X, .layers['counts'], .obsm['spatial'] 
          'cell_id-group': AnnData only contains .obs DataFrame for mapping of cell_id 
           to region. 
    
   with coordinate systems: 
        'global', with elements: 
           DAPI (Images), 'DAPI_nuc' (Images), 'HE_nuc_original' (Images), 'HE_nuc_registered' 
               (Images), 'HE_original' (Images), 'HE_registered' (Images), 'group' (Images), 
   6 'group_HEspace' (Images), 'transcripts' (Points) 
      'scale_um_to_px', with elements: 
         transcripts (Points)

In the minimum version of the data provided for crunch1 (in crunch1_min.tar), only HE_original, HE_nuc_original, anucleus and cell_id-group are provided.

Expected Output

The output consists of four columns:

cell_id: contains the held-out nuclei (both validation and test tissue regions).
gene: the gene among the 460 genes to be predicted.
prediction: the gene expression value, rounded to two decimal places.
sample: the tissue sample among the 8 samples to process.

Make sure your predictions are log1p-normalized with a scale factor of 100 as in anucleus.X

Scoring

The scoring metric is a cell-wise Spearman correlation.

A Mean Squared Error metric is also computed, the value must be below 0.2. Since the baseline is 0.1, a model with an MSE that is too high is not considered viable and will not be eligible for rewarded.

The evaluation code is available on GitHub.

Submit

To build a valid submission, your model need to be coded within the infer function, effectively respecting the crunch code submission interface.

See how to submit through the quickstarter.

Learn about crunch code interface.

Data Variants

Due to the large size of the datasets, Crunch provides both a small (aka. default) and a large version.

Depending on your local setup and goals within Crunch, you can choose either one.

By default, the small dataset is downloaded.

To access the larger dataset, specify it explicitly with a different CLI command:

   # setup with the large data 
   crunch setup --size large broad-1 my-model ... 
    
   # setup with the small data 
   crunch setup broad-1 my-model ...

The larger version contain the Xenium transcriptomic data. It allow you to know both the gene expression and the coordinate (x, y, z) of the position of the gene in the Cells.

More details about the gene transcriptomic data in the full documentation.

The large variant is for local use only.

The Cloud Environment will always use the default dataset.

Useful Links

Discord

Join the server

Join

Forum

Help each others

Visit

Quickstarters

Get started quickly

Open in Colab

Datasets

Download the data

Go to

Docs

Read the docs

Visit

To Dos

(0%)

Competition Host

Eric and Wendy Schmidt Center

Prize Pool

Part 1: 12,000 $USDC Part 2: 12,000 $USDC Part 3: 26,000 $USDC Total: 50,000 $USDC

Teams

Limited to 8 members

36 Manchester Drive, Westfield, New Jersey, 07090, United States