Overview Leaderboard Submissions & Runs Team Resources

Submit Your First Model

Part. 2 Autoimmune Disease Machine Learning Challenge

Crunch Lab, The Eric and Wendy Schmidt Center, and The Klarman Cell Observatory invite you to join the Part 2 of the Autoimmune Disease ML Challenge to design algorithms to help millions of people.

Introduction

Autoimmune diseases arise when the immune system mistakenly targets healthy cells. Affecting 50M people in the U.S., with rising global cases, Inflammatory Bowel Disease (IBD) is one of the most prevalent forms. IBD occurs when the barrier between our gut and the microbes living there breaks down, leading to the activation of the immune system and persistent inflammation. This cycle of flares and remission increases the risk of colorectal cancer (up to two-fold). Although modern treatments have improved survival, IBD remains challenging to diagnose and treat due to its complex pathogenic pathways and multifactorial nature.

Pathologists rely on gut tissue images to diagnose and treat IBD, guiding decisions on the most suitable drug treatments and predicting cancer risk. These tissue images, combined with recent advances in genomics, offer a valuable dataset for machine learning models to revolutionize IBD diagnosis and treatment.

Read the full competition specifications here.

This challenge is meant for everyone! We have created a three-lecture crash course that provides background on the biology, technology, and data in the three crunches. You do not need a background in biology or medicine to participate.

Find the lecture crash course here.

Phases

The challenge is broken down into three Crunches, ordered by increasing complexity.

Crunch 1 – Oct 28 to Feb 9 – Predict gene expression in spatial transcriptomics data from matched pathology images

Crunchers will build a model to predict the expression of 460 genes in held-out patches of colon tissue using H&E pathology images and Xenium spatial transcriptomics training data. Hematoxylin and Eosin (H&E) images provide insight into cell organization, while Xenium data add information on gene expression and cellular pathways of disease.

Crunch 2 – Nov 18 to Mar 21 – Predicting Unseen Genes

In this phase, participants will predict the expression of all protein-coding genes, including those that were not measured in the spatial training data, using single-cell RNA-seq data as support. This Crunch focuses on leveraging cell transcriptional profiles to enhance the predictive model’s ability to infer the expression of unknown genes in spatial contexts.

Crunch 3 – Dec 9 to Apr 30 (submission deadline) / May 15 (peer review deadline) – Identifying Gene Markers for Pre-cancerous Regions

Participants will rank genes by their ability to distinguish between dysplasia (pre-cancerous) regions and noncancerous tissue in IBD patients, increasing our ability to detect cancer early. The final gene panel will be chosen based on participant performance in Crunch 2 and on peer review of participants' methods taking place after the submission deadline. The gene panel will be experimentally validated in a new colon tissue with dysplasia, and all participants' ranked gene lists will be scored.

The best-performing models will be experimentally tested to validate their ability to predict cancer risk, which could lead to early detection and improved treatment options. The best models will be publish in an official publication from Broad Institute.

Participant Output Requirements

For each Crunch, participants must submit predictions in CSV format. Each submission must adhere to the provided log1p-normalization standards.

Outputs will be evaluated using:

Mean Squared Error (Crunch 1). To avoid overfitting Crunch will score all submitted models on a private dataset during two Checkpoints.
Spearman’s Correlation (Crunch 2)
Accuracy and Diversity Metrics (Crunch 3)

Evaluation Criteria

Performance will be evaluated through:

Accuracy in gene expression prediction (Crunch 1 & 2)
Gene panel design for distinguishing between noncancerous and dysplasia regions (Crunch 3)
Diversity of selected gene programs in Crunch 3, with extra emphasis on identifying unique biological pathways
Peer review of methods to select dysplasia gene panel in Crunch 3

External Resources

Crunchers are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.

A list of potential resources and references is provided in the full challenge specifications.

Foundry Institute offers a computing environment with $10 USD equivalent to around 10h of GPU time.

Find ML Foundry documentation here.

Overview

In Crunch 2, your task is to predict the expression levels of genes that were not measured in a spatial transcriptomics dataset. You will use both spatial data and single-cell RNA sequencing (scRNA-Seq) data from similar colon tissue samples to make these predictions.

X (Inputs Data)

Spatial Data: The .zarr data provided in Crunch 1.
scRNA-Seq Data: The Crunch2_scRNAseq.h5ad file contains gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data object.

Y (Targets)

Gene Expression Predictions: The expression levels of 2,000 genes.

Example of 2,000 genes expressions with random values.

Evaluation Phases

In Crunch 2, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.

There will be checkpoints every:

Friday — to get your scores before the weekend
Monday — to see how your weekend work stacks up

The final submission must be submitted by March 21th (Eastern Time 17:59).

The single-cell transcriptomic datasets

Colon tissue samples similar to those profiled by Xenium spatial transcriptomics.

These datasets contain single-cell gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data.

Datasets Included

We provide datasets (atlases) from multiple studies to represent all the cell types that are found in the colon tissue.

UC Dataset: An atlas of ulcerative colitis patients, including inflamed, non-inflamed, and healthy colon tissue.
ENS Dataset: An atlas of the enteric nervous system, including glial cells and neurons innervating the colon.
Muscle Dataset: An atlas of the colon muscle layer.

Data Format

Provided as an AnnData object stored in an h5ad file: Crunch2_scRNAseq.h5ad.

Cell Metadata scRNA-Seq.obs

Cell Type: scRNA-Seq.obs["annotation"]
Study: scRNA-Seq.obs["study"]
Individual: scRNA-Seq.obs["individual"]
Disease Status: scRNA-Seq.obs["status"]

Expression DatascRNA-Seq.X — log1p-normalized counts

Original raw counts per cell are divided by the sum of counts per cell, multiplied by 10,000, and then log1p-transformed.

This representation displays the scRNAseq.X matrix in DataFrame format to clarify the structure of the CSR matrix.

The columns in the DataFrame are as follows:

Row: the row index corresponding to the observation index, accessible via scRNAseq.obs
Column: the column index corresponding to the gene index, accessible via scRNAseq.var
Value: normalized, log-transformed gene expression counts

Raw Gene Counts — Available in scRNA-Seq.layers["counts"]

Expected Output

The output must be provided as a DataFrame with the following structure:

Index

Contains the cell_id values corresponding to the validation and test groups expected in the SpatialData (.zarr file provided by the infer function).

Columns

Contains 2,000 genes randomly selected from the 18,615 protein-coding genes in the scRNA-Seq data, including the 20 genes already measured by Xenium spatial transcriptomics but excluded from the Spatial Data object.
You can retrieve this list from the Crunch2_gene_list.csv file included in the competition dataset.

Values

Gene expression predictions for each cell and gene.
Predictions must be log1p-normalized and rounded to 2 decimal points.

Refer to the random-submission.ipynb notebook for an example of how to format your submission.

Evaluation

Your predictions are evaluated on the 20 held-out genes using Spearman’s rank correlation for cells with non-zero expression. For cells with zero expression, a separate metric applies. Scores combine predictions across global and local regions for a balanced final score.

Useful Links

Discord

Join the server

Join

Forum

Help each others

Visit

Quickstarters

Get started quickly

Open in Colab

Datasets

Download the data

Go to

Docs

Read the docs

Visit

Competition Host

Eric and Wendy Schmidt Center

Prize Pool

Part 1: 12,000 $USDC Part 2: 12,000 $USDC Part 3: 26,000 $USDC Total: 50,000 $USDC

Teams

Limited to 8 members

36 Manchester Drive, Westfield, New Jersey, 07090, United States