Copyright © 2024 Crunch Lab Inc. All rights reserved.
36 Manchester Drive, Westfield, New Jersey, 07090, United States
Crunch Lab, The Eric and Wendy Schmidt Center, and The Klarman Cell Observatory invite you to join the Part 2 of the Autoimmune Disease ML Challenge to design algorithms to help millions of people.
Autoimmune diseases arise when the immune system mistakenly targets healthy cells. Affecting 50M people in the U.S., with rising global cases, Inflammatory Bowel Disease (IBD) is one of the most prevalent forms. IBD occurs when the barrier between our gut and the microbes living there breaks down, leading to the activation of the immune system and persistent inflammation. This cycle of flares and remission increases the risk of colorectal cancer (up to two-fold). Although modern treatments have improved survival, IBD remains challenging to diagnose and treat due to its complex pathogenic pathways and multifactorial nature.
Pathologists rely on gut tissue images to diagnose and treat IBD, guiding decisions on the most suitable drug treatments and predicting cancer risk. These tissue images, combined with recent advances in genomics, offer a valuable dataset for machine learning models to revolutionize IBD diagnosis and treatment.
This challenge is meant for everyone! We have created a three-lecture crash course that provides background on the biology, technology, and data in the three crunches. You do not need a background in biology or medicine to participate.
The challenge is broken down into three Crunches, ordered by increasing complexity.
Crunchers will build a model to predict the expression of 460 genes in held-out patches of colon tissue using H&E pathology images and Xenium spatial transcriptomics training data. Hematoxylin and Eosin (H&E) images provide insight into cell organization, while Xenium data add information on gene expression and cellular pathways of disease.
In this phase, participants will predict the expression of all protein-coding genes, including those that were not measured in the spatial training data, using single-cell RNA-seq data as support. This Crunch focuses on leveraging cell transcriptional profiles to enhance the predictive model’s ability to infer the expression of unknown genes in spatial contexts.
Participants will rank genes by their ability to distinguish between dysplasia (pre-cancerous) regions and noncancerous tissue in IBD patients, increasing our ability to detect cancer early. The final gene panel will be chosen based on participant performance in Crunch 2 and on peer review of participants' methods taking place after the submission deadline. The gene panel will be experimentally validated in a new colon tissue with dysplasia, and all participants' ranked gene lists will be scored.
The best-performing models will be experimentally tested to validate their ability to predict cancer risk, which could lead to early detection and improved treatment options. The best models will be publish in an official publication from Broad Institute.
For each Crunch, participants must submit predictions in CSV format. Each submission must adhere to the provided log1p-normalization standards.
Outputs will be evaluated using:
Mean Squared Error (Crunch 1). To avoid overfitting Crunch will score all submitted models on a private dataset during two Checkpoints.
Spearman’s Correlation (Crunch 2)
Accuracy and Diversity Metrics (Crunch 3)
Performance will be evaluated through:
Accuracy in gene expression prediction (Crunch 1 & 2)
Gene panel design for distinguishing between noncancerous and dysplasia regions (Crunch 3)
Diversity of selected gene programs in Crunch 3, with extra emphasis on identifying unique biological pathways
Peer review of methods to select dysplasia gene panel in Crunch 3
Crunchers are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.
A list of potential resources and references is provided in the full challenge specifications.
Foundry Institute offers a computing environment with $10 USD equivalent to around 10h of GPU time.
In Crunch 2, your task is to predict the expression levels of genes that were not measured in a spatial transcriptomics dataset. You will use both spatial data and single-cell RNA sequencing (scRNA-Seq) data from similar colon tissue samples to make these predictions.
Spatial Data: The .zarr data provided in Crunch 1.
scRNA-Seq Data: The Crunch2_scRNAseq.h5ad file contains gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data object.
Gene Expression Predictions: The expression levels of 2,000 genes.
Example of 2,000 genes expressions with random values.
In Crunch 2, you will have the opportunity to evaluate your model’s predictive performance on a validation dataset, before submission of your test dataset predictions.
There will be checkpoints every:
Friday — to get your scores before the weekend
Monday — to see how your weekend work stacks up
The final submission must be submitted by March 21th (Eastern Time 17:59).
Colon tissue samples similar to those profiled by Xenium spatial transcriptomics.
These datasets contain single-cell gene expression data for 18,615 protein-coding genes, including the 460 genes in the Spatial Data.
We provide datasets (atlases) from multiple studies to represent all the cell types that are found in the colon tissue.
UC Dataset: An atlas of ulcerative colitis patients, including inflamed, non-inflamed, and healthy colon tissue.
ENS Dataset: An atlas of the enteric nervous system, including glial cells and neurons innervating the colon.
Muscle Dataset: An atlas of the colon muscle layer.
Provided as an AnnData object stored in an h5ad file: Crunch2_scRNAseq.h5ad.
Cell Type: scRNA-Seq.obs["annotation"]
Study: scRNA-Seq.obs["study"]
Individual: scRNA-Seq.obs["individual"]
Disease Status: scRNA-Seq.obs["status"]
Original raw counts per cell are divided by the sum of counts per cell, multiplied by 10,000, and then log1p-transformed.
This representation displays the scRNAseq.X matrix in DataFrame format to clarify the structure of the CSR matrix.
The columns in the DataFrame are as follows:
Row: the row index corresponding to the observation index, accessible via scRNAseq.obs
Column: the column index corresponding to the gene index, accessible via scRNAseq.var
Value: normalized, log-transformed gene expression counts
The output must be provided as a DataFrame with the following structure:
Index
Contains the cell_id values corresponding to the validation and test groups expected in the SpatialData (.zarr file provided by the infer function).
Columns
Contains 2,000 genes randomly selected from the 18,615 protein-coding genes in the scRNA-Seq data, including the 20 genes already measured by Xenium spatial transcriptomics but excluded from the Spatial Data object.
You can retrieve this list from the Crunch2_gene_list.csv file included in the competition dataset.
Values
Gene expression predictions for each cell and gene.
Predictions must be log1p-normalized and rounded to 2 decimal points.
Refer to the random-submission.ipynb notebook for an example of how to format your submission.
Your predictions are evaluated on the 20 held-out genes using Spearman’s rank correlation for cells with non-zero expression. For cells with zero expression, a separate metric applies. Scores combine predictions across global and local regions for a balanced final score.
Competition Host
Eric and Wendy Schmidt Center
Prize Pool
Part 1: 12,000 $USDC Part 2: 12,000 $USDC Part 3: 26,000 $USDC Total: 50,000 $USDC