Copyright © 2024 Crunch Lab Inc. All rights reserved.
36 Manchester Drive, Westfield, New Jersey, 07090, United States
Copyright © 2024 Crunch Lab Inc. All rights reserved.
36 Manchester Drive, Westfield, New Jersey, 07090, United States
Crunch Lab, The Eric and Wendy Schmidt Center, and The Klarman Cell Observatory invite you to join the Part 3 of the Autoimmune Disease ML Challenge to design algorithms to help millions of people.
Autoimmune diseases arise when the immune system mistakenly targets healthy cells. Affecting 50M people in the U.S., with rising global cases, Inflammatory Bowel Disease (IBD) is one of the most prevalent forms. IBD occurs when the barrier between our gut and the microbes living there breaks down, leading to the activation of the immune system and persistent inflammation. This cycle of flares and remission increases the risk of colorectal cancer (up to two-fold). Although modern treatments have improved survival, IBD remains challenging to diagnose and treat due to its complex pathogenic pathways and multifactorial nature.
Pathologists rely on gut tissue images to diagnose and treat IBD, guiding decisions on the most suitable drug treatments and predicting cancer risk. These tissue images, combined with recent advances in genomics, offer a valuable dataset for machine learning models to revolutionize IBD diagnosis and treatment.
This challenge is meant for everyone! We have created a three-lecture crash course that provides background on the biology, technology, and data in the three crunches. You do not need a background in biology or medicine to participate.
The challenge is broken down into three Crunches, ordered by increasing complexity.
Crunchers will build a model to predict the expression of 460 genes in held-out patches of colon tissue using H&E pathology images and Xenium spatial transcriptomics training data. Hematoxylin and Eosin (H&E) images provide insight into cell organization, while Xenium data add information on gene expression and cellular pathways of disease.
In this phase, participants will predict the expression of all protein-coding genes, including those that were not measured in the spatial training data, using single-cell RNA-seq data as support. This Crunch focuses on leveraging cell transcriptional profiles to enhance the predictive model’s ability to infer the expression of unknown genes in spatial contexts.
Participants will rank genes by their ability to distinguish between dysplasia (pre-cancerous) regions and noncancerous tissue in IBD patients, increasing our ability to detect cancer early. The final gene panel will be chosen based on participant performance in Crunch 2 and on peer review of participants' methods taking place after the submission deadline. The gene panel will be experimentally validated in a new colon tissue with dysplasia, and all participants' ranked gene lists will be scored.
The best-performing models will be experimentally tested to validate their ability to predict cancer risk, which could lead to early detection and improved treatment options. The best models will be publish in an official publication from Broad Institute.
For each Crunch, participants must submit predictions in CSV format. Each submission must adhere to the provided log1p-normalization standards.
Outputs will be evaluated using:
Mean Squared Error (Crunch 1). To avoid overfitting Crunch will score all submitted models on a private dataset during two Checkpoints.
Spearman’s Correlation (Crunch 2)
Accuracy and Diversity Metrics (Crunch 3)
Performance will be evaluated through:
Accuracy in gene expression prediction (Crunch 1 & 2)
Gene panel design for distinguishing between noncancerous and dysplasia regions (Crunch 3)
Diversity of selected gene programs in Crunch 3, with extra emphasis on identifying unique biological pathways
Peer review of methods to select dysplasia gene panel in Crunch 3
Crunchers are encouraged to use publicly available external resources, including gene expression datasets and pre-trained models, as long as they are properly credited.
A list of potential resources and references is provided in the full challenge specifications.
Foundry Institute offers a computing environment with $10 USD equivalent to around 10h of GPU time.
In Crunch 3, your task is to design a gene panel that best distinguishes dysplasia regions from noncancerous mucosa regions in colon tissue affected by Inflammatory Bowel Disease (IBD). Using provided H&E images annotated by pathologists and single-cell RNA sequencing (scRNA-Seq) data, you will rank 18,615 protein-coding genes based on their ability to discriminate between these disease states.
If you participated in Crunch 1 or Crunch 2, you may leverage your previously developed models to make gene expression predictions on the annotated regions and design your gene panel based on these predictions. If not, you can design your gene panel from scratch using biological insights or other approaches.
Additionally, you are required to:
Provide a justification for how you constructed your gene panel.
three submissions from other participants based on their justifications.
First H&E Image: Includes only noncancerous mucosa (already provided in Crunch 1 and Crunch 2).
Second H&E Image: Entire colon tissue section including both dysplasia and noncancerous mucosa regions (UC9_I-crunch3-HE.tif).
Associated Files:
Nucleus Segmentation Masks.
Tissue Region Masks with Annotations (UC9_I-crunch3-HE-dysplasia-ROI.tif):
0: Other tissue regions.
1: Noncancerous mucosa.
2: Dysplasia.
Dataset: Crunch3_scRNAseq.h5ad.
Content: Gene expression data for 18,615 protein-coding genes from colon tissue samples with and without dysplasia.
Cell Metadata (adata.obs):
Cell Type: adata.obs["annotation"].
Individual: adata.obs["individual"].
Disease Status: adata.obs["status"] (Normal, Unaffected tissue, Polyp, Adenocarcinoma).
Dysplasia Status: adata.obs["dysplasia"] (y, n, or ND).
Normalized Counts: adata.X (log1p-normalized).
Raw Counts: Available in adata.layers["counts"]
Rank all 18,615 protein-coding genes from 1 (best discriminator) to 18,615 (worst), based on their ability to distinguish between dysplasia and noncancerous mucosa regions.
Including genes associated with different biological functions can enhance your gene panel and will be considered in the
Your submission should include the following.
Format: A DataFrame returned by your infer function.
Structure:
Index (Rank): Unique integers from 1 (best discriminator) to 18,615 (worst).
Column (Gene Name): Gene symbols matching those provided in the dataset.
File: REPORT.md
Length: Maximum 1 page.
Content:
Method Description: Explain how your method works. (5-10 sentences)
Rationale: Describe the reasoning behind your gene panel design. (5-10 sentences)
Data and Resources Used: Specify the datasets and any other resources utilized. (5-10 sentences)
References: May be included (not counted toward the page limit).
Hard Requirement: A submission will be rejected outright if the file is missing:
If you are submitting via the CLI, just create a REPORT.md at the root of your submission.
If you are submitting via a Notebook, you must write it in a Markdown Cell.
Learn more about Embed Files.
Only non-empty and non-comment lines are considered.
;This is an example of what is expected:
# Method Description
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam eget augue quis metus viverra vehicula sit amet lacinia odio.
# Rationale
Praesent dignissim ipsum vel leo eleifend, eget pulvinar mauris ornare.
Duis efficitur lectus posuere iaculis dictum.
# Data and Resources Used
Donec feugiat eros vel odio gravida venenatis.
Nam et sem sit amet nisi vestibulum semper bibendum et libero.
The report is attached to the submission.
To modify the REPORT.md, you must resubmit with the modified content.
Mandatory Participation: To qualify for prizes, you must review three submissions from other participants.
Purpose: The peer review process is crucial for selecting the most promising gene panels for . Your evaluations help identify submissions with strong justifications and innovative approaches, contributing to the advancement of dysplasia research.
Evaluation Criteria:
Assign a score on a 1-3 scale
1 - excellent justification
2 - adequate justification
3 - poor justification
Provide a short explanation (200-400 words) covering:
Rationale of design.
Novelty of design.
Compliance with the required format.
We will assess your submissions based on two key criteria:
Classification Accuracy: We'll use your top 50 genes to train a model that distinguishes between dysplasia and noncancerous mucosa. The better your genes help the model correctly identify these regions, the higher your accuracy score will be. This is the main factor in determining your ranking.
Diversity: We'll also consider the variety of biological functions represented in your gene panel. Including genes from different pathways enhances the panel's usefulness and may provide deeper insights into dysplasia. A more diverse panel is favorable and can help differentiate teams with similar accuracy.
Your final ranking will prioritize classification accuracy, with diversity as a supplementary factor to distinguish between submissions with close accuracy scores.
To validate the most promising gene panels, we will select up to 500 genes for experimental evaluation. This selection will occur via two routes:
Route 1: the top performers from Crunch 2 who also participate in Crunch 3 will have up to 50 of their highest-ranked genes included.
Route 2: the top performers from Crunch 3, determined by peer review and expert evaluations, will also have up to 50 of their top genes included.
We will order a Xenium gene panel comprising these selected genes, reserving a small number of additional genes to identify important cell types in the colon. This panel will be used to perform spatial transcriptomics measurements on a new colon tissue section diagnosed with dysplasia, enabling experimental validation of your gene panels.
Competition Host
Eric and Wendy Schmidt Center
Prize Pool
Part 1: $12,000 Part 2: $12,000 Part 3: $26,000 Total: $50,000