Overview Leaderboard Submissions & Runs Team Resources

Submit Your First Model

ADIA Lab Causal Discovery Challenge

Help solve one of the most critical challenges of modern data science and go beyond mere correlation.

Discovering the causal structure that governs the relationships among variables from their observations is a challenging and valuable problem in many domains of application, like healthcare, economics, social sciences, environmental science, education, etc. In this competition, the basic building block that you are given is a dataset of observations of a set of variables and your task is to discover the causal directed acyclic graph (DAG) that defines the causal relationships between them.

From a pandas DataFrame to a causal directed acyclic graph (DAG)

Description

The task of this competition is causal discovery: your goal is to find the causal graph (DAG) for each dataset you will be given. To help you in this endeavor, we provide a large number of example datasets together with their corresponding causal DAGs — as the training set — so that you can calibrate your unsupervised discovery methods, or train your prediction models if you prefer a supervised approach. Your causal discovery algorithm has to be designed to take as input a dataset and to output the causal DAG.

All causal graphs in this competition have a specific structure: they have at least two special nodes, X and Y, which are the treatment and the outcome variables, respectively. The treatment variable X is the one that causes effects on the outcome variable Y. All other variables/nodes may or may not influence X and Y, possibly interfering with their relationship X→Y, so each may act as a confounder on X→Y, or as a collider, mediator, or be a cause or consequence of X (or Y), or not have any influence at all, etc.

The goal of the competition is to estimate the causal graph behind each dataset. The scores are based on accurately identifying the role of all nodes on X→Y.

Both unsupervised and supervised approaches are warmly welcome.

Evaluation

In all datasets, there are two special variables — X and Y — that are the treatment and the effect. We always assume that there is a causal link from X to Y: X→Y. For each predicted graph, the evaluation metric quantifies the correctness of the edges/arrows for all nodes but considers only the edges (or lack of) from each node to X and Y. In other words, the evaluation metric wants to assess the effects of errors in specifying wrong edges affecting X and Y.

Each node K (with the exclusion of X and Y) can be in one of these 8 categories:

Confounder: K→X, K→Y, X→Y
Collider : X→K, Y→K, X→Y
Mediator: X→K, K→Y, X→Y
Independent: X→Y (no links to X or Y)
Cause of X: K→X→Y
Consequence of X: X→K, X→Y
Cause of Y: K→Y, X→Y
Consequence of Y: X→Y→K

Each node in your predicted graph will be tested against its true class and the final scoring metric across all datasets is the multiclass balanced accuracy.

Participants should submit predicted DAGs for all datasets, and we will transform the predicted DAGs to the corresponding classes for scoring.

Prediction File

For each example_id in the test set, which is in the form <dataset_id>_<source_variable>_<target_variable> you must predict a binary value (0 or 1) representing the absence or presence of a causal link between <source_variable> and <target_variable>. The file should contain a header and have the following format:

   example_id, prediction 
   00000_0_0, 0 
   00000_0_1, 0 
   00000_0_X, 1 
   00000_0_4, 0 
   etc.

For example, the row 01234_X_1, 1 means that for the test dataset 01234, the participant predicts a causal link between X and 1: X→1.

Dataset Description

The whole dataset of the competition, between the training set and test set, comprises 47,000 individual datasets, each of 1000 observations for a certain number of variables, which is between 3 and 10. For the training datasets, the corresponding causal graphs are available. The causal graph is provided via its adjacency matrix, so if the dataset has 8 variables, the adjacency matrix is 8x8 matrix — which becomes 9x9 in the corresponding CSV file because the variable names are indicated for each row and column — where a value of 1 at position (i, j), means that variable i causes variable j, and value 0 means it does not.

Tutorial #1

Prize

Winners’ rank	Prize value
1st place	$40,000 USD
2nd place	$20,000 USD
3rd place	$10,000 USD
4th place	$5,000 USD
5th place	$5,000 USD
6th place	$5,000 USD
7th place	$5,000 USD
8th place	$3,500 USD
9th place	$3,500 USD
10th place	$3,000 USD

Useful Links

Discord

Join the server

Join

Forum

Help each others

Visit

Quickstarters

Get started quickly

Open in Colab

Datasets

Download the data

Go to

Docs

Read the docs

Visit

GitHub

Explore the code

Visit

Competition Host

ADIA Lab

Prize Pool

100,000 $USDC

Teams

Limited to 5 members

36 Manchester Drive, Westfield, New Jersey, 07090, United States