Patient Disease Stage Prediction in Online Health Forum


Online health forum forms very important patient community for them to communicate and provide informational and emotional support. For example, Breast Cancer Forum contains over 10,000 active users for cancer communication among peer patients and professionals.

It is always important to know the disease status of a (typically anonymous) patient, in order to provide precise informational and emotional support. Hence a very interesting features for this type of social media is that they usually also contains the user's longitudinal health data such as the time of their surgeries, size of tumor, details of chemical radiation, etc. A common user could list the following longitudinal information:

However, such critical information is not always available for all the users. Therefore, an important goal is to infer such health information of users based on their behavior in the forum. An important behavior is the 'traversal among different subforums', where each suforum may be dedicated to each health topic, such as diagnose, chemical radiation, physical therapy, etc. For example, the traversal behavior of a user who first has spent several months in Subform 'Cancer Diagnose Issues' and then transits to Subforum 'Surgery' would indicate that this user has been diagnosed cancer and may consider to do surgey as the treatment.

Therefore, in short, our problem is to use user traversal graph to predict his/her corresponding health status sequence. This can be fomulated as a temporal graph to sequence mapping problem. The following provides the data.

Processsed Data

Download link: [dyngraph2seq_data]

Data format: See the readme.txt for details.

Data Source

All the data are crawled and processed from Breast Cancer Forum, additional description on the details can be found in our paper: [ICDM 2019].


To use these datasets, please cite the papers:

Yuyang Gao, Lingfei Wu, Houman Homayoun, and Liang Zhao. DynGraph2Seq: Dynamic-Graph-to-Sequence Interpretable Learning for Health Stage Prediction in Online Health Forums. The 19th International Conference on Data Mining (ICDM 2019), short paper, (acceptance rate: 18.05%), Beijing, China, to appear.



NSF (sole-PI): III: CAREER: Spatial Network Deep Generative Modeling, Transformation, and Interpretation, $549,656, 2020-2025, National Science Foundation.