Development of a Multi-Task Learning Framework for Simultaneous Prediction of Protein Secondary Structure, Solvent Accessibility, and Disorder Regions

multi-task learning protein structure prediction secondary structure solvent accessibility intrinsic disorder

Authors

March 14, 2026

Downloads

Accurately predicting protein structural properties is of great importance in protein function annotation and protein therapeutics design. The available protein databases, however, have fragmented labels - none of the existing datasets simultaneously possess labels for secondary structure, solvent accessibility, and disorder regions. The lack of comprehensive labeled data is caused by the intrinsic limitations of experimental methods and the purpose-oriented design of different databases. As a result, it is difficult to build models that accurately predict all of these properties. In this paper, we present a multi-task learning framework that leverages partially labeled data from three different, yet complementary datasets: CB513 (labeled for secondary structure and solvent accessibility), DisProt (labeled for disorder annotations), and PISCES (providing additional sequences). Our joint model uses a shared bidirectional LSTM encoder followed by task-specific attention modules and uncertainty-weighted loss balancing to predict all three properties jointly. We trained our framework on 6,056 proteins with fragmented annotations and obtained Q3 accuracy of 75.6%, 99.99% and 59.7% (46.9% F1-score) on secondary structure, solvent accessibility and disorder, respectively. The multi-task model significantly outperformed our single-task baselines by 5.6% on disorder F1-score, highlighting that shared representations learnt from weak, fragmented signals on each task can lead to better accuracy on all tasks.