Explainable Graph Neural Network for Computational Chemistry

Annual Review 2025 · Cardiff University

Amir Masoud Nourollah

Invalid Date

Overview


  1. The Current State of MLIPs
    Neural potentials, the black-box problem, and systemic extrapolation failure

  2. The Data-Evaluation Paradox
    Why standard benchmarks cannot measure generalisation

  3. GMD-26: A Framework for Systematic Evaluation
    Our benchmark, four tasks, and key results

  4. The DeepSet-MoE
    Proposed architecture for interpretable generalisation

  5. Conclusion & Future Outlook
    Key findings and next steps

Neural Network Architectures in Chemistry

Neural networks in chemistry can be categorised by their resolution and computational cost:

  • Quantum Chemistry Methods (e.g., PauliNet)
  • Semi-Empirical Methods (e.g., NN-xTB)
  • Coarse-Grained Methods (e.g., CGNet)
  • Force Field Methods (e.g., SchNet)


Force field methods, or Machine-Learned Interatomic Potentials (MLIPs), are currently the most widely adopted approach. They offer an optimal trade-off between computational complexity and predictive accuracy.

Impact of MLIPs in Computational Chemistry

Bridging the Gap

  • MLIPs maintain ab initio accuracy while matching the efficiency of classical force fields.
  • Scaling Advantage: DFT scales cubically (\(O(N^3)\)), whereas MLIPs scale linearly (\(O(N)\)), preventing cost explosion for large systems.

Computational Breakthroughs

  1. Drastic Acceleration: Achieving speedups of \(10^5\times\) compared to standard DFT.
  2. Overcoming Intractability: Simulations requiring decades of DFT compute time can be completed in days.
  3. Resource Efficiency: High-fidelity MD is now feasible on consumer GPUs rather than massive HPC clusters.

The Ambiguity of Learned Representations

The shift from fixed descriptors to learnable representations has traded transparency for accuracy, creating a fundamental “Black Box” problem:

1. The Latent Space “Black Box”

  • Unlike descriptor-based models (using clear physical quantities), end-to-end architectures map atoms to abstract latent vectors.
  • These high-dimensional representations are opaque; it is effectively impossible to disentangle which physical features the model has learned.

2. The Mapping Problem

  • Even in large-scale Foundation Models, the mapping between data and representation remains hidden.
  • It is unclear if the model has constructed a continuous, valid Chemical Space, or merely memorised disjointed training manifolds.

The Consequence: Systemic Extrapolation Failure

Because the learned representation is opaque, we cannot define its validity domain. This leads to critical failures that standard metrics fail to detect:

1. The Extrapolation Catastrophe

  • Undefined Boundaries: Models cannot signal when they leave the training domain. They “hallucinate” interactions for unseen chemical species.
  • Physics Violation: The opaque latent space often maps valid geometric distortions to unphysical energy surfaces, leading to unstable dynamics in OOD settings.

2. The Evaluation Blind Spot

  • The Interpolation Trap: Standard benchmarks (e.g., MD17, QM9) rely on ID splits. This rewards models for memorising the local manifold rather than learning transferable principles.
  • False Security: High accuracy on these datasets masks the model’s inability to generalise compositionally to larger or more complex systems.

Landscape of Established Datasets

Existing benchmarks were designed for accuracy — not for probing compositional generalisation.

The field has evolved from small, specific datasets to massive, heterogeneous aggregates:

1. Single-System MD
MD17 · rMD17

Trajectories of single small molecules (Ethanol, Aspirin).

  • Focus: Learning the force field of a specific molecule.
  • Limit: Models overfit to “that aspirin” and fail to transfer.
2. Static Equilibrium
QM9 · OE62

Properties for thousands of stable organic molecules.

  • Focus: Predicting properties of relaxed geometries.
  • Limit: Lacks the distorted geometries needed for reaction dynamics.
3. Massive Aggregates
OMat24 · MPtrj

Huge, heterogeneous collections mixing equilibrium and non-equilibrium data.

  • Focus: Training general-purpose Foundation Models.
  • Limit: Unstructured coverage; difficult to audit for systematic OOD gaps.

Gap: Despite the scale of Category 3, we still lack systematic benchmarks designed to specifically stress-test compositional extrapolation.

The Data-Evaluation Paradox

While datasets have scaled massively to create “Universal” models, evaluation protocols remain fundamentally flawed:

1. The Scale of Data

Modern efforts aim for universal coverage using massive, heterogeneous aggregates:

  • Organic: ANI-1x (20M+ conformations).
  • Materials: MPtrj (Inorganic trajectories).
  • Surfaces: Open Catalyst (Millions of relaxations).
→ Assumption: “More data automatically yields better physical understanding.”

2. The Evaluation Deficit

A. The In-Distribution Trap:

  • Random splits test memorisation of the local manifold, not physical learning.

B. The “Ineffective” OOD Solution:

  • Existing benchmarks use arbitrary splits rather than systematic variations.
  • Opacity: They cannot quantify where in chemical space the model fails.

The Critical Gap: We lack a metric to measure how far a model can extrapolate. We define OOD loosely (“unseen”), rather than mathematically (“distance from training manifold”).

Introducing GMD-26: A Framework for Generalisation

We introduce GMD-26 not merely as a dataset, but as a flexible Benchmarking Framework designed to rigorously quantify generalisation.

1. The Protocol (Domain-Agnostic)

A systematic structure adaptable to diverse fields (e.g., Polymers, Proteins):

  • 4 Systematic Tasks: Defined by specific compositional shifts rather than random splits.
  • 2 Subtasks: Targeted probes for specific failure modes.
  • Philosophy: Testing physical principles (transferable) vs. memorisation.

2. Reference Implementation

Validation via Small Organic Molecules:

  • Tractable Systems: Small molecules provide interpretable, clear failure modes.
  • Multi-Trajectory Challenge: Unlike MD17, we use multiple diverse trajectories. This forces models to learn the potential surface, not just a single path.
  • Fidelity Status: Currently generated at GFN2-xTB level; active upgrade to DFT is underway to support Foundation Models.

Goal: To establish a universal standard for auditing model robustness across any chemical domain.

GMD-26: The Four Systematic Generalisation Tasks

Visual schematic of the four GMD-26 tasks.

Results: The “Duplication” Challenge

Task Definition: Training on molecules with a single fragment A (e.g., Mono-acid) and testing on molecules with two of the same fragment A (e.g., Di-acid).


Model Forces MAE Cosine Sim. F Mag. MAE Energy MAE (eV)
ID OOD ID OOD ID OOD ID OOD
SchNet 0.0292 0.1143 0.9979 0.9691 0.0307 0.1458 0.0295 29.32
PaiNN 0.0020 0.0600 0.9999 0.9768 0.0022 0.0713 0.0181 30.72
GemNet 0.0009 0.0224 0.9999 0.9978 0.0010 0.0223 0.0129 12.27
EquiFormerV2 0.0018 0.0538 0.9999 0.9746 0.0019 0.0603 0.0984 133.77
DimeNet++ 0.0018 0.2484 0.9999 0.9406 0.0019 0.3863 0.0128 34.86
MACE 0.0028 0.0171 0.9999 0.9978 0.0030 0.0201 0.0018 40.97
NequIP 0.0256 0.0460 0.9984 0.9922 0.0269 0.0501 0.0198 34.59
Detailed performance metrics on the Fragment Duplication task. Values in red indicate catastrophic OOD failure.

Results: The “Chain Extension” Challenge

Detailed comparison of predicted vs. actual values. Note the divergence in the OOD regions (right side) compared to the tight correlation in ID regions (left side).

Proposed Architecture: The DeepSet-MoE

We implement the DeepSet framework (\(E = \rho(\sum \phi)\)), modifying the pipeline to expose interpretability and challenge standard assumptions:

1. Encoder (\(\phi\))

Architecture: Mixture of Experts (MoE) replacing the monolithic network.

Refined Routing Metrics:

  • Metric A (Spatial): Experts activated by pairwise distances (replacing the previous dual-cutoff).
  • Metric B (Identity): Experts assigned by central atom type. One expert processes all hierarchies.
2. Aggregation (\(\sum\))

Role: Pooling atomic states into a global representation.

Current State:

  • Standard Summation.

The Aggregation Hypothesis:

  • Hypothesis: Rigid summation obscures distinct atomic contributions.
  • Experiment: Removing the Aggregation Head entirely.
3. Decoder (\(\rho\))

Role: Mapping latent state to Energy/Force.

Inference:

  • Currently Identity (Direct).
  • Goal: Eliminate ambiguity by inferring directly from Encoder output.

Summary: The Generalisation Crisis

The Diagnosis

We challenged the generalisation capabilities of established SOTA architectures by introducing GMD-26, a benchmark designed to test physical robustness rather than memorisation.


Key Findings:

  • Universal Failure: Every evaluated architecture — from simple invariant models (SchNet) to equivariant transformers (EquiFormerV2) and high-order bodies (MACE) — failed to generalise compositionally.
  • The Magnitude of Error: OOD errors were not merely higher; they were catastrophic (often orders of magnitude larger than ID errors), proving that excellent performance on standard benchmarks masks a critical lack of physical understanding.

Future Outlook: Toward Interpretable Generalisation

To bridge this gap, we are moving away from monolithic “Black Box” networks toward a modular, chemically aligned architecture:

1. The DeepSet-MoE Implementation

  • Specialised Encoding: Replacing the single encoder with a Mixture of Experts routed by optimised physical metrics.
  • Goal: To ensure distinct chemical environments are processed and learned by distinct, specialised parameters.

2. The Aggregation Hypothesis

  • Current Bottleneck: We suspect the rigid summation (\(\sum\)) obscures the distinct share of atomic contributions.
  • Next Step: We are currently investigating alternative aggregation paradigms to identify a strategy that preserves distinct atomic contributions within the global prediction.


Final Conclusion: We aim to establish DeepSet-MoE as an explainable architecture where OOD performance remains close to ID levels, prioritising this generalisation stability over chasing the absolute lowest ID errors seen in SOTA models.

Thank You



Amir Masoud Nourollah

PhD Student · Cardiff NLP & Knowledge Representation and Reasoning Groups


Supervisors: Prof. Stefano Leoni · Prof. Steven Schockaert


📧 NourollahA@cardiff.ac.uk
🌐 nourollah.me



Preprint available at arxiv.org/abs/2605.08988

Cardiff University Logo