Determining slow nodes on a cluster

144 Views Asked by At

I have the following problem I need help solving.

I am running jobs on a cluster that is yielding different performance times every time. I suspect that some of the nodes may not be running optimally.

This cluster has 60 nodes. Each job uses 12 nodes. And for each job I know

  • The nodes that were used
  • How long a specific job took (specifically how long a inverting a matrix takes), my figure of merit

I currently have run 20 jobs. A lot of the nodes used overlap with some variation. I would like to find a way to assign a probability that nodeA is the one that is not running optimal.

How should I go about doing this?