I have the following problem I need help solving.
I am running jobs on a cluster that is yielding different performance times every time. I suspect that some of the nodes may not be running optimally.
This cluster has 60 nodes. Each job uses 12 nodes. And for each job I know
- The nodes that were used
- How long a specific job took (specifically how long a inverting a matrix takes), my figure of merit
I currently have run 20 jobs. A lot of the nodes used overlap with some variation. I would like to find a way to assign a probability that nodeA is the one that is not running optimal.
How should I go about doing this?