So I'm building (undirected) graphs from a dataset... I can walk the path of the graph and grab all the nodes, no problem.
I've now got to incorporate "manual overrides" whereby nodes can be forcibly added to the graph, or removed into its own subgraph. (edit: note that "manual break" edges are indicative that its nodes should be in separate subgraphs, so if the dataset includes "manual break 3-4" that implies that nodes 3 and 4 should end up in entirely separate subgraphs; no connecting paths)
Challenge I'm having is when the calculated/walked datapoints interfere with manual overrides, and where/how to break the graph into subgraphs.
Goals during subgraphing:
break the least number of calculated edges as necessary to observe the manual overrides
while maintaining the fewest subgraphs as possible (and if the choices is otherwise equal, I'll figure out any remaining ranking criteria)... which is to say that if there is a choice to break a graph into two vs three subgraphs, two is fewer than three.
This example should provide a reasonable demonstration of the issue:

Visually, I can see that the edge between nodes 1 and 4 can be removed, at which point I'm left with two subgraphs, 1-2-3-7 and 4-5-6.
But I can't for the life of me, figure out a set of rules for reaching that determination.
My dataset includes:
| node | node | type |
|---|---|---|
| 1 | 2 | Manual Combine |
| 1 | 4 | Calculated |
| 1 | 7 | Calculated |
| 2 | 3 | Calculated |
| 3 | 4 | Manual Break |
| 3 | 6 | Manual Break |
| 4 | 5 | Calculated |
| 4 | 7 | Manual Break |
| 5 | 6 | Calculated |
| 6 | 7 | Manual Break |
(technically it'll also have the inverse, but the graph is not directional so shouldn't be significant to this post)
Edits:
so far we've got two working approaches
comprehensive approach : step 1, build calculated graph; superimpose manual overrides. step 2, in a loop, calculate shortest path for all manual break overrides (3-4, 3-6, 4-7, 6-7) and identify the most frequent used pathway (1-4 by identifying 3-2-1-4, 3-2-1-4-5-6, 4-1-7, and 6-5-4-1-7, then breaking into each pathway, calculating the max count; ensure that 1-4 and 4-1 are considered equal since this is nondirected graph). Thought here is to find any paths that allow known non-match nodes to exist in the same graph, look for the most used pathway, and break it until there are no paths between known non-matching nodes.
find any single node that connects nonmatching nodes (so in this example node 1 would be a candidate for connecting to both 4 and 7 which cannot share the same subgraph); use a ranking to remove all but one. This is easier to rationalize, since it's known that 1 can't coexist, but doesn't do anything for longer paths.
thought at the moment is to take a quick pass using the second approach, since it's very fast (closer to O(N) performance), then the first approach to find any other corrections.
in theory they both hit the goals : break least number of connections - in approach 2 it's easy because we know only one can survive; for approach 1 it seems likely since it's focusing on the most heavily used pathway... in terms of minimizing subgraphs, I have no way to know/prove whether either approach will maximize the goal or not.
here's an example (unsure if i'm helping or not at this point)

let's say that i'm allowing calculated lookups to find "related animals", which I want to use for making broader classifications; in this case Canine vs Feline.
The calculation determined the black edges (maybe based on animals with similar colors/shapes/sizes/appendages/etc), and possibly some red edges... the red edges, being "I manually said they're different" (because at some point, a person looked at this classification, and determined that cats and foxes are too different, and defined the Manual Break dataset), need to be preserved after subgraphing (example, fox and dog are close enough so sure, but fox and cat are not the same).
so...
G` = (C + MC) - MB
but also ensuring that G` or subgraphs (G``?) do not contain both nodes in the set of MB edges
Your greedy approach will not give optimal solutions in the general case. For example, take the star graph $S_4$ ($1$ is the internal node, 2,3,4,5 are the leaves). If we add $13,14,15,23,24,25$ as Manual Break edges, then the most frequent used pathway is the edge $12$, but the optimum is to remove $13,14,15$. If you are worried that Manual Combine edges match Calculated Break edges, you can replace edges by paths of length 2.
We have 3 sets of edges, $MB$, $MC$ and $C$ (Manual Break, Manual Combine and Calculated). Let $G′=G−MB$ be the graph without Manual Break edges.
If we consider the minimization of the broken Calculated edges, the problem is equivalent to the minimal multicut in $G'$ (we want to find a cut which separate each extremities of each Manual Combine edge). As it is NP-hard, there is probably no polynomial algorithm. See Marie-Christine, Costa & Létocart, Lucas & Roupin, Frédéric. (2005). Minimal multicut and maximal integer multiflow: A survey. European Journal of Operational Research. 162. 55-69. 10.1016/j.ejor.2003.10.037.
If we consider the minimization of the number of classes (connected components), it is also NP-Hard, by reducing graph coloring to it. If we have all edges calculated, then the minimal number of classes is the chromatic index of the graph composed only of the Manual Break edges. If not all edges are calculated, this is no longer true. For example, the graph composed of Calculated edges $12$, $13$, $14$, $15$ and of Manual Break edges $23$, $45$ have at least 2 classes, but the chromatic index of the graph on Manual Break edges is 3.
The minimization of the broken Calculated edges does not imply the minimization of the number of connected components. For example, if we take the graph where the Manual Break edges are $12$ and $34$, and the Calculated edges are $13$, $24$ and three disjoint paths from $1$ to $4$. Then to minimize the number of broken Calculated edges, we need to remove $13$ and $24$, which leaves 3 connected components. If we cut three edges to disconnect the three disjoint paths, then we get only 2 connected components.