Suppose you had many models that are trained to classify a sequence of logical characters into 'satisfiable' or 'unsatisfiable' (e.g., the sequence "A and notA" is unsatisfiable). Suppose all of these models have a Transformer architecture, and can therefore handle inputs of variable length without having recurrent connections. You are interested in whether a given model can classify examples that are longer than the ones it was trained on; call a model that can do this a "generalizer".
Now suppose you have 1000 of these models that are already labeled regarding whether they're a generalizer or not, and you want to train another model to perform this classification on a new model outside this training set. One way to do this is to have a model that looks like this, where the training examples are entire models that have been labeled "generalizer" or "not generalizer":
The input to this classifier is a random Gaussian vector, and it has several hidden layers that eventually have connections to and from the Transformer model. Essentially, it injects noise into the Transformer model and tries to diagnose whether the $i$th model is a generalizer or not.
To train this model, you want to modify the weights and biases of the generalizer-classifier, but not those of the embedded model whose generalizability you're trying to classify. So you want to perform backpropagation through the entire network, but then modify the weights and biases only on the hidden layers that sandwich the embedded Transformer model. Would gradient descent still work in this case? Are there examples where something like this has been done?
