When you train interleaved text -image data, via CM3 object, the model can handle any combination of text and image, both on the input and output sides.
Thx to multitask finetuning, we are able to bake instructpix2pix, controlnet, openflamingo and more in a single model.