training chat models is not a clean industrial process. different training runs even using the same datasets can produce models that are noticeably different in personality, writing style, refusal behavior, evaluation performance, and even political bias