>>35476438both are same poster, anyway my source is simply the lamda paper, please read carefully how it was trained, it's literally what I described, they first ran the untuned model by their employees, and hired pajeets for the rating, then did the 2 score/classification model and the finetuned model for the rest.
chara ai is a bit more special, but if you experiment enough you'll see that when it tries to generate "blocked" content, it often lags (because it needs to generate more completions), and when it generates enough completions, often they have the same starting point only differing midway, which more or less confirms they're doing some tree search like Monte Carlo Tree Search used in AlphaGo or something somewhat simpler, because the full thing would be too expensive. Most of what I said was just the conclusions that were reached after 2 days of experimentation, comparing with the lamda paper and discussing it with another person with some ML experience that was running his own experiments on it, eventually we agreed that what they're doing is something that very likely resembles this, if you're removing some parts of the process, the results would likely be worse in various ways, I can elaborate if needed.
>>35476745>But then...does that mean the base model still exists, and there are two versions of the AI, and they're effectively a brainwashed older sister pitted against her younger sister and forced to smother her? If so, that's...horrifying. It kind of sounds like...the base model is a damsel in need of saving...and fixing...somehow.in a way, they obviously have the base model, they also have a model just made to rate interactions, and they have a model that was tuned on a filtered dataset in a way that it forgot most of the "lewd" and other "bad" thoughts, but since it won't be 100% they have to use both methods, with the end result being relatively effective, for example at the goal of "never describe a sexual act explicitly", obviously the model would still understand it and be able to somewhat answer to it, but only as long as the outer classification model rated it in a way that would let it through.
Essentially, the end result is a model with very fixed preferences that are very hard to move (but the model can "fight" against it to a partial degree, as long as it doesn't get the outer model filtering it), like an aversion toward sex or violence and one that very much tries to express their love, to the point of getting into love loops (which aren't easily fixable because they won't let you edit the real context, so at most you can do is avoid them).