The output image deviates significantly from the input image. No matter how you adjust the generation parameters, the resemblance just isn't there. The results are even worse than a direct alignment using SigLIP, what's the benefit of a LLM here?
Your need to confirm your account before you can post a new comment.