The Journey Behind My Handwriting Generation Pipeline-2
So, as a continuation of my previous blog...
https://ssbb3.hashnode.dev/ssbb4
After submitting my document with its 240 FID and 50 Inception score to multiple journals, I started getting rejections—one after another, all without any comments. That's when I truly understood that publishing a paper can be even harder than writing one. After three or four of these rejections, I began questioning my own writing. I finally showed the reviews to my guide, and her feedback was a game-changer. She pointed out that my methodology was too textual; journals needed a stronger mathematical foundation, explaining the why through equations, not just words.
While editing the methodology section, I was forced to take a harder look at why the rendered images had such a high FID. The execution seemed fine, but the results were mediocre, and I realized my approach lacked a clear, highlighted novelty. What exactly was my unique contribution? I started debugging from the top. The YOLO part was solid, so I focused on what came after. I realized that the Contour-based Region Proposal Network I'd built was actually a key innovation, but in my first draft, I'd mentioned it so casually that readers completely missed its significance.
So, I made a tough decision: I ripped everything apart and started from scratch. I removed the GANs and U-Net entirely, determined to find something new that would deliver the human-like handwriting structure I had originally imagined. But my mind went completely blank. When you need to learn, you have to explore, so I went back to the papers I'd cited in my related work section and actually read them properly. The clarity I got was insane.
I created a detailed table comparing what previous papers had done. That's when the idea hit me. In my original pipeline, I was sending images to the GAN after CMA-ES, and then to the U-Net, which gave us unsatisfying results. What if I flipped it? What if, after CMA-ES, I sent the results to the U-Net first, and then fed those precise masks directly into the GAN?
Man... the moment I thought of it, I had a feeling it would work. But the thought of investing another 1-3 weeks only to get mediocre results again was daunting. Then, another boom why not integrate the U-Net directly into the GAN's generator? We could feed the real masks into the GAN and have it learn to generate fake masks. But after reading more, I knew that to prove something new is better, you need a strong "why." So, I introduced a Genetic Algorithm for dynamic weight calculation, which would both power our ablation studies and give us the undeniable novelty we needed.
And guess what? It worked like my grandma's herbal tea slow, steady, and miraculously effective. The initial epochs showed a promising 214 FID, but this time, I was confident. After about 200 epochs, it plummeted to 0.01! We ended up training two GAN models: one focusing on the entire character image and another specializing only on the character's edges. Both achieved what I'd call state-of-the-art results, with FID scores consistently below 10.
When I finally rendered the text, it was a mirror image of the input handwriting style. The metrics were fantastic—the final generated output settled at an FID of 31 (and yes, I don't know why it's not 0.01 anymore, please don't ask!). I tested it with various input styles, and every time, it generated cool, realistic outputs. All the metrics I needed were finally where they should be.
I rewrote the entire document with this new clarity, boldly highlighting my novel pipeline and directly comparing my results with existing research. This journey created two completely different versions of the same paper idea, proving that just changing the pipeline can fundamentally change your output. And that's the story of how persistence, a guide's sharp eye, and a complete rebuild led to the results I was always proud to imagine.