Machine Learning Basics

The Journey Behind My Handwriting Generation Pipeline-2

Manikanta SSB — Sat, 10 Jan 2026 19:17:34 GMT

So, as a continuation of my previous blog...

https://ssbb3.hashnode.dev/ssbb4

After submitting my document with its 240 FID and 50 Inception score to multiple journals, I started getting rejections—one after another, all without any comments. That's when I truly understood that publishing a paper can be even harder than writing one. After three or four of these rejections, I began questioning my own writing. I finally showed the reviews to my guide, and her feedback was a game-changer. She pointed out that my methodology was too textual; journals needed a stronger mathematical foundation, explaining the why through equations, not just words.

While editing the methodology section, I was forced to take a harder look at why the rendered images had such a high FID. The execution seemed fine, but the results were mediocre, and I realized my approach lacked a clear, highlighted novelty. What exactly was my unique contribution? I started debugging from the top. The YOLO part was solid, so I focused on what came after. I realized that the Contour-based Region Proposal Network I'd built was actually a key innovation, but in my first draft, I'd mentioned it so casually that readers completely missed its significance.

So, I made a tough decision: I ripped everything apart and started from scratch. I removed the GANs and U-Net entirely, determined to find something new that would deliver the human-like handwriting structure I had originally imagined. But my mind went completely blank. When you need to learn, you have to explore, so I went back to the papers I'd cited in my related work section and actually read them properly. The clarity I got was insane.

I created a detailed table comparing what previous papers had done. That's when the idea hit me. In my original pipeline, I was sending images to the GAN after CMA-ES, and then to the U-Net, which gave us unsatisfying results. What if I flipped it? What if, after CMA-ES, I sent the results to the U-Net first, and then fed those precise masks directly into the GAN?

Man... the moment I thought of it, I had a feeling it would work. But the thought of investing another 1-3 weeks only to get mediocre results again was daunting. Then, another boom why not integrate the U-Net directly into the GAN's generator? We could feed the real masks into the GAN and have it learn to generate fake masks. But after reading more, I knew that to prove something new is better, you need a strong "why." So, I introduced a Genetic Algorithm for dynamic weight calculation, which would both power our ablation studies and give us the undeniable novelty we needed.

And guess what? It worked like my grandma's herbal tea slow, steady, and miraculously effective. The initial epochs showed a promising 214 FID, but this time, I was confident. After about 200 epochs, it plummeted to 0.01! We ended up training two GAN models: one focusing on the entire character image and another specializing only on the character's edges. Both achieved what I'd call state-of-the-art results, with FID scores consistently below 10.

When I finally rendered the text, it was a mirror image of the input handwriting style. The metrics were fantastic—the final generated output settled at an FID of 31 (and yes, I don't know why it's not 0.01 anymore, please don't ask!). I tested it with various input styles, and every time, it generated cool, realistic outputs. All the metrics I needed were finally where they should be.

I rewrote the entire document with this new clarity, boldly highlighting my novel pipeline and directly comparing my results with existing research. This journey created two completely different versions of the same paper idea, proving that just changing the pipeline can fundamentally change your output. And that's the story of how persistence, a guide's sharp eye, and a complete rebuild led to the results I was always proud to imagine.

The Journey Behind My Handwriting Generation Pipeline

Manikanta SSB — Sun, 19 Oct 2025 08:24:43 GMT

Warning: This post is long, messy, and built on curiosity. But it’s real this is how I built my first handwritten generation pipeline from scratch.

Hey guys, it's Manikanta. I'm genuinely proud to share my recent publication on Synthetic Handwritten Generation, a project I can truly call my own from start to finish.

This whole idea was born from a universal student dream: escaping the drudgery of writing hectic assignments. I was traveling home on the university bus one day when a friend complained, "I'm done with writing these assignments." We laughed, but it sparked a question in my mind: why hasn't anyone built an app that takes a sample of your handwriting and generates text in that same natural style?

Driven by curiosity, I dove into the research on ScienceDirect, looking at papers on OCR (Optical Character Recognition). Most focused on extracting and identifying text, but I needed to generate it. My past experience with GANs, which create similar-looking images, and Auto-Encoders, which extract outlines, felt like pieces of a puzzle. I immediately texted the idea to myself so I wouldn't forget.

My first attempt was straightforward: I wrote out A-Z, a-z, and 0-9 on a paper and tried to extract each character individually. It was a complete failure. Tools like Tesseract-OCR and Easy-OCR kept grabbing entire rows or nothing at all. Frustrated, I had to step away from the problem for a while.

The breakthrough came unexpectedly. My professor asked for some old code, and while reviewing my files, two concepts suddenly connected in my mind: Edge Detection and YOLO. Edge Detection highlights the outlines of an image, and YOLO places bounding boxes around objects to identify them. My interest took a complete U-turn.

I immediately applied edge detection to my character sheet. Boom the edges of every character were perfectly highlighted. The next logical step was to draw bounding boxes around those edges. Another boom! The technique successfully isolated all 62 characters. I had my foundation.

Now, the real architectural work began on the pipeline. My initial plan was: Edge Detection + Bounding Boxes -> CNN -> U-Net -> GANs -> Rendering. But I quickly realized that handling a multi-class detection problem was better suited for YOLO, so I changed it to: Edge Detection + Bounding Boxes -> YOLO -> U-Net -> GANs -> Rendering.

Then I overthought the U-Net's role. U-Net is for segmentation, focusing on the core object and removing the background. I worried that segmenting a character before the GAN would leave me with only an outer edge, making generation too hard. So, for the third time, I reshuffled the pipeline to its final form: Edge Detection + Bounding Boxes -> YOLO -> GANs -> U-NET -> Rendering.

With the blueprint set, I started collecting data. I gathered 93 handwritten samples from my friends—the best I could manage. Annotating all 93 images in Roboflow was the most grueling part; it took 3-4 days, and I owe a huge thanks to my sister for her help. I trained a YOLOv8s model, but the results were disastrous, with a Mean Average Precision (mAP) barely above zero. A newborn baby could have done better!

I used Roboflow's augmentation feature to expand my dataset to 243 images, but the mAP only climbed to 30-40. Staring at the confusing confusion matrix, I had a new idea: why not extract every single character using the labels I already had and create individual class folders? It worked. I resized the 27,000 resulting images to 64x64 and reformatted them for YOLO, effectively treating each full image as a bounding box. This time, training was a success, finishing with a solid mAP-95 of 0.75. The first major task was complete.

From here, the path felt smoother. The next step was the GAN, to generate new characters that looked natural. With nine types of GANs to choose from, I fell into a dilemma. I randomly started with StyleGAN, famous for generating realistic human faces. Unsurprisingly, it started generating faces instead of characters! My hope began to fade.

Further research led me to Conditional GANs, which use class labels to guide the generation. Perfect! When YOLO detected a character, it would be sorted into a class-labeled folder to train the GAN. But a new problem emerged: similar-looking characters like (0, o, O), (5, s, S), and (2, Z) were being grouped together, reducing our 62 classes to just 30-40 unique folders. Instead of fighting it, I decided to embrace these 19 clusters, allowing the model to substitute a detected 's' for a '5', for instance.

The final hurdle was achieving a natural, flowing handwriting style. Simply rotating characters created ugly black padding. The solution came from an Evolutionary Algorithm called CMA-ES (Covariance Matrix Adaptation-Evaluation Strategy), which intelligently shifts pixels within the matrix to create variation without adding noise. It worked beautifully.

The U-Net phase was surprisingly straightforward after all that, and soon I was rendering the final text. To be completely honest, I wasn't impressed with the final output. The metrics told the story: my FID score (which measures how real generated images look) was 240, indicating they were "super fake" compared to the ideal score of under 20. My segmentation scores like Inception Score and DiceID were around 50, far from the target of 80. The rendered characters just weren't as clear as I'd hoped.

Despite the imperfect results, I've documented the entire journey and submitted it to journals. This is the story of how I built my first complete pipeline idea, with all its twists, turns, and late-night breakthroughs.

I know research is all about inventing and trying something new, but here, everything I used has been applied in many other applications just not in the way I used it. This is only the beginning of my research journey. After this, the topic gained more add-ons and eventually transformed into a quality paper. Stay tuned for more project workings; I’ll update you on what happens next. Have a good day. ✌️

Machine Learning Basics

Manikanta SSB — Mon, 07 Apr 2025 14:33:58 GMT

Hey, I am Manikanta. Today, we will try to discuss machine learning basics. So, what is machine learning? It is a field of AI that allows computers to learn from data. This is the basic definition that we can get via ChatGPT or whatever AI model we are comfortable with. But what do we benefit from giving data to AI and making the computer learn from that data? That we will discuss now. We will train an AI algorithm, which is known as the "Model," by giving it the data to find patterns between the data. This will help us to predict the future from the existing data, classify the data, and create something relevant. What is the use of training these "AI models"? Even a person with basic knowledge can do those jobs, yes, but not with the speed and scalability these models can do them. Today, we will learn something about how these "models" work. So, there are three types of learning in artificial intelligence. We will get to know what they are in the coming days, but I will give you a summary of each in this one.

Supervised Learning: Whenever we are trying to do something, we know what we are doing that for. In technical terms, we will have the features and the target. In other words, we will know both input and output. What are these features and targets? Do not worry; I will explain it to you with an example. Let us say there is a cricket match happening to win a particular match. What are the features? Good batting lineup, Good fielding, a Good bowling attack, and perfect strategies. If we have all these features, our target will be a win. If any feature is lacking, there can be a chance of changing the target variable. For this, if we have data from previous matches with these features and targets, we will call that a dataset.

For what is this supervised learning used?

Classification, Regression. Oo, unfamiliar terms again.

Classification is about trying to separate which class the given new sample belongs to, like if we are trying to classify whether we pass an exam or fail.

Regression is about predicting continuous values based on the input. Like temperature forecasting for coming days and house price prediction.

Algorithms based on supervised learning:

Decision trees, random forests, linear regression, and logistic regression.
Unsupervised Learning

Unsupervised learning is something that trains on unlabeled data. Here ,we will work only with features, not targets. Allow me to explain to you with an example.

Let us say I have collected data from a group of people where I collected their interests, hobbies, and goals. Now I can try to group people by using available clustering algorithms. Similarly, we also have dimensionality reduction, which will try to reduce the number of features that will help the model to train on important data, which reduces computational time.

Algorithms based on unsupervised learning.

K-means clustering, hierarchical clustering, and principal component analysis.
Reinforcement Learning:

This will be interesting. While training a model using reinforcement learning, there will be an agent in the model who will continuously monitor the performance of the model. Whenever the model is performing well, it will give a reward; else, it will give a penalty to the model so that it corrects itself.

For now, let us assume a dog. While we are training a dog, the trainer will give a treat to the dog when it performs the trick properly; otherwise the trainer will correct the dog. In this process, the dog will learn and do what the trainer says for treats. In the same way, the agent tries to train the model using a reward/penalty basis.

Algorithms based on reinforcement learning.

Q-learning, policy gradient methods.

I tried to mention everything that can be useful for everyone to understand these basics well. In the coming blogs, I will explain each of these learnings with real-world examples with codes.