Lacking Data for Machine Learning Models? Generate it Automatically

Lacking Data for Machine Learning Models? Generate it Automatically By Andrea Mycroft Machine Learning Engineer

By Andrea Mycroft, Machine Learning Engineer

As a tech company, we hear a lot about machine learning and how it can facilitate real-world solutions that reduce cost and improve efficiency. Machine learning allows us to solve unique problems using artificial intelligence, but it does not come without its own challenges, especially when a client does not have sufficient data available to train a model. In this article, you will learn about how one of our Machine Learning Engineers took a challenge from a client, led a team to develop an automatic data generator to create usable data to inform the model, and delivered an effective solution.

What is your title, and what do you do for Octo? 

I am a Machine Learning Engineer. I work mostly on internal research and development, artificial intelligence, and machine learning projects for the Chief Technology Officer (CTO). I also work on client projects as needed.

What is one of the most interesting projects you’ve worked on at Octo?

One of our customers wants to automate a label checking process using computer vision and machine learning. Companies submit alcohol labels to this customer for review and approval. Currently, this process is manually intensive, time-consuming, and expensive. An analyst must perform inspections to make sure the label meets agency requirements. Labels must conform to legibility criteria, cannot contain certain phrases, must contain other phrases, and must comply with regulations regarding the use of icons and logos.

What challenges did you encounter? 

The lack of significant amounts of data made highly accurate modeling a challenge. Good machine learning models require a lot of data for training, and there was almost no available data. To address this deficiency, we built an automatic data generator that quickly generates thousands of randomized image examples in short timeframes. It uses a web scraper tool to download the desired background images (alcohol labels) and object images (icons) based on search terms. The tool randomly combines the backgrounds and objects to create a data set that is ready to be used in machine learning. The data generator improves consistency in labeling and annotating data, reducing cost and increasing efficiency.

What makes this tool unique?

This tool goes beyond cutting and pasting images. It removes backgrounds from object images to create data that is very similar to real data. We automated foreground detection and background removal in this process, reducing the resources used to perform these actions. Our automated data generator proves that image verification can be automated, even when sufficient data is not readily available for training.

On a high level, what technologies were used on this project?

Our team used optical character recognition (OCR) with Tesseract, object detection with custom trained machine learning models (PyTorch), and typical Python packages, such as OpenCV, Numpy, and Pandas.

How might other agencies use this solution? What are some potential use cases? 

Automatically generating a dataset for object detection when there is limited data is an extremely useful tool. This solution demonstrates our ability to automate image verification using computer vision and is extensible to other agencies that need labels or images verified based on their particular guidelines.

To see more examples of how our innovators use machine learning and experience other AI solutions from our innovation lab, click here or contact us.