Classifying Leaves using extracted features from Image Processing

Rafael Madrigal
5 min readSep 14, 2022

--

At the beginning of this article series, we discussed how Image Processing can be used to complement Machine Learning algorithms. The articles before this discussed the different techniques on how we can enhance, manipulate, and preprocess our images for machine learning tasks. Now, we use of that knowledge and attempt to classify leaves using a combination of image processing and Machine Learning techniques.

First Let’s Load the images

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skimage.io import imshow, imread
import glob
from tqdm.notebook import tqdm
from skimage.color import rgb2gray
from matplotlib.patches import Rectangle
import warnings
warnings.simplefilter('ignore')
leaves = glob.glob('p*.jpg')
fig, axes = plt.subplots(3, 9, figsize=(27, 9))
plants = {'plantA': [], 'plantB': [],
'plantC': [], 'plantD': [],
'plantE': []}
for ax, leaf in tqdm(zip(axes.flatten(), leaves)):
im = imread(leaf)
plant_type = leaf.split('.')[0].split('_')[0]
plants[plant_type].append(im*1.0/255)
ax.imshow(im)
ax.set_title(plant_type)
ax.axis(False

We have to extract each of the leaves from each of the photos and extract features that can be used to classify them. But first, let’s attempt to do some white balancing. Here we used the ground truth algorithm. We also performed a simple binarization/ thresholding to convert each image to binary images

plants_wb = {‘plantA’: [], ‘plantB’: [], 
‘plantC’: [], ‘plantD’: [],
‘plantE’: []}
plants_bw = {‘plantA’: [], ‘plantB’: [],
‘plantC’: [], ‘plantD’: [],
‘plantE’: []}
for plant_label in list(plants.keys()):
print(‘Processing ‘, plant_label)
for im in tqdm(plants[plant_label]):
im_wb = apply_ground_truth(im,
get_patches(im, x=800, y=600, s=20, plot=True),
method=’max’,
plot=True)
im_wb = img_as_float(sobel(sobel(im_wb) + im_wb)+im_wb).clip(0, 1)
im_bw = area_closing(area_closing(rgb2gray(im_wb)<0.525))

plants_wb[plant_label].append(im_wb)
plants_bw[plant_label].append(im_bw)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(6, 3))
ax1.imshow(im_bw)
ax1.set_title(‘Binarized Photo’)
ax2.imshow(im_wb)
ax2.set_title(‘After Apply Edge’);

Well, that looks helpful! We were able to cleanly cut the leaves using a threshold parameter. We can not then extract the relevant features using region props. For this case, I identified properties that I think are relevant to the problem and measured for all of that.

from skimage.measure import label, regionprops, regionprops_table
reg_props = ['area', 'bbox', 'convex_area', 'convex_image',
'eccentricity', 'equivalent_diameter', 'euler_number',
'local_centroid', 'major_axis_length', 'minor_axis_length',
'perimeter']
def extract_features(plant_bw, plant_wb, reg_props=reg_props):
features = pd.DataFrame()
for plant_label in (list(plant_bw.keys())):
for im, im_wb in tqdm(zip(plant_bw[plant_label],
plant_wb[plant_label]),
total=len(plant_bw[plant_label])):
im_wb = rgb2gray(im_wb)
labelled_im = label(im)
im_features = pd.DataFrame(
{**regionprops_table(labelled_im, properties=reg_props),
**{'plant_label': plant_label.replace('plant', '')}})

convex_orig = []
for i, row in im_features.iterrows():

try:
convex_orig.append(np.where(row.convex_image,
(im_wb[row['bbox-0']:row['bbox-2'],
row['bbox-1']:row['bbox-3']])
,0))
except Exception as e:
print(e)
convex_orig.append(np.nan)
im_features['convex_orig'] = convex_orig
features = features.append(im_features)
return features.reset_index(drop=True)
features = extract_features(plants_bw, plants_wb)
features['mean_intensity'] = features.convex_orig.map(np.mean)
features['max_intensity'] = features.convex_orig.map(np.max)

However, given that the binarization also captured some unwanted specks or vertical/ horizontal lines, I created a plot of area and perimeter and eliminated the obvious outliers. I used perimeter and area as a filtering property because the same leaves would probably have the same area and/or perimeter and huge deviations from the mean value would probably mean that there are images that were counted but shouldn’t be. Below are the Violin plots showing the distribution per plant type for Area and Perimeter

Overall, this gave us a total of 250 leaves from initial detected connected components of 436. What I did is just one of the simple ways to address the problem with connected components. Even if we capture specks as connected components, we can easily remove them by performing an outlier analysis. For the classification task, our PCC is 20%, and our 1.25 PCC is 25%.

After cleaning the data and generating the features table, we performed some correlation analysis and removed features that are highly correlated to each other. We were left with a total of 8 features from 12. These are: context_area, eccentricity, euler_number, major_axis_length, minor_axis_length, perimeter, mean_intensity, and max_intensity.

We normalized the features and used several machine learning models to classify the images. At the end, we found to that Logistic Regression L1 bet all the other machine learning methods for this specific use case with a mean test accuracy of 92.8%. We tried this with our data and came up with an F1 score of 93.7% and a Precision of 94.3%

As expected, the model had a hard time classifying plants C and D. From the images provided, plant C and D leaves are very similar in size. Since we are reliant on geometric properties, then most likely, they will have close measurements. This can be better addressed by identifying edges and using colored images and finding a way to factor that in the computation of the algorithm.

Wrapping Up

In this simple use case, we were able to present a full ML pipeline using image processing. There are a lot of opportunities to explore when integrating Image processing with Machine Learning and learning the basics through PIL, OpenCV, or scikit-image paves the way for us to understand most Computer Vision tasks.

--

--

Rafael Madrigal

Data Scientist and Corporate Strategist. Can’t function without Coffee