Example Paper

Paper

Python

Dataviz

Published

March 8, 2024

Code

import io
from sklearn import get_config
from config import save_system
from encryption import RandomState
import tomli
from contextlib import redirect_stdout

# remove stupid advert
f = io.StringIO()
with redirect_stdout(f):
  import ydata_profiling as yd

from pycaret.datasets import get_data
from pycaret.classification.functional import setup, get_config, compare_models, create_model, pull, finalize_model, save_model, plot_model
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'DeJavu Serif'
plt.rcParams['font.serif'] = ['Times New Roman']
from yellowbrick.features import rank2d
import yellowbrick as yb
import dtreeviz

Code

save_system()

with open("../../pyproject.toml", mode="rb") as fp:
  config = tomli.load(fp)

PROJECT_NAME = config["tool"]["poetry"]["name"]
RANDOM_STATE = RandomState(PROJECT_NAME).state

Classifying Irises by Field Measurements

Classification is a fundamental machine learning technique used to predict the group membership of data instances based on input features. The Iris dataset, a well-known benchmark in machine learning, presents a classification challenge involving three species of the Iris plant based on sepal and petal measurements. This paper explores the application of Support Vector Machines (SVM) combined with Principal Component Analysis (PCA) for dimensionality reduction to enhance classification accuracy and interpretability. Performance metrics, including precision, recall, F1 score, and confusion matrices, are analyzed to evaluate model effectiveness. The study demonstrates that while SVM achieves high classification accuracy, Decision Trees provide greater interpretability, which is valuable for practical applications in biological classification.

Introduction

Bioinformatics is an emerging and rapidly evolving field focused on extracting meaningful insights from biological data using computational techniques. Fundamental challenges in bioinformatics, such as protein structure prediction, sequence alignment, and phylogenetic inference, are often computationally complex and classified as NP-hard problems. Machine learning (ML) methods, including Artificial Neural Networks (ANN), Fuzzy Logic, Genetic Algorithms, and Support Vector Machines (SVM), have shown promise in addressing these challenges.

This study focuses on classifying Iris species using measurements of sepal length, sepal width, petal length, and petal width. The Iris dataset, introduced by R.A. Fisher in 1936, has become a standard benchmark for classification models. The goal is to evaluate the effectiveness of SVM for classification and explore the impact of dimensionality reduction using PCA on model performance. Decision Tree models are also assessed for comparison, balancing accuracy with interpretability.

Data

The Iris dataset, obtained from the UCI Machine Learning Repository, consists of 150 instances representing three species of Iris plants:

Iris Setosa
Iris Versicolor
Iris Virginica

Each species is represented by 50 samples. The dataset includes four numeric attributes:

Sepal length (cm)
Sepal width (cm)
Petal length (cm)
Petal width (cm)

Code

data = get_data("iris")
decoder = ['setosa', 'versicolor', 'virginica']

Iris Measures
	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

The fifth attribute is the class label indicating the species. The dataset is complete, with no missing values or inconsistencies reported. While the Setosa species is separable from the other two species, Versicolor and Virginica exhibit significant overlap, making classification the challenge of the study. A copy of the exploratory data report can be found here EDA Report

Code

report = data.profile_report(progress_bar=False, title="Iris EDA Report")
report.to_file("../../assets/Iris_EDA.html", silent = True)

Methods

Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that can project data onto a lower-dimensional subspace while retaining the maximum variance. PCA helps reduce the computational complexity of the classification task, but more importantly in this instance, improves visualization.

In this study, PCA reduces the four-dimensional Iris data into two principal components:

First Principal Component: Accounts for the highest variance in the data.
Second Principal Component: Captures the next highest variance.

Code

env_pca = setup(data, target="species", train_size = .8, session_id = RANDOM_STATE, pca=True, pca_components = 2, verbose = False, normalize = True)

Code

env_pca.X_transformed.head()

Reduced Dimension Measurements
	pca0	pca1
83	-1.069254	-0.680773
19	2.308932	1.167804
84	-0.230247	-0.329121
52	-1.329566	0.578077
81	0.044678	-1.574244

By plotting the data along these two principal components, a clear separation between the Setosa species and the other two species becomes visible. However, Versicolor and Virginica still exhibit considerable overlap.

Code

pca = env_pca.pipeline.named_steps["pca"].transformer
plt.figure()
sns.scatterplot(env_pca.X_transformed, x="pca0", y="pca1", hue=data["species"]).set(title="IRIS Species Scatter Plot \n Explained Variance: %.3f" % pca.explained_variance_ratio_.sum())

Feature Importance

While PCA is great for dimensionality reduction and visualization, we do lose some interpretability in how the scientific measurements are correlated to the target species. For this, we look at feature rank and feature importance charts to determine the measurements most affecting the separation between classes.

Code

env_norm = setup(data, target="species", train_size = .8, session_id = RANDOM_STATE, verbose = False, normalize=True)

Code

fig = rank2d(get_config("X_train"), get_config("y_train"))
fig = yb.target.feature_correlation.feature_correlation(env_norm.X_train, env_norm.y_train, method='mutual_info-classification')

Standard practice is to perform train test split on the transformed data to evaluate the model performance. We will use the convention that splits the data in 80% training/validation and 20% test set. Because the sample size is relatively small, we will use cross validation to prevent overfitting. The advantages of this technique are the ability to limit overfitting on a relatively small dataset.

Code

fig = yb.target.class_balance(env_norm.y_train, env_norm.y_test, labels = decoder)

Analysis

The models were evaluated using an 80/20 train-test split with five-fold cross-validation to mitigate overfitting. Evaluation metrics included:

Precision – The proportion of correct positive predictions.
Recall – The proportion of actual positives correctly identified.
F1 Score – The harmonic mean of precision and recall, providing a balanced measure of model accuracy.
Confusion Matrix - Illustrates misclassification patterns across species.

Classification by Support Vector Machine

Support Vector Machines (SVM) are effective for high-dimensional classification problems. SVM constructs a hyperplane in a multi-dimensional space that maximally separates the classes. In cases where the data is not linearly separable, SVM employs a kernel trick to map the data into a higher-dimensional space where linear separation becomes possible.

Code

svm_model = create_model("svm", tol=1e-3, alpha=.0012, verbose=False) # parameters can be found with tune_model
pull()

Cross Validation Results
	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC
Fold
0	0.7500	0.0	0.7500	0.7556	0.7460	0.6250	0.6316
1	0.9167	0.0	0.9167	0.9333	0.9153	0.8750	0.8843
2	0.9167	0.0	0.9167	0.9333	0.9153	0.8750	0.8843
3	0.9167	0.0	0.9167	0.9333	0.9153	0.8750	0.8843
4	1.0000	0.0	1.0000	1.0000	1.0000	1.0000	1.0000
5	0.8333	0.0	0.8333	0.8333	0.8333	0.7500	0.7500
6	1.0000	0.0	1.0000	1.0000	1.0000	1.0000	1.0000
7	1.0000	0.0	1.0000	1.0000	1.0000	1.0000	1.0000
8	0.9167	0.0	0.9167	0.9333	0.9153	0.8750	0.8843
9	0.8333	0.0	0.8333	0.8889	0.8222	0.7500	0.7833
Mean	0.9083	0.0	0.9083	0.9211	0.9063	0.8625	0.8702
Std	0.0786	0.0	0.0786	0.0744	0.0805	0.1179	0.1141

Comparison to Decision Tree

A Decision Tree classifier was used as a benchmark to compare with the SVM model. Decision Trees provide an interpretable model by recursively splitting the data based on the most informative features. The tree structure allows for straightforward interpretation of how classifications are made.

Code

tree_model = create_model("dt", max_depth=3, verbose = False)
winning_model = compare_models(include=[svm_model, tree_model], verbose = False)
pull()

	Model	Accuracy	AUC	Recall	Prec.	F1	Kappa	MCC	TT (Sec)
1	Decision Tree Classifier	0.9500	0.9677	0.9500	0.9600	0.9492	0.9250	0.9306	0.005
0	SVM - Linear Kernel	0.9083	0.0000	0.9083	0.9211	0.9063	0.8625	0.8702	0.008

Code

fig = plot_model(svm_model, plot = "confusion_matrix", plot_kwargs = {"classes": decoder})
fig = plot_model(tree_model, plot = "confusion_matrix", plot_kwargs = {"classes": decoder})
fig = plot_model(svm_model, plot = "class_report", plot_kwargs = {"classes": decoder})
fig = plot_model(tree_model, plot = "class_report", plot_kwargs = {"classes": decoder})

The Decision Tree model achieved comparable performance, almost identical to the Support Vector Machine. This is a common phenomenon in machine learning applications and often greatly ignored by auto=ml packages. Decision Trees, provide a more transparent decision-making process, which is valuable for field applications where understanding the classification process is essential.

Code

message = save_model(svm_model, "models/svm-model", verbose = False)
message = save_model(tree_model, "models/tree-model", verbose = False)

Model Interpretability

A simple decision tree can be printed out as a flowchart or a series of branching yes/no questions, making it easy to use in field work where quick decisions are necessary. Each node in the tree represents a question based on a specific feature, such as “Is petal length greater than 2.5 cm?” The branches lead to subsequent questions or to a classification decision, such as identifying the plant as Setosa or Versicolor. This format allows field researchers to visually trace the decision path step-by-step, even without access to computational tools. The straightforward structure of a decision tree makes it easy to understand and follow, enabling non-experts to accurately classify samples based on observable characteristics. The transparency and simplicity of the printed decision tree make it particularly valuable for practical applications in biological fieldwork.

Code

# Build a model on the final data for deployment.
env_prod = setup(data, target="species", train_size = .8, session_id = RANDOM_STATE, verbose = False) # trees don't need normalization
final_tree = finalize_model(tree_model)
message = save_model(final_tree, "models/production-model", verbose = False)

Code

# Visualize predictions
sample = env_prod.X_transformed.iloc[120] 
final_tree_model = final_tree.named_steps['actual_estimator']
viz_model =  dtreeviz.model(final_tree_model, X_train=env_prod.X_transformed, y_train = env_prod.y_transformed, feature_names = list(env_prod.X_transformed), target_name = "species name", class_names = decoder)
viz_model.view(x = sample, fontname="DejaVu Sans")

In-case displaying the entire tree is not desirable especially in regulatory environments, we can instead display the list of features and their importance related to the decision.

Code

viz_model.instance_feature_importance(sample, fontname="DejaVu Sans")

Conclusion

Discussion

Misclassification between Versicolor and Virginica reflects the biological similarities in their sepal and petal measurements. Future research could explore additional botanical features or incorporate other ML techniques, such as ensemble methods or deep learning, to improve discrimination between these species.

SVM remains a powerful classification tool for complex, high-dimensional datasets. However, the increased interpretability of Decision Trees suggests that they may be more suitable for practical, real-world applications in biological research.

Conclusion

This study demonstrates the effectiveness of SVM in classifying Iris species based on sepal and petal measurements. PCA successfully reduced the data dimensionality, improving visualization. SVM achieves high accuracy, however, the interpretability of Decision Trees makes them a valuable alternative for practical applications where model transparency is critical.

Future Work

Future work could explore:

Incorporating additional morphological features to improve classification between Versicolor and Virginica.
Testing ensemble methods such as Random Forest or Gradient Boosted Trees for enhanced accuracy.

Appendix

The entire website, including this example project is located at https://github.com/joshuacharleshyatt/personal