Kaggle Kernels for Classification Tasks

The following Kaggle kernels show how to patch scikit-learn with Intel® Extension for Scikit-learn* for various classification tasks. These kernels usually include a performance comparison between stock scikit-learn and scikit-learn patched with Intel® Extension for Scikit-learn*.

TPS stands for Tabular Playground Series, which is a series of beginner-friendly Kaggle competitions.

Binary Classification

Kernel

Goal

Content

Logistic Regression for Binary Classification

Data: [TPS Nov 2021] Synthetic spam emails data

Identify spam emails via features extracted from the email

  • data preprocessing (normalization)

  • search for optimal parameters using Optuna

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Feature Importance in Random Forest for Binary Classification

Data: [TPS Nov 2021] Synthetic spam emails data

Identify spam emails via features extracted from the email

  • reducing DataFrame memory usage

  • computing feature importance with ELI5 and the default scikit-learn permutation importance

  • training using scikit-learn-intelex

  • performance comparison to scikit-learn

Random Forest for Binary Classification

Data: [TPS Apr 2021] Synthetic data based on Titanic dataset

Predict whether a passenger survivies

  • data preprocessing

  • feature construction

  • search for optimal parameters using Optuna

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Support Vector Classification (SVC) for Binary Classification

Data: [TPS Apr 2021] Synthetic data based on Titanic dataset

Predict whether a passenger survivies

  • data preprocessing

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Support Vector Classification (SVC) with Feature Preprocessing for Binary Classification

Data: [TPS Apr 2021] Synthetic data based on Titanic dataset

Predict whether a passenger survivies

  • data preprocessing

  • feature engineering

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

MultiClass Classification

Kernel

Goal

Content

Logistic Regression for MultiClass Classification with Quantile Transformer

Data: [TPS Jun 2021] Synthetic eCommerce data

Predict the category of an eCommerce product

  • data preprocessing with Quantile Transformer

  • training and prediction using scikit-learn-intelex

  • search for optimal paramters using Optuna

  • performance comparison to scikit-learn

Support Vector Classification (SVC) for MultiClass Classification

Data: [TPS May 2021] Synthetic eCommerce data

Predict the category of an eCommerce product

  • data preprocessing

  • training and prediction using scikit-learn-intelex

Stacking Classifer with Logistic Regression, kNN, Random Forest, and Quantile Transformer

Data: [TPS Jun 2021] Synthetic eCommerce data

Predict the category of an eCommerce product

  • data preprocessing: one-hot encoding, dimensionality reduction with PCA, normalization

  • creating a stacking classifier with logistic regression, kNN, and random forest, and a pipeline of Quantile Transformer and another logistic regression as a final estimator

  • searching for optimal parameters for the stacking classifier

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Support Vector Classification (SVC) for MultiClass Classification

Data: [TPS Dec 2021] Synthetic Forest Cover Type data

Predict the forest cover type

  • data preprocessing

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Feature Importance in Random Forest for MultiClass Classification

Data: [TPS Dec 2021] Synthetic Forest Cover Type data

Predict the forest cover type

  • reducing DataFrame memory usage

  • computing feature importance with ELI5

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

k-Nearest Neighbors (kNN) for MultiClass Classification

Data: [TPS Feb 2022] Bacteria DNA

Predict bacteria species based on repeated lossy measurements of DNA snippets

  • data preprocessing

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Classification Tasks in Computer Vision

Kernel

Goal

Content

Support Vector Classification (SVC) for MultiClass Classification (CV task)

Data: Digit Recognizer (MNIST)

Recognize hand-written digits

  • data preprocessing

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

k-Nearest Neighbors (kNN) for MultiClass Classification (CV task)

Data: Digit Recognizer (MNIST)

Recognize hand-written digits

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Classification Tasks in Natural Language Processing

Kernel

Goal

Content

Support Vector Classification (SVC) for a Binary Classification (NLP task)

Data: Natural Language Processing with Disaster Tweets

Predict which tweets are about real disasters and which ones are not

  • data preprocessing

  • TF-IDF calculation

  • search for optimal paramters using Optuna

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

One-vs-Rest Support Vector Machine (SVM) with Text Data for MultiClass Classification

Data: What’s Cooking

Use recipe ingredients to predict the cuisine

  • feature extraction using TfidfVectorizer

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn

Support Vector Classification (SVC) for Binary Classification with Sparse Data (NLP task)

Data: Stack Overflow questions

Predict the binary quality rating for Stack Overflow questions

  • data preprocessing

  • TF-IDF calculation

  • search for optimal paramters using Optuna

  • training and prediction using scikit-learn-intelex

  • performance comparison to scikit-learn