Auto-Sklearn is a library for automatic machine learning developed by AutoML group from Freibug-Hannover, Germany.
Generate tabular classification prediction using auto-sklearn 2
Main components of the library:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Import libraries
import autosklearn.experimental.askl2
import autosklearn.metrics
# Configure model parameters
classifier = autosklearn.experimental.askl2.AutoSklearn2Classifier(
time_left_for_this_task = 600, # in seconds
metric = autosklearn.metrics.roc_auc)
# Fit model
classifier.fit(X_train, y_train)
# Predict on test data
y_pred = classifier.predict(X_test)
|
To find out more about other auto-sklearn applications, visit the examples webpage.
This notebook is available on GitHub or to be downloaded here.
Note: to install auto-sklearn on MacOS, use the commands below. Additional details can be found in this comment.
1
2
3
| brew install swig
brew link swig
pip install -U auto-sklearn
|
Download the dataset from Kaggle
The dataset being used is from the Kaggle Titanic competition.
1
2
| import fastkaggle
print("fastkaggle version: ", fastkaggle.__version__)
|
1
| fastkaggle version: 0.0.7
|
1
2
3
| comp = 'titanic' # competition name
path = fastkaggle.setup_comp(comp,
install = 'fastai "timm >= 0.6.2.dev0"')
|
1
2
3
4
| # Import basic dependencies such as np, pd
import fastai
from fastai.imports import *
print("fastai version: ", fastai.__version__)
|
1
| gender_submission.csv test.csv train.csv
|
Process and clean the data
Additional transformation and normalization are handled by auto-sklearn 2.
1
2
| df = pd.read_csv(path/'train.csv', index_col = 'PassengerId')
df.head()
|
| Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
PassengerId | | | | | | | | | | | |
---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
Define data features(X_train) and label(y_train).
1
2
| y_train = df['Survived']
X_train = df.drop(['Survived', 'Name'], axis = 1)
|
Since auto-sklearn 2 does not accept string columns, it is necessary to convert them into categorical columns.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Create a function that finds categorical columns and label them as such
## Import dependency
from fastai.tabular.all import *
def to_cat(df = df):
'''
Convert string-type columns of a dataframe into categorical columns
'''
# Identify string/categorical columns in the dataframe
_, cat = cont_cat_split(df, 1)
# Convert to categorical type using for loops
for col in cat:
df[col] = pd.Categorical(df[col])
|
| Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
PassengerId | | | | | | | | | |
---|
1 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
---|
2 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
---|
3 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
---|
4 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
---|
5 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
---|
Data exploration
The package pandas_profiling provide quick and valuable insights into the data.
1
2
| import pandas_profiling
print("pandas_profiling version: ", pandas_profiling.__version__)
|
1
| pandas_profiling version: 3.2.0
|
1
2
| X_train.profile_report(progress_bar = False).to_notebook_iframe()
# Use .to_notebook_iframe() for HTML format or .to_widgets() for built in widget view
|
1
2
| import autosklearn
print("autosklearn version: ", autosklearn.__version__)
|
1
| autosklearn version: 0.14.7
|
1
2
3
4
5
6
7
8
9
10
| import autosklearn.experimental.askl2
import autosklearn.metrics
# Configure model parameters
cls = autosklearn.experimental.askl2.AutoSklearn2Classifier(
seed = 42,
time_left_for_this_task = 600, # in seconds
metric = autosklearn.metrics.roc_auc,
memory_limit = None,
n_jobs = -1) # Use all CPUs available
|
1
2
3
| %%capture
# Train the model
cls.fit(X_train, y_train)
|
1
2
| [WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Time limit for a single run is higher than total time limit. Capping the limit for a single run to the total time given to SMAC (599.729022)
[WARNING] [2022-08-25 04:49:26,365:Client-AutoML(42):870bbeda-2420-11ed-af95-acde48001122] Capping the per_run_time_limit to 299.0 to have time for a least 2 models in each process.
|
1
| print(cls.sprint_statistics())
|
1
2
3
4
5
6
7
8
9
| auto-sklearn results:
Dataset name: 870bbeda-2420-11ed-af95-acde48001122
Metric: roc_auc
Best validation score: 0.884487
Number of target algorithm runs: 232
Number of successful target algorithm runs: 232
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 0
Number of target algorithms that exceeded the memory limit: 0
|
To save the progress thus far, we can use fastai’s function save_pickle to store the trained model.
1
| save_pickle('cls.pkl', cls)
|
If the pickled model needs to be accessed later, run the following:
1
| cls = load_pickle('cls.pkl')
|
Model insights
The contents of the model ensemble can be viewed below.
1
| print(cls.leaderboard())
|
1
2
3
4
5
6
| rank ensemble_weight type cost duration
model_id
115 1 0.02 passive_aggressive 0.117159 3.054357
164 2 0.02 passive_aggressive 0.119444 2.595886
175 3 0.92 sgd 0.121888 2.583323
159 4 0.04 sgd 0.124069 2.279986
|
1
2
| import PipelineProfiler
%pip show pipelineprofiler
|
1
2
3
4
5
6
7
8
9
10
11
| Name: pipelineprofiler
Version: 0.1.18
Summary: Pipeline Profiler tool. Enables the exploration of D3M pipelines in Jupyter Notebooks
Home-page: https://github.com/VIDA-NYU/PipelineVis
Author: Jorge Piazentin Ono, Sonia Castelo, Roque Lopez, Enrico Bertini, Juliana Freire, Claudio Silva
Author-email: jorgehpo@nyu.edu
License: UNKNOWN
Location: /Users/tompham/Library/Python/3.10/lib/python/site-packages
Requires: networkx, notebook, numpy, python-dateutil, scikit-learn, scipy
Required-by:
Note: you may need to restart the kernel to use updated packages.
|
1
2
| profiler_data = PipelineProfiler.import_autosklearn(cls)
PipelineProfiler.plot_pipeline_matrix(profiler_data)
|
Use the trained model to make predictions
Process the test data similarly to the trained features(X_train).
1
2
3
4
| df_test = pd.read_csv(path/'test.csv', index_col = 'PassengerId')
df_test = df_test.drop('Name', axis = 1)
to_cat(df_test)
df_test.head()
|
| Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
PassengerId | | | | | | | | | |
---|
892 | 3 | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
---|
893 | 3 | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
---|
894 | 2 | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
---|
895 | 3 | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
---|
896 | 3 | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
---|
1
2
| # Make prediction
prediction = cls.predict(df_test)
|
1
2
3
4
| # Convert the prediction to dataframe from ndarry
subm = pd.DataFrame(prediction,
index = df_test.index,
columns = ['Survived'])
|
Save the prediction as a .csv file.
1
2
3
| subm.to_csv('subm.csv')
# View the first few rows
!head subm.csv
|
1
2
3
4
5
6
7
8
9
10
| PassengerId,Survived
892,0
893,1
894,0
895,0
896,1
897,0
898,1
899,0
900,1
|
Submit the prediction directly to the Kaggle competition. View the scores in this webpage.
1
2
3
4
5
| # Submit to competition
from kaggle import api
api.competition_submit_cli('subm.csv', # file name
'auto-sklearn 2 - 10m', # version description
comp) # competition name
|
1
2
3
4
5
6
7
| 100%|██████████| 2.77k/2.77k [00:00<00:00, 5.35kB/s]
Successfully submitted to Titanic - Machine Learning from Disaster
|
This submission has an accuracy score of 79.186%, which is top 6% of all submissions. Note: there are numerous top predictions with 100% accuracy from cheating.