Installation and Setup
1. Install Python
Steps:
- Download and install Python from python.org
- Verify installation by running
python --version
in your terminal
2. Create a Virtual Environment
Steps:
- Install virtualenv:
pip install virtualenv
- Create a new environment:
virtualenv sklearn_env
- Activate the environment:
- Windows:
sklearn_env\Scripts\activate
- macOS/Linux:
source sklearn_env/bin/activate
- Windows:
3. Install scikit-learn
Steps:
- Install scikit-learn:
pip install scikit-learn
- Install required dependencies:
pip install numpy pandas matplotlib
- Verify installation:
import sklearn print(f"scikit-learn version: {sklearn.__version__}")
Data Preprocessing
Handling Data
scikit-learn provides powerful tools for data preprocessing. Here's how to handle common data tasks:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Create preprocessing steps
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
Key Features
-
Feature Scaling
Standardize or normalize features for better model performance
-
Encoding
Convert categorical variables to numerical representations
-
Imputation
Handle missing values in your dataset
Model Training
Building Models
Train various machine learning models with scikit-learn:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
# Classification example
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Regression example
reg = LinearRegression()
reg.fit(X_train, y_train)
# Support Vector Machine example
svm = SVC(kernel='rbf')
svm.fit(X_train, y_train)
Model Types
-
Supervised Learning
Classification and regression models
-
Unsupervised Learning
Clustering and dimensionality reduction
-
Model Selection
Cross-validation and hyperparameter tuning
Model Evaluation
Assessing Performance
Evaluate your models using various metrics:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score
# Classification metrics
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
# Regression metrics
y_pred = reg.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y_pred)}")
Evaluation Metrics
-
Classification Metrics
Accuracy, precision, recall, F1-score
-
Regression Metrics
MSE, RMSE, R2 score
-
Cross-Validation
K-fold, stratified, time series splits
Model Deployment
Saving and Loading Models
Deploy your trained models:
import joblib
from sklearn.pipeline import Pipeline
# Create a complete pipeline
pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', clf)
])
# Save the model
joblib.dump(pipeline, 'model.joblib')
# Load the model
loaded_model = joblib.load('model.joblib')
# Make predictions
predictions = loaded_model.predict(new_data)
Deployment Options
-
Model Persistence
Save and load models using joblib
-
API Deployment
Deploy models as REST APIs
-
Cloud Deployment
Deploy models on cloud platforms
Best Practices
Development Guidelines
Follow these best practices for efficient scikit-learn development:
- Always preprocess your data before training
- Use pipelines for reproducible workflows
- Perform cross-validation for reliable results
- Regularize models to prevent overfitting
- Document your preprocessing steps
Next Steps
Now that you understand the basics of scikit-learn, you can:
- Explore advanced model architectures
- Learn about feature engineering
- Implement custom transformers
- Deploy models in production