{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# 機械学習\n",
"\n",
"機械学習は与えられたデータをよく表現できるようにモデルのパラメーターを調整し「学習」を行います。学習したモデルを用いることで、新たなデータに対する予測を行うことができます。\n",
"\n",
"## 機械学習の種類\n",
"機械学習は次の3つに大別できます。このNotebookではこのうち教師あり学習と教師なし学習を扱います。\n",
"1. 教師あり学習 (Supervised learning)\n",
" - データの特徴(特徴量)に対して正解データ(正しい答え、ラベル、Ground Truth)がある場合は教師あり学習と呼ばれます。正解データを教師としてデータからラベルを予測できるように学習します。\n",
" - 分類 ラベルが離散値 (例:犬か猫かを予測する)\n",
" - 回帰 ラベルが連続値 (例:明日の株価を予測する)\n",
" \n",
" \n",
"2. 教師なし学習 (Unsupervised learning)\n",
" - ラベルがないデータのみからデータの構造や特徴・パターンなどをよく表すようなモデルをつくります。\n",
" - クラスタリング\n",
" - PCAなどの次元圧縮\n",
" \n",
"3. 強化学習 (Reinforcement learning)\n",
" - ラベルはないですが報酬が与えられます。報酬を最大化するように学習する。\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 機械学習のデータ\n",
"\n",
"- 学習データ(訓練データ、training data)\n",
"- 検証データ(テストデータ、testing data/validation data)\n",
"\n",
"機械学習とは、データセットの特徴を学習し、別のデータセットに対してそれらの特徴をテストします。機械学習の一般的な方法は、データセットを2つに分割してアルゴリズムを評価することです。これらのセットのうちの1つを学習データ(training data)と呼び、そこから特徴を学習し、もう1つのセットを検証データ(testing data)と呼び、その上で学習した特徴をテストします。\n",
"\n",
"\n",
"ここではPythonの`scikit-learn`を用いて機械学習を動かしてみます。"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## 教師あり学習"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 分類問題"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"まず、サンプルデータセットをダウンロードします。\n",
"3種類の花(iris)のがくへんの長さや・幅、花弁の長さや・幅が特徴量として与えられているデータです。\n",
"これらの特徴から、3種類のうちどの種類の花なのかを学習・予測します。\n",
"\n",
"データで\n",
"- 0: `setosa`\n",
"- 1: `versicolor`\n",
"- 2: `virginica`\n",
"\n",
"を示します。\n",
"\n",
"**iris (あやめ)** \n",
"非常に有名なテストデータの一つ「iris」は、\n",
"アヤメの花弁の長さ・幅、ガクの長さ・幅のデータです。\n",
"※アヤメのガクは、黄色い筋が入っていて外に垂れ下がっている部分です。\n",
"\n",
"種類(3種類) \n",
"* setosa(セトサ)種\n",
"* versicolor(バージカラー)種、\n",
"* virginica(バージニカ)種\n",
"\n",
"列名 \n",
"* Id (ID)\n",
"* SepalLengthCm (がく片の長さ[cm])\n",
"* SepalWidthCm (がく片幅[cm])\n",
"* PetalLengthCm (花びらの長さ[cm])\n",
"* PetalWidthCm (花びらの幅[cm])\n",
"* Species (種)\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import datasets\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.linear_model import LogisticRegression, LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error\n",
"from sklearn import svm\n",
"from matplotlib import pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal length (cm) | \n",
" sepal width (cm) | \n",
" petal length (cm) | \n",
" petal width (cm) | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n",
"0 5.1 3.5 1.4 0.2 \n",
"1 4.9 3.0 1.4 0.2 \n",
"2 4.7 3.2 1.3 0.2 \n",
"3 4.6 3.1 1.5 0.2 \n",
"4 5.0 3.6 1.4 0.2 \n",
"\n",
" species \n",
"0 0 \n",
"1 0 \n",
"2 0 \n",
"3 0 \n",
"4 0 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = datasets.load_iris()\n",
"df = pd.DataFrame(data.data, columns=data.feature_names).reset_index(drop=True)\n",
"target = pd.DataFrame(data.target, columns = ['species']).reset_index(drop=True)\n",
"df = df.merge(target, left_index=True, right_index=True, )\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" sepal length (cm) | \n",
" sepal width (cm) | \n",
" petal length (cm) | \n",
" petal width (cm) | \n",
" species | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 150.000000 | \n",
" 150.000000 | \n",
" 150.000000 | \n",
" 150.000000 | \n",
" 150.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 5.843333 | \n",
" 3.057333 | \n",
" 3.758000 | \n",
" 1.199333 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" std | \n",
" 0.828066 | \n",
" 0.435866 | \n",
" 1.765298 | \n",
" 0.762238 | \n",
" 0.819232 | \n",
"
\n",
" \n",
" min | \n",
" 4.300000 | \n",
" 2.000000 | \n",
" 1.000000 | \n",
" 0.100000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 5.100000 | \n",
" 2.800000 | \n",
" 1.600000 | \n",
" 0.300000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 5.800000 | \n",
" 3.000000 | \n",
" 4.350000 | \n",
" 1.300000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 6.400000 | \n",
" 3.300000 | \n",
" 5.100000 | \n",
" 1.800000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
" max | \n",
" 7.900000 | \n",
" 4.400000 | \n",
" 6.900000 | \n",
" 2.500000 | \n",
" 2.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" sepal length (cm) sepal width (cm) petal length (cm) \\\n",
"count 150.000000 150.000000 150.000000 \n",
"mean 5.843333 3.057333 3.758000 \n",
"std 0.828066 0.435866 1.765298 \n",
"min 4.300000 2.000000 1.000000 \n",
"25% 5.100000 2.800000 1.600000 \n",
"50% 5.800000 3.000000 4.350000 \n",
"75% 6.400000 3.300000 5.100000 \n",
"max 7.900000 4.400000 6.900000 \n",
"\n",
" petal width (cm) species \n",
"count 150.000000 150.000000 \n",
"mean 1.199333 1.000000 \n",
"std 0.762238 0.819232 \n",
"min 0.100000 0.000000 \n",
"25% 0.300000 0.000000 \n",
"50% 1.300000 1.000000 \n",
"75% 1.800000 2.000000 \n",
"max 2.500000 2.000000 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"X = df.iloc[:, :4]\n",
"y = df['species']"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((150, 4), (150,))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.shape, y.shape"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# 学習データと検証データに分割\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9777777777777777"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# ロジスティック回帰モデル (one-vs-rest)\n",
"model=LogisticRegression()\n",
"model.fit(X_train, y_train) # モデルを訓練データに適合\n",
"y_predicted=model.predict(X_test) # テストデータでラベルを予測\n",
"accuracy_score(y_test, y_predicted) # 予測精度(accuracy)の評価"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[15 0 0]\n",
" [ 0 15 0]\n",
" [ 0 1 14]]\n"
]
}
],
"source": [
"\n",
"print(confusion_matrix(y_test, y_predicted))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 1.00 1.00 1.00 15\n",
" 1 0.94 1.00 0.97 15\n",
" 2 1.00 0.93 0.97 15\n",
"\n",
" accuracy 0.98 45\n",
" macro avg 0.98 0.98 0.98 45\n",
"weighted avg 0.98 0.98 0.98 45\n",
"\n"
]
}
],
"source": [
"\n",
"print(classification_report(y_test, y_predicted))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9619047619047619\n",
"0.9777777777777777\n"
]
}
],
"source": [
"# Support Vector Machine (SVM)\n",
"clf = svm.SVC()\n",
"clf.fit(X_train, y_train) # モデルを訓練データに適合\n",
"y_predicted_clf = clf.predict(X_test)\n",
"print(accuracy_score(y_test, y_predicted_clf))\n",
"print(confusion_matrix(y_test, y_predicted_clf))\n",
"print(classification_report(y_test, y_predicted_clf))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2 回帰問題"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"irisデータセットを用いて、特徴量の一つ`petal length`から`petal width`を予測する回帰モデルをつくります"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"X = df[['petal length (cm)']]\n",
"y = df[['petal width (cm)']]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"散布図を描いて相関を確認します"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize = (4,4))\n",
"plt.scatter(X,y, color = 'orange', alpha = .5)\n",
"plt.title('Scatter plot')\n",
"plt.xlabel('petal length')\n",
"plt.ylabel('petal width')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# 学習データと検証データに分割\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1,)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9217091797032136\n",
"0.039618634662187104\n"
]
}
],
"source": [
"lr = LinearRegression()\n",
"lr.fit(X_train, y_train)\n",
"print(lr.score(X_test, y_test)) # R2\n",
"y_predicted = lr.predict(X_test)\n",
"print(mean_squared_error(y_test, y_predicted))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"次に、学習された回帰モデルをプロットすることで、`petal_length`と`petal_width`の実際のデータをうまく表現できているかを確認します。"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(X,y)\n",
"plt.plot(X, lr.predict(X), color = 'red') # 回帰直線をプロット\n",
"plt.xlabel('petal length')\n",
"plt.ylabel('petal width')\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"モデルがデータをうまく表現できているかを確認する方法としては、残差のプロットも有効です。\n",
"\n",
"残差が0の周辺でランダムにばらついていれば、うまく表現できていて、そうでない別のパターン等がある場合は、モデルでは説明しきれていない情報があることが示唆されます。"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(y_predicted, y_predicted - y_test, color = 'blue', alpha = 0.3) # 残差をプロット \n",
"plt.hlines(y = 0, xmin = min(y_predicted), xmax = max(y_predicted), color = 'black') # x軸に沿った直線をプロット\n",
"plt.title('Residual Plot')\n",
"plt.xlabel('Predicted Values')\n",
"plt.ylabel('Residuals')\n",
"plt.grid()\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.16"
}
},
"nbformat": 4,
"nbformat_minor": 4
}