기계는 거짓말하지 않는다

Python Scikit Learn Iris 꽃 분석, Classification 본문

AI

Python Scikit Learn Iris 꽃 분석, Classification

KillinTime 2021. 8. 12. 20:18

Scikit Learn 라이브러리에서 제공하는 데이터 셋 중 Iris 꽃의 데이터이다.

from sklearn.datasets import load_iris
iris_dataset = load_iris()
print(type(iris_dataset)
# sklearn.utils.Bunch
print(iris_dataset.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(iris_dataset['DESCR'][:193])
'''
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, pre
'''

 

 

Target과 Features가 무엇인지 확인할 수 있다.

print('Target names: ', iris_dataset['target_names'])
# Target names: ['setosa' 'versicolor' 'virginica']
print('Features names: \n', iris_dataset['feature_names'])
'''
Features names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
'''

 

Target, Feature Data, Shape

print(iris_dataset['data'])
'''
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
...
'''
print('target:\n', iris_dataset['target'])
'''
target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
'''
print(iris_dataset['data'].shape, iris_dataset['target'].shape)
# (150, 4) (150,)

 

Split Data

from sklearn.model_selection import train_test_split
# 기본 test_size=0.25. shuffle은 기본 True이며, 순차 데이터를 원할 경우 False로 설정
# stratify=iris_dataset['target']를 설정하면 train, test dataset의 class 비율을 동일하게 나눌 수 있음
# random_state를 변경하면 다른 순서로 섞이게 됨
x_train, x_test, y_train, y_test = \
train_test_split(iris_dataset['data'], iris_dataset['target'], test_size=0.25, random_state = 0)
print('x_train shape: ', x_train.shape)
print('y_train shape: ', y_train.shape)
'''
x_train shape: (112, 4)
y_train shape: (112,)
'''
print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)
'''
x_test shape: (38, 4)
y_test shape: (38,)
'''

 

Model Select

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
logistic = LogisticRegression(max_iter=300)

 

Training, Prediction

# training
knn.fit(x_train, y_train)
logistic.fit(x_train, y_train)
# prediction
logistic_y_pred = logistic.predict(x_test)
knn_y_pred = knn.predict(x_test)

 

Score

print('Test set score(logistic): ', np.mean(logistic_y_pred == y_test))
print('Test set score(knn): ', np.mean(knn_y_pred == y_test))
'''
Test set score(logistic): 0.9736842105263158
Test set score(knn): 0.9736842105263158
# split시에 stratify=iris_dataset['target']를 설정할 경우
Test set score(logistic): 1.0
Test set score(knn): 0.9736842105263158
'''