기계는 거짓말하지 않는다

Python Scikit Learn Iris 꽃 분석, Classification 본문

AI

Python Scikit Learn Iris 꽃 분석, Classification

KillinTime 2021. 8. 12. 20:18

Scikit Learn 라이브러리에서 제공하는 데이터 셋 중 Iris 꽃의 데이터이다.

from sklearn.datasets import load_iris

iris_dataset = load_iris()
print(type(iris_dataset)
# sklearn.utils.Bunch

print(iris_dataset.keys())
# dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

print(iris_dataset['DESCR'][:193])
'''
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
'''

 

 

Target과 Features가 무엇인지 확인할 수 있다.

print('Target names: ', iris_dataset['target_names'])
# Target names:  ['setosa' 'versicolor' 'virginica']
print('Features names: \n', iris_dataset['feature_names'])
'''
Features names: 
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
'''

 

Target, Feature Data, Shape

print(iris_dataset['data'])
'''
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       ...
'''

print('target:\n', iris_dataset['target'])
'''
target:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
 '''
 
 print(iris_dataset['data'].shape, iris_dataset['target'].shape)
 # (150, 4) (150,)

 

Split Data

from sklearn.model_selection import train_test_split

# 기본 test_size=0.25. shuffle은 기본 True이며, 순차 데이터를 원할 경우 False로 설정
# stratify=iris_dataset['target']를 설정하면 train, test dataset의 class 비율을 동일하게 나눌 수 있음
# random_state를 변경하면 다른 순서로 섞이게 됨
x_train, x_test, y_train, y_test = \
	train_test_split(iris_dataset['data'], iris_dataset['target'], test_size=0.25, random_state = 0)
    
print('x_train shape: ', x_train.shape)
print('y_train shape: ', y_train.shape)
'''
x_train shape:  (112, 4)
y_train shape:  (112,)
'''

print('x_test shape: ', x_test.shape)
print('y_test shape: ', y_test.shape)
'''
x_test shape:  (38, 4)
y_test shape:  (38,)
'''

 

Model Select

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)
logistic = LogisticRegression(max_iter=300)

 

Training, Prediction

# training
knn.fit(x_train, y_train)
logistic.fit(x_train, y_train)

# prediction
logistic_y_pred = logistic.predict(x_test)
knn_y_pred = knn.predict(x_test)

 

Score

print('Test set score(logistic): ', np.mean(logistic_y_pred == y_test))
print('Test set score(knn): ', np.mean(knn_y_pred == y_test))

'''
Test set score(logistic):  0.9736842105263158
Test set score(knn):  0.9736842105263158

# split시에 stratify=iris_dataset['target']를 설정할 경우
Test set score(logistic):  1.0
Test set score(knn):  0.9736842105263158
'''
Comments