kharuka2016のブログ

日々の出来事を書き留めておくブログ

【訂正】機械学習-線形回帰その1-

入門 Python 3

入門 Python 3

目次

訂正

手順2.7. DataFrameに変換します。

前提条件

OS:Windows 10 64-bit, version 1607

Anaconda 4.4.0(Python 3.6 version 64-bit)インストール

NumPyのインストール

seabornのインストール

scikit-learn(機械学習のライブラリ)のインストール

手順概要

  1. コマンドプロンプトでJupyter Notebookを起動します。

  2. 線形回帰その1

手順

1.コマンドプロンプトでJupyter Notebookを起動します。

ipython notebook

2. 線形回帰その1

2.1. ライブラリをimportします。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
2.2. load_bostonをインポートします。
from sklearn.datasets import load_boston
2.3. load_bostonをbostonに格納します。
boston=load_boston()
2.4. ボストンの住宅価格のデータを表示します。

因みにサイト(http://archive.ics.uci.edu/ml/datasets/Housing)にアクセスしてみると以下のようなメッセージが確認出来ます。Housingサイト無いみたいですね。この時点で嫌な予感が、、、

I’m sorry, the dataset “Housing” does not appear to exist.

print(boston.DESCR)

Out:

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
2.5. ヒストグラムを描画します。
plt.hist(boston.target,bins=50)
plt.xlabel('Price($1,000)')
plt.ylabel('Number of houses')
2.6. 部屋の数と住宅の価格の関係をプロットします。
plt.scatter(boston.data[:,5],boston.target)
plt.xlabel('Price($1,000)')
plt.ylabel('Number of rooms')
2.7. DataFrameに変換します。

boston_df.colmunsのスペル間違いでした。正しくはboston_df.columnsです。猛反省です。そして、Pythonのコードとしては、スペル間違いをしても新しい名前の属性にデータを格納するという勉強になりました。

boston_df=DataFrame(boston.data)
boston_df.columns=boston.feature_names
2.8. DataFrame先頭5行表示します。
boston_df.head()

Out:

   0  1  2  3  4  5  6  7  8  9  10 11 12
0  0.00632    18 2.31   0  0.538  6.575  65.2   4.09   1  296    15.3   396.9  4.98
1  0.02731    0  7.07   0  0.469  6.421  78.9   4.9671 2  242    17.8   396.9  9.14
2  0.02729    0  7.07   0  0.469  7.185  61.1   4.9671 2  242    17.8   392.83 4.03
3  0.03237    0  2.18   0  0.458  6.998  45.8   6.0622 3  222    18.7   394.63 2.94
4  0.06905    0  2.18   0  0.458  7.147  54.2   6.0622 3  222    18.7   396.9  5.33
2.9. Price列を追加します。
boston_df['Price']=boston.target
2.10. DataFrameを先頭5行表示します。
boston_df.head()

Out:

   0  1  2  3  4  5  6  7  8  9  10 11 12 Price
0  0.00632    18 2.31   0  0.538  6.575  65.2   4.09   1  296    15.3   396.9  4.98   24
1  0.02731    0  7.07   0  0.469  6.421  78.9   4.9671 2  242    17.8   396.9  9.14   21.6
2  0.02729    0  7.07   0  0.469  7.185  61.1   4.9671 2  242    17.8   392.83 4.03   34.7
3  0.03237    0  2.18   0  0.458  6.998  45.8   6.0622 3  222    18.7   394.63 2.94   33.4
4  0.06905    0  2.18   0  0.458  7.147  54.2   6.0622 3  222    18.7   396.9  5.33   36.2
2.11. “[‘RM’] not in index"とRMなんていうindexないよとエラーが出ました。確かに手順10のラベルが数字のままです。
sns.lmplot('RM','Price',data=boston_df)

Out:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-103-50dcc3ed7a37> in <module>()
----> 1 sns.lmplot('RM','Price',data=boston_df)

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\seaborn\regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
    550     need_cols = [x, y, hue, col, row, units, x_partial, y_partial]
    551     cols = np.unique([a for a in need_cols if a is not None]).tolist()
--> 552     data = data[cols]
    553 
    554     # Initialize the grid

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2054         if isinstance(key, (Series, np.ndarray, Index, list)):
   2055             # either boolean or fancy integer index
-> 2056             return self._getitem_array(key)
   2057         elif isinstance(key, DataFrame):
   2058             return self._getitem_frame(key)

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
   2098             return self.take(indexer, axis=0, convert=False)
   2099         else:
-> 2100             indexer = self.loc._convert_to_indexer(key, axis=1)
   2101             return self.take(indexer, axis=1, convert=True)
   2102 

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1229                 mask = check == -1
   1230                 if mask.any():
-> 1231                     raise KeyError('%s not in index' % objarr[mask])
   1232 
   1233                 return _values_from_object(indexer)

KeyError: "['RM'] not in index"
2.12. RMをRMに相当する5に変えたら行けるのでは?いけなかった、、、
sns.lmplot(5,'Price',data=boston_df)

Out:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-109-3a443a37aee5> in <module>()
----> 1 sns.lmplot(5,'Price',data=boston_df)

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\seaborn\regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
    550     need_cols = [x, y, hue, col, row, units, x_partial, y_partial]
    551     cols = np.unique([a for a in need_cols if a is not None]).tolist()
--> 552     data = data[cols]
    553 
    554     # Initialize the grid

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2054         if isinstance(key, (Series, np.ndarray, Index, list)):
   2055             # either boolean or fancy integer index
-> 2056             return self._getitem_array(key)
   2057         elif isinstance(key, DataFrame):
   2058             return self._getitem_frame(key)

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_array(self, key)
   2098             return self.take(indexer, axis=0, convert=False)
   2099         else:
-> 2100             indexer = self.loc._convert_to_indexer(key, axis=1)
   2101             return self.take(indexer, axis=1, convert=True)
   2102 

C:\Users\XXX\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1229                 mask = check == -1
   1230                 if mask.any():
-> 1231                     raise KeyError('%s not in index' % objarr[mask])
   1232 
   1233                 return _values_from_object(indexer)

KeyError: "['5'] not in index"
2.13. 力技で列名を変えてみます。
boston_df.rename(columns={0:"CRIM",1:"ZN",2:"INDUS",3:"CHAS",4:"NOX",5:"RM",6:"AGE",7:"DIS",8:"RAD",9:"TAX",10:"PTRATIO",11:"B",12:"LSTAT"},inplace=True)
2.14. なんとかなった。しかしこれで良いのか?
boston_df.head()

Out:

   CRIM    ZN  INDUS   CHAS    NOX RM  AGE DIS RAD TAX PTRATIO B   LSTAT   Price
0  0.00632    18 2.31   0  0.538  6.575  65.2   4.09   1  296    15.3   396.9  4.98   24
1  0.02731    0  7.07   0  0.469  6.421  78.9   4.9671 2  242    17.8   396.9  9.14   21.6
2  0.02729    0  7.07   0  0.469  7.185  61.1   4.9671 2  242    17.8   392.83 4.03   34.7
3  0.03237    0  2.18   0  0.458  6.998  45.8   6.0622 3  222    18.7   394.63 2.94   33.4
4  0.06905    0  2.18   0  0.458  7.147  54.2   6.0622 3  222    18.7   396.9  5.33   36.2
2.15. 回帰曲線を描画します。
sns.lmplot('RM','Price',data=boston_df)

スクリーンショット

f:id:kharuka2016:20170825193457p:plainf:id:kharuka2016:20170825193502p:plainf:id:kharuka2016:20170825193506p:plainf:id:kharuka2016:20170825193512p:plainf:id:kharuka2016:20170825193516p:plainf:id:kharuka2016:20170825193520p:plainf:id:kharuka2016:20170825193526p:plainf:id:kharuka2016:20170825193530p:plainf:id:kharuka2016:20170825193534p:plainf:id:kharuka2016:20170825193539p:plainf:id:kharuka2016:20170825193542p:plain

Pythonではじめる機械学習 ―scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎

Pythonではじめる機械学習 ―scikit-learnで学ぶ特徴量エンジニアリングと機械学習の基礎

参考:

udemy 実践Pythonデータサイエンス

www.udemy.com

線形回帰 - Wikipedia

Pythonスタートブック

Pythonスタートブック