rd-2. ヒストグラム，散布

図，折れ線グラフ，要約

統計量

金子邦彦

データサイエンス演習

（R システムを使用）

https://www.kkaneko.jp/de/rd/index.html

2-1 パッケージの追加インス

トール

パッケージの設定 (1/2)

•次の手順で，必要なパッケージをインストール

•パッケージをインストールするのにインターネッ

ト接続が必要

•install.packages("ggplot2") を実行

•install.packages("dplyr") を実行

パッケージの設定 (2/2)

•install.packages("tidyr") を実行

•install.packages("magrittr") を実行

•install.packages("KernSmooth") を実行

こんな表示が

でたら Yes

※「K」と「S」が大文字

2-2 R オブジェクトのコンス

トラクタ

コンストラクタの例

x1 <- data.frame( 年次=c(1985, 1990, 1995, 2000, 2005, 2010),

出生数=c(1432, 1222, 1187, 1191, 1063, 1071),

死亡数=c(752, 820, 922, 962, 1084, 1197) )

年次出生数死亡数

1985

1432

752

1990

1222

820

1995

1187

922

2000

1191

962

2005

1063

1084

2010

1071

1197

テーブルの例

上記のテーブルを生成するコンストラクタ

コンストラクタの動作画面

2-3 iris データセット

アヤメ属 (Iris)

•多年草

•世界に 150種. 日本に 9種.

•花被片は 6個

•外花被片（がいかひへん） Sepal

3個（大型で下に垂れる）

•内花被片（ないかひへん） Petal

3個（直立する）

Iris データセット

•3種のアヤメの外花被辺、

内花被片の幅と長さを計

測したデータセット

Iris setosa

Iris versicolor

Iris virginica

•データ数は 50 ×3

•作成者：Ronald Fisher

•作成年：1936

Iris データセットは，

Ｒシステムの中に組

み込み済み

2-4 ヒストグラムの例

iris の4属性それぞれのヒストグラム

各属性のヒストグラム

属性： Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

複数ヒストグラムの重ね合わせ表示

library(dplyr)

d2 <- tbl_df( iris )

library(tidyr)

library(magrittr)

library(KernSmooth)

library(ggplot2)

d2 %>% select( Sepal.Length, Sepal.Width, Petal.Length,

Petal.Width ) %>% gather() %>% ggplot( aes(x=value, fill=key) ) +

geom_histogram( binwidth=dpih( use_series(d2, Sepal.Length) ),

alpha=0.5, position="identity") +

theme_bw()

ヒストグラムでの区間幅の調整

library(ggplot2)

ggplot(iris, aes(x= Sepal.Length)) +

geom_histogram(binwidth=0.1) +

theme_bw()

library(magrittr)

library(KernSmooth)

library(ggplot2)

ggplot(iris, aes(x = Sepal.Length)) +

geom_histogram(

binwidth=dpih( iris$Sepal.Length ) ) +

theme_bw()

区間幅 = 0.1 区間幅を、dpih 関数を用いて調整

2-5 散布図，折れ線グラフ

散布図、折れ線グラフのバリエーション

年次

出生数

（千

人）

死亡数

（千

人）

1985

1432

752

1990

1222

820

1995

1187

922

2000

1191

962

2005

1063

1084

2010

1071

1197

出生数、死亡数の推移

出典：総務省「第63回日本統計年鑑平成26年」

散布図

＋折れ線

散布図

＋線形近似

散布図

x1 <- data.frame( 年次=c(1985, 1990, 1995, 2000, 2005, 2010),

出生数=c(1432, 1222, 1187, 1191, 1063, 1071),

死亡数=c(752, 820, 922, 962, 1084, 1197) )

library(ggplot2)

ggplot(x1, aes(x=年次)) +

geom_point( aes(y=出生数, colour="出生数"), size=3 ) +

geom_point( aes(y=死亡数, colour="死亡数"), size=3 ) +

labs(x="年次", y="出生数, 死亡数") +

theme_bw()

軸(フィールド名)

年次

軸(フィールド名)

出生数

, 死亡数

点の大きさ

(数値)

軸の名前 (文字列)

年次

軸の名前 (文字列)

出生数

, 死亡数

年次出生

数

死亡数

1985

1432

752

1990

1222

820

1995

1187

922

2000

1191

962

2005

1063

1084

2010

1071

1197

散布図＋折れ線

x1 <- data.frame( 年次=c(1985, 1990, 1995, 2000, 2005, 2010),

出生数=c(1432, 1222, 1187, 1191, 1063, 1071),

死亡数=c(752, 820, 922, 962, 1084, 1197) )

library(ggplot2)

ggplot(x1, aes(x=年次)) +

geom_point( aes(y=出生数, colour="出生数"), size=6 ) +

geom_point( aes(y=死亡数, colour="死亡数"), size=6 ) +

geom_line( aes(y=出生数, colour="出生数"), size=2 ) +

geom_line( aes(y=死亡数, colour="死亡数"), size=2 ) +

labs(x="年次", y="出生数, 死亡数") +

theme_bw() 17

軸(フィールド名)

年次

軸(フィールド名)

出生数

, 死亡数

点の大きさ

(数値)

軸の名前 (文字列)

年次

軸の名前 (文字列)

出生数

, 死亡数

年次出生

数

死亡

数

1985

1432

752

1990

1222

820

1995

1187

922

2000

1191

962

2005

1063

1084

2010

1071

1197

散布図＋線形近似

x1 <- data.frame( 年次=c(1985, 1990, 1995, 2000, 2005, 2010),

出生数=c(1432, 1222, 1187, 1191, 1063, 1071),

死亡数=c(752, 820, 922, 962, 1084, 1197) )

library(ggplot2)

ggplot(x1, aes(x=年次)) +

geom_point( aes(y=出生数, colour="出生数"), size=6 ) +

geom_point( aes(y=死亡数, colour="死亡数"), size=6 ) +

stat_smooth( method="lm", se=FALSE, aes(y=出生数, colour="出生数"),

size=2 ) +

stat_smooth( method="lm", se=FALSE, aes(y=死亡数, colour="死亡数"),

size=2 ) +

labs(x="年次", y="出生数, 死亡数") +

theme_bw() 18

軸(フィールド名)

年次

軸(フィールド名)

出生数

, 死亡数

点の大きさ

(数値)

軸の名前 (文字列)

年次

軸の名前 (文字列)

出生数

, 死亡数

年次出生

数

死亡

数

1985

1432

752

1990

1222

820

1995

1187

922

2000

1191

962

2005

1063

1084

2010

1071

1197

2-6 グラフのファイルへの保

存

png ファイルの作成

x1 <- data.frame( 年次=c(1985, 1990, 1995, 2000, 2005, 2010),

出生数=c(1432, 1222, 1187, 1191, 1063, 1071),

死亡数=c(752, 820, 922, 962, 1084, 1197) )

library(ggplot2)

png("f:/1.png")

ggplot(x1, aes(x=年次)) +

geom_point( aes(y=出生数, colour="出生数"), size=3 ) +

labs(x="年次", y="出生数") +

theme_bw()

dev.off() 20

ファイル f:/1.png に保存

2-7 要約統計量，頻度，ヒス

トグラム

ここで行うこと

各フィールドの頻度（数え上げ）

種類ごとの数え上げ

各フィールドの要約統計量の算出

平均 (mean)、標準偏差 (sd)、分散 (var)

中央値 (median)、四分位点 (quantile)、

最大値 (max)、最小値 (min)

元データ要約統計量の例

科目

受講者

得点

国語

算数

理科

ここでの説明で使用するデータ

d1 <- data.frame(

科目=c("国語", "国語", "算数", "算数", "理科"),

受講者=c("A", "B", "A", "B", "A"),

得点=c(90, 80, 95, 90, 80) )

科目

受講者

得点

国語

算数

理科

成績データ

コンストラクタ

iris データセット

※iris データセットは

R システムに組み込み済み

要約統計量（summary を使用）①

d1 <- data.frame(

科目=c("国語", "国語", "算数", "算数", "理科"),

受講者=c("A", "B", "A", "B", "A"),

得点=c(90, 80, 95, 90, 80) )

summary(d1)

◆数値属性に対しては

最小、最大、平均、

中央値、四分位点

成績

科目

受講者

得点

国語

算数

理科

頻度のグラフ化 ①

d1 <- data.frame(

科目=c("国語", "国語", "算数", "算数", "理科"),

受講者=c("A", "B", "A", "B", "A"),

得点=c(90, 80, 95, 90, 80) )

library(ggplot2)

ggplot(d1, aes( x=科目, fill=科目 )) +

geom_bar(stat="count") +

labs(x="科目", y="総数") +

theme_bw()

集約を行うテーブルの変数名

集約したいフィールド名

科目

軸の名前 (文字列)

科目

軸の名前 (文字列)

総数

科目

受講者

得点

国語

算数

理科

頻度のグラフ化 ②

d1 <- data.frame(

科目=c("国語", "国語", "算数", "算数", "理科"),

受講者=c("A", "B", "A", "B", "A"),

得点=c(90, 80, 95, 90, 80) )

library(ggplot2)

ggplot(d1, aes( x=得点 )) +

geom_bar(stat="count") +

labs(x="得点", y="総数") +

theme_bw()

集約を行うテーブルの変数名

集約したいフィールド名

得点

軸の名前 (文字列)

得点

軸の名前 (文字列)

総数

科目

受講者

得点

国語

算数

理科

要約統計量（summary を使用）②

summary(iris)

iris データセット