2017年6月 – かものはしの分析ブログ

洋楽の歌詞データでDoc2vecを実行してみる

はじめに

仕事で記事間の類似度計算などがあったりするんですが、TF-IDFにしてCOS類似度を計算するなどの方法で行っているのが現状です。そろそろ他の手法にも移行したいので、それに変わる類似度計算の手段としてDoc2vecを試してみたいと思います。

データ

以前より収集している洋楽の歌詞データを用います。Billboardのランキングに登場した楽曲の歌詞データを961曲分集めたものとなります。英語なので、日本語のように形態素解析は不要ですが、ストップワードを除去するなどの処理を施したコーパスを用います。Rのtmパッケージによるストップワードの除去についてはBillboard100位以内の楽曲の歌詞情報にLDAを適用してみたをご覧ください。日本語でのDoc2Vecの適用は参考文献において紹介しています。

類似度計算

TaggedLineDocumentを用いて、doc2vecで扱えるオブジェクトを作成します。TaggedLineDocumentに指定するファイルは主にtxtファイルで、その満たすべき条件は「1行につき1文書」「単語がスペースで区切られている」などです。あとは、doc2vecを実行するだけです。パラメータなどの細かい指定については今後の宿題としたいです。

#ライブラリの読み込み
from gensim.models.doc2vec import LabeledSentence
from collections import namedtuple
import pandas as pd

#データの読み込みとtxtファイルの生成
billboard_list = pd.read_csv("lyrics_data_preprocessed.csv", delimiter=",", encoding='utf-8')
billboard_list.columns = ['title','artist', 'lyrics']
billboard_songs = billboard_list['lyrics']
billboard_songs.to_csv("billboard_data.txt",header=None,index=None)

#1行1ドキュメントとしてdoc2vecで扱えるオブジェクトに変換
sentences = doc2vec.TaggedLineDocument("billboard_data.txt")

#doc2vecの実行
model = models.Doc2Vec(sentences, dm=0, size=300, window=15, alpha=.025,
        min_alpha=.025, min_count=1, sample=1e-6)

#トレーニングの開始
print('\nStart Training')
for epoch in range(20):
    print('Epoch: {}'.format(epoch + 1))
    model.train(sentences)
    model.alpha -= (0.025 - 0.0001) / 19
    model.min_alpha = model.alpha

#ライブラリの読み込み

from gensim.models.doc2vec import LabeledSentence

from collections import namedtuple

import pandas as pd

#データの読み込みとtxtファイルの生成

billboard_list = pd.read_csv("lyrics_data_preprocessed.csv", delimiter=",", encoding='utf-8')

billboard_list.columns = ['title','artist', 'lyrics']

billboard_songs = billboard_list['lyrics']

billboard_songs.to_csv("billboard_data.txt",header=None,index=None)

#1行1ドキュメントとしてdoc2vecで扱えるオブジェクトに変換

sentences = doc2vec.TaggedLineDocument("billboard_data.txt")

#doc2vecの実行

model = models.Doc2Vec(sentences, dm=0, size=300, window=15, alpha=.025,

min_alpha=.025, min_count=1, sample=1e-6)

#トレーニングの開始

print('\nStart Training')

for epoch in range(20):

print('Epoch: {}'.format(epoch + 1))

model.train(sentences)

model.alpha -= (0.025 - 0.0001) / 19

model.min_alpha = model.alpha

類似度の算出

早速、気になる楽曲に関して、類似度の高い楽曲を抽出してみたいと思います。

#推定したモデルの保存
model.save('doc2vec.model')
model = models.Doc2Vec.load('doc2vec.model')

#調べたい楽曲のインデックスの確認
billboard_list[billboard_list['title'].str.contains("Radioactive")]

#推定したモデルの保存

model.save('doc2vec.model')

model = models.Doc2Vec.load('doc2vec.model')

#調べたい楽曲のインデックスの確認

billboard_list[billboard_list['title'].str.contains("Radioactive")]

どうやら、Radioactiveという曲はkings of leonというグループも歌っているようですが、私はimagine dragonsの方の楽曲に関心がありますので、インデックスを409にして歌詞情報の近い楽曲を抽出します。

#最も近い歌詞の楽曲情報を抽出
mostsimilarlyrics = model.docvecs.most_similar(409)
billboard_list['title'][mostsimilarlyrics[0][0]]

'Made In America'

#最も近い歌詞の楽曲情報を抽出

mostsimilarlyrics = model.docvecs.most_similar(409)

billboard_list['title'][mostsimilarlyrics[0][0]]

'Made In America'

どうやら、toby keithのMade In Americaという楽曲が最も近いようです。類似度は35%程度ですが、全然単語が被っていないので本当に近いのか納得がいかないです。

次に、lady gagaのBorn This Wayに近い楽曲を出してみます。Dancing Queenという非常に懐かしい曲が選ばれていますが、類似度は49%と先ほどよりも高いです。queenやgirlやcanやrightなど共通の単語が含まれているので、先ほどの結果よりは近いのかなぁと思います。

正しく推定できているのか不安だったので、類似度が90%と非常に高かった、Just The Way You Areという楽曲の最も近い楽曲を見てみます。

調べたところ、同じ楽曲のカバー版のようです。近いものは、ちゃんと近いと見なせるようです。近いかどうかの基準をどの水準に置くのかは難しい判断ですね。

参考情報

models.doc2vec – Deep learning with paragraph2vec
Doc2Vecの仕組みとgensimを使った文書類似度算出チュートリアル
 Pythonによるデータ分析入門 ―NumPy、pandasを使ったデータ処理

Billboard100位以内の楽曲の歌詞情報にLDAを適用してみた

目次

・はじめに
・データ収集
・Rによる分析
・LDAの結果
・参考文献

はじめに

前回の投稿でBillboardの週次洋楽ランキングデータをWebスクレイピングで取得し、楽曲の消費サイクルのような順位の挙動を確かめることができました。（某洋楽ヒットチャートの週次ランキングデータをBeautiful Soupで集めてみた）今回は、歌詞の情報を用いて順位データとつなぐことにより、どのような単語の入っている洋楽がBillboardにおいてTop10に入る傾向があるのかをLDAを行うことで確かめたいと思います。

データ収集

残念なことに、Billboardのサイトに歌詞の情報は載っていません。そこで、洋楽の歌詞が取り上げられている某サイトをPython(3系)でWebスクレイピングし、名寄せを頑張って順位データと歌詞データを繋ぎます。

幸いなことに某サイトのURLに規則性があったので、アーティスト名からなるURLを生成し、そのURLをWebスクレイピングして楽曲のリストを集め、今回のBillboardのランキングに入った楽曲のみに絞ります。

#アーティストの楽曲一覧の取得

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
from urllib.error import HTTPError, URLError
import csv, re, time
from requests.exceptions import ConnectionError

f = open('artist_url_list.csv', 'r')
dataReader = csv.reader(f)

#結果の出力用のデータフレームを作る。
data01 =[] #URL
data02 =[] #曲名
data03 =[] #link

for row in dataReader:
       for url in row:
            time.sleep(10.0) #sleep(秒指定)
            try:
                    r = requests.get(url)
                    soup =  BeautifulSoup(r.content, 'html.parser')
                    
                    for body in soup.findAll("td",{'class':'colfirst'}):
                        for link in body.findAll("a"):
                                data01.append(url)
                                data02.append(''.join(link.findAll(text=True)))
                                data03.append(link.get("href"))
                                data = zip(data01,data02,data03)
                                #CSV出力
                                with open('artistpage_result.csv','wt',errors='backslashreplace') as fout:
                                    writecsv = csv.writer(fout,lineterminator='\n')
                                    writecsv.writerows(data)                                   
                                
                                    
            except HTTPError as e:
                print(e.code)
                
            except URLError:
                print("URLError")

#アーティストの楽曲一覧の取得

import urllib

from bs4 import BeautifulSoup

from urllib.request import urlopen

import requests

from urllib.error import HTTPError, URLError

import csv, re, time

from requests.exceptions import ConnectionError

f = open('artist_url_list.csv', 'r')

dataReader = csv.reader(f)

#結果の出力用のデータフレームを作る。

data01 =[] #URL

data02 =[] #曲名

data03 =[] #link

for row in dataReader:

for url in row:

time.sleep(10.0) #sleep(秒指定)

try:

r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

for body in soup.findAll("td",{'class':'colfirst'}):

for link in body.findAll("a"):

data01.append(url)

data02.append(''.join(link.findAll(text=True)))

data03.append(link.get("href"))

data = zip(data01,data02,data03)

#CSV出力

with open('artistpage_result.csv','wt',errors='backslashreplace') as fout:

writecsv = csv.writer(fout,lineterminator='\n')

writecsv.writerows(data)

except HTTPError as e:

print(e.code)

except URLError:

print("URLError")

楽曲をランキングに含まれるもののみに絞ったら、歌詞詳細ページを取得します。

#歌詞詳細の歌詞該当部分のみ抽出

import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
from urllib.error import HTTPError, URLError
import csv, re, time
from requests.exceptions import ConnectionError

f = open('song_detail_url.csv', 'r')
dataReader = csv.reader(f)

#結果の出力用のデータフレームを作る。
data01 =[] #URL
data02 =[] #歌詞

for row in dataReader:
       for url in row:
            time.sleep(10.0) #sleep(秒指定)
            try:
                    r = requests.get(url)
                    soup =  BeautifulSoup(r.content, 'html.parser')
                    
                    for body in soup.findAll("div",{'id':'content_h'}):
                                data01.append(url)
                                data02.append(''.join(body.findAll(text=True)))
                                data = zip(data01,data02)
                                #CSV出力
                                with open('lyrics_result.csv','wt',errors='backslashreplace') as fout:
                                    writecsv = csv.writer(fout,lineterminator='\n')
                                    writecsv.writerows(data)                                   
                                
                                    
            except HTTPError as e:
                print(e.code)
                           
            except URLError:
                print("URLError")

#歌詞詳細の歌詞該当部分のみ抽出

import urllib

from bs4 import BeautifulSoup

from urllib.request import urlopen

import requests

from urllib.error import HTTPError, URLError

import csv, re, time

from requests.exceptions import ConnectionError

f = open('song_detail_url.csv', 'r')

dataReader = csv.reader(f)

#結果の出力用のデータフレームを作る。

data01 =[] #URL

data02 =[] #歌詞

for row in dataReader:

for url in row:

time.sleep(10.0) #sleep(秒指定)

try:

r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

for body in soup.findAll("div",{'id':'content_h'}):

data01.append(url)

data02.append(''.join(body.findAll(text=True)))

data = zip(data01,data02)

#CSV出力

with open('lyrics_result.csv','wt',errors='backslashreplace') as fout:

writecsv = csv.writer(fout,lineterminator='\n')

writecsv.writerows(data)

except HTTPError as e:

print(e.code)

except URLError:

print("URLError")

うまいこと歌詞情報を手に入れることができました。ざっと947曲です。

	url	lyrics
0	http://www.lyricsfreak.com/e/eminem/love+the+w...	Just gonna stand there and watch me burnBut th...
1	http://www.lyricsfreak.com/t/taio+cruz/dynamit...	I came to dance-dance-dance-dance (Yeah)I hate...
2	http://www.lyricsfreak.com/t/taylor+swift/mine...	Oh, oh, ohOh, oh, ohYou were in college, worki...
3	http://www.lyricsfreak.com/e/enrique+iglesias/...	One life, one loveEnrique Iglesias, PitbullY'a...
4	http://www.lyricsfreak.com/b/bob/airplanes_208...	Can we pretend that airplanesIn the night sky ...
5	http://www.lyricsfreak.com/m/mike+posner/coole...	If I could write you a song,And make you fall ...
6	http://www.lyricsfreak.com/j/jason+derulo/ridi...	Yea yeah, yeah, yeah, yeah,I'm feeling like a ...
7	http://www.lyricsfreak.com/t/travie+mccoy/bill...	I wanna be a billionaire so freakin' badBuy al...
8	http://www.lyricsfreak.com/d/drake/find+your+l...	I'm more than just an option (hey, hey, hey) R...
9	http://www.lyricsfreak.com/u/usher/omg_2087748...	Oh my goshBaby let meDid it again, so Imma let...
10	http://www.lyricsfreak.com/b/bob/magic_2087969...	I got the magic in meEvery time I touch that t...
11	http://www.lyricsfreak.com/n/nicki+minaj/your+...	[Chorus]Shawty I'm a only tell you this once, ...
12	http://www.lyricsfreak.com/m/maroon+5/misery_2...	Oh yeahOh yeahSo scared of breaking itThat you...
13	http://www.lyricsfreak.com/t/train/hey+soul+si...	Hey, hey, heyYour lipstick stains on the front...
14	http://www.lyricsfreak.com/b/bruno+mars/just+t...	Oh her eyes, her eyesMake the stars look like ...
15	http://www.lyricsfreak.com/l/lady+gaga/alejand...	I know that we are young,And I know that you m...
16	http://www.lyricsfreak.com/l/la+roux/bulletpro...	Been there, done that, messed aroundI'm having...
17	http://www.lyricsfreak.com/f/flo+rida/club+can...	You know I know howTo make 'em stop and stare ...
18	http://www.lyricsfreak.com/s/shontelle/impossi...	I remember years agoSomeone told me I should t...
19	http://www.lyricsfreak.com/p/paramore/the+only...	When I was youngerI saw my daddy cryAnd curse ...
20	http://www.lyricsfreak.com/u/usher/there+goes+...	Yeah, Right,Usher baby, OKYeah man, rightThere...

url lyrics

0 http://www.lyricsfreak.com/e/eminem/love+the+w... Just gonna stand there and watch me burnBut th...

1 http://www.lyricsfreak.com/t/taio+cruz/dynamit... I came to dance-dance-dance-dance (Yeah)I hate...

2 http://www.lyricsfreak.com/t/taylor+swift/mine... Oh, oh, ohOh, oh, ohYou were in college, worki...

3 http://www.lyricsfreak.com/e/enrique+iglesias/... One life, one loveEnrique Iglesias, PitbullY'a...

4 http://www.lyricsfreak.com/b/bob/airplanes_208... Can we pretend that airplanesIn the night sky ...

5 http://www.lyricsfreak.com/m/mike+posner/coole... If I could write you a song,And make you fall ...

6 http://www.lyricsfreak.com/j/jason+derulo/ridi... Yea yeah, yeah, yeah, yeah,I'm feeling like a ...

7 http://www.lyricsfreak.com/t/travie+mccoy/bill... I wanna be a billionaire so freakin' badBuy al...

8 http://www.lyricsfreak.com/d/drake/find+your+l... I'm more than just an option (hey, hey, hey) R...

9 http://www.lyricsfreak.com/u/usher/omg_2087748... Oh my goshBaby let meDid it again, so Imma let...

10 http://www.lyricsfreak.com/b/bob/magic_2087969... I got the magic in meEvery time I touch that t...

11 http://www.lyricsfreak.com/n/nicki+minaj/your+... [Chorus]Shawty I'm a only tell you this once, ...

12 http://www.lyricsfreak.com/m/maroon+5/misery_2... Oh yeahOh yeahSo scared of breaking itThat you...

13 http://www.lyricsfreak.com/t/train/hey+soul+si... Hey, hey, heyYour lipstick stains on the front...

14 http://www.lyricsfreak.com/b/bruno+mars/just+t... Oh her eyes, her eyesMake the stars look like ...

15 http://www.lyricsfreak.com/l/lady+gaga/alejand... I know that we are young,And I know that you m...

16 http://www.lyricsfreak.com/l/la+roux/bulletpro... Been there, done that, messed aroundI'm having...

17 http://www.lyricsfreak.com/f/flo+rida/club+can... You know I know howTo make 'em stop and stare ...

18 http://www.lyricsfreak.com/s/shontelle/impossi... I remember years agoSomeone told me I should t...

19 http://www.lyricsfreak.com/p/paramore/the+only... When I was youngerI saw my daddy cryAnd curse ...

20 http://www.lyricsfreak.com/u/usher/there+goes+... Yeah, Right,Usher baby, OKYeah man, rightThere...

Rによる分析

ここから、Rにてテキストマイニングを行いたいと思います。まず、tmパッケージを用いて、不要語（stop word）を除去します。具体的にはtheとかyouとかを除外しています。

#歌詞データの読み込み
lyrics_dataset <- read.csv(file = "lyrics_result.csv",as.is = TRUE,header = FALSE)
colnames(lyrics_dataset) <- c("link","lyrics")

library(tm)

#歌詞を小文字にする
document_dataset$lyrics <- tolower(document_dataset$lyrics)

#stop wordを除去する
stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
document_dataset$lyrics <- stringr::str_replace_all(document_dataset$lyrics, stopwords_regex, '')

#歌詞データの読み込み

lyrics_dataset <- read.csv(file = "lyrics_result.csv",as.is = TRUE,header = FALSE)

colnames(lyrics_dataset) <- c("link","lyrics")

library(tm)

#歌詞を小文字にする

document_dataset$lyrics <- tolower(document_dataset$lyrics)

#stop wordを除去する

stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')

stopwords_regex = paste0('\\b', stopwords_regex, '\\b')

document_dataset$lyrics <- stringr::str_replace_all(document_dataset$lyrics, stopwords_regex, '')

続いて、LDAを実行できるtopicmodelsパッケージで扱えるようにするために、テキストデータに以下の処理を施します。

#文書単語行列の作成

corpus <- Corpus(VectorSource(document_dataset$lyrics))
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
findFreqTerms(dtm)

#文書単語行列の作成

corpus <- Corpus(VectorSource(document_dataset$lyrics))

inspect(corpus)

dtm <- DocumentTermMatrix(corpus)

findFreqTerms(dtm)

あとは以下のコードでLDAを実行するだけです。トピック数はアドホックに20としています。研究者の方、いい加減ですみません。

#LDAの実行

library(topicmodels)

nbo_topics <- 20
lda <- LDA(dtm,control=list(verbose=1), k = nbo_topics,method = "Gibbs")

#LDAの実行

library(topicmodels)

nbo_topics <- 20

lda <- LDA(dtm,control=list(verbose=1), k = nbo_topics,method = "Gibbs")

LDAの結果

まずは推定されたトピックごとの上位10単語をみてみます。トピック1はラブソングとかでしょうか。トピック17にパリピっぽい単語が、トピック18にスラングが含まれていますね。

#トピックの上位10単語を確認する
terms_each_topics <- data.frame(terms(lda,10))

1 2	#トピックの上位10単語を確認する terms_each_topics <- data.frame(terms(lda,10))

> terms_each_topics
   Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6 Topic.7 Topic.8 Topic.9 Topic.10 Topic.11   Topic.12 Topic.13 Topic.14 Topic.15 Topic.16 Topic.17 Topic.18 Topic.19　Topic.20
1     love    know    feel  better   wanna     hey     one  chorus    back     yeah    gonna      never     like      got      ’m     keep     stop      ain      let      get
2     like     now   heart   world    want    said     ooh     way    time     baby      low       will      new     like   don’t     even     just     like      can     good
3     make    just    life    whoa    take     old    call   verse    know     girl     hear      still     city     back     ’re  getting    hands     shit      say    night
4    touch    need    away     run    rock   every   cause     can    come   little     mean       ever     high    right    young      one      put     fuck      believe      ain
5     know   think    just   light     see     woo  gettin     pre    long     like    sound       eyes     bout     wish      ’s     give    party      got      fly     kind
6   nobody   cause   break  things    kiss    left    born    tell    like     just    shaky     always     ride     hold     ’ll     lose    crazy     hook      fall   really
7     baby    much    real    find    come    told     day     got    best    right     just everything     yeah     boom      que   please     live      gon      made    sleep
8    cause    give  enough    show    make nothing   makes    like    home      get     solo      leave      fun     know   ain’t   youand   lights    nigga      first champion
9     name    mind   every     see    body   daddy    came    used     til      can  tonight       hope     know      one     para     just     play   niggas      words    catch
10  loving  really    find waiting tonight   sweet   stand    made alright     look   wicked       stay      get     come   can’t  without      see    money      lonely      til

> terms_each_topics

Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6 Topic.7 Topic.8 Topic.9 Topic.10 Topic.11 Topic.12 Topic.13 Topic.14 Topic.15 Topic.16 Topic.17 Topic.18 Topic.19　Topic.20

1 love know feel better wanna hey one chorus back yeah gonna never like got ’m keep stop ain let get

2 like now heart world want said ooh way time baby low will new like don’t even just like can good

3 make just life whoa take old call verse know girl hear still city back ’re getting hands shit say night

4 touch need away run rock every cause can come little mean ever high right young one put fuck believe ain

5 know think just light see woo gettin pre long like sound eyes bout wish ’s give party got fly kind

6 nobody cause break things kiss left born tell like just shaky always ride hold ’ll lose crazy hook fall really

7 baby much real find come told day got best right just everything yeah boom que please live gon made sleep

8 cause give enough show make nothing makes like home get solo leave fun know ain’t youand lights nigga first champion

9 name mind every see body daddy came used til can tonight hope know one para just play niggas words catch

10 loving really find waiting tonight sweet stand made alright look wicked stay get come can’t without see money lonely til

見ずらいので、行を一つにまとめて、トピックにidを割り振ります。

#トピックの上位10単語をまとめて、トピックにidをふる
topic_keywords <- data.frame(topic_keywords_10=apply(t(terms_each_topics),1,paste,collapse=","))
topic_keywords <- topic_keywords %>% mutate(topic_id=1:n())

#トピックの上位10単語をまとめて、トピックにidをふる

topic_keywords <- data.frame(topic_keywords_10=apply(t(terms_each_topics),1,paste,collapse=","))

topic_keywords <- topic_keywords %>% mutate(topic_id=1:n())

> topic_keywords
                                              topic_keywords_10 topic_id
1       love,like,make,touch,know,nobody,baby,cause,name,loving        1
2          know,now,just,need,think,cause,much,give,mind,really        2
3        feel,heart,life,away,just,break,real,enough,every,find        3
4      better,world,whoa,run,light,things,find,show,see,waiting        4
5          wanna,want,take,rock,see,kiss,come,make,body,tonight        5
6          hey,said,old,every,woo,left,told,nothing,daddy,sweet        6
7           one,ooh,call,cause,gettin,born,day,makes,came,stand        7
8              chorus,way,verse,can,pre,tell,got,like,used,made        8
9           back,time,know,come,long,like,best,home,til,alright        9
10           yeah,baby,girl,little,like,just,right,get,can,look       10
11     gonna,low,hear,mean,sound,shaky,just,solo,tonight,wicked       11
12 never,will,still,ever,eyes,always,everything,leave,hope,stay       12
13               like,new,city,high,bout,ride,yeah,fun,know,get       13
14             got,like,back,right,wish,hold,boom,know,one,come       14
15        ’m,don’t,’re,young,’s,’ll,que,ain’t,para,can’t       15
16   keep,even,getting,one,give,lose,please,youand,just,without       16
17         stop,just,hands,put,party,crazy,live,lights,play,see       17
18           ain,like,shit,fuck,got,hook,gon,nigga,niggas,money       18
19         let,can,say,believe,fly,fall,made,first,words,lonely       19
20      get,good,night,ain,kind,really,sleep,champion,catch,til       20

> topic_keywords

topic_keywords_10 topic_id

1 love,like,make,touch,know,nobody,baby,cause,name,loving 1

2 know,now,just,need,think,cause,much,give,mind,really 2

3 feel,heart,life,away,just,break,real,enough,every,find 3

4 better,world,whoa,run,light,things,find,show,see,waiting 4

5 wanna,want,take,rock,see,kiss,come,make,body,tonight 5

6 hey,said,old,every,woo,left,told,nothing,daddy,sweet 6

7 one,ooh,call,cause,gettin,born,day,makes,came,stand 7

8 chorus,way,verse,can,pre,tell,got,like,used,made 8

9 back,time,know,come,long,like,best,home,til,alright 9

10 yeah,baby,girl,little,like,just,right,get,can,look 10

11 gonna,low,hear,mean,sound,shaky,just,solo,tonight,wicked 11

12 never,will,still,ever,eyes,always,everything,leave,hope,stay 12

13 like,new,city,high,bout,ride,yeah,fun,know,get 13

14 got,like,back,right,wish,hold,boom,know,one,come 14

15 ’m,don’t,’re,young,’s,’ll,que,ain’t,para,can’t 15

16 keep,even,getting,one,give,lose,please,youand,just,without 16

17 stop,just,hands,put,party,crazy,live,lights,play,see 17

18 ain,like,shit,fuck,got,hook,gon,nigga,niggas,money 18

19 let,can,say,believe,fly,fall,made,first,words,lonely 19

20 get,good,night,ain,kind,really,sleep,champion,catch,til 20

最後に、BillboardでTop10に入ったかどうかのデータを作っておき、そのデータと各歌詞を繋ぎ、各歌詞ごとに割りふられた確率が最大のトピックで集計をします。

#上位10位に入ったことがあるかどうかのダミーを作成
top_10_songs <- merge_dataset %>% group_by(link.y) %>% summarise(top_10=sum(top_10))
top_10_songs <- top_10_songs %>% mutate(top_10_dummy=ifelse(top_10>0,1,0))

#割り振られた最大の確率のトピックを抽出し、歌詞データと統合する
topics_each_document <- data.frame(topic_id=topics(lda,1))
topics_each_document <- cbind(link.y=document_dataset$link.y,topics_each_document)

#上位10位ダミーを繋ぎ、トピックの上位10単語の表現も繋ぐ
topics_each_document <- topics_each_document %>% left_join(top_10_songs,by="link.y")
topics_each_document <- topics_each_document %>% left_join(topic_keywords,by="topic_id")

#トピックごとのBillboardのTop10ランクイン割合をもとめる
topics_each_document %>% group_by(topic_keywords_10) %>% summarise(top_10_dummy=mean(top_10_dummy),count=n())

# A tibble: 20 × 3
                                              topic_keywords_10 `mean(top_10_dummy)` count
                                                         <fctr>                <dbl> <int>
1         ’m,don’t,’re,young,’s,’ll,que,ain’t,para,can’t           0.12195122    41
2            ain,like,shit,fuck,got,hook,gon,nigga,niggas,money           0.14705882    68
3           back,time,know,come,long,like,best,home,til,alright           0.21666667    60
4      better,world,whoa,run,light,things,find,show,see,waiting           0.25000000    40
5              chorus,way,verse,can,pre,tell,got,like,used,made           0.14814815    54
6        feel,heart,life,away,just,break,real,enough,every,find           0.10416667    48
7       get,good,night,ain,kind,really,sleep,champion,catch,til           0.22857143    35
8      gonna,low,hear,mean,sound,shaky,just,solo,tonight,wicked           0.22857143    35
9              got,like,back,right,wish,hold,boom,know,one,come           0.13888889    36
10         hey,said,old,every,woo,left,told,nothing,daddy,sweet           0.09756098    41
11   keep,even,getting,one,give,lose,please,youand,just,without           0.18918919    37
12         know,now,just,need,think,cause,much,give,mind,really           0.21333333    75
13         let,can,say,believe,fly,fall,made,first,words,lonely           0.15217391    46
14               like,new,city,high,bout,ride,yeah,fun,know,get           0.14705882    34
15      love,like,make,touch,know,nobody,baby,cause,name,loving           0.17021277    47
16 never,will,still,ever,eyes,always,everything,leave,hope,stay           0.22222222    72
17          one,ooh,call,cause,gettin,born,day,makes,came,stand           0.28205128    39
18         stop,just,hands,put,party,crazy,live,lights,play,see           0.25000000    48
19         wanna,want,take,rock,see,kiss,come,make,body,tonight           0.10810811    37
20           yeah,baby,girl,little,like,just,right,get,can,look           0.14814815    54

#上位10位に入ったことがあるかどうかのダミーを作成

top_10_songs <- merge_dataset %>% group_by(link.y) %>% summarise(top_10=sum(top_10))

top_10_songs <- top_10_songs %>% mutate(top_10_dummy=ifelse(top_10>0,1,0))

#割り振られた最大の確率のトピックを抽出し、歌詞データと統合する

topics_each_document <- data.frame(topic_id=topics(lda,1))

topics_each_document <- cbind(link.y=document_dataset$link.y,topics_each_document)

#上位10位ダミーを繋ぎ、トピックの上位10単語の表現も繋ぐ

topics_each_document <- topics_each_document %>% left_join(top_10_songs,by="link.y")

topics_each_document <- topics_each_document %>% left_join(topic_keywords,by="topic_id")

#トピックごとのBillboardのTop10ランクイン割合をもとめる

topics_each_document %>% group_by(topic_keywords_10) %>% summarise(top_10_dummy=mean(top_10_dummy),count=n())

# A tibble: 20 × 3

topic_keywords_10 `mean(top_10_dummy)` count

1 ’m,don’t,’re,young,’s,’ll,que,ain’t,para,can’t 0.12195122 41

2 ain,like,shit,fuck,got,hook,gon,nigga,niggas,money 0.14705882 68

3 back,time,know,come,long,like,best,home,til,alright 0.21666667 60

4 better,world,whoa,run,light,things,find,show,see,waiting 0.25000000 40

5 chorus,way,verse,can,pre,tell,got,like,used,made 0.14814815 54

6 feel,heart,life,away,just,break,real,enough,every,find 0.10416667 48

7 get,good,night,ain,kind,really,sleep,champion,catch,til 0.22857143 35

8 gonna,low,hear,mean,sound,shaky,just,solo,tonight,wicked 0.22857143 35

9 got,like,back,right,wish,hold,boom,know,one,come 0.13888889 36

10 hey,said,old,every,woo,left,told,nothing,daddy,sweet 0.09756098 41

11 keep,even,getting,one,give,lose,please,youand,just,without 0.18918919 37

12 know,now,just,need,think,cause,much,give,mind,really 0.21333333 75

13 let,can,say,believe,fly,fall,made,first,words,lonely 0.15217391 46

14 like,new,city,high,bout,ride,yeah,fun,know,get 0.14705882 34

15 love,like,make,touch,know,nobody,baby,cause,name,loving 0.17021277 47

16 never,will,still,ever,eyes,always,everything,leave,hope,stay 0.22222222 72

17 one,ooh,call,cause,gettin,born,day,makes,came,stand 0.28205128 39

18 stop,just,hands,put,party,crazy,live,lights,play,see 0.25000000 48

19 wanna,want,take,rock,see,kiss,come,make,body,tonight 0.10810811 37

20 yeah,baby,girl,little,like,just,right,get,can,look 0.14814815 54

#トピックごとのランクイン割合を高い順に棒グラフで描写

library(ggplot2)
ggplot(topics_and_top10, aes(x=reorder(topic_keywords_10,top_10_dummy), y=top_10_dummy)) +
  geom_bar(stat='identity') + coord_flip() + xlab("topics")

#トピックごとのランクイン割合を高い順に棒グラフで描写

library(ggplot2)

ggplot(topics_and_top10, aes(x=reorder(topic_keywords_10,top_10_dummy), y=top_10_dummy)) +

geom_bar(stat='identity') + coord_flip() + xlab("topics")

BillboardのTop10ランクイン割合の高いトピックTop3
「one,ooh,call,cause,gettin,born,day,makes,came,stand」
「better,world,whoa,run,light,things,find,show,see,waiting」・・・明るい感じ？
「stop,just,hands,put,party,crazy,live,lights,play,see」・・・パリピぽい

BillboardのTop10ランクイン割合の低いトピックTop3
「wanna,want,take,rock,see,kiss,come,make,body,tonight」・・・欲求系？
「feel,heart,life,away,just,break,real,enough,every,find」・・・癒し系？
「hey,said,old,every,woo,left,told,nothing,daddy,sweet」

あまり洋楽を聴かないので、得られたトピックの解釈が中々できないのがもどかしいです。ただ、スラングの歌詞を含む歌詞はそんなにランクイン割合が悪いわけではなさそうですね。洋楽をもっと聴いて、前処理などもう少し工夫してリベンジしたいですね。

参考文献

トピックモデルによる統計的潜在意味解析 (自然言語処理シリーズ)

Pythonクローリング&スクレイピング -データ収集・解析のための実践開発ガイド-

モダンなRによるテキスト解析 topicmodels: An R Package for Fitting Topic Models

某洋楽ヒットチャートの週次ランキングデータをBeautiful Soupで集めてみた

はじめに

知人より、洋楽の流行りに疎いのでキャッチアップしたいという要望があり、某洋楽ヒットチャートの週次ランキングとTop100のデータを大量に集めてみようと思うに至りました。今回は深い考察を行うには至っていませんが、簡単にRにて集計・可視化を行います。

データ収集

Webスクレイピング対象の某洋楽ヒットチャートの週次ランキングは今週の順位・先週の順位・アーティスト名・曲名・詳細ページへのリンクなどが載せられおり、毎週土曜日更新されています。サイト内から導線はありませんが、URLのパラメータに法則があるため、うまく収集できます。今回は2010年8月〜2017年6月の約7年分のデータを集めます。

URLのリストをCSVで読み込み、BeautifulSoupでタグを指定して抽出します。

#ライブラリの読み込み
import urllib
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
from urllib.error import HTTPError
import csv, time

#URLのリストを読み込む
f = open('url_list.csv', 'r')
dataReader = csv.reader(f)

#結果の出力用のリストを作る
data01 =[] #URL
data02 =[] #順位
data03 =[] #title&artist
data04 =[] #link
data05 =[] #artist

for row in dataReader:
       for url in row:
            time.sleep(10.0) #sleep(秒指定)
            try:
                    r = requests.get(url)
                    soup =  BeautifulSoup(r.content, 'html.parser')
                    
                    for body in soup.findAll("tbody"):
                        for detail in body.findAll("tr"):
                            for ranking in detail.findAll("td",{'class':'rank_td'}):
                                for content in detail.findAll("div",{'class':'name_detail'}):
                                    for artist_name in content.findAll("span"):
                                        for link in artist_name.findAll("a"):
                                            data01.append(url)
                                            data02.append(''.join(ranking.findAll(text=True)))
                                            data03.append(''.join(content.findAll(text=True)))
                                            data04.append(link.get("href"))
                                            data05.append(''.join(artist_name.findAll(text=True)))
                                            data = zip(data01,data02,data03,data04,data05)
                                            #CSV出力
                                            with open('result.csv','wt',errors='backslashreplace') as fout:
                                                writecsv = csv.writer(fout,lineterminator='\n')
                                                writecsv.writerows(data)                                   
                                
                                    
            except HTTPError as e:
                print(e.code)

#ライブラリの読み込み

import urllib

from bs4 import BeautifulSoup

from urllib.request import urlopen

import requests

from urllib.error import HTTPError

import csv, time

#URLのリストを読み込む

f = open('url_list.csv', 'r')

dataReader = csv.reader(f)

#結果の出力用のリストを作る

data01 =[] #URL

data02 =[] #順位

data03 =[] #title&artist

data04 =[] #link

data05 =[] #artist

for row in dataReader:

for url in row:

time.sleep(10.0) #sleep(秒指定)

try:

r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

for body in soup.findAll("tbody"):

for detail in body.findAll("tr"):

for ranking in detail.findAll("td",{'class':'rank_td'}):

for content in detail.findAll("div",{'class':'name_detail'}):

for artist_name in content.findAll("span"):

for link in artist_name.findAll("a"):

data01.append(url)

data02.append(''.join(ranking.findAll(text=True)))

data03.append(''.join(content.findAll(text=True)))

data04.append(link.get("href"))

data05.append(''.join(artist_name.findAll(text=True)))

data = zip(data01,data02,data03,data04,data05)

#CSV出力

with open('result.csv','wt',errors='backslashreplace') as fout:

writecsv = csv.writer(fout,lineterminator='\n')

writecsv.writerows(data)

except HTTPError as e:

print(e.code)

データ取得後は簡単にpandasのstr.replaceで整形すると、以下のような結果になります。今週の順位と先週の順位が引っ付いてしまっています。

#Webスクレイピングしたデータをpandasで整形
import pandas as pd

#データ読み込み
data_list = pd.read_csv("result.csv", header = None, delimiter=",", encoding='utf-8')
data_list.columns = ['url', 'ranking','title&artist','link','artist']

#文字列の置換
data_list['url'] = data_list['url'].str.replace('置換前','置換後')
data_list['title&artist'] = data_list['title&artist'].str.replace('置換前','置換後')
data_list['ranking'] = data_list['ranking'].str.replace('置換前','置換後')

#CSVに保存
data_list.to_csv("dataset.csv",index=False)

#Webスクレイピングしたデータをpandasで整形

import pandas as pd

#データ読み込み

data_list = pd.read_csv("result.csv", header = None, delimiter=",", encoding='utf-8')

data_list.columns = ['url', 'ranking','title&artist','link','artist']

#文字列の置換

data_list['url'] = data_list['url'].str.replace('置換前','置換後')

data_list['title&artist'] = data_list['title&artist'].str.replace('置換前','置換後')

data_list['ranking'] = data_list['ranking'].str.replace('置換前','置換後')

#CSVに保存

data_list.to_csv("dataset.csv",index=False)

ここから横着してRで整形し、各週の順位データを作成しました。

#引っ付いたランキングに関するデータを分離して、データテーブルに加える。
dataset <- strsplit(ranking_dataset$ranking, "from")
rankings <- data.frame(current_ranking=as.integer(),
                       previous_ranking=as.integer())
for( i in 1:nrow(ranking_dataset)){
  rankings[i,1] <- as.integer(dataset[[i]][1])
  rankings[i,2] <- as.integer(dataset[[i]][2])
}

ranking_dataset <- cbind(ranking_dataset,rankings)

ranking_dataset <- ranking_dataset %>% select(url,current_ranking,previous_ranking,title.artist,artist,link)
colnames(ranking_dataset) <- c("date","current_ranking","previous_ranking","title",
                               "artist","link")

#引っ付いたランキングに関するデータを分離して、データテーブルに加える。

dataset <- strsplit(ranking_dataset$ranking, "from")

rankings <- data.frame(current_ranking=as.integer(),

previous_ranking=as.integer())

for( i in 1:nrow(ranking_dataset)){

rankings[i,1] <- as.integer(dataset[[i]][1])

rankings[i,2] <- as.integer(dataset[[i]][2])

}

ranking_dataset <- cbind(ranking_dataset,rankings)

ranking_dataset <- ranking_dataset %>% select(url,current_ranking,previous_ranking,title.artist,artist,link)

colnames(ranking_dataset) <- c("date","current_ranking","previous_ranking","title",

"artist","link")

データ確認

データ構造はこのような形です。

#データ構造を確認する
> str(ranking_dataset)
'data.frame':	22254 obs. of  6 variables:
 $ date            : chr  "2010/08/21" "2010/08/21" "2010/08/21" "2010/08/21" ...
 $ current_ranking : int  1 2 3 4 5 6 7 8 10 11 ...
 $ previous_ranking: int  1 3 NA 2 5 4 6 9 11 8 ...
 $ title           : chr  "Love The Way You Lie" "Dynamite" "Mine" "California Gurls" ...
 $ artist          : chr  "Eminem Featuring Rihanna" "Taio Cruz" "Taylor Swift" "Katy Perry Featuring Snoop Dogg" ...
 $ link            : chr  "/artists/detail/306903" "/artists/detail/464754" "/artists/detail/319433" "/artists/detail/450934" ...

#データ構造を確認する

> str(ranking_dataset)

'data.frame': 22254 obs. of 6 variables:

$ date : chr "2010/08/21" "2010/08/21" "2010/08/21" "2010/08/21" ...

$ current_ranking : int 1 2 3 4 5 6 7 8 10 11 ...

$ previous_ranking: int 1 3 NA 2 5 4 6 9 11 8 ...

$ title : chr "Love The Way You Lie" "Dynamite" "Mine" "California Gurls" ...

$ artist : chr "Eminem Featuring Rihanna" "Taio Cruz" "Taylor Swift" "Katy Perry Featuring Snoop Dogg" ...

$ link : chr "/artists/detail/306903" "/artists/detail/464754" "/artists/detail/319433" "/artists/detail/450934" ...

まず、どんな楽曲やアーティストがランキングに入っているのかを簡単に確認してみます。ほとんど聞いたことない人の名前、曲名ですが。。

#楽曲ごとのランクイン回数の上位20人
> ranking_dataset %>% count(title) %>% arrange(desc(n)) %>% print(n=20)
# A tibble: 1,785 × 2
                          title     n
                          <chr> <int>
1                   Radioactive    91
2                          Sail    78
3             Party Rock Anthem    68
4                Counting Stars    67
5                       Animals    66
6           Rolling In The Deep    65
7                        Ho Hey    62
8                         Sorry    62
9  Somebody That I Used To Know    61
10                       Demons    60
11                    All Of Me    58
12                       Lights    57
13 Dark Horse Featuring Juicy J    56
14                  Some Nights    56
15                 Stay With Me    55
16                        Hello    54
17      Can't Stop The Feeling!    52
18     Cheap Thrills Featuring     52
19            Don't Let Me Down    52
20                      Pompeii    52
# ... with 1,765 more rows

#楽曲ごとのランクイン回数の上位20人

> ranking_dataset %>% count(title) %>% arrange(desc(n)) %>% print(n=20)

# A tibble: 1,785 × 2

title n

1 Radioactive 91

2 Sail 78

3 Party Rock Anthem 68

4 Counting Stars 67

5 Animals 66

6 Rolling In The Deep 65

7 Ho Hey 62

8 Sorry 62

9 Somebody That I Used To Know 61

10 Demons 60

11 All Of Me 58

12 Lights 57

13 Dark Horse Featuring Juicy J 56

14 Some Nights 56

15 Stay With Me 55

16 Hello 54

17 Can't Stop The Feeling! 52

18 Cheap Thrills Featuring 52

19 Don't Let Me Down 52

20 Pompeii 52

# ... with 1,765 more rows

#アーティストごとのランクイン回数の上位20位
> ranking_dataset %>% count(artist) %>% arrange(desc(n)) %>% print(n=20)
# A tibble: 876 × 2
             artist     n
              <chr> <int>
1             Drake   350
2        Bruno Mars   325
3           Rihanna   301
4      Taylor Swift   299
5             Adele   294
6   Imagine Dragons   258
7     One Direction   254
8        Katy Perry   237
9     Justin Bieber   204
10      Chris Brown   195
11 Carrie Underwood   192
12  Lady Antebellum   184
13      Nicki Minaj   183
14      OneRepublic   183
15   Ellie Goulding   182
16         Maroon 5   182
17       The Weeknd   182
18       Ed Sheeran   169
19    Blake Shelton   167
20          Beyonce   152
# ... with 856 more rows

#アーティストごとのランクイン回数の上位20位

> ranking_dataset %>% count(artist) %>% arrange(desc(n)) %>% print(n=20)

# A tibble: 876 × 2

artist n

1 Drake 350

2 Bruno Mars 325

3 Rihanna 301

4 Taylor Swift 299

5 Adele 294

6 Imagine Dragons 258

7 One Direction 254

8 Katy Perry 237

9 Justin Bieber 204

10 Chris Brown 195

11 Carrie Underwood 192

12 Lady Antebellum 184

13 Nicki Minaj 183

14 OneRepublic 183

15 Ellie Goulding 182

16 Maroon 5 182

17 The Weeknd 182

18 Ed Sheeran 169

19 Blake Shelton 167

20 Beyonce 152

# ... with 856 more rows

続いて、2010年8月〜2017年6月の間に100位以内に入った数を楽曲ごとにヒストグラムにしてみます。べき乗分布な形かと思いきや、20回前後で盛り上がっているのが気になりますね。

#タイトルごとのランクイン回数の集計値をヒストグラムに示す
title_ranking <- ranking_dataset %>% count(title) %>% arrange(desc(n))
ggplot(data = title_ranking,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Title")

#タイトルごとのランクイン回数の集計値をヒストグラムに示す

title_ranking <- ranking_dataset %>% count(title) %>% arrange(desc(n))

ggplot(data = title_ranking,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Title")

#楽曲のランクイン回数のサマリー
> summary(title_ranking$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    9.00   12.47   20.00   91.00

#楽曲のランクイン回数のサマリー

> summary(title_ranking$n)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 1.00 9.00 12.47 20.00 91.00

中央値が9週間なので、意外と長い期間Top100には入っています。

続いて、100位以内に入った数をアーティストごとにヒストグラムにしてみます。こちらはべき乗分布のような形になっています。

#アーティストごとのランクイン回数の集計値をヒストグラムに示す
artist_ranking <- ranking_dataset %>% count(artist) %>% arrange(desc(n))
ggplot(data = artist_ranking,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Artist")

#アーティストごとのランクイン回数の集計値をヒストグラムに示す

artist_ranking <- ranking_dataset %>% count(artist) %>% arrange(desc(n))

ggplot(data = artist_ranking,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Artist")

Top10入りの楽曲の実態

Top10に入っている楽曲のみに絞って、ヒストグラムを描いてみます。

#Top10入りのデータに絞ってヒストグラムを描く
top_10 <- ranking_dataset %>% filter(current_ranking<=10) %>% count(title) %>% arrange(desc(n))
ggplot(data = top_10,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Title in Top10")

#Top10入りのデータに絞ってヒストグラムを描く

top_10 <- ranking_dataset %>% filter(current_ranking<=10) %>% count(title) %>% arrange(desc(n))

ggplot(data = top_10,aes(x = n)) + geom_histogram() + xlab("the number of Appearance of Title in Top10")

#Top10に入っている楽曲の10位以内ランクイン回数サマリー
> summary(top_10$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000  10.000   9.779  15.000  32.000

#Top10に入っている楽曲の10位以内ランクイン回数サマリー

> summary(top_10$n)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 3.000 10.000 9.779 15.000 32.000

Top10に入ったら、10週近くは10位以内に含まれるようです。上位はすぐに取って代わられるのかと思いきや、人気が人気を呼ぶとかなのでしょうか。確か、某ECサイトの方が、生キャラメルは売れるから売れたんだとか言っていた気がします。

順位の推移

100位以内にランクインした回数が最も多かった楽曲のTop10に関して、時系列プロットをしてみます。

#100位内ランクイン回数が上位の10曲に関して時系列プロットを行う。
title_ranking_top10 <- title_ranking[1:10,]
title_ranking_top10 <- ranking_dataset %>% filter(title %in% title_ranking_top10$title)
title_ranking_top10$date <- as.Date(title_ranking_top10$date)
ggplot(data = title_ranking_top10,aes(x = date,
                                      y = current_ranking,
                                      colour=title)) + geom_line() + 
  　　　　　　　　　　　　　　　　　　ylab("current ranking") + 
  　　　　　　　　　　　　　　　　　　ggtitle("Time Series Plot of Top10")

#100位内ランクイン回数が上位の10曲に関して時系列プロットを行う。

title_ranking_top10 <- title_ranking[1:10,]

title_ranking_top10 <- ranking_dataset %>% filter(title %in% title_ranking_top10$title)

title_ranking_top10$date <- as.Date(title_ranking_top10$date)

ggplot(data = title_ranking_top10,aes(x = date,

y = current_ranking,

colour=title)) + geom_line() +

　　　　　　　　　　　　　　　　　　ylab("current ranking") +

　　　　　　　　　　　　　　　　　　ggtitle("Time Series Plot of Top10")

初回に上位にランクインして、後は下がるだけの楽曲や、じわじわとランキングを上げていく楽曲などが観察されています。楽曲の消費のサイクルみたいなものがあるのでしょう。

今後について

せっかく面白そうなデータが手に入ったので、リンク先も辿って、どのような楽曲やアーティストの特徴が人気に繋がりうるのか見てみるのも良いですね。あと、もう少し洋楽聴いてみようと思います。私はクラシック音楽とジャズしかウォークマンに入っていないので。

参考文献

Pythonクローリング&スクレイピング -データ収集・解析のための実践開発ガイド-

ラーメン二郎の某飲食店レビューサイトデータに対して共分散構造分析をしてみる

データ

ラーメン二郎に関して、某飲食店レビューサイトのデータをWebスクレイピングしたもので、「料理・味」・「サービス」・「雰囲気」・「CP」・「酒・ドリンク」の評価項目に関して、1~5の実数値が割り振られています。ラーメン二郎の店舗数40店のうち、欠損のなかった37店舗の評価データとなります。

記述統計量・ヒストグラム

まずは記述統計量とヒストグラムを見てみます。

> describe(jirou_dataset[,-1])
                 vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
taste               1 37 3.58 0.09   3.59    3.59 0.01 3.10 3.73  0.63 -3.77    17.97 0.02
service             2 37 3.51 0.11   3.55    3.53 0.04 3.10 3.59  0.49 -2.64     6.41 0.02
atmosphere          3 37 3.47 0.16   3.53    3.50 0.03 3.00 3.59  0.59 -2.11     3.06 0.03
cost_performance    4 37 3.61 0.12   3.59    3.61 0.01 3.07 3.95  0.88 -1.46     9.16 0.02
drink               5 37 3.08 0.09   3.07    3.07 0.10 2.99 3.33  0.34  1.00     0.39 0.01

> describe(jirou_dataset[,-1])

vars n mean sd median trimmed mad min max range skew kurtosis se

taste 1 37 3.58 0.09 3.59 3.59 0.01 3.10 3.73 0.63 -3.77 17.97 0.02

service 2 37 3.51 0.11 3.55 3.53 0.04 3.10 3.59 0.49 -2.64 6.41 0.02

atmosphere 3 37 3.47 0.16 3.53 3.50 0.03 3.00 3.59 0.59 -2.11 3.06 0.03

cost_performance 4 37 3.61 0.12 3.59 3.61 0.01 3.07 3.95 0.88 -1.46 9.16 0.02

drink 5 37 3.08 0.09 3.07 3.07 0.10 2.99 3.33 0.34 1.00 0.39 0.01

#各評価項目に関してヒストグラムを描く
hist_jirou_dataset <- melt(jirou_dataset)
g <- ggplot(data = hist_jirou_dataset,
            aes(x = value,
                y = ..density..)) +
  geom_histogram(alpha = 0.5,position = "identity") +
  geom_density(alpha = 0)
g + facet_wrap(~variable,nrow=5)

#各評価項目に関してヒストグラムを描く

hist_jirou_dataset <- melt(jirou_dataset)

g <- ggplot(data = hist_jirou_dataset,

aes(x = value,

y = ..density..)) +

geom_histogram(alpha = 0.5,position = "identity") +

geom_density(alpha = 0)

g + facet_wrap(~variable,nrow=5)

まぁ、お酒を飲むところではないので、drinkは低いですよね。しかしながら、総じて3点台なのは驚きです。昔行ったことがあって、雰囲気は決して良くはないはずなので。

共分散構造分析

今回、検証したい仮説は「ラーメン店としての質が二郎愛につながるかどうか」です。

ラーメン店としての質に繋がりそうな評価項目
「料理・味」
「CP」
「酒・ドリンク」

二郎愛に繋がりそうな評価項目
「料理・味」
「サービス」
「雰囲気」

加えて、料理・味とCPは関係していそうな項目なので、その点もパスにおいて考慮しておきます。

以下の図は仮説のイメージです。

パスを描写する際の細かい指定は参考文献を参照されると良いと思います。
さっそく、モデルを作って実行します。

#lavaanパッケージのためにモデルを描写する。
library(lavaan)

model_jirou <- 'f1 =~ cost_performance + drink + taste
                f2 =~ taste + atmosphere + service
                f2 ~ f1
                cost_performance ~~ taste'

fit <- sem(model = model_jirou,
           data = jirou_dataset[,-1],
           estimator="MLM")

summary(fit)

#lavaanパッケージのためにモデルを描写する。

library(lavaan)

model_jirou <- 'f1 =~ cost_performance + drink + taste

f2 =~ taste + atmosphere + service

f2 ~ f1

cost_performance ~~ taste'

fit <- sem(model = model_jirou,

data = jirou_dataset[,-1],

estimator="MLM")

summary(fit)

こちらが推定結果です。

> summary(fit)
lavaan (0.5-23.1097) converged normally after 100 iterations

  Number of observations                            37

  Estimator                                         ML      Robust
  Minimum Function Test Statistic                2.057       1.965
  Degrees of freedom                                 2           2
  P-value (Chi-square)                           0.358       0.374
  Scaling correction factor                                  1.047
    for the Satorra-Bentler correction

Parameter Estimates:

  Information                                 Expected
  Standard Errors                           Robust.sem

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  f1 =~                                               
    cost_performnc    1.000                           
    drink            -0.153    0.134   -1.139    0.255
    taste             0.381    0.194    1.958    0.050
  f2 =~                                               
    taste             1.000                           
    atmosphere        4.008    1.972    2.032    0.042
    service           3.147    1.423    2.211    0.027

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  f2 ~                                                
    f1                0.095    0.130    0.731    0.465

Covariances:
                      Estimate  Std.Err  z-value  P(>|z|)
 .cost_performance ~~                                    
   .taste               -0.001    0.007   -0.115    0.908

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
   .cost_performnc    3.606    0.020  179.760    0.000
   .drink             3.081    0.014  214.293    0.000
   .taste             3.582    0.015  237.571    0.000
   .atmosphere        3.471    0.027  129.316    0.000
   .service           3.509    0.019  187.760    0.000
    f1                0.000                           
   .f2                0.000                           

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .cost_performnc   -0.007    0.017   -0.390    0.696
   .drink             0.007    0.002    3.170    0.002
   .taste             0.002    0.003    0.712    0.477
   .atmosphere        0.005    0.004    1.273    0.203
   .service          -0.000    0.001   -0.329    0.742
    f1                0.021    0.017    1.242    0.214
   .f2                0.001    0.001    1.282    0.200

> summary(fit)

lavaan (0.5-23.1097) converged normally after 100 iterations

Number of observations 37

Estimator ML Robust

Minimum Function Test Statistic 2.057 1.965

Degrees of freedom 2 2

P-value (Chi-square) 0.358 0.374

Scaling correction factor 1.047

for the Satorra-Bentler correction

Parameter Estimates:

Information Expected

Standard Errors Robust.sem

Latent Variables:

Estimate Std.Err z-value P(>|z|)

f1 =~

cost_performnc 1.000

drink -0.153 0.134 -1.139 0.255

taste 0.381 0.194 1.958 0.050

f2 =~

taste 1.000

atmosphere 4.008 1.972 2.032 0.042

service 3.147 1.423 2.211 0.027

Regressions:

Estimate Std.Err z-value P(>|z|)

f2 ~

f1 0.095 0.130 0.731 0.465

Covariances:

Estimate Std.Err z-value P(>|z|)

.cost_performance ~~

.taste -0.001 0.007 -0.115 0.908

Intercepts:

Estimate Std.Err z-value P(>|z|)

.cost_performnc 3.606 0.020 179.760 0.000

.drink 3.081 0.014 214.293 0.000

.taste 3.582 0.015 237.571 0.000

.atmosphere 3.471 0.027 129.316 0.000

.service 3.509 0.019 187.760 0.000

f1 0.000

.f2 0.000

Variances:

Estimate Std.Err z-value P(>|z|)

.cost_performnc -0.007 0.017 -0.390 0.696

.drink 0.007 0.002 3.170 0.002

.taste 0.002 0.003 0.712 0.477

.atmosphere 0.005 0.004 1.273 0.203

.service -0.000 0.001 -0.329 0.742

f1 0.021 0.017 1.242 0.214

.f2 0.001 0.001 1.282 0.200

lavaan (0.5-23.1097) converged normally after 100 iterationsとあるので、適切に推定されたようです。自由度が2とデータ数が少ないためかなりギリギリな推定となっています。ただ、推定すべき母数の数よりもデータ数が一応多い状態ではあります。二郎インスパイア系の店のデータも集めた方がいいかもしれないですね。

モデルの評価

モデルの評価として適合度と母数の推定に関して見ていきます。

適合度
・適合度指標であるCFI(Comparative Fit Index)が1なので、適合度に関しては良さそうです。
・同じく適合度指標であるTLI(Tucker-Lewis Index)が0.998なので、適合度に関しては良さそうです。
・0.05以下であれば当てはまりが良いとされるRMSEAは0.028なので、当てはまりは良さそうです。
・0に近いほど良いとされるSRMRは0.024となっています。

母数の推定

f2 ~ f1はラーメン店としての質と二郎愛の関係を想定したものですが、p値が0.47と全然だめでした。ラーメン店としての質が二郎愛に繋がるという仮説は正しいとは言えないです。

参考文献にもあるよう、係数の解釈を行いやすくするために標準化推定値を求めます。

> standardizedsolution(fit)
                lhs op              rhs est.std     se      z pvalue
1                f1 =~ cost_performance   1.208  0.524  2.303  0.021
2                f1 =~            drink  -0.257  0.178 -1.448  0.148
3                f1 =~            taste   0.612  0.325  1.883  0.060
4                f2 =~            taste   0.400  0.120  3.327  0.001
5                f2 =~       atmosphere   0.901  0.066 13.723  0.000
6                f2 =~          service   1.016  0.049 20.694  0.000
7                f2  ~               f1   0.382  0.338  1.132  0.258
8  cost_performance ~~            taste  -0.201  1.651 -0.122  0.903
9  cost_performance ~~ cost_performance  -0.459  1.267 -0.362  0.717
10            drink ~~            drink   0.934  0.091 10.226  0.000
11            taste ~~            taste   0.278  0.399  0.698  0.485
12       atmosphere ~~       atmosphere   0.187  0.118  1.582  0.114
13          service ~~          service  -0.033  0.100 -0.333  0.740
14               f1 ~~               f1   1.000  0.000     NA     NA
15               f2 ~~               f2   0.854  0.258  3.309  0.001
16 cost_performance ~1                   29.960  8.666  3.457  0.001
17            drink ~1                   35.715  4.676  7.638  0.000
18            taste ~1                   39.595 15.318  2.585  0.010
19       atmosphere ~1                   21.553  4.333  4.974  0.000
20          service ~1                   31.293  7.949  3.937  0.000
21               f1 ~1                    0.000  0.000     NA     NA
22               f2 ~1                    0.000  0.000     NA     NA

> standardizedsolution(fit)

lhs op rhs est.std se z pvalue

1 f1 =~ cost_performance 1.208 0.524 2.303 0.021

2 f1 =~ drink -0.257 0.178 -1.448 0.148

3 f1 =~ taste 0.612 0.325 1.883 0.060

4 f2 =~ taste 0.400 0.120 3.327 0.001

5 f2 =~ atmosphere 0.901 0.066 13.723 0.000

6 f2 =~ service 1.016 0.049 20.694 0.000

7 f2 ~ f1 0.382 0.338 1.132 0.258

8 cost_performance ~~ taste -0.201 1.651 -0.122 0.903

9 cost_performance ~~ cost_performance -0.459 1.267 -0.362 0.717

10 drink ~~ drink 0.934 0.091 10.226 0.000

11 taste ~~ taste 0.278 0.399 0.698 0.485

12 atmosphere ~~ atmosphere 0.187 0.118 1.582 0.114

13 service ~~ service -0.033 0.100 -0.333 0.740

14 f1 ~~ f1 1.000 0.000 NA NA

15 f2 ~~ f2 0.854 0.258 3.309 0.001

16 cost_performance ~1 29.960 8.666 3.457 0.001

17 drink ~1 35.715 4.676 7.638 0.000

18 taste ~1 39.595 15.318 2.585 0.010

19 atmosphere ~1 21.553 4.333 4.974 0.000

20 service ~1 31.293 7.949 3.937 0.000

21 f1 ~1 0.000 0.000 NA NA

22 f2 ~1 0.000 0.000 NA NA

これによると、f1(ラーメン店としての質)に対してはコストパフォーマンスが最も影響を与えるようです。味よりもコストパフォーマンスが勝っているという考察になりますが、それはそれで面白いですね。他方、f2(二郎愛)に対しては味よりも雰囲気・サービスが影響を与えるようです。

semPlotパッケージを用いてパス図を出力してみます。

#パス図の出力
library(semPlot)

semPaths(fit,whatLabels = "stand",optimizeLatRes = T)

#パス図の出力

library(semPlot)

semPaths(fit,whatLabels = "stand",optimizeLatRes = T)

変数名が3文字に省略されているのですが、修正方法がパッと見つからなかったので、そのまま載せています。

初歩の初歩ですが、一通りの進め方がわかったので、今後も共分散構造分析にチャレンジしてみたいと思います。

参考文献

共分散構造分析 R編―構造方程式モデリング

RStanで学部時代の研究を振り返ってみる

研究概要

大学時代に実験経済学で行った実験結果のデータがUSBに入っていたので、振り返って分析などをしてみたいと思います。

研究目的
　ピア効果に関して、競争相手が自分よりも秀でた人がいいのか劣った人がいいのかを確かめる。

実験方法
・1分間で100マス計算を2セット解いてもらう。（めちゃ速い人には3枚目も渡した）
・実験開始後、実験対象のクラスによって、途中で「平均的なクラスは○○マスまで進んでいます！」とアナウンスします。アナウンスすることで、競争相手のレベルを知り、焦るなり余裕を感じるなりしてもらおうという計画です。
なお、対照群はアナウンスをしていません。アナウンス内容は「平均告知（１８秒）、上告知（１５秒）、超上告知（１２秒）、下告知（２０秒）」と4パターンとなります。
・計算が間違っているものは加点しません。

実験対象
　某国立大学の経済学部生の1~2年の必修科目履修者217名（先生に交渉して授業の開始5分を頂いて実験を行いました。）
　内部進学やスポ専などがない分、計算能力的にある程度近い集団ではないかと思われます。

検証方法
　アナウンスごとに100マス計算の点数の水準が変わりうるのかを回帰分析などで判断。

データ可視化

以下、実験カテゴリごとの略記です。
下告知（２０秒）・・・slow_20
上告知（１５秒）・・・fast_15
超上告知（１２秒）・・・fastest_12
平均告知（１８秒）・・・average_18
対照群・・・baseline

データ構造の確認です。

> str(dataset)
'data.frame':	217 obs. of  4 variables:
 $ categories    : chr  "baseline" "baseline" "baseline" "baseline" ...
 $ points        : int  72 79 81 81 98 99 100 101 102 104 ...
 $ errors        : int  0 0 0 0 2 1 0 0 1 0 ...
 $ genuine_points: int  72 79 81 81 96 98 100 101 101 104 ...

> str(dataset)

'data.frame': 217 obs. of 4 variables:

$ categories : chr "baseline" "baseline" "baseline" "baseline" ...

$ points : int 72 79 81 81 98 99 100 101 102 104 ...

$ errors : int 0 0 0 0 2 1 0 0 1 0 ...

$ genuine_points: int 72 79 81 81 96 98 100 101 101 104 ...

平均値、中央値、標準偏差、サンプルサイズを出してみます。

> dataset %>% group_by(categories) %>% summarise(average=mean(genuine_points),
+                                                median=median(genuine_points),
+                                                stdev=sd(genuine_points),
+                                                sample=n())
# A tibble: 5 × 5
  categories  average median    stdev sample
       <chr>    <dbl>  <dbl>    <dbl>  <int>
1 average_18 120.8605    117 28.08455     43
2   baseline 121.0000    121 22.19109     46
3    fast_15 123.0357    126 24.32355     56
4 fastest_12 123.6774    125 24.09756     31
5    slow_20 126.3902    126 23.20547     41

> dataset %>% group_by(categories) %>% summarise(average=mean(genuine_points),

+ median=median(genuine_points),

+ stdev=sd(genuine_points),

+ sample=n())

# A tibble: 5 × 5

categories average median stdev sample

1 average_18 120.8605 117 28.08455 43

2 baseline 121.0000 121 22.19109 46

3 fast_15 123.0357 126 24.32355 56

4 fastest_12 123.6774 125 24.09756 31

5 slow_20 126.3902 126 23.20547 41

中央値で見てみると、baselineに対してわずかですが点数に違いがありそうに見えます。

実験種別で点数に関するヒストグラムと確率密度関数を確認してみます。

library(ggplot2)

g <- ggplot(data = dataset,
            aes(x = genuine_points,
                y = ..density..)) +
            geom_histogram(alpha = 0.5,position = "identity") +
            geom_density(alpha = 0)
g + facet_wrap(~categories,nrow=5)

library(ggplot2)

g <- ggplot(data = dataset,

aes(x = genuine_points,

y = ..density..)) +

geom_histogram(alpha = 0.5,position = "identity") +

geom_density(alpha = 0)

g + facet_wrap(~categories,nrow=5)

baselineが多峰性がありそうなのが気になります。average_18は低そうに見えますね。

RStanで重回帰

『StanとRでベイズ統計モデリング』にあるコードを参考にしています。正規分布を事前分布にした線形回帰モデルです。
被説明変数が点数、説明変数が実験種別のダミー変数だけからなります。

library(rstan)
library(dummies)

dummies <- dummy.data.frame(dataset, sep = "_", names = c("categories"))

analytical_dataset <- dummies %>% select(categories_average_18,
                                         categories_fast_15,
                                         categories_fastest_12,
                                         categories_slow_20,
                                         genuine_points)


data <- list(N=nrow(analytical_dataset),
             average_18=analytical_dataset$categories_average_18,
             fast_15=analytical_dataset$categories_fast_15,
             fastest_12=analytical_dataset$categories_fastest_12,
             slow_20=analytical_dataset$categories_slow_20,
             genuine_points=analytical_dataset$genuine_points)


stan_code <- "
data{
int N; //the number of student
int<lower=0> genuine_points[N];
real<lower=0, upper=1> average_18[N];
real<lower=0, upper=1> fast_15[N];
real<lower=0, upper=1> fastest_12[N];
real<lower=0, upper=1> slow_20[N];
}

parameters{
real b1;
real b2;
real b3;
real b4;
real b5;
real<lower=0> sigma;
}

transformed parameters{
real mu[N];
for(n in 1:N)
mu[n] = b1 + b2*average_18[n] + b3*fast_15[n] + b4*fastest_12[n] + b5*slow_20[n];
}

model{
for(n in 1:N)
genuine_points[n] ~ normal(mu[n], sigma);
}
"

fit <- stan(model_code =stan_code, data=data, seed=1234)
fit.summary <-data.frame(summary(fit)$summary)
head(fit.summary,6)

library(rstan)

library(dummies)

dummies <- dummy.data.frame(dataset, sep = "_", names = c("categories"))

analytical_dataset <- dummies %>% select(categories_average_18,

categories_fast_15,

categories_fastest_12,

categories_slow_20,

genuine_points)

data <- list(N=nrow(analytical_dataset),

average_18=analytical_dataset$categories_average_18,

fast_15=analytical_dataset$categories_fast_15,

fastest_12=analytical_dataset$categories_fastest_12,

slow_20=analytical_dataset$categories_slow_20,

genuine_points=analytical_dataset$genuine_points)

stan_code <- "

data{

int N; //the number of student

int<lower=0> genuine_points[N];

real<lower=0, upper=1> average_18[N];

real<lower=0, upper=1> fast_15[N];

real<lower=0, upper=1> fastest_12[N];

real<lower=0, upper=1> slow_20[N];

}

parameters{

real b1;

real b2;

real b3;

real b4;

real b5;

real<lower=0> sigma;

}

transformed parameters{

real mu[N];

for(n in 1:N)

mu[n] = b1 + b2*average_18[n] + b3*fast_15[n] + b4*fastest_12[n] + b5*slow_20[n];

}

model{

for(n in 1:N)

genuine_points[n] ~ normal(mu[n], sigma);

}

fit <- stan(model_code =stan_code, data=data, seed=1234)

fit.summary <-data.frame(summary(fit)$summary)

head(fit.summary,6)

結果

traceplot(fit)でMCMCのサンプリング結果を確認する。

収束しているように見えます。

以下推定結果ですが、残念ながらベイズ予測区間において符号の逆転が起きていないものはなかったので、アナウンスによる効果があるとは言えないようです。ただ、slow_20の係数がおしいですね。少なくとも他の実験種別よりも、アナウンス効果があるかもしれないという考察に止まりそうです。

> head(fit.summary,6)
             mean   se_mean       sd      X2.5.       X25.        X50.       X75.     X97.5.    n_eff      Rhat
b1    121.0253094 0.1034836 3.460792 114.214104 118.793988 121.0140383 123.213054 127.902759 1118.429 0.9998152
b2     -0.1195318 0.1326887 4.957153 -10.009827  -3.372795  -0.1304577   3.117195   9.694184 1395.715 1.0004034
b3      1.9440496 0.1248433 4.661571  -7.494857  -1.043060   1.9756929   5.007503  10.827640 1394.228 1.0000830
b4      2.5460694 0.1436941 5.447930  -8.026616  -1.128587   2.4997953   6.220891  13.044467 1437.426 1.0008688
b5      5.3029965 0.1351662 4.988323  -4.479865   2.004280   5.3108476   8.543292  15.458379 1361.988 1.0004238
sigma  24.6158717 0.0211252 1.150481  22.501612  23.811858  24.5900106  25.368057  26.970197 2965.905 1.0017414

> head(fit.summary,6)

mean se_mean sd X2.5. X25. X50. X75. X97.5. n_eff Rhat

b1 121.0253094 0.1034836 3.460792 114.214104 118.793988 121.0140383 123.213054 127.902759 1118.429 0.9998152

b2 -0.1195318 0.1326887 4.957153 -10.009827 -3.372795 -0.1304577 3.117195 9.694184 1395.715 1.0004034

b3 1.9440496 0.1248433 4.661571 -7.494857 -1.043060 1.9756929 5.007503 10.827640 1394.228 1.0000830

b4 2.5460694 0.1436941 5.447930 -8.026616 -1.128587 2.4997953 6.220891 13.044467 1437.426 1.0008688

b5 5.3029965 0.1351662 4.988323 -4.479865 2.004280 5.3108476 8.543292 15.458379 1361.988 1.0004238

sigma 24.6158717 0.0211252 1.150481 22.501612 23.811858 24.5900106 25.368057 26.970197 2965.905 1.0017414

分布でも見てみます。アナウンス効果が0を確実に超えているとは言えないですね。

library (reshape)
library(dplyr)
library(ggplot2)
post   <- extract (fit, permuted = F)
m.post <- melt (post)
m.post <- m.post %>% filter(parameters %in% c("b1","b2","b3","b4","b5"))
graph  <- ggplot (m.post, aes(x = value))
graph  <- graph + geom_density () + facet_grid(. ~ parameters, scales = "free") + theme_bw() 
plot (graph)

library (reshape)

library(dplyr)

library(ggplot2)

post <- extract (fit, permuted = F)

m.post <- melt (post)

m.post <- m.post %>% filter(parameters %in% c("b1","b2","b3","b4","b5"))

graph <- ggplot (m.post, aes(x = value))

graph <- graph + geom_density () + facet_grid(. ~ parameters, scales = "free") + theme_bw()

plot (graph)

結局、学部時代のレポートと結論は変わらないのですが、係数が0よりも大きい確率という観点で結果に向き合えたのは良かったと思います。

参考文献

StanとRでベイズ統計モデリング (Wonderful R)