Bayesian Statistics and Marketingの5章 – 家計の異質性を考慮した階層ベイズモデル

はじめに

ゴールデンウィークで実家に持ち込む本としてチョイスしたのが、2005年出版の「Bayesian Statistics and Marketing」です。大学院のときに購入して、ちょっとしか読んでませんでした。

この本は、字面の通りマーケティング関連の分析に関してベイズ統計を使ってアプローチするというもので、この書籍のために作られた、Rのbayesmというパッケージの紹介もあり、理論だけでなくRで実践することもできます。1章から7章までの全ての分析事例に対して実行可能な関数が用意されています。（CRANにあるdocumentも120p程度と割と大きめのパッケージです。）

和書で言うと、東北大学の照井先生の「ベイズモデリングによるマーケティング分析」などがありますが、その82pでもBayesian Statistics and Marketingとbayesmパッケージが紹介されています。

今回は、5章に載っている階層ベイズモデルを用いた、家計の異質性を考慮したブランド選択モデルの分析を紹介します。加えて、GitHubでstanによる再現を試みている方がいらっしゃったので、その方のコードの紹介も行います。

最近はこれまで以上にベイズ統計が流行ってきていますが、マーケティング×ベイズの書籍は限られている印象なので、少しでもリサーチのお役に立てれば幸いです。

目的

マーガリンの購買データから、ブランドごと、家計ごとのマーガリン価格に対しての反応の違いを明らかにしたい。

データ

bayesmパッケージにある、margarineデータ。data(margarine)で呼び出せ、詳細はcranのドキュメントに載っています。

Household Panel Data on Margarine Purchasesには、516家計の購買データと、家計ごとのデモグラフィック情報が収められています。1991年の論文のデータとなるので、かなり昔のデータです。

購買データは価格（USドル）と選択したブランドのID（10種類）
デモグラフィック情報はfamily size（家族構成）、学歴、職位、退職の有無などのダミー変数

今回の事例では、5回以上購買した家計に限定して分析しているため、
313家計・3405の購買レコードからなるデータセットとなります。

モデル

家計ごとに異なる、マーガリン価格に対する反応を想定。各マーガリンのブランドの価格に対するパラメータの数は家計の数だけある。
価格に対する反応は家計の属性によっても決まる。
という前提に立ち、以下のセッティングで推論していきます。
6つのブランド選択に関する多項ロジスティックモデル（カテゴリカル分布とsoftmax関数の適用）
1階層目はブランドごとの価格を説明変数とし、価格に対する反応係数をかけ合わせたものを多項ロジスティックモデルの入力とする。
2階層目はブランドの価格に対する反応係数が家計ごとの定数項と属性データに属性ごとの係数をかけ合わせたものからなる。
家計ごとの定数項は平均0、分散V_betaの正規分布に従う。
属性ごとの係数は平均vec(delta_bar)、分散V_betaクロネッカーのデルタA^(-1)の正規分布に従う。
分散V_betaは平均υ、分散Vの逆ウィシャート分布に従う。
A = 0.01、υ = 6 + 3 = 9、V = υI(Iは単位行列)

$\textit{ l } $ \beta_i | y_i , X_i $ [ Multinomial Logit ]$

$B = Z \varDelta + U , u_i \sim N(0,V_\beta)$

$vec(\varDelta | V_\beta ) \sim N(vec(\bar{\varDelta}), V_\beta \otimes A^{-1} )$

$V_\beta \sim IW(\upsilon ,V )$

コード

kefitsさんがいくつかの章に登場するbayesmでの実践例をstanに書き直されているようですので、そちらのコードで学ばせていただこうと思います。
https://github.com/kefits/Bayesian-Statistics-and-Marketing

以下が、stanのコードとなっています。ここでは、Hierarchical_MNL.stanとして保存します。

data{
  int<lower=0> N_x; // 購買レコードの数
  int<lower=0> N_z; // 家計の数
  int<lower=0> p_x; // 購買レコードの項目数
  int<lower=0> p_z; // 家計の属性データの項目数
  
  int y[N_x]; // 選択肢
  matrix[N_x, p_x] X; // 説明変数
  matrix[N_z, p_z] Z; // 家計の属性データ
  int<lower=0> hhid[N_x];  // 家計ID
}

transformed data{
  real nu;
  matrix[p_x, p_x] I; // 説明変数の数の正方行列
  
  nu = p_x + 3; // 説明変数の項に3を足す
  I = diag_matrix(rep_vector(1, p_x)); // 1を繰り返しp_x個並べた対角行列を作成
}

parameters{
  vector[p_x] beta_ast[N_z]; // 説明変数の数だけある、家計ごとのパラメータ
  matrix[p_z, p_x] Delta; // 属性データの説明変数の数×購買データの説明変数の数だけのパラメータ
  cov_matrix[p_x] V_b; // 共分散行列
}

transformed parameters{
  vector[p_x] beta[N_z]; #家計の数だけの係数ベクトル
  matrix[p_x, p_x] L_b; #共分散行列（beta(家計ごとの係数の共分散)）
  matrix[p_x, p_x] L_d; #共分散行列（delta(属性データの係数の共分散)）
  
  L_b = cholesky_decompose(V_b); // 共分散行列のコレスキー因子をもとめる
  L_d = cholesky_decompose(100*V_b); // 共分散行列に0.01で割ったもののコレスキー因子をもとめる
  for(i in 1:N_z){
    beta[i] = beta_ast[i] + Delta' * Z[i]'; // 係数は家計属性ごとの特徴に異質なDeltaとbeta_astの和で決まる 
  }
}

model{
  for(i in 1:N_x){
    y[i] ~ categorical(softmax(beta[hhid[i]] .* to_vector(X[i]))); //カテゴリカル分布にsoftmaxを組み合わせて多項ロジスティック回帰を行う
  }
  for(i in 1:p_z){
    Delta[i] ~ multi_normal_cholesky(rep_vector(0, p_x), L_d); // コレスキー因子（L_d）を引数にとる多変量正規分布(推定の高速化のために用いることがある。)
  }
  beta_ast ~ multi_normal_cholesky(rep_vector(0, p_x), L_b); // コレスキー因子（L_b）を引数にとる多変量正規分布
  V_b ~ inv_wishart(nu, nu*I); // 正規分布の共分散行列の共役事前分布として逆ウィシャート分布を利用
}

data{

int<lower=0> N_x; // 購買レコードの数

int<lower=0> N_z; // 家計の数

int<lower=0> p_x; // 購買レコードの項目数

int<lower=0> p_z; // 家計の属性データの項目数

int y[N_x]; // 選択肢

matrix[N_x, p_x] X; // 説明変数

matrix[N_z, p_z] Z; // 家計の属性データ

int<lower=0> hhid[N_x]; // 家計ID

}

transformed data{

real nu;

matrix[p_x, p_x] I; // 説明変数の数の正方行列

nu = p_x + 3; // 説明変数の項に3を足す

I = diag_matrix(rep_vector(1, p_x)); // 1を繰り返しp_x個並べた対角行列を作成

}

parameters{

vector[p_x] beta_ast[N_z]; // 説明変数の数だけある、家計ごとのパラメータ

matrix[p_z, p_x] Delta; // 属性データの説明変数の数×購買データの説明変数の数だけのパラメータ

cov_matrix[p_x] V_b; // 共分散行列

}

transformed parameters{

vector[p_x] beta[N_z]; #家計の数だけの係数ベクトル

matrix[p_x, p_x] L_b; #共分散行列（beta(家計ごとの係数の共分散)）

matrix[p_x, p_x] L_d; #共分散行列（delta(属性データの係数の共分散)）

L_b = cholesky_decompose(V_b); // 共分散行列のコレスキー因子をもとめる

L_d = cholesky_decompose(100*V_b); // 共分散行列に0.01で割ったもののコレスキー因子をもとめる

for(i in 1:N_z){

beta[i] = beta_ast[i] + Delta' * Z[i]'; // 係数は家計属性ごとの特徴に異質なDeltaとbeta_astの和で決まる

}

model{

for(i in 1:N_x){

y[i] ~ categorical(softmax(beta[hhid[i]] .* to_vector(X[i]))); //カテゴリカル分布にsoftmaxを組み合わせて多項ロジスティック回帰を行う

}

for(i in 1:p_z){

Delta[i] ~ multi_normal_cholesky(rep_vector(0, p_x), L_d); // コレスキー因子（L_d）を引数にとる多変量正規分布(推定の高速化のために用いることがある。)

}

beta_ast ~ multi_normal_cholesky(rep_vector(0, p_x), L_b); // コレスキー因子（L_b）を引数にとる多変量正規分布

V_b ~ inv_wishart(nu, nu*I); // 正規分布の共分散行列の共役事前分布として逆ウィシャート分布を利用

}

以下はstanをキックするためのRコードです。

library(bayesm)
library(dplyr)
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

data("margarine")

#1,2,3,4,5,7の商品に関してデータを抽出し、家計IDごとにカウントし、5件以上のものに絞る。
hhid_selected <- margarine$choicePrice %>% 
                  filter(choice %in% c(1,2,3,4,5,7)) %>% 
                  group_by(hhid) %>% 
                  summarise(purc_cnt = n()) %>% 
                  filter(purc_cnt >= 5)

#今回扱う商品のカラムだけを抽出し、先ほど絞ったユーザーのリストに合致するデータでフィルターする。
choicePrice.selected <- margarine$choicePrice %>% 
                          filter(choice %in% c(1,2,3,4,5,7) & hhid %in% hhid_selected$hhid)
#並べにくいので7を6に置き換える。
choicePrice.selected$choice[choicePrice.selected$choice == 7] <- 6

#家計ごとに関する属性データの抽出
demos.selected <- margarine$demos %>% filter(hhid %in% hhid_selected$hhid)

#データサイズ
N <- nrow(choicePrice.selected)

#選択肢の数（特に使っているデータではない。）
p <- n_distinct(choicePrice.selected$choice)

#被説明変数
y <- choicePrice.selected$choice

#説明変数
X <- choicePrice.selected %>% select(3,4,5,6,7,9)

#家計の属性データから家計IDを除く
Z <- demos.selected %>% 
        select(-hhid)

#定数項を1列目に追加する
Z <- data.frame(intercept = rep(1, nrow(Z))) %>% 
        bind_cols(Z)

#家計の属性データから家計IDを抽出し、1から行数までのインデックスを付与する。
hhid_index <- demos.selected %>%
                select(hhid) %>% 
                mutate(ind = seq(1,nrow(demos.selected)))

#購買データの家計IDを抽出し、先ほどのインデックスとjoinする
hhid_x <- choicePrice.selected %>% 
            select(hhid) %>% 
            left_join(hhid_index)

#stanで扱うデータリストの作成
d.dat <- list(N_x=nrow(X), N_z=nrow(Z), 
              p_x=ncol(X), p_z=ncol(Z),
              y=y, X=X, Z=Z,
              hhid = hhid_x$ind)

#推定
d.fit <- stan("../Chapter5/Hierarchical_MNL.stan",
              data = d.dat,
              iter = 500,
              chains = 4)

library(bayesm)

library(dplyr)

library(rstan)

rstan_options(auto_write = TRUE)

options(mc.cores = parallel::detectCores())

data("margarine")

#1,2,3,4,5,7の商品に関してデータを抽出し、家計IDごとにカウントし、5件以上のものに絞る。

hhid_selected <- margarine$choicePrice %>%

filter(choice %in% c(1,2,3,4,5,7)) %>%

group_by(hhid) %>%

summarise(purc_cnt = n()) %>%

filter(purc_cnt >= 5)

#今回扱う商品のカラムだけを抽出し、先ほど絞ったユーザーのリストに合致するデータでフィルターする。

choicePrice.selected <- margarine$choicePrice %>%

filter(choice %in% c(1,2,3,4,5,7) & hhid %in% hhid_selected$hhid)

#並べにくいので7を6に置き換える。

choicePrice.selected$choice[choicePrice.selected$choice == 7] <- 6

#家計ごとに関する属性データの抽出

demos.selected <- margarine$demos %>% filter(hhid %in% hhid_selected$hhid)

#データサイズ

N <- nrow(choicePrice.selected)

#選択肢の数（特に使っているデータではない。）

p <- n_distinct(choicePrice.selected$choice)

#被説明変数

y <- choicePrice.selected$choice

#説明変数

X <- choicePrice.selected %>% select(3,4,5,6,7,9)

#家計の属性データから家計IDを除く

Z <- demos.selected %>%

select(-hhid)

#定数項を1列目に追加する

Z <- data.frame(intercept = rep(1, nrow(Z))) %>%

bind_cols(Z)

#家計の属性データから家計IDを抽出し、1から行数までのインデックスを付与する。

hhid_index <- demos.selected %>%

select(hhid) %>%

mutate(ind = seq(1,nrow(demos.selected)))

#購買データの家計IDを抽出し、先ほどのインデックスとjoinする

hhid_x <- choicePrice.selected %>%

select(hhid) %>%

left_join(hhid_index)

#stanで扱うデータリストの作成

d.dat <- list(N_x=nrow(X), N_z=nrow(Z),

p_x=ncol(X), p_z=ncol(Z),

y=y, X=X, Z=Z,

hhid = hhid_x$ind)

#推定

d.fit <- stan("../Chapter5/Hierarchical_MNL.stan",

data = d.dat,

iter = 500,

chains = 4)

実行結果

Core i5、8GBメモリのMacBook Proで40分ほどかかりました。

traceplot(d.fit)で以下のように4回の試行結果が描かれますが、収束しているようです。

summary関数を使えばわかりますが、3913行ものパラメータたちのサマリーが得られます。

313家計の家計ごとのブランドに対するパラメータ(1878個)
313家計の家計ごとのブランドに対する潜在パラメータ(1878個)
6ブランドの係数の共分散行列(36個)
6ブランドの係数の分散のハイパーパラメータの行列(36個)
6ブランドの属性データ(8つ)に対する係数(48個)
6ブランドの属性データに対する係数の共分散行列(36個)
lp(log posterior(確率密度の和でモデル比較で扱う。))(1個)

64番目の家計の各ブランドの価格に対する係数の分布を確認すると、4番目・5番目のブランドの係数が他のブランドに比べて小さいことがわかります。

続いて、家計ごとの係数に関して集計し、係数ごとの相関係数を見てみると、各ブランドごとに正の相関、負の相関がありそうです。

#トレースプロット
traceplot(d.fit)

#係数のサマリー
summary_table <- summary(d.fit)$summary

draws <- extract(d.fit)
beta <- as.data.frame(draws$beta)
Delta <- as.data.frame(draws$Delta)
V_b <- as.data.frame(draws$V_b)

hhid_info <- inner_join(hhid_index, hhid_selected)

# 1000行*313列のデータを313000行*1列のデータにしたい。
for (i in 1:6) {
  nam <- paste("beta", i, sep = "")
  assign(nam, beta[,(1+313*(i-1)):(313*(i))] %>% tidyr::gather(key, value))
}

beta_matrix <- beta1 %>% bind_cols(beta2,beta3,beta4,beta5,beta6)
beta_matrix <- beta_matrix %>% select(-starts_with("key"))

#相関係数
cor(beta_matrix)

            value     value1     value2     value3     value4    value5
value   1.0000000  0.5902734  0.4864998 -0.1798877 -0.4781558 0.3188025
value1  0.5902734  1.0000000  0.6134343 -0.2484286 -0.4441728 0.2002954
value2  0.4864998  0.6134343  1.0000000  0.1336322 -0.4512474 0.3663121
value3 -0.1798877 -0.2484286  0.1336322  1.0000000  0.6149186 0.2671819
value4 -0.4781558 -0.4441728 -0.4512474  0.6149186  1.0000000 0.1591574
value5  0.3188025  0.2002954  0.3663121  0.2671819  0.1591574 1.0000000

#トレースプロット

traceplot(d.fit)

#係数のサマリー

summary_table <- summary(d.fit)$summary

draws <- extract(d.fit)

beta <- as.data.frame(draws$beta)

Delta <- as.data.frame(draws$Delta)

V_b <- as.data.frame(draws$V_b)

hhid_info <- inner_join(hhid_index, hhid_selected)

# 1000行*313列のデータを313000行*1列のデータにしたい。

for (i in 1:6) {

nam <- paste("beta", i, sep = "")

assign(nam, beta[,(1+313*(i-1)):(313*(i))] %>% tidyr::gather(key, value))

}

beta_matrix <- beta1 %>% bind_cols(beta2,beta3,beta4,beta5,beta6)

beta_matrix <- beta_matrix %>% select(-starts_with("key"))

#相関係数

cor(beta_matrix)

value value1 value2 value3 value4 value5

value 1.0000000 0.5902734 0.4864998 -0.1798877 -0.4781558 0.3188025

value1 0.5902734 1.0000000 0.6134343 -0.2484286 -0.4441728 0.2002954

value2 0.4864998 0.6134343 1.0000000 0.1336322 -0.4512474 0.3663121

value3 -0.1798877 -0.2484286 0.1336322 1.0000000 0.6149186 0.2671819

value4 -0.4781558 -0.4441728 -0.4512474 0.6149186 1.0000000 0.1591574

value5 0.3188025 0.2002954 0.3663121 0.2671819 0.1591574 1.0000000

最後に、家計ごとに集計した、ブランドに対する価格反応係数の事後分布を描きます。

~~多峰性などはなく、正規分布に従っているようです。他のブランドと比較して、5番目の係数が小さいようです。~~

というのは誤りで、一週間後に気づいたのですが、家計ごとのブランドごとの係数の事後分布の平均値をプロットするべきでした。
正しくはこちらです。

事前情報として正規分布を仮定していましたが、係数に関して正規分布に従っていません。
そのため、事前情報として対称性のあるような正規分布を扱うのは適切ではなさそうです。

おわりに

2005年の本とは言え、十分に使いみちのある本だと思いました。まだまだ扱いきれていないですが、引き続き勉強していきます。
この本にはケーススタディが5つほどあるのですが、それのstanコード化などをしていけばかなり力がつくような気がします。

マーケティングの部署で働くデータアナリストにとって、マーケティング×ベイズの話は非常にモチベーションの上がるところなので、こういう文献を今後も見つけていきたい。

参考文献

Bayesian Statistics and Marketing (Wiley Series in Probability and Statistics)
Bayesian Statistics and Marketingのサポートサイト
 ベイズモデリングによるマーケティング分析
 StanとRでベイズ統計モデリング (Wonderful R)
RStanのおさらいをしながら読む岩波DS 1 Shinya Uryu
Stanのlp__とは何なのかうなどん
 ‘LP__’ IN STAN OUTPUT
Package ‘bayesm’

蒙古タンメン中本コーパスに対してのLDAの適用とトピック数の探索

モチベーション

前回の記事では、Webスクレイピングにより入手した、蒙古タンメン中本の口コミデータに関して、Word2Vecを適用した特徴量エンジニアリングの事例を紹介しました。
今回はせっかく興味深いデータがあるので、どのようなトピックがあるのかをLDAを適用したいと思います。加えて、これまで記事で扱ってきたLDAの事例では評価指標であるPerplexityやCoherenceを扱ってこなかったことから、トピック数がどれくらいであるべきなのか、考察も含めて行いたいと思います。以前扱った階層ディリクレ過程であれば、トピック数を事前に決める必要が無いのですが、今回は扱わないものとします。

環境

・MacBook Pro
・Python3.5
・R version 3.4.4

Gensimで行うLDA

今回もPythonのGensimライブラリを用いて行います。

パープレキシティ
- テストデータに対して計算
- 負の対数尤度で、低いほどよい。
  - パープレキシティが低いと、高い精度で予測できるよい確率モデルと見なされる。汎化能力を表す指標。
  - トピックの数をいくらでも増やせばパープレキシティは下がる傾向が出ている。
  - 教科書でのパープレキシティの事例に関しては、トピック数を増やせば低くなるという傾向が出ている。

以下のコードでパープレキシティを計算します。

import pandas as pd
from gensim import corpora, models
import  gensim as  gensim

import matplotlib.pyplot as plt
import numpy as np
from ipywidgets import FloatProgress
from IPython.display import display, clear_output

nakamoto_corpus = pd.read_pickle("nakamoto_corpus.pickle")

output_list = pd.DataFrame(columns=["y","x"])
plt.ion()
fig = plt.figure()
axe = fig.add_subplot(111)

for iterations in range(2, 100):
                        clear_output(wait = True)

                        nakamoto_test = nakamoto_corpus.sample(frac=0.1, replace=True)
                        nakamoto_train= nakamoto_corpus[~nakamoto_corpus.index.isin(nakamoto_test.index)]

                        texts = [ ]
                        for line in nakamoto_train.text_wakati:
                            texts.append(line.split())

                        texts_test = [ ]
                        for line in nakamoto_test.text_wakati:
                            texts_test.append(line.split())

                        # 辞書作成
                        dictionary = corpora.Dictionary(texts)
                        dictionary.filter_extremes(no_below=20, no_above=0.3)

                        # コーパスを作成
                        corpus = [dictionary.doc2bow(text) for text in texts]
                        corpus_test = [dictionary.doc2bow(text) for text in texts_test]

                        # LDA の計算
                        topic_N = 1 + iterations
                        parameters = topic_N
                        lda = gensim.models.ldamodel.LdaModel(
                            corpus=corpus,
                            alpha='auto',
                            num_topics=topic_N,
                            id2word=dictionary
                        )

                        d = {'y':lda.log_perplexity(chunk=corpus_test),'x':parameters}
                        df = pd.DataFrame(data=d,index=[0])
                        output_list = output_list.append(df)
                        
                        axe.plot(output_list.x,output_list.y)
                        fig.set_size_inches(8, 8)
                        display(fig)
                        axe.cla()
                        
                        for i in range(topic_N):
                            print('TOPIC:', i, '__', lda.print_topic(i))

import pandas as pd

from gensim import corpora, models

import gensim as gensim

import matplotlib.pyplot as plt

import numpy as np

from ipywidgets import FloatProgress

from IPython.display import display, clear_output

nakamoto_corpus = pd.read_pickle("nakamoto_corpus.pickle")

output_list = pd.DataFrame(columns=["y","x"])

plt.ion()

fig = plt.figure()

axe = fig.add_subplot(111)

for iterations in range(2, 100):

clear_output(wait = True)

nakamoto_test = nakamoto_corpus.sample(frac=0.1, replace=True)

nakamoto_train= nakamoto_corpus[~nakamoto_corpus.index.isin(nakamoto_test.index)]

texts = [ ]

for line in nakamoto_train.text_wakati:

texts.append(line.split())

texts_test = [ ]

for line in nakamoto_test.text_wakati:

texts_test.append(line.split())

# 辞書作成

dictionary = corpora.Dictionary(texts)

dictionary.filter_extremes(no_below=20, no_above=0.3)

# コーパスを作成

corpus = [dictionary.doc2bow(text) for text in texts]

corpus_test = [dictionary.doc2bow(text) for text in texts_test]

# LDA の計算

topic_N = 1 + iterations

parameters = topic_N

lda = gensim.models.ldamodel.LdaModel(

corpus=corpus,

alpha='auto',

num_topics=topic_N,

id2word=dictionary

)

d = {'y':lda.log_perplexity(chunk=corpus_test),'x':parameters}

df = pd.DataFrame(data=d,index=[0])

output_list = output_list.append(df)

axe.plot(output_list.x,output_list.y)

fig.set_size_inches(8, 8)

display(fig)

axe.cla()

for i in range(topic_N):

print('TOPIC:', i, '__', lda.print_topic(i))

実際に、中本コーパスで計算したトピック数に対してのパープレキシティは以下のように推移しました。

Ldaのモデル選択におけるperplexityの評価によると、
”複数のトピック数で比べて、Perplexityが最も低いものを選択する。」という手法は人間にとって有益なモデルを選択するのに全く役に立たない可能性がある。”と記されています。

『トピックモデルによる統計的潜在意味解析』には、”識別問題の特徴量として使う場合は識別問題の評価方法で決定すればよい”とあるので、目的によってはパープレキシティにこだわらなくても良いと思われます。

今回のケースだと、パープレキシティだけだと、決めかねてしまいますね。

コヒーレンス
- トピックごとの単語間類似度の平均
- トピック全体のコヒーレンスが高ければ、良い学習アルゴリズムとみなす。

以下のコードでコヒーレンスを計算します。

output_list = pd.DataFrame(columns=["y","x"])
plt.ion()
fig = plt.figure()
axe = fig.add_subplot(111)

for iterations in range(2, 100):
                        clear_output(wait = True)

                        nakamoto_test = nakamoto_corpus.sample(frac=0.1, replace=True)
                        nakamoto_train= nakamoto_corpus[~nakamoto_corpus.index.isin(nakamoto_test.index)]

                        texts = [ ]
                        for line in nakamoto_train.text_wakati:
                            texts.append(line.split())

                        texts_test = [ ]
                        for line in nakamoto_test.text_wakati:
                            texts_test.append(line.split())

                        # 辞書作成
                        dictionary = corpora.Dictionary(texts)
                        dictionary.filter_extremes(no_below=20, no_above=0.3)

                        # コーパスを作成
                        corpus = [dictionary.doc2bow(text) for text in texts]
                        corpus_test = [dictionary.doc2bow(text) for text in texts_test]

                        # LDA の計算
                        topic_N = 1 + iterations
                        parameters = topic_N
                        lda = gensim.models.ldamodel.LdaModel(
                            corpus=corpus,
                            alpha='auto',
                            num_topics=topic_N,
                            id2word=dictionary
                        )

                        
                        cm = models.coherencemodel.CoherenceModel(model=lda, corpus=corpus, coherence='u_mass')  # tm is the trained topic model                        
                        d = {'y':cm.get_coherence(),'x':parameters}
                        df = pd.DataFrame(data=d,index=[0])
                        output_list = output_list.append(df)
                        
                        axe.plot(output_list.x,output_list.y)
                        fig.set_size_inches(8, 8)
                        display(fig)
                        axe.cla()
                        
                        for i in range(topic_N):
                            print('TOPIC:', i, '__', lda.print_topic(i))

output_list = pd.DataFrame(columns=["y","x"])

plt.ion()

fig = plt.figure()

axe = fig.add_subplot(111)

for iterations in range(2, 100):

clear_output(wait = True)

nakamoto_test = nakamoto_corpus.sample(frac=0.1, replace=True)

nakamoto_train= nakamoto_corpus[~nakamoto_corpus.index.isin(nakamoto_test.index)]

texts = [ ]

for line in nakamoto_train.text_wakati:

texts.append(line.split())

texts_test = [ ]

for line in nakamoto_test.text_wakati:

texts_test.append(line.split())

# 辞書作成

dictionary = corpora.Dictionary(texts)

dictionary.filter_extremes(no_below=20, no_above=0.3)

# コーパスを作成

corpus = [dictionary.doc2bow(text) for text in texts]

corpus_test = [dictionary.doc2bow(text) for text in texts_test]

# LDA の計算

topic_N = 1 + iterations

parameters = topic_N

lda = gensim.models.ldamodel.LdaModel(

corpus=corpus,

alpha='auto',

num_topics=topic_N,

id2word=dictionary

)

cm = models.coherencemodel.CoherenceModel(model=lda, corpus=corpus, coherence='u_mass') # tm is the trained topic model

d = {'y':cm.get_coherence(),'x':parameters}

df = pd.DataFrame(data=d,index=[0])

output_list = output_list.append(df)

axe.plot(output_list.x,output_list.y)

fig.set_size_inches(8, 8)

display(fig)

axe.cla()

for i in range(topic_N):

print('TOPIC:', i, '__', lda.print_topic(i))

実際に推定してみたところ、トピック数が20を超えたあたりからコヒーレンスが下がる傾向があるので、
それ以上のトピック数は追い求めない方が良いのかもしれません。

Rでもやってみる

Rでトピック数を決める良い方法がないか調べてみたところ、ldatuningとかいうパッケージがあることがわかりました。複数の論文（Griffiths2004, CaoJuan2009, Arun2010,Deveaud2014）で扱われている手法を元に、適切なトピック数を探れるようです。このパッケージを紹介しているブログの事例では、90から140の範囲で最適なトピック数となることが示されています。詳しくはこちらを見てください。
Select number of topics for LDA model

以下のコードで実行しました。一部、驚異のアニヲタさんのコードを拝借しております。なお、ldaパッケージのlexicalize関数を用いることで、ldatuningに入力するデータを作成することができます。

library(ldatuning)
library(topicmodels)
library(tidyverse)

nakamoto_data <- read_csv(file = "nakamoto_dataset.csv")

TFIDF <- function(corpus, progress=FALSE){ 
  res <- matrix(0, nr=length(corpus$vocab), nc=4)
  dimnames(res) <- list(corpus$vocab, c("documents", "count", "freq", "score"))
  res[, "documents"] <- length(corpus$documents)
  wordset <- mapply(function(x) x[1,], corpus$documents) 
  allfreq <- matrix(unlist(corpus$documents), nr=2)
  wordfreq <- tapply(allfreq[2,], allfreq[1,], sum) 
  for(v in seq(corpus$vocab)){ 
    count_docs <- sum(sapply(lapply(wordset, "==", v-1), any)) #
    res[v, "freq"] <- count_docs
    if(progress){ 
      pb <- txtProgressBar(min=1, max=length(corpus$vocab), style=3)
      setTxtProgressBar(pb, v)
    }
  }
  res[, "count"] <- wordfreq
  res[, "score"] <- log(res[, "count"]) * log(res[, "documents"]/res[, "freq"])
  return(as.data.frame(res))
}

lex1 <- lda::lexicalize(nakamoto_data$text_wakati_remove_freq)

s0 <- TFIDF(lex1, TRUE)
term1 <- rownames(s0)[s0$score > 0]
lex2 <- list(documents=lda::lexicalize(nakamoto_data$text_wakati_remove_freq, vocab=term1), vocab=term1)
dtm2 <- ldaformat2dtm(lex2$documents, lex2$vocab)

result <- FindTopicsNumber(
  dtm2,
  topics = seq(from = 2, to = 100, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)

FindTopicsNumber_plot(values = result)

library(ldatuning)

library(topicmodels)

library(tidyverse)

nakamoto_data <- read_csv(file = "nakamoto_dataset.csv")

TFIDF <- function(corpus, progress=FALSE){

res <- matrix(0, nr=length(corpus$vocab), nc=4)

dimnames(res) <- list(corpus$vocab, c("documents", "count", "freq", "score"))

res[, "documents"] <- length(corpus$documents)

wordset <- mapply(function(x) x[1,], corpus$documents)

allfreq <- matrix(unlist(corpus$documents), nr=2)

wordfreq <- tapply(allfreq[2,], allfreq[1,], sum)

for(v in seq(corpus$vocab)){

count_docs <- sum(sapply(lapply(wordset, "==", v-1), any)) #

res[v, "freq"] <- count_docs

if(progress){

pb <- txtProgressBar(min=1, max=length(corpus$vocab), style=3)

setTxtProgressBar(pb, v)

}

res[, "count"] <- wordfreq

res[, "score"] <- log(res[, "count"]) * log(res[, "documents"]/res[, "freq"])

return(as.data.frame(res))

}

lex1 <- lda::lexicalize(nakamoto_data$text_wakati_remove_freq)

s0 <- TFIDF(lex1, TRUE)

term1 <- rownames(s0)[s0$score > 0]

lex2 <- list(documents=lda::lexicalize(nakamoto_data$text_wakati_remove_freq, vocab=term1), vocab=term1)

dtm2 <- ldaformat2dtm(lex2$documents, lex2$vocab)

result <- FindTopicsNumber(

dtm2,

topics = seq(from = 2, to = 100, by = 1),

metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),

method = "Gibbs",

control = list(seed = 77),

mc.cores = 2L,

verbose = TRUE

)

FindTopicsNumber_plot(values = result)

これを見る限りは、60〜70個の辺りに落ち着くのでしょうか。

トピックの吐き出し

Rでの結果から、60個程度のトピックで推定し、各記事に割り当てが最大のトピックを付与して、トピック別の口コミ評価をみてみようと思います。

以下のコードではトピック別の口コミ評価のしやすさからtopicmodelsパッケージを用いた推定となっています。

nbo_topics <- 60
lda_estimate <- topicmodels::LDA(dtm2,control=list(verbose=1, alpha = 0.1),
                                 k = nbo_topics,
                                 method = "Gibbs")

#トピックの上位5単語を確認する
terms_each_topics <- data.frame(terms(lda_estimate,5))
topic_keywords <- data.frame(topic_keywords_5 = apply(t(terms_each_topics),1,paste,collapse=","))
topic_keywords <- topic_keywords %>% mutate(topic_id=1:n())

#割り振られた最大の確率のトピックを抽出し、口コミデータと統合する
topics_each_document <- data.frame(topic_id=topics(lda_estimate,1))
topics_each_document <- nakamoto_data %>% bind_cols(topics_each_document)

#トピックごとの口コミ評価を計算する
topic_rating_summary <- topics_each_document %>% 
                            group_by(topic_id) %>% 
                            summarise(average_rating = mean(rating),
                                      count=n()) %>% 
                            left_join(topic_keywords, by = "topic_id") %>% 
                            arrange(desc(average_rating))

nbo_topics <- 60

lda_estimate <- topicmodels::LDA(dtm2,control=list(verbose=1, alpha = 0.1),

k = nbo_topics,

method = "Gibbs")

#トピックの上位5単語を確認する

terms_each_topics <- data.frame(terms(lda_estimate,5))

topic_keywords <- data.frame(topic_keywords_5 = apply(t(terms_each_topics),1,paste,collapse=","))

topic_keywords <- topic_keywords %>% mutate(topic_id=1:n())

#割り振られた最大の確率のトピックを抽出し、口コミデータと統合する

topics_each_document <- data.frame(topic_id=topics(lda_estimate,1))

topics_each_document <- nakamoto_data %>% bind_cols(topics_each_document)

#トピックごとの口コミ評価を計算する

topic_rating_summary <- topics_each_document %>%

group_by(topic_id) %>%

summarise(average_rating = mean(rating),

count=n()) %>%

left_join(topic_keywords, by = "topic_id") %>%

arrange(desc(average_rating))

口コミ評価の点数が上位のトピックはこんな感じです。

口コミ評価の点数が下位のトピックはこんな感じです。

中本は社会人2〜3年目で新規メディアの立ち上げのストレス解消で数回行きましたが、北極の赤さは異常だと思います。北極を食べたり、トッピングする余裕のある人、ましてや辛さを倍にするという時点で口コミ評価も高くなると考えるのは自然なのかもしれません。

参考情報

トピックモデル (機械学習プロフェッショナルシリーズ)
トピックモデルによる統計的潜在意味解析 (自然言語処理シリーズ)
models.ldamodel – Latent Dirichlet Allocation
Ldaのモデル選択におけるperplexityの評価
 pythonでgensimを使ってトピックモデル(LDA)を行う
 gensim0.8.6のチュートリアルをやってみた【コーパスとベクトル空間】
LDA 実装の比較
 Jupyter notebookにMatplotlibでリアルタイムにチャートを書く
 Inferring the number of topics for gensim’s LDA – perplexity, CM, AIC, and BIC
Select number of topics for LDA model
47の心得シリーズをトピックモデルで分類する。 – 驚異のアニヲタ社会復帰への道