機械学習 – ページ 4 – かものはしの分析ブログ

OS X YosemiteへのTensorFlowのインストールと簡易な分類モデルの実行

遅ればせながらTensorFlowをインストールしてみようと思います。
環境はOSX Yosemite version 10.10.5です。

まずは、こちらのサイトにあるように、ターミナル（Windowsでいうところのコマンドプロンプト）にて、
以下のコードを実行します。（前提として、Python2.7とpipが入っているものとします。）

sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl

1	sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.7.1-cp27-none-any.whl

こちらを実行すれば、インストールされます。

次に、画像を格納しておくためのディレクトリを作成します。
同じくターミナルにて以下のコードを実行します。

mkdir ~/tensorflow
cd ~/tensorflow
touch input_data.py
vi input_data.py

mkdir ~/tensorflow

cd ~/tensorflow

touch input_data.py

vi input_data.py

次に、こちらのgithubより、MNISTの画像を取得するコードをコピーします。

# Copyright 2015 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Functions for downloading and reading MNIST data."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gzip
import os

import numpy
from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
import tensorflow as tf

SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'


def maybe_download(filename, work_directory):
  """Download the data from Yann's website, unless it's already here."""
  if not tf.gfile.Exists(work_directory):
    tf.gfile.MakeDirs(work_directory)
  filepath = os.path.join(work_directory, filename)
  if not tf.gfile.Exists(filepath):
    filepath, _ = urllib.request.urlretrieve(SOURCE_URL + filename, filepath)
    with tf.gfile.GFile(filepath) as f:
      size = f.Size()
    print('Successfully downloaded', filename, size, 'bytes.')
  return filepath


def _read32(bytestream):
  dt = numpy.dtype(numpy.uint32).newbyteorder('>')
  return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]


def extract_images(filename):
  """Extract the images into a 4D uint8 numpy array [index, y, x, depth]."""
  print('Extracting', filename)
  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2051:
      raise ValueError(
          'Invalid magic number %d in MNIST image file: %s' %
          (magic, filename))
    num_images = _read32(bytestream)
    rows = _read32(bytestream)
    cols = _read32(bytestream)
    buf = bytestream.read(rows * cols * num_images)
    data = numpy.frombuffer(buf, dtype=numpy.uint8)
    data = data.reshape(num_images, rows, cols, 1)
    return data


def dense_to_one_hot(labels_dense, num_classes=10):
  """Convert class labels from scalars to one-hot vectors."""
  num_labels = labels_dense.shape[0]
  index_offset = numpy.arange(num_labels) * num_classes
  labels_one_hot = numpy.zeros((num_labels, num_classes))
  labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
  return labels_one_hot


def extract_labels(filename, one_hot=False):
  """Extract the labels into a 1D uint8 numpy array [index]."""
  print('Extracting', filename)
  with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:
    magic = _read32(bytestream)
    if magic != 2049:
      raise ValueError(
          'Invalid magic number %d in MNIST label file: %s' %
          (magic, filename))
    num_items = _read32(bytestream)
    buf = bytestream.read(num_items)
    labels = numpy.frombuffer(buf, dtype=numpy.uint8)
    if one_hot:
      return dense_to_one_hot(labels)
    return labels


class DataSet(object):

  def __init__(self, images, labels, fake_data=False, one_hot=False,
               dtype=tf.float32):
    """Construct a DataSet.
    one_hot arg is used only if fake_data is true.  `dtype` can be either
    `uint8` to leave the input as `[0, 255]`, or `float32` to rescale into
    `[0, 1]`.
    """
    dtype = tf.as_dtype(dtype).base_dtype
    if dtype not in (tf.uint8, tf.float32):
      raise TypeError('Invalid image dtype %r, expected uint8 or float32' %
                      dtype)
    if fake_data:
      self._num_examples = 10000
      self.one_hot = one_hot
    else:
      assert images.shape[0] == labels.shape[0], (
          'images.shape: %s labels.shape: %s' % (images.shape,
                                                 labels.shape))
      self._num_examples = images.shape[0]

      # Convert shape from [num examples, rows, columns, depth]
      # to [num examples, rows*columns] (assuming depth == 1)
      assert images.shape[3] == 1
      images = images.reshape(images.shape[0],
                              images.shape[1] * images.shape[2])
      if dtype == tf.float32:
        # Convert from [0, 255] -> [0.0, 1.0].
        images = images.astype(numpy.float32)
        images = numpy.multiply(images, 1.0 / 255.0)
    self._images = images
    self._labels = labels
    self._epochs_completed = 0
    self._index_in_epoch = 0

  @property
  def images(self):
    return self._images

  @property
  def labels(self):
    return self._labels

  @property
  def num_examples(self):
    return self._num_examples

  @property
  def epochs_completed(self):
    return self._epochs_completed

  def next_batch(self, batch_size, fake_data=False):
    """Return the next `batch_size` examples from this data set."""
    if fake_data:
      fake_image = [1] * 784
      if self.one_hot:
        fake_label = [1] + [0] * 9
      else:
        fake_label = 0
      return [fake_image for _ in xrange(batch_size)], [
          fake_label for _ in xrange(batch_size)]
    start = self._index_in_epoch
    self._index_in_epoch += batch_size
    if self._index_in_epoch > self._num_examples:
      # Finished epoch
      self._epochs_completed += 1
      # Shuffle the data
      perm = numpy.arange(self._num_examples)
      numpy.random.shuffle(perm)
      self._images = self._images[perm]
      self._labels = self._labels[perm]
      # Start next epoch
      start = 0
      self._index_in_epoch = batch_size
      assert batch_size <= self._num_examples
    end = self._index_in_epoch
    return self._images[start:end], self._labels[start:end]


def read_data_sets(train_dir, fake_data=False, one_hot=False, dtype=tf.float32):
  class DataSets(object):
    pass
  data_sets = DataSets()

  if fake_data:
    def fake():
      return DataSet([], [], fake_data=True, one_hot=one_hot, dtype=dtype)
    data_sets.train = fake()
    data_sets.validation = fake()
    data_sets.test = fake()
    return data_sets

  TRAIN_IMAGES = 'train-images-idx3-ubyte.gz'
  TRAIN_LABELS = 'train-labels-idx1-ubyte.gz'
  TEST_IMAGES = 't10k-images-idx3-ubyte.gz'
  TEST_LABELS = 't10k-labels-idx1-ubyte.gz'
  VALIDATION_SIZE = 5000

  local_file = maybe_download(TRAIN_IMAGES, train_dir)
  train_images = extract_images(local_file)

  local_file = maybe_download(TRAIN_LABELS, train_dir)
  train_labels = extract_labels(local_file, one_hot=one_hot)

  local_file = maybe_download(TEST_IMAGES, train_dir)
  test_images = extract_images(local_file)

  local_file = maybe_download(TEST_LABELS, train_dir)
  test_labels = extract_labels(local_file, one_hot=one_hot)

  validation_images = train_images[:VALIDATION_SIZE]
  validation_labels = train_labels[:VALIDATION_SIZE]
  train_images = train_images[VALIDATION_SIZE:]
  train_labels = train_labels[VALIDATION_SIZE:]

  data_sets.train = DataSet(train_images, train_labels, dtype=dtype)
  data_sets.validation = DataSet(validation_images, validation_labels,
                                 dtype=dtype)
  data_sets.test = DataSet(test_images, test_labels, dtype=dtype)

  return data_sets

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# ==============================================================================

"""Functions for downloading and reading MNIST data."""

from __future__ import absolute_import

from __future__ import division

from __future__ import print_function

import gzip

import os

import numpy

from six.moves import urllib

from six.moves import xrange # pylint: disable=redefined-builtin

import tensorflow as tf

SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'

def maybe_download(filename, work_directory):

"""Download the data from Yann's website, unless it's already here."""

if not tf.gfile.Exists(work_directory):

tf.gfile.MakeDirs(work_directory)

filepath = os.path.join(work_directory, filename)

if not tf.gfile.Exists(filepath):

filepath, _ = urllib.request.urlretrieve(SOURCE_URL + filename, filepath)

with tf.gfile.GFile(filepath) as f:

size = f.Size()

print('Successfully downloaded', filename, size, 'bytes.')

return filepath

def _read32(bytestream):

dt = numpy.dtype(numpy.uint32).newbyteorder('>')

return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]

def extract_images(filename):

"""Extract the images into a 4D uint8 numpy array [index, y, x, depth]."""

print('Extracting', filename)

with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:

magic = _read32(bytestream)

if magic != 2051:

raise ValueError(

'Invalid magic number %d in MNIST image file: %s' %

(magic, filename))

num_images = _read32(bytestream)

rows = _read32(bytestream)

cols = _read32(bytestream)

buf = bytestream.read(rows * cols * num_images)

data = numpy.frombuffer(buf, dtype=numpy.uint8)

data = data.reshape(num_images, rows, cols, 1)

return data

def dense_to_one_hot(labels_dense, num_classes=10):

"""Convert class labels from scalars to one-hot vectors."""

num_labels = labels_dense.shape[0]

index_offset = numpy.arange(num_labels) * num_classes

labels_one_hot = numpy.zeros((num_labels, num_classes))

labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1

return labels_one_hot

def extract_labels(filename, one_hot=False):

"""Extract the labels into a 1D uint8 numpy array [index]."""

print('Extracting', filename)

with tf.gfile.Open(filename, 'rb') as f, gzip.GzipFile(fileobj=f) as bytestream:

magic = _read32(bytestream)

if magic != 2049:

raise ValueError(

'Invalid magic number %d in MNIST label file: %s' %

(magic, filename))

num_items = _read32(bytestream)

buf = bytestream.read(num_items)

labels = numpy.frombuffer(buf, dtype=numpy.uint8)

if one_hot:

return dense_to_one_hot(labels)

return labels

class DataSet(object):

def __init__(self, images, labels, fake_data=False, one_hot=False,

dtype=tf.float32):

"""Construct a DataSet.

one_hot arg is used only if fake_data is true. `dtype` can be either

`uint8` to leave the input as `[0, 255]`, or `float32` to rescale into

`[0, 1]`.

"""

dtype = tf.as_dtype(dtype).base_dtype

if dtype not in (tf.uint8, tf.float32):

raise TypeError('Invalid image dtype %r, expected uint8 or float32' %

dtype)

if fake_data:

self._num_examples = 10000

self.one_hot = one_hot

else:

assert images.shape[0] == labels.shape[0], (

'images.shape: %s labels.shape: %s' % (images.shape,

labels.shape))

self._num_examples = images.shape[0]

# Convert shape from [num examples, rows, columns, depth]

# to [num examples, rows*columns] (assuming depth == 1)

assert images.shape[3] == 1

images = images.reshape(images.shape[0],

images.shape[1] * images.shape[2])

if dtype == tf.float32:

# Convert from [0, 255] -> [0.0, 1.0].

images = images.astype(numpy.float32)

images = numpy.multiply(images, 1.0 / 255.0)

self._images = images

self._labels = labels

self._epochs_completed = 0

self._index_in_epoch = 0

@property

def images(self):

return self._images

@property

def labels(self):

return self._labels

@property

def num_examples(self):

return self._num_examples

@property

def epochs_completed(self):

return self._epochs_completed

def next_batch(self, batch_size, fake_data=False):

"""Return the next `batch_size` examples from this data set."""

if fake_data:

fake_image = [1] * 784

if self.one_hot:

fake_label = [1] + [0] * 9

else:

fake_label = 0

return [fake_image for _ in xrange(batch_size)], [

fake_label for _ in xrange(batch_size)]

start = self._index_in_epoch

self._index_in_epoch += batch_size

if self._index_in_epoch > self._num_examples:

# Finished epoch

self._epochs_completed += 1

# Shuffle the data

perm = numpy.arange(self._num_examples)

numpy.random.shuffle(perm)

self._images = self._images[perm]

self._labels = self._labels[perm]

# Start next epoch

start = 0

self._index_in_epoch = batch_size

assert batch_size <= self._num_examples

end = self._index_in_epoch

return self._images[start:end], self._labels[start:end]

def read_data_sets(train_dir, fake_data=False, one_hot=False, dtype=tf.float32):

class DataSets(object):

pass

data_sets = DataSets()

if fake_data:

def fake():

return DataSet([], [], fake_data=True, one_hot=one_hot, dtype=dtype)

data_sets.train = fake()

data_sets.validation = fake()

data_sets.test = fake()

return data_sets

TRAIN_IMAGES = 'train-images-idx3-ubyte.gz'

TRAIN_LABELS = 'train-labels-idx1-ubyte.gz'

TEST_IMAGES = 't10k-images-idx3-ubyte.gz'

TEST_LABELS = 't10k-labels-idx1-ubyte.gz'

VALIDATION_SIZE = 5000

local_file = maybe_download(TRAIN_IMAGES, train_dir)

train_images = extract_images(local_file)

local_file = maybe_download(TRAIN_LABELS, train_dir)

train_labels = extract_labels(local_file, one_hot=one_hot)

local_file = maybe_download(TEST_IMAGES, train_dir)

test_images = extract_images(local_file)

local_file = maybe_download(TEST_LABELS, train_dir)

test_labels = extract_labels(local_file, one_hot=one_hot)

validation_images = train_images[:VALIDATION_SIZE]

validation_labels = train_labels[:VALIDATION_SIZE]

train_images = train_images[VALIDATION_SIZE:]

train_labels = train_labels[VALIDATION_SIZE:]

data_sets.train = DataSet(train_images, train_labels, dtype=dtype)

data_sets.validation = DataSet(validation_images, validation_labels,

dtype=dtype)

data_sets.test = DataSet(test_images, test_labels, dtype=dtype)

return data_sets

コピーしたものを、先ほどのターミナル上で表示されているvimにて、
i
を押してからペーストし、Escを押してコマンドモードに移り、
:wq
を入力することで上書きしてファイルを閉じます。
詳しくはvimコマンド一覧をご覧ください。

そして、ターミナル上でpythonを開き、以下のコードを実行します。

import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

1 2	import input_data mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

そうすることで、フォルダが作られデータが格納されていることが確認できると思います。
だいたい11MBくらいのボリュームです。元の画像は手書き文字で、28×28ピクセルの画素だそうです。

さて、このデータに対する機械学習による分類の実行ですが、
「TensorFlow 畳み込みニューラルネットワークで手書き認識率99.2%の分類器を構築」こちらで紹介されているコードを丸々引用させていただきます。これの初級者用のロジスティック回帰モデルのものを扱います。(試行回数を1000回から50000回に変更しています。)

# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import input_data
import tensorflow as tf
# mnistデータ読み込み
print "****MNISTデータ読み込み****"
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

print "****Start Tutorial****"
x = tf.placeholder("float", [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
y_ = tf.placeholder("float", [None, 10])
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))

# In this case, we ask TensorFlow to minimize cross_entropy
# using the gradient descent algorithm with a learning rate of 0.01.
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

# 学習変数とセッションの初期化
print "****init****"
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)

# 50000回学習
print "****50000回学習と結果表示****"
for i in range(50000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

# 結果表示
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})

# -*- coding: utf-8 -*-

from __future__ import absolute_import, unicode_literals

import input_data

import tensorflow as tf

# mnistデータ読み込み

print "****MNISTデータ読み込み****"

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

print "****Start Tutorial****"

x = tf.placeholder("float", [None, 784])

W = tf.Variable(tf.zeros([784, 10]))

b = tf.Variable(tf.zeros([10]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder("float", [None, 10])

cross_entropy = -tf.reduce_sum(y_ * tf.log(y))

# In this case, we ask TensorFlow to minimize cross_entropy

# using the gradient descent algorithm with a learning rate of 0.01.

train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

# 学習変数とセッションの初期化

print "****init****"

init = tf.initialize_all_variables()

sess = tf.Session()

sess.run(init)

# 50000回学習

print "****50000回学習と結果表示****"

for i in range(50000):

batch_xs, batch_ys = mnist.train.next_batch(100)

sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

# 結果表示

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})

こちらを実行したところ、

****MNISTデータ読み込み****
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
****Start Tutorial****
****init****
****50000回学習と結果表示****
0.914

と試行回数を50000まで増やしても、Qiita初級者向けモデルの1000回試行の0.91と比べてあまり良くなりませんでした。
紹介されていたQiitaの上級者向けモデルでは0.992まで持って行っていましたので、上級版もチャレンジしてみたいですね。
勉強しないとここらへんの改良はできないので、今後学習していこうと思います。

LDA（潜在的ディリクレ配分法）まとめ手法の概要と試行まで

【目次】
・トピックモデルとは
・トピックモデルの歴史
・トピックモデルでできること
・トピックモデルを理解するために必要な知識
・トピックモデルの手法について
・トピックモデルの実行方法について（R言語）
・トピックモデルの評価方法について
・Correlated Topic Models (CTM)について
・PAM:Pachinko Allocation Modelについて
・Relational Topic Models(RTM)について
・参考文献

トピックモデルとは

・一つの文書に複数の潜在的なトピック（話題・分野・文体・著者など）、文書ごとのトピック分布、トピックごとの単語分布を仮定して、テキストからトピックを推定するモデル。文書に限らず、様々な離散データに隠された潜在的なトピックを推定するベイジアンモデル。幅広いドメインでの離散データで有効とされている。

トピックモデルの歴史

1998年：pLSA(probabilistic Latent Semantic Analysis)
2003年：LDA(Latent Dirichlet Allocation)
2004年〜：拡張モデル
2007年〜：大規模データのための高速化

トピックモデルでできること

・人を介することなく、大量の文書集合から話題になっているトピックを抽出できる。
・文書データだけでなく、画像処理、推薦システム、ソーシャルネットワーク解析、バイオインフォマティクス、音楽情報処理などにも応用されている。
・確率過程を用いて、ノイズを取り除き、本質的な情報を抽出できる。

トピックモデルを理解するために必要な知識

・確率
　-確率分布
　　-ベルヌーイ分布
　　-カテゴリ分布
　　-ベータ分布・ガンマ分布
　　-ディリクレ分布
・ラグランジュ未定乗数法
・ユニグラム
　-BOW(Bag of words)
　-ユニグラムモデル
・混合ユニグラムモデル
・混合モデル
・EMアルゴリズム
・最尤推定
・ベイズの定理
　-ベイズ推定
　 -ベイズ予測分布
　 -ハイパーパラメータ推定
　 -モデル選択
　 -変分ベイズ推定
　 -ギブスサンプリング

トピックモデルの手法について

推定方法としては以下の三つが提案されている。
・最尤推定
・変分ベイズ推定
・ギブスサンプリング

ギブスサンプリングによる方法

一部のパラメータを積分消去して、トピック集合の事後分布を推定
↓
文書集合とトピック集合の同時分布を導出
↓
単語ごとにトピックをサンプリングする
↓
サンプリングされたトピックからトピック分布と単語分布を推定
↓
周辺同時尤度を最大化させるハイパーパラメータαとβを推定する

※LDAのギブスサンプリングはLDAが共役性に基づいたモデリングであるため効率的な計算が可能とされる。

ハイパーパラメータを適切にデータから学習すれば、MAP推定・変分ベイズ推定・ギブスサンプリングの性能の差は大きく出ないという研究結果があるらしい。（なお、MAP推定だと、クロスバリデーションを行い、尤度を最大化させるハイパーパラメータを見つけなければならない。）

トピックモデルの実行方法について（R言語）

以下のパッケージで実行可能だが、新しい手法などにはまだ対応していない。
lda（CRANのPDF）
topicmodels（CRANのPDF）

以下はtopicmodelsパッケージの実行コードであるが、BOW形式のデータがあれば、実行はすぐにできる。
ただし、パープレキシティなどを計算しながら、ハイパーパラメータのチューニングをする必要がある。

library(topicmodels)
k <- 10 #トピック数
LDA_estimate <- LDA(bagofwords, k, method="Gibbs",control=list(alpha=alpha,verbose=1,
                                                                            iter=10000,burnin=1000,delta=delta))

library(topicmodels)

k <- 10 #トピック数

LDA_estimate <- LDA(bagofwords, k, method="Gibbs",control=list(alpha=alpha,verbose=1,

iter=10000,burnin=1000,delta=delta))

時間があれば、ソースコードを見て自分で書けるようにしたい。
ちなみに、HDP-LDAはPythonのgensimに用意されているようです。(gensimへのリンク)

トピックモデルの評価方法について

パープレキシティ（Perplexity）

-確率モデルの性能を評価する尺度として、テストデータを用いて計算する。
-負の対数尤度から計算できる。
-低いパープレキシティはテストデータを高い精度で予測できる良い確率モデルであることを示す。
-全ての語彙が一葉の確率で出現するモデルのパープレキシティは語彙数Vになり、性能が悪いと考えることができる。
-このパープレキシティが小さくなるようなトピック数やパラメータの設定が必要。

Correlated Topic Models (CTM)について

トピックモデルは「各トピック k の間には相関がない」という前提のもとに成り立つ手法ですが、本当にトピック間に相関はないのでしょうか。「本当は相関のあるトピック」を無理やり「相関のないトピック」に分割している可能性が高いので、相関を加味したモデルが必要です。そこで、トピックの生成割合を決める際に、トピック間の相関を持つように多次元の正規分布を用います。ただし、その代わりに共役的な分布を持たないため、従来のギブスサンプリングではサンプリングが非効率になることから変分ベイズを用いる必要があります。

PAM:Pachinko Allocation Modelについて

CTMは各トピックが同じレベルにあるため、トピック間の階層構造を表すことができません。
自動車にとっての、セダンやトラック、あるいはお酒にとってのワインやビールのようにトピックに関しても階層構造があることは想像に難くありません。トピック間の関係・相関を一般的に表現するモデルで、トピック間の階層構造を基本として、パチンコ玉が落ちるように単語を生成します。パチンコ玉が落ちるというのは、向きが一方行の有向かつ非巡回ノードが連想されます。分布に関しても共役なので、ギブスサンプリングでサンプリングが可能です。
2016/04/24の段階ではまだGensimでモデルの開発はなされていないようです。
Pachinko Allocation Model
ちなみに、論文はこちらです。
Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Relational Topic Models(RTM)について

文書の中身だけでなく、文書間のリンクの生成過程も同時に確率モデル化する手法。論文や特許データに対して活用できる。過去の購買行動に応じた、顧客のセグメント解析や商品のレコメンデーションに活用できるかもしれない。

参考文献

『トピックモデル (機械学習プロフェッショナルシリーズ)』
『トピックモデルによる統計的潜在意味解析 (自然言語処理シリーズ)』
Tokyo Webmining 46th 『トピックモデルことはじめ』
machine_learning_python/topic.md

統計数理研究所 H24年度公開講座「確率的トピックモデル」サポートページ

サポートベクターマシン(SVM)のまとめ・参考文献

今回はサポートベクターマシン(Support Vector Machine:SVM)の学習に当たって、見つけた参考文献を列挙していこうと思います。

ブログ系

R言語でSVM(Support Vector Machine)による分類学習
無料で利用できる統計解析ソフトRを用いてサポートベクターマシンについて紹介してくれているブログです。

SVMの定番入門書「サポートベクターマシン入門(赤本)」の読み方
サポートベクターマシンを学習する上で役に立つ文献を紹介してくれています。

ところでサポートベクターマシンって何なの？
プログラムのコードと実践例が記されています。ただしコードはjavaの様です。

SVM を使うと，なにが嬉しいの？
サポートベクターマシンの手法としてのモチベーションが記されています。

SVMの最大の特徴は「マージン最大化」にある

識別境界の位置を決定する明確な基準を持っており、学習データの中で最も他クラスと近い位置にいるものを基準として、そのユークリッド距離が最も大きくなるような位置に識別境界を設定する。明確な基準を与えているということ自体、ノンパラメトリックな手法では他に例のないことで、SVMの最も優れた部分とされる。

Ｒとカーネル法・サポートベクターマシン
Rを用いたサポートベクターマシンの実践例が載ってあります。同志社大学のページなので、ちょっと信用度が高いかも。

SVMを使いこなす！チェックポイント8つ
・スケーリング（特徴量の修正）
・カテゴリ特徴量（ダミー変数の作成）
・カーネル関数
・パラメータ
・クロスバリデーション
・不均衡データ問題（パラメータC(コスト)を大きくする、データ数を揃える、アンダーorオーバーサンプリング）
・多クラス分類
・アンサンブル学習

論文・レポート系

サポートベクターマシン入門
産業技術総合研究所のレポートです。大変わかりやすい記述です。

痛快!サポートベクトルマシン : 古くて新しいパターン認識手法
ちょっと短めで、かつ古いのですが、大まかな流れをサクッと掴むには向いていると思います。

サポートベクターマシンによる倒産予測
卒業論文ですね。

企業格付判別のための SVM 手法の提案および逐次ロジットモデルとの比較による有効性検証
http://www.orsj.or.jp/~archive/pdf/j_mag/Vol.57_J_092.pdf

スライド系

SVMについて
2クラスの分類しか記されていませんが、きれいにまとまった資料だと思います。

メモ

SVM
利点
　・データの特徴の次元が大きくなっても識別精度が良い
　・最適化すべきパラメータが少ない
　・パラメータの算出が容易

欠点
　・学習データが増えると計算量が膨大になる
　　（「次元の呪い」の影響が顕著）
　・基本的には2クラスの分類にしか使えない

カテゴリー: 機械学習

OS X YosemiteへのTensorFlowのインストールと簡易な分類モデルの実行

LDA（潜在的ディリクレ配分法）まとめ手法の概要と試行まで

推薦システムに関する参考情報

サポートベクターマシン(SVM)のまとめ・参考文献