樓主: Nicolle
1166 0

[Case Study]Naïve Bayes using Apache Mahout [分享]

版主

巨擘

0%

還不是VIP/貴賓

-

TA的文庫  其他...

Python Programming

SAS Programming

Must-Read Books

威望
16
論壇币
12292652 个
通用積分
262.2717
學術水平
3071 点
熱心指數
3066 点
信用等級
2862 点
經驗
453053 点
帖子
21118
精華
92
在線時間
8072 小时
注冊时间
2005-4-23
最后登錄
2019-11-15

Nicolle 学生认证  发表于 2015-6-22 01:43:01 |顯示全部樓層

Naïve Bayes using Apache Mahout


We will use a dataset of 20 newsgroups for this exercise. The 20 newsgroups dataset is a standard dataset commonly used for machine learning research. The data is obtained from transcripts of several months of postings made in 20 Usenet newsgroups from the early 1990s. This dataset consists of messages, one per file. Each file begins with header lines that specify things such as who sent the message, how long it is, what kind of software was used, and the subject. A blank line follows and then the message body follows as unformatted text.

  • Download the 20news-bydate.tar.gz dataset from http://qwone.com/~jason/20Newsgroups/. The following steps are used to build the Naïve Bayes classifier using Mahout:
  • Create a 20newsdata directory and unzip the data here:
  1. mkdir /tmp/20newsdata
  2. cd /tmp/20newsdata
  3. tar –xzvf /tmp/20news-bydate.tar.gz
複制代碼
  • You will see two folders under 20newsdata: 20news-bydate-test and 20news-bydate-train. Now create another directory called 20newsdataall and merge both the training and test data of the 20 newsgroups.
  • Come out of the directory and move to the home directory and execute the following:
  1. mkdir /tmp/20newsdataall
  2. cp –R /20newsdata/*/* /tmp/20newsdataall
複制代碼
  • Create a directory in Hadoop and save this data in HDFS format:
  1. hadoop fs –mkdir /user/hue/20newsdata
  2. hadoop fs –put /tmp/20newsdataall /user/hue/20newsdata
複制代碼
  • Convert the raw data into a sequence file. The seqdirectory command will generate sequence files from a directory. Sequence files are used in Hadoop. A sequence file is a flat file that consists of binary key/value pairs. We are converting the files into sequence files so that it can be processed in Hadoop, which can be done using the following command:
  1. bin/mahout seqdirectory -i /user/hue/20newsdata/20newsdataall  -o /user/hue/20newsdataseq-out
複制代碼

The output of the preceding command can be seen in the following screenshot:


  • Convert the sequence file into a sparse vector using the following command:
  1. bin/mahout seq2sparse -i /user/hue/20newsdataseq-out/part-m-00000 -o /user/hue/20newsdatavec -lnorm -nv -wt tfidf
複制代碼


  • Split the set of vectors to train and test the model:
  1. bin/mahout split -i /user/hue/20newsdatavec/tfidf-vectors --trainingOutput /user/hue/20newsdatatrain --testOutput /user/hue/20newsdatatest --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
複制代碼



  • Train the model:
  1. bin/mahout trainnb -i /user/hue/20newsdatatrain -el -o /user/hue/model -li /user/hue/labelindex -ow -c
複制代碼
  • Test the model using the following command:
  1. bin/mahout testnb -i /user/hue/20newsdatatest -m /user/hue/model/ -l  /user/hue/labelindex -ow -o /user/hue/results
複制代碼





  • We get the result of our Naïve Bayes classifier for the 20 newsgroups.

Reference

Learning Apache Mahout Classification By Ashish Ashish Gupta Gupta
關鍵詞:Case study mahout apache Bayes Using software learning research exercise obtained

本帖被以下文庫推荐

您需要登錄后才可以回帖 登錄 | 我要注冊

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 論壇法律顾问:王进律师 知識産權保護聲明   免責及隱私聲明

GMT+8, 2019-11-15 10:29