樓主: Nicolle
1386 0

[Case Study]Random Forest Using Apache Mahout [分享]

版主

巨擘

0%

還不是VIP/貴賓

-

TA的文庫  其他...

Python Programming

SAS Programming

Must-Read Books

威望
16
論壇币
12292647 个
通用積分
262.2717
學術水平
3071 点
熱心指數
3066 点
信用等級
2862 点
經驗
453053 点
帖子
21118
精華
92
在線時間
8071 小时
注冊时间
2005-4-23
最后登錄
2019-11-14

Nicolle 学生认证  发表于 2015-6-22 02:07:26 |顯示全部樓層

Random Forest Using Apache Mahout


Mahout has implementation for the Random forest algorithm. It is very easy to understand and use. So let's get started.

Dataset

We will use the NSL-KDD dataset. Since 1999, KDD'99 has been the most widely used dataset for the evaluation of anomaly detection methods. This dataset is prepared by S. J. Stolfo and is built based on the data captured in the DARPA'98 IDS evaluation program.

Note

In KDDTrain+_20Percent.ARFF and KDDTest+.ARFF, remove the first 44 lines (that is, all lines starting with @attribute). If this is not done, we will not be able to generate a descriptor file.




Steps

The steps to implement the Random forest algorithm in Apache Mahout are as follows:

  • Transfer the test and training datasets to hdfs using the following commands:
    1. hadoop fs -mkdir /user/hue/KDDTrain
    2. hadoop fs -mkdir /user/hue/KDDTest
    3. hadoop fs –put /tmp/KDDTrain+_20Percent.arff  /user/hue/KDDTrain
    4. hadoop fs –put /tmp/KDDTest+.arff  /user/hue/KDDTest
    複制代碼

  • Generate the descriptor file. Before you build a Random forest model based on the training data in KDDTrain+.arff, a descriptor file is required. This is because all information in the training dataset needs to be labeled. From the labeled dataset, the algorithm can understand which one is numerical and categorical. Use the following command to generate descriptor file:
    1. hadoop jar  $MAHOUT_HOME/core/target/mahout-core-xyz.job.jar
    2. org.apache.mahout.classifier.df.tools.Describe
    3. -p /user/hue/KDDTrain/KDDTrain+_20Percent.arff
    4. -f /user/hue/KDDTrain/KDDTrain+.info
    5. -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
    複制代碼
    1. Jar: Mahout core jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class Describe is used here and it takes three parameters:

    2. The p path for the data to be described.

    3. The f location for the generated descriptor file.

    4. d is the information for the attribute on the data. N 3 C 2 N C 4 N C 8 N 2 C 19 N L defines that the dataset is starting with a numeric (N), followed by three categorical attributes, and so on. In the last, L defines the label.
    複制代碼

    The output of the previous command is shown in the following screenshot:

  • Build the Random forest using the following command:
    1. hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest  
    2. -Dmapred.max.split.size=1874231 -d /user/hue/KDDTrain/KDDTrain+_20Percent.arff  
    3. -ds /user/hue/KDDTrain/KDDTrain+.info
    4. -sl 5 -p -t 100 –o /user/hue/ nsl-forest
    複制代碼
    1. Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The main class build forest is used to build the forest with other arguments, which are shown in the following list:

    2. Dmapred.max.split.size indicates to Hadoop the maximum size of each partition.

    3. d stands for the data path.

    4. ds stands for the location of the descriptor file.

    5. sl is a variable to select randomly at each tree node. Here, each tree is built using five randomly selected attributes per node.

    6. p uses partial data implementation.

    7. t stands for the number of trees to grow. Here, the commands build 100 trees using partial implementation.

    8. o stands for the output path that will contain the decision forest.
    複制代碼



    In the end, the process will show the following result:

  • Use this model to classify the new dataset:
    1. hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-xyz-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest
    2. -i /user/hue/KDDTest/KDDTest+.arff
    3. -ds /user/hue/KDDTrain/KDDTrain+.info -m /user/hue/nsl-forest -a –mr
    4. -o /user/hue/predictions
    5. Jar: Mahout example jar (xyz stands for version). If you have directly installed Mahout, it can be found under the /usr/lib/mahout folder. The class to test the forest has the following parameters:

    6. I indicates the path for the test data

    7. ds stands for the location of the descriptor file

    8. m stands for the location of the generated forest from the previous command

    9. a informs to run the analyzer to compute the confusion matrix

    10. mr informs hadoop to distribute the classification

    11. o stands for the location to store the predictions in
    複制代碼



    The job provides the following confusion matrix:




So, from the confusion matrix, it is clear that 9,396 instances were correctly classified and 315 normal instances were incorrectly classified as anomalies. And the accuracy percentage is 77.7635 (correctly classified instances by the model / classified instances). The output file in the prediction folder contains the list where 0 and 1. 0 defines the normal dataset and 1 defines the anomaly.


Reference

Learning Apache Mahout Classification By Ashish Ashish Gupta Gupta


關鍵詞:Case study random apache Forest mahout understand prepared methods Random forest

您需要登錄后才可以回帖 登錄 | 我要注冊

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 論壇法律顾问:王进律师 知識産權保護聲明   免責及隱私聲明

GMT+8, 2019-11-15 07:51