樓主: Nicolle
949 5

Apache Mahout Essentials [分享]

版主

巨擘

0%

還不是VIP/貴賓

-

TA的文庫  其他...

Python Programming

SAS Programming

Must-Read Books

威望
16
論壇币
12292652 个
通用積分
262.2717
學術水平
3071 点
熱心指數
3066 点
信用等級
2862 点
經驗
453053 点
帖子
21118
精華
92
在線時間
8072 小时
注冊时间
2005-4-23
最后登錄
2019-11-15

Nicolle 学生认证  发表于 2015-8-27 08:00:36 |顯示全部樓層
  1. Book Description
  2. Apache Mahout is a scalable machine learning library with algorithms for clustering, classification, and recommendations. It empowers users to analyze patterns in large, diverse, and complex datasets faster and more scalably.

  3. This book is an all-inclusive guide to analyzing large and complex datasets using Apache Mahout. It explains complicated but very effective machine learning algorithms simply, in relation to real-world practical examples.

  4. Starting from the fundamental concepts of machine learning and Apache Mahout, this book guides you through Apache Mahout's implementations of machine learning techniques including classification, clustering, and recommendations. During this exciting walkthrough, real-world applications, a diverse range of popular algorithms and their implementations, code examples, evaluation strategies, and best practices are given for each technique. Finally, you will learn vdata visualization techniques for Apache Mahout to bring your data to life.
  5. Book Details
  6. Publisher:        Packt Publishing
  7. By:        Jayani Withanawasam
  8. ISBN:        978-1-78355-499-7
  9. Year:        2015
  10. Pages:        164
  11. Language:        English
  12. File size:        8 MB
  13. File format:        PDF
複制代碼

關鍵詞:Essentials Essential apache mahout Essen effective practical learning relation complex

本帖被以下文庫推荐

Nicolle 学生认证  发表于 2016-4-16 10:15:50 |顯示全部樓層

K-Means Clustering using Apache Mahout

  1. private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT = "Kmeansdata";

  2. public static void main(String[] args) throws Exception {

  3.   // Path to output folder
  4.   Path output = new Path("Kmeansoutput");

  5.   // Hadoop configuration details
  6.   Configuration conf = new Configuration();
  7.   HadoopUtil.delete(conf, output);

  8.   run(conf, new Path("KmeansTest"), output, new EuclideanDistanceMeasure(), 2, 0.5, 10);
  9. }

  10. public static void run(Configuration conf, Path input, Path output, DistanceMeasure measure, int k,
  11. double convergenceDelta, int maxIterations) throws Exception {

  12.   // Input should be given as sequence file format
  13.   Path directoryContainingConvertedInput = new Path(output, DIRECTORY_CONTAINING_CONVERTED_INPUT);
  14.   InputDriver.runJob(input, directoryContainingConvertedInput, "org.apache.mahout.math.RandomAccessSparseVector");

  15.   // Get initial clusters randomly
  16.   Path clusters = new Path(output, "random-seeds");
  17.   clusters = RandomSeedGenerator.buildRandom(conf, directoryContainingConvertedInput, clusters, k, measure);

  18.   // Run K-Means with a given K
  19.   KMeansDriver.run(conf, directoryContainingConvertedInput, clusters, output, convergenceDelta,
  20.   maxIterations, true, 0.0, false);

  21.   // run ClusterDumper to display result
  22.   Path outGlob = new Path(output, "clusters-*-final");
  23.   Path clusteredPoints = new Path(output,"clusteredPoints");

  24.   ClusterDumper clusterDumper = new ClusterDumper(outGlob, clusteredPoints);
  25.   clusterDumper.printClusters(null);
  26. }
複制代碼
Nicolle 学生认证  发表于 2016-4-16 10:20:16 |顯示全部樓層

Text clustering with the K-Means algorithm

  1. Text clustering with the K-Means algorithm

  2. The following example demonstrates text document clustering using the K-Means algorithm in the command line. Wikipedia content is given in the text documents. The files are copied to HDFS and processed with Hadoop MapReduce. You can try out the commands in local mode as well. To do this, perform the following steps:

  3. Copy the files from the local filesystem folder to HDFS:
  4. hdfs dfs -put kmeans/input/ kmeans/input
  5. Use the following command to display the copied files, as shown in the image that follows:
  6. hdfs dfs -ls kmeans/input
  7. Text clustering with the K-Means algorithm
  8. Convert the input to sequence files:
  9. mahout seqdirectory -i kmeans/input/ -o kmeans/sequencefiles
  10. Use the preceding command to display sequence files, as shown in the following image:
  11. hdfs dfs -text kmeans/sequencefiles/part-m-00000
  12. Generate TF-IDF vectors from sequence files:
  13. mahout seq2sparse -i kmeans/sequencefiles -o kmeans/sparse
  14. The outcome and intermediate results during the TF-IDF vector creation are shown in the following image:

  15. Text clustering with the K-Means algorithm
  16. Execute the K-Means algorithm, as shown here:
  17. mahout kmeans -i kmeans/sparse/tfidf-vectors/ -c kmeans/ cl -o kmeans/out -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 2 --clustering –cl
  18. The outcome directory, which is kmeans.out in this case, will contain the files shown in the following image:

  19. Text clustering with the K-Means algorithm
  20. Display the results of the K-Means clustering:
  21. mahout clusterdump -dt sequencefile -d kmeans/sparse/dictionary.file-0 -i kmeans/out/clusters-1-final
複制代碼
Nicolle 学生认证  发表于 2016-4-16 10:27:42 |顯示全部樓層

Text classification using Naïve Bayes – the Spark implementation

  1. Text classification using Naïve Bayes – the Spark implementation

  2. We have mentioned how to set up Spark in detail in the previous section (using linear regression). Here are the steps that you need to perform in order to set up Spark:

  3. Set the SPARK_HOME path.
  4. Start the Spark server.
  5. Train the dataset and generate the model:
  6. mahout spark-trainnb -i 20news-train-vectors -el -o model -li labelindex -ow
  7. Test and evaluate the dataset:
  8. mahout spark-testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing
  9. Apache Mahout contains an example script, which includes all the steps in one go in the following path.

  10. It contains several classification algorithms such as CNaiveBayes, Naïve Bayes, and SGD.

  11. To execute the script, use the following command:

  12. MAHOUT_HOME/examples/bin/classify-20newsgroups.sh
複制代碼
Nicolle 学生认证  发表于 2016-4-16 10:32:33 |顯示全部樓層

Movie Recommendations using Apache Mahout

  1. Movie Recommendations using Apache Mahout
  2. The Java code example for user-based recommendations is given as follows:

  3. DataModel model = new FileDataModel (new File("movie.csv"));

  4. UserSimilarity similarity = new PearsonCorrelationSimilarity (model);

  5. UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model);

  6. Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity);

  7. List<RecommendedItem> recommendations = recommender.recommend(3, 2);

  8. for (RecommendedItem recommendation : recommendations) {
  9.   System.out.println(recommendation);
  10. }
  11. In this example, we need to get the top two recommendations for user 03 (Nimal). User 03 liked the following movies, with the given ratings in brackets:

  12. User 3 > Item 1 (5), Item 4 (8), Item 5 (9), Item 7 (10)

  13. User 4 > Item 1, Item 4, Item 6 (8), Item 3 (6)
  14. User 5 > Item 1, Item 2, Item 3 (4), Item 4, Item 5, Item 6 (8)
  15. This shows the result of the code example:

  16. RecommendedItem[item:6, value:8.0]
  17. RecommendedItem[item:3, value:5.181073]
複制代碼
Nicolle 学生认证  发表于 2016-4-16 10:35:23 |顯示全部樓層

Item-based recommenders with Spark

  1. Item-based recommenders with Spark

  2. An item-based recommender can also be executed on top of Spark. We have given the steps to set up the Spark server in Chapter 3, Regression and Classification in detail.

  3. Start the Spark servers:
  4. [SPARK_HOME]/sbin/start-all
  5. Prepare the input data (only the user ID and item ID, no preference values).
  6. Copy the input data to HDFS.
  7. Execute the following mahout command to generate recommendations for each item:
  8. mahout spark-itemsimilarity --input inputfile --output outputdirectory
  9. movie.csv can be used as inputfile, or else you can give your own data file as well.

  10. The generated indicator matrix is given in the following figure; this can be found in the outputdirectory/indicator-matrix/ directory:

  11. Item-based recommenders with Spark
  12. Similar items for each other item are given in the indicator matrix, with a similarity value according to the following format:

  13. itemIDx  itemIDy:valuey itemIDz:valuez
複制代碼
您需要登錄后才可以回帖 登錄 | 我要注冊

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 論壇法律顾问:王进律师 知識産權保護聲明   免責及隱私聲明

GMT+8, 2019-11-15 10:43