樓主: Nicolle
1226 1

[Case Study]Logistic Regression Using Apache Mahout [分享]

版主

巨擘

0%

還不是VIP/貴賓

-

TA的文庫  其他...

Python Programming

SAS Programming

Must-Read Books

威望
16
論壇币
12292652 个
通用積分
262.2717
學術水平
3071 点
熱心指數
3066 点
信用等級
2862 点
經驗
453053 点
帖子
21118
精華
92
在線時間
8072 小时
注冊时间
2005-4-23
最后登錄
2019-11-15

Nicolle 学生认证  发表于 2015-6-22 01:41:25 |顯示全部樓層

Logistic Regression Using Mahout


Mahout has implementations for logistic regression using SGD. It is very easy to understand and use. So let's get started.


Dataset

We will use the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. This is a dataset for breast cancer tumors and data is available from 1995 onwards. It has 569 instances of breast tumor cases and has 30 features to predict the diagnosis, which is categorized as either benign or malignant.

  • Make the target class numeric. In this case, the second field diagnosis is the target variable. We will change malignant to 0 and benign to 1. Use the following code snippet to introduce the changes.
  1. public void convertTargetToInteger() throws IOException{
  2.   //Read the data
  3.   BufferedReader br = new BufferedReader(new FileReader("wdbc.csv"));
  4.   String line =null;
  5.   //Create the file to save the resulted data
  6.   File wdbcData = new File("<Your Destination location for file.>");
  7.   FileWriter fw = new FileWriter(wdbcData);
  8.   //We are adding header to the new file
  9.   fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Perimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusStdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdError"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStdError"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDimensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimeter"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"worstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worstfractaldimensions"+"\n");

  10.   /*In the while loop we are reading line by line and checking the last field- parts[1] and changing it to numeric value accordingly*/
  11.   while((line=br.readLine())!=null){
  12.     String []parts = line.split(",");
  13.     if(parts[1].equals("M")){
  14.     fw.write(parts[0]+","+"0"+","+parts[2]+","+parts[3]+","+parts[4]+","+parts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+","+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+","+parts[31]+"\n");
  15.     }

  16.     if(parts[1].equals("B")){
  17.       fw.write(parts[0]+","+"1"+","+parts[2]+","+parts[3]+","+parts[4]+","+parts[5]+","+parts[6]+","+parts[7]+","+parts[8]+","+parts[9]+","+parts[10]+","+parts[11]+","+parts[12]+","+parts[13]+","+parts[14]+","+parts[15]+","+parts[16]+","+parts[17]+","+parts[18]+","+parts[19]+","+parts[20]+","+parts[21]+","+parts[22]+","+parts[23]+","+parts[24]+","+parts[25]+","+parts[26]+","+parts[27]+","+parts[28]+","+parts[29]+","+parts[30]+","+parts[31]+"\n");
  18.     }
  19.   }
  20.   fw.close();
  21.   br.close();
  22. }
複制代碼
  • Split the dataset into training and test datasets and then shuffle the datasets so that we can mix them up, which can be done using the following code snippet:
  1. public void dataPrepration() throws Exception {
  2.   // Reading the dataset created by earlier method convertTargetToInteger and here we are using google guava api's.
  3.   List<String> result = Resources.readLines(Resources.getResource("wdbc.csv"), Charsets.UTF_8);
  4.   //This is to remove header before the randomization process. Otherwise it can appear in the middle of dataset.
  5.   List<String> raw = result.subList(1, 570);
  6.   Random random = new Random();
  7.   //Shuffling the dataset.
  8.   Collections.shuffle(raw, random);
  9.   //Splitting dataset into training and test examples.
  10.   List<String> train = raw.subList(0, 470);
  11.   List<String> test = raw.subList(470, 569);
  12.   File trainingData = new File("<your Location>/ wdbcTrain.csv");
  13.   File testData = new File("<your Location>/ wdbcTest.csv");
  14.   writeCSV(train, trainingData);
  15.   writeCSV(test, testData);
  16. }
  17. //This method is writing the list to desired file location.
  18. public void writeCSV(List<String> list, File file) throws IOException{
  19.   FileWriter fw = new FileWriter(file);
  20.   fw.write("ID_Number"+","+"Diagnosis"+","+"Radius"+","+"Texture"+","+"Perimeter"+","+"Area"+","+"Smoothness"+","+"Compactness"+","+"Concavity"+","+"ConcavePoints"+","+"Symmetry"+","+"Fractal_Dimension"+","+"RadiusStdError"+","+"TextureStdError"+","+"PerimeterStdError"+","+"AreaStdError"+","+"SmoothnessStdError"+","+"CompactnessStdError"+","+"ConcavityStdError"+","+"ConcavePointStdError"+","+"Symmetrystderror"+","+"FractalDimensionStderror"+","+"WorstRadius"+","+"worsttexture"+","+"worstperimeter"+","+"worstarea"+","+"worstsmoothness"+","+"worstcompactness"+","+"worstconcavity"+","+"worstconcavepoints"+","+"worstsymmentry"+","+"worstfractaldimensions"+"\n");
  21.   for(int i=0;i< list.size();i++){
  22.     fw.write(list.get(i)+"\n");
  23.   }
  24.   fw.close();
  25. }
複制代碼
  • Training the model
  1. mahout trainlogistic --input /tmp/wdbcTrain.csv --output /tmp//model --target Diagnosis --categories 2 --predictors Radius Texture Perimeter Area Smoothness Compactness Concavity ConcavePoints Symmetry Fractal_Dimension RadiusStdError TextureStdError PerimeterStdError AreaStdError SmoothnessStdError CompactnessStdError ConcavityStdError ConcavePointStdError Symmetrystderror FractalDimensionStderror WorstRadius worsttexture worstperimeter worstarea worstsmoothness worstcompactness worstconcavity worstconcavepoints worstsymmentry worstfractaldimensions  --types numeric --features 30 --passes 90 --rate 300
複制代碼
  • Check the confusion and AUC matrix. The command for this will be as follows:
  1. mahout runlogistic --input /tmp/wdbcTrain.csv --model /tmp//model  --auc --confusion
複制代碼
  • Run the same algorithm on test data as well:
  1. mahout runlogistic --input /tmp/wdbcTest.csv --model /tmp//model  --auc –confusion
複制代碼

  • So this model works almost the same on test data as well. It has classified 34 out of the 40 malignant tumors correctly.


Reference

Learning Apache Mahout Classification By Ashish Ashish Gupta Gupta
關鍵詞:Case study regression regressio logistic ogistic understand available features either breast

本帖被以下文庫推荐

chonghuihedong 发表于 2015-6-22 07:49:13 |顯示全部樓層
您需要登錄后才可以回帖 登錄 | 我要注冊

京ICP备16021002-2号 京B2-20170662号 京公网安备 11010802022788号 論壇法律顾问:王进律师 知識産權保護聲明   免責及隱私聲明

GMT+8, 2019-11-15 10:26