用PHP編寫Hadoop的MapReduce程序

s19930811 ? 2015-04-13 22:52 ? Linux干貨, 大數據運維, 系統運維

Hadoop流

雖然Hadoop是用java寫的，但是Hadoop提供了Hadoop流，Hadoop流提供一個API, 允許用戶使用任何語言編寫map函數和reduce函數.
Hadoop流動關鍵是，它使用UNIX標準流作為程序與Hadoop之間的接口。因此，任何程序只要可以從標準輸入流中讀取數據，并且可以把數據寫入標準輸出流中，那么就可以通過Hadoop流使用任何語言編寫MapReduce程序的map函數和reduce函數。
例如：bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out4
Hadoop流引入的包：hadoop-streaming-0.20.203.0.jar,Hadoop根目錄下是沒有hadoop-streaming.jar的，因為streaming是一個contrib，所以要去contrib下面找，以hadoop-0.20.2為例，它在這里：
-input：指明輸入hdfs文件的路徑
-output：指明輸出hdfs文件的路徑
-mapper：指明map函數
-reducer：指明reduce函數

mapper函數

mapper.php文件，寫入如下代碼：

#!/usr/local/php/bin/php  
<?php  
$word2count = array();  
// input comes from STDIN (standard input)  
// You can this code :$stdin = fopen(“php://stdin”, “r”);  
while (($line = fgets(STDIN)) !== false) {  
    // remove leading and trailing whitespace and lowercase  
    $line = strtolower(trim($line));  
    // split the line into words while removing any empty string  
    $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);  
    // increase counters  
    foreach ($words as $word) {  
        $word2count[$word] += 1;  
    }  
}  
// write the results to STDOUT (standard output)  
// what we output here will be the input for the  
// Reduce step, i.e. the input for reducer.py  
foreach ($word2count as $word => $count) {  
    // tab-delimited  
    echo $word, chr(9), $count, PHP_EOL;  
}  
?>

這段代碼的大致意思是：把輸入的每行文本中的單詞找出來，并以”

hello 1
world 1″

這樣的形式輸出出來。

和之前寫的PHP基本沒有什么不同，對吧，可能稍微讓你感到陌生有兩個地方：

PHP作為可執行程序

第一行的
#!/usr/local/php/bin/php
告訴linux，要用#!/usr/local/php/bin/php這個程序作為以下代碼的解釋器。寫過linux shell的人應該很熟悉這種寫法了，每個shell腳本的第一行都是這樣: #!/bin/bash, #!/usr/bin/python

有了這一行，保存好這個文件以后，就可以像這樣直接把mapper.php當作cat, grep一樣的命令執行了：./mapper.php

使用stdin接收輸入

PHP支持多種參數傳入的方法，大家最熟悉的應該是從$_GET, $_POST超全局變量里面取通過Web傳遞的參數，次之是從$_SERVER['argv']里取通過命令行傳入的參數，這里，采用的是標準輸入stdin

它的使用效果是：

在linux控制臺輸入 ./mapper.php

mapper.php運行，控制臺進入等候用戶鍵盤輸入狀態

用戶通過鍵盤輸入文本

用戶按下Ctrl + D終止輸入，mapper.php開始執行真正的業務邏輯，并將執行結果輸出

那么stdout在哪呢？print本身已經就是stdout啦，跟我們以前寫web程序和CLI腳本沒有任何不同。

reducer函數

創建reducer.php文件，寫入如下代碼：

#!/usr/local/php/bin/php  
<?php  
$word2count = array();  
// input comes from STDIN  
while (($line = fgets(STDIN)) !== false) {  
    // remove leading and trailing whitespace  
    $line = trim($line);  
    // parse the input we got from mapper.php  
    list($word, $count) = explode(chr(9), $line);  
    // convert count (currently a string) to int  
    $count = intval($count);  
    // sum counts  
    if ($count > 0) $word2count[$word] += $count;  
}  
// sort the words lexigraphically  
//  
// this set is NOT required, we just do it so that our  
// final output will look more like the official Hadoop  
// word count examples  
ksort($word2count);  
// write the results to STDOUT (standard output)  
foreach ($word2count as $word => $count) {  
    echo $word, chr(9), $count, PHP_EOL;  
}  
?>

這段代碼的大意是統計每個單詞出現了多少次數，并以”

hello 2

world 1″

這樣的形式輸出

用Hadoop來運行

把文件放入 Hadoop 的 DFS 中：
bin/hadoop dfs -put test.log test
執行 php 程序處理這些文本(以Streaming方式執行PHP mapreduce程序:):

bin/hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /usr/local/hadoop/mapper.php -reducer /usr/local/hadoop/reducer.php -input test/* -output out

注意：

1) input和output目錄是在hdfs上的路徑

2) mapper和reducer是在本地機器的路徑，一定要寫絕對路徑，不要寫相對路徑，以免到時候hadoop報錯說找不到mapreduce程序

3 ) mapper.php 和 reducer.php 必須復制到所有 DataNode 服務器上的相同路徑下, 所有的服務器都已經安裝php.且安裝路徑一樣.

查看結果

bin/hadoop d fs -cat /tmp/out/part-00000

轉自：http://blog.csdn.net/hguisu/article/details/7263746

原創文章，作者：s19930811，如若轉載，請注明出處：http://www.www58058.com/3084

hadoop mapreduce php WhiteSpace 全局變量

贊 (0)

0

Hadoop Hive與Hbase整合+thrift

上一篇 2015-04-13 22:52

谷歌三大核心技術（二）Google MapReduce中文版

下一篇 2015-04-13 22:52

馬哥教育網絡20期第七周課程練習

1、創建一個10G分區，并格式為ext4文件系統； (1) 要求其block大小為2048, 預留空間百分比為2, 卷標為MYDATA, 默認掛載屬性包含acl； fdisk /dev/sdb ; mke2fs -t ext4 -b 2048 -L MYDATA -m 2 –O acl /dev/sdb1 (2) 掛載至/data/mydata目錄，要求掛載…

Linux干貨 2016-08-15
馬哥教育網絡班21期-第九周課程練習

"1、寫一個腳本，判斷當前系統上所有用戶的shell是否為可登錄shell（即用戶的shell不是/sbin/nologin）；分別這兩類用戶的個數；通過字符串比較來實現； #!/bin/bash # declare -i login_user=0 declare -i nologin_user=0 whil…

Linux干貨 2016-09-15
Linux干貨

corosync v2+pacemaker實現mariadb的高可用

高可用mariadb拓撲圖一、設計前提 1、時間同步 # ntpdate 172.16.0.1 或者 # chronyc sources 2、所有的主機對應的IP地址解析可以正常工作，主機名要與命令#uname -n 所得的結果一致因此，/etc/hosts中的內容為以下內容 ????????172.16.23.10?node1.rj.com?node…

2017-11-02
Centos 7 快速進入圖形界面

Centos 7 快速進入圖形界面.pdf

系統運維 2016-04-05
用戶、組及其管理

用戶和組管理 Linux是一個多用戶、多任務的操作系統。多用戶、多任務就是可以在系統上建立多個用戶，多個用戶可以在同一時間內登錄同一臺主機的系統執行不同的任務，而互不影響。例如某臺linux服務器上有4個用戶，分別是root、www、ftp和mysql，在同一時間內root用戶可能在管理維護系統，www用戶可能在修改自己的程序和操作…

Linux干貨 2016-08-04
用戶和組管理類命令詳解

用戶和組管理類命令詳解組管理 groupadd 功能描述：創建一個新組命令格式： groupadd [選項] GROUP 選項： -g GID 表示指定GID，默認情況下使用的是最小的未使用過的GID -r 表示創建一個系統組 groupmod 功能描述：修改組屬性命令格式：groupmod [選項] GROUP 選項： -g GID 表示修改GID …

Linux干貨 2017-07-16

欧美性久久久久