redis的異常行為是一件令人頭疼的問題。redis提供的錯誤日志只提供了一些server的運行狀態信息,而沒有server在故障現場的操作日志,比如執行了什么命令,命令操作對象的數據結構信息。如果redis也有mysql的slow-log,那么很多性能和故障問題也很容易處理。
1、redis rdb
在redis2.8版本以前,redis主從數據復制在生產上存在一個嚴重的問題。當主從連接斷開時,redis slave會請請求master重傳數據,進行全量的rdb重傳。對于大數據集和高并發的redis實例,會引起redis阻塞相當長的時間,甚至導致redis不能處理任何客戶端的請求。(redis latency)。
1)redis開啟bgsave,調用fork創建一個子進程,然后對數據集進行copy-on-write。fork()產生的子進程需要復制父進程的地址空間(頁表)。比如在頁大小為4KB的linux系統上,24GB的redis數據集,需要一個24GB/4KB*8=48MB的頁表。子進程復制頁表(read-only狀態)的時候,redis主進程(父進程)不能處理寫入請求。bgsave和aofrewrite的實現過程可以查看rdb.c和aof.c文件.
2)如果數據集較大,系統內存不是很充足,rdb會導致swap,redis性能非常低下。
2、2.8版本出現復制中斷現象
而2.8版本實現了增量復制,利用緩沖區實現了斷點續傳,避免出現rdb重傳。雖然psync不成功的情況,也會進行rdb重傳。
下文要講述的問題與rdb/aof并無多大關系,而下面的日志是在redis2.8.8上發生的復制中斷的問題。
具體日志信息如下:
error log of slave
[181852] 12 Mar 20:31:34.695 # MASTER timeout: no data nor PING received... [181852] 12 Mar 20:31:34.696 # Connection with master lost. [181852] 12 Mar 20:31:34.696 * Caching the disconnected master state. [181852] 12 Mar 20:31:34.697 * Connecting to MASTER 192.168.0.197:6384 [181852] 12 Mar 20:31:34.697 * MASTER <-> SLAVE sync started [181852] 12 Mar 20:31:37.697 * Non blocking connect for SYNC fired the event. [181852] 12 Mar 20:31:37.698 * Master replied to PING, replication can continue... [181852] 12 Mar 20:31:37.698 * Trying a partial resynchronization (request 737d5014a4916a0fcabfcbeef337fdc0d552a44a:64363916219). [181852] 12 Mar 20:31:37.699 * Successful partial resynchronization with master. [181852] 12 Mar 20:31:37.700 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
errorerror log of master
[189366] 12 Mar 20:31:35.788 # Connection with slave 192.168.0.174:6384 lost. [189366] 12 Mar 20:31:37.698 * Slave asks for synchronization [189366] 12 Mar 20:31:37.699 * Partial resynchronization request accepted. Sending 135901 bytes of backlog starting from offset 64363916219.
3、客戶端異常:
在應用客戶端發生了什么呢?開發告訴我,應用程序端(PHP)出現大量的“Redis server went away”。
4、master-slave的基本行為
1)首先,slave(第34s)日志顯示沒有收到master的數據或者ping信息,slave重新連接master。
2)接著,master(第35s)日志提示從庫斷開了連接。
3)在第37秒,主從的日志均顯示復制連接重現建立成功,并開始psync增量復制。
5、尋找問題的蛛絲馬跡
1)機器上的其它的主從復制都很正常,沒有復制斷開的狀況,排除了網絡故障原因。
2)master機器上的流量突然上升,而從庫的網卡流量比較正常。這說明master突然有數據大量讀取。
3)master自身堵住了,沒有rdb和aofrewr。那么只有redis的慢查詢了,通過slowlog get 10。果不其然,存在大量的keys*操作,導致了復制中斷。
“Redis server went away”的原因也也很明顯,大量的慢查詢導致redis server堵塞,不能立即響應redis客戶端的請求而出現連接超時。
附加話題:redis復制超時
https://github.com/antirez/redis/issues/918
Since the master -> slave channel is one-way in Redis, currently replication timeout is only up to the slave. Basically this is what happens: The master normally sends the replication stream to the slave. The master also PINGs the slave from time to time, to make sure that even when no new data is available to be transmitted to the slave, the slave still get some news from the master. Because of this, there is always data going from master to slave, so the slave will be able to detect when the connection is down and reconnect if possible. On the master side, if we don't get an error from the socket, we keep sending data. The output buffer can get too large, but the limits in Redis 2.6 will detect that and close the connection with the slave.
原創文章,作者:s19930811,如若轉載,請注明出處:http://www.www58058.com/2474