简介

从Postgresql 9.0开始，基于预写日志(Write Ahead Log,WAL)的可读复制(PITR)更是成为了官方提供的异步主从复制(Master-Slave Replication)解决方案，该方案拥有如下优点：
a.使用预写日志记录数据库的改动，不额外增加服务端的其他负荷。
b.当主服务器失败（如断电、系统崩溃、灾难）时，不会丢失任何数据。
c.支持基于流和基于档案的两种日志传输方案。
d.备用服务器可作为负载均衡节点提供读请求。
e.支持多个或多级备用服务器。

实现原理

主服务器在接受到每个事务请求时，将数据改动用预写日志(WAL)记录。具体而言，事务采用两段提交(Two Phase Commit)，即先将改动写入预写日志，然后再实际改动数据库。这样可以保证预写日志的时间戳永远不落后于数据库，即便是正在写入时服务器突然崩溃，重启以后也可以依据预写日志将数据恢复，因为预写日志保留了比数据库记录中更新的版本。Postgresql的异步复制解决方案正是利用了预写日志，将预写日志从主服务器(Master Sever)传输到备用服务器(Standby Server)，然后在备用服务器上回放(Replay)出预写日志中记录改动，从而实现主从复制。Postgresql使用了两种方式传输预写日志：存档式(archive)和流式(streaming)。
存档式复制的原理是主服务器将预写日志主动拷贝到一个安全的位置（可以直接到备用服务器，也可以是第三台服务器），同时备用服务器定期扫描这个位置，并将预写日志拷贝到备用服务器端然后再回放。这样即使主服务器崩溃了，备用服务器也可以从这个安全的位置获取到一份完整的记录，以确保任何数据不会丢失。而流式复制则简化了这一个步骤，由主服务器直接通过TCP协议向备用服务器传输日志，避免了两次复制的开销，有利于减小备用服务器和主服务器直接的数据延时。但当主服务器崩溃时，未被传输到备用服务器的日志则会丢失，造成数据损失。Postgresql 支持存档式和流式两种模式的混合，当两种模式都开启时，备用服务器会定期检查是否有存档已经到达指定的位置，并回放日志。一旦检测到指定的位置没有新的日志，则会切换到流式模式试图直接从网络传输日志，接着再检查存档，不断重复这一循环。

环境说明

Hostname	IP	系统	数据库
master	10.10.10.60	Centos6.5	postgresql 9.2
slave	10.10.10.61	Centos6.5	postgresql 9.2

需求：master和slave作为主从流复制，当master宕机后，slave切换为新主继续服务；然而当master恢复故障后，作为从重新加入主从流复制。

master配置

主的基本配置请参考http://www.jb51.cc/article/p-wxaybmrj-bau.html，以下是需更改添加的地方。
1.创建复制用户

postgres=# create user rep replication password 'reptest';
CREATE ROLE
postgres=# \du
 List of roles
 Role name | Attributes | Member of -----------+------------------------------------------------+-----------
 postgres | Superuser,Create role,Create DB,Replication | {}
 rep | Replication | {}

2.设置pg_hba.conf

#允许slave访问
vim pg_hba.conf
host   replication      rep          10.10.10.61/8           md5

3.设置主库postgresql.conf

wal_level = hot_standby # write ahead log，流复制时为hot_standby
hot_standby = on
max_wal_senders = 2 # 流复制的最大连接数
wal_keep_segments = 16 # 流复制保留的最大xlog数

其中为了在备机服务器上开启“只读”查询，wal_level必须设置成“hot_standby”. 但是，如果你一直不会在stand-by模式下连接到备机，你可以选择“archive”，archive模式只会将主上的wal日志记录备份到备用服务器上。

4.重启数据库

pg_ctl restart -D /data/pgsql/data
或者
pg_ctl stop -D /data/pgsql/data -m fast
pg_ctl start -D /data/pgsql/data -m fast

slave配置

1.创建数据目录

mkdir -p /data/pgsql/data
chown -R postgres.postgres data
chmod 700 data

2.使用pg_backendup生成备库

-bash-4.2$ pg_basebackup -D /data/pgsql/data -Fp -Xs -v -P -h 10.10.10.60 -p 5432 -U rep
Password: 
transaction log start point: 0/2000020
pg_basebackup: starting background WAL receiver
20154/20154 kB (100%),1/1 tablespace                                    
transaction log end point: 0/20000E0
pg_basebackup: waiting for background process to finish streaming...
pg_basebackup: base backup completed
-bash-4.2$ ls /data/pgsql/data/
backup_label  global   pg_hba.conf    pg_multixact  pg_serial     pg_stat_tmp  pg_tblspc    PG_VERSION  postgresql.conf
base          pg_clog  pg_ident.conf  pg_notify     pg_snapshots  pg_subtrans  pg_twophase  pg_xlog

此时表空间目录和$PGDATA目录已经复制过来了。
注意：从服务器无需初始化，因为所有的配置文件及目录都是从主上复制过来的。

3.配置recovery.conf

-bash-4.2$ rpm -ql postgresql-server |grep recovery /usr/share/pgsql/recovery.conf.sample
-bash-4.2$ cp /usr/share/pgsql/recovery.conf.sample /data/pgsql/data/recovery.conf #添加如下几行
vim recovery.conf
standby_mode = on
trigger_file = '/data/pgsql/pg.trigger'
primary_conninfo = 'host=10.10.10.60 port=5432 user=rep password=reptest'

其中:
standby_mode = on指明是否开启服务器作为一个备机,在流复制里，这个参数必须要开启。
primary_conninfo 指明用于备服务器连接到主服务器的连接字符串
trigger_file 指定一个触发文件让备服务器感觉到它的时候就会停止流复制（即：故障转移）,不要创建这个文件。当你想主从切换的时候才需要创建它。
注意：当从切换为主完成后，recovery.conf会自动变为recovery.done
4.启动pgsql

pg_ctl start -D /data/pgsql/data

测试流复制

1.查看主库

-bash-4.2$ ps -ef |grep postgres
postgres  3585 23019  0 10:37 pts/0    00:00:00 psql
postgres  3730     1  0 11:23 pts/0    00:00:00 /usr/bin/postgres -D /data/pgsql/data
postgres  3731  3730  0 11:23 ?        00:00:00 postgres: logger process   
postgres  3733  3730  0 11:23 ?        00:00:00 postgres: checkpointer process   
postgres  3734  3730  0 11:23 ?        00:00:00 postgres: writer process   
postgres  3735  3730  0 11:23 ?        00:00:00 postgres: wal writer process   
postgres  3736  3730  0 11:23 ?        00:00:00 postgres: autovacuum launcher process  
postgres  3737  3730  0 11:23 ?        00:00:00 postgres: stats collector process   
postgres  4066  3730  0 13:40 ?        00:00:00 postgres: wal sender process rep 192.168.3.141(47146) streaming 0/3000A38
postgres  4069 23019  0 13:42 pts/0    00:00:00 ps -ef
postgres  4070 23019  0 13:42 pts/0    00:00:00 grep --color=auto postgres
root     23018 21245  0 Apr01 pts/0    00:00:00 su - postgres
postgres 23019 23018  0 Apr01 pts/0    00:00:00 -bash

确认主库中进程有“postgres: wal sender process”
2.查看从库

-bash-4.2$ ps -ef |grep postgres
root       694 19367  0 11:27 pts/0    00:00:00 psql -h 192.168.3.139 -U postgres -W
root       724 19367  0 11:57 pts/0    00:00:00 su - postgres
postgres   725   724  0 11:57 pts/0    00:00:00 -bash
postgres  1069     1  0 14:13 pts/0    00:00:00 /usr/bin/postgres -D /data/pgsql/data
postgres  1070  1069  0 14:13 ?        00:00:00 postgres: logger process   
postgres  1071  1069  0 14:13 ?        00:00:00 postgres: startup process   recovering 000000010000000000000003
postgres  1072  1069  0 14:13 ?        00:00:00 postgres: checkpointer process   
postgres  1073  1069  0 14:13 ?        00:00:00 postgres: writer process   
postgres  1074  1069  0 14:13 ?        00:00:00 postgres: stats collector process   
postgres  1075  1069  0 14:13 ?        00:00:00 postgres: wal receiver process   streaming 0/3000AD0
postgres  1078   725  0 14:15 pts/0    00:00:00 ps -ef
postgres  1079   725  0 14:15 pts/0    00:00:00 grep --color=auto postgre

确定备库进程中有”postgres: wal receiver process”

3.执行如下命令查看快照，它返回主库记录点、备库记录点；主库每增加一条写入，记录点的值就会加1。

postgres=# select txid_current_snapshot();
 txid_current_snapshot -----------------------
 1894:1894:
(1 row)
postgres=# create database testdb;
CREATE DATABASE
postgres=# select txid_current_snapshot();
 txid_current_snapshot -----------------------
 1895:1895:
(1 row)

4.查看主备同步状态

postgres=# select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_locat
ion | sync_priority | sync_state ------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-------------
----+---------------+------------
 4066 | 16385 | rep | walreceiver | 192.168.3.141 | | 47146 | 2016-04-08 13:40:53.375562+08 | streaming | 0/3000B68 | 0/3000B68 | 0/3000B68 | 0/3000B68 
 | 0 | async
(1 row)

字段state显示的同步状态有：startup（连接中）、catchup（同步中）、streaming（同步）；字段sync_state显示的模式有：async（异步）、sync（同步）、potential（虽然现在是异步模式，但是有可能升级到同步模式）

5.主库添加数据

postgres=# create table t1(id int4,create_time timestamp(0) without time zone);
CREATE TABLE postgres=# insert into t1 values(1,now());
INSERT 0 1 postgres=# select * from t1;
 id |     create_time     
----+---------------------
  1 | 2016-04-08 13:53:50
(1 row)

6.在从库查看数据

postgres=# c testdb
postgres=# select * from t1;
 id | create_time ----+---------------------
 1 | 2016-04-08 13:53:50
(1 row)

主从库数据同步，则流复制成功。

7.从库插入数据

postgres=# insert into t1 values(2,now());
ERROR:  cannot execute INSERT in a read-only transaction

备库上执行insert语句会报错，因为备库是只读的。

主从切换

1.模拟主机故障

-bash-4.2$ pg_ctl stop -D /data/pgsql/data -m fast
waiting for server to shut down.... done
server stopped

2.提升备库状态

-bash-4.2$ pg_ctl promote -D /data/pgsql/data 
server promoting

3.查看状态

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery 
-------------------
 f
(1 row)

主库返回f,备库返回t

原master恢复为从

我们以新master和新slave称呼
1.模拟新的主上新增数据

#新master
postgres=# insert into t1 values(2,now());
postgres=# insert into t1 values(3,now());

2.查看两台数据库服务器的pg_xlog;

#新master
-bash-4.2$ ll total 81928
-rw-------. 1 postgres postgres 16777216 Apr 8 12:45 000000010000000000000002 -rw-------. 1 postgres postgres 16777216 Apr 8 16:09 000000010000000000000003 -rw-------. 1 postgres postgres 16777216 Apr 11 09:22 000000020000000000000003 -rw-------. 1 postgres postgres 16777216 Apr 11 10:24 000000020000000000000004 -rw-------. 1 postgres postgres 16777216 Apr 12 09:16 000000020000000000000005 -rw-------. 1 postgres postgres 56 Apr 8 16:10 00000002.history drwx------. 2 postgres postgres     4096 Apr  8 12:45 archive_status
#新slave
-bash-4.2$ ll total 49156
-rw-------. 1 postgres postgres 16777216 Apr 8 12:12 000000010000000000000001 -rw-------. 1 postgres postgres 16777216 Apr 8 12:12 000000010000000000000002 -rw-------. 1 postgres postgres 16777216 Apr 8 15:37 000000010000000000000003 drwx------. 2 postgres postgres     4096 Apr  1 09:22 archive_status

区别是新主上多出来的日志就是我们新插入数据后生成的日志
3.配置新master的pg_hba.conf

host   replication      rep          10.10.10.60/8           md5

4.查看主从pg_controldata输出

#新master
-bash-4.2$ pg_controldata /data/pgsql/data
pg_control version number:            922
Catalog version number:               201204301
Database system identifier:           6268391932590681930
Database cluster state:               in production
pg_control last modified:             Mon 11 Apr 2016 04:36:25 PM CST
Latest checkpoint location:           0/5002E68
Prior checkpoint location:            0/5002DD0
Latest checkpoint's REDO location:    0/5002E30
Latest checkpoint's TimeLineID:       2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/1899
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        1879
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  1899
Time of latest checkpoint:            Mon 11 Apr 2016 04:36:25 PM CST
Minimum recovery ending location:     0/0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
#新slave
-bash-4.2$ pg_controldata /data/pgsql/data
pg_control version number:            922
Catalog version number:               201204301
Database system identifier:           6268391932590681930
Database cluster state:               shut down in recovery
pg_control last modified:             Mon 11 Apr 2016 04:06:39 PM CST
Latest checkpoint location:           0/3014EB8
Prior checkpoint location:            0/3014EB8
Latest checkpoint's REDO location:    0/3014EB8
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/1897
Latest checkpoint's NextOID:          16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        1879
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Time of latest checkpoint:            Fri 08 Apr 2016 03:37:07 PM CST
Minimum recovery ending location:     0/3014F18
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            hot_standby
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

从输出结果我们得出Latest checkpoint’s TimeLineID 新主为2，因为我们由从切换为新master并且新插入了数据，而新slave仍为1

5.配置新slave的recovery.conf

-bash-4.2$ rpm -ql postgresql-server |grep recovery
/usr/share/pgsql/recovery.conf.sample
-bash-4.2$ cp /usr/share/pgsql/recovery.conf.sample /data/pgsql/data/recovery.conf
添加如下几行
standby_mode = on
trigger_file = '/data/pgsql/pg.trigger'
primary_conninfo = 'host=10.10.10.61 port=5432 user=rep password=rep'
#新增如下一行
recovery_target_timeline = 'latest'

6.启动新从

pg_ctl start -D /data/pgsql/data

7.查看日志pg_log
FATAL: timeline 2 of the primary does not match recovery target timeline 1
原因：由于新主从的timelineID不一致造成
8.我们将主从pg_xlog下的多出的文件copy到从上，注意后缀为history的文件也要复制，archive_status归档日志可以不用复制

scp 00000002* test@10.10.10.60:/data/pgsql/data/pg_xlog

9.重启新slave数据库

pg_ctl restart -D /data/pgsql/data

10.查看新slave数据是否同步

-bash-4.2$ psql
psql (9.2.15)
Type "help" for help.

postgres=# select * from t1;
 id | create_time ----+---------------------
 1 | 2016-04-08 13:53:50
 2 | 2016-04-11 10:27:23
 3 | 2016-04-11 10:33:17
(3 rows)

postgres=# select pg_is_in_recovery(); pg_is_in_recovery 
------------------- t (1 row)

由上看出新slave和新master同步，并且状态正常，切换成功。
另外，你可以通过比较主服务器上的当前的WAL写位置与备服务器上的最新“接收/重做”的WAL位置来计算复制的延迟。它们各自可以通过在主服务器端使用pg_current_xlog_location函数来获取，在备服务器上通过pg_last_xlog_receive_location或者pg_last_xlog_replay_location来获取。

#新master
postgres=# select pg_current_xlog_location(); pg_current_xlog_location 
-------------------------- 0/50227C8 (1 row) #新slave postgres=# select pg_last_xlog_receive_location(); pg_last_xlog_receive_location -------------------------------
 0/50227C8
(1 row)

postgres=# select pg_last_xlog_replay_location();
 pg_last_xlog_replay_location ------------------------------
 0/50227C8
(1 row)

postgresql流复制

简介