Redis学习笔记

July 4, 2017, 9:54 am

≫ Next: Real-time materialized view，面向开发者的12.2新特性

≪ Previous: 令人误解的ORA-16047: DGID mismatch between destination setting and target database

Redis的官方网站是https://redis.io/，也有中文的网站 http://www.redis.cn/。
Redis 当前的稳定版本是3.2(具体是3.2.9)，最新版本是4.0。

在本文你将看到：
1. Redis的基础知识，如redis的数据类型，redis的安装配置，redis的主要参数设置等等。
2. Redis的主从复制，以及Redis的自动主从切换的高可用架构（Sentinel）
3. Redis的集群高可用架构，即Redis Cluster（包含主从自动切换和数据分片）
4. Redis的监控
5. Redis的docker化。
6. Redis 4.0的新特性

一、Redis基础知识

1. redis是一个内存数据库，是key value的方式记录数据。redis是单进程单线程，所以只占用一个cpu，所以在监控时候，多CPU主机的平均使用cpu可能使用率低，但是可能redis进程使用的那个cpu已经打满。
redis的主要操作命令工具是redis-cli，提供交互命令行，类似sqlplus，进行数据的操作。

redis数据类型主要有如下5种：（其他还有bitmap，hyperloglog等等，这里不做讨论）
1.1 string类型：
• set 插入或者修改（注1：不能存相同的字符串；注2：无序，无左右）
• get 获取
• del 删除

127.0.0.1:6379> set name oracleblog
OK
127.0.0.1:6379> get name
"oracleblog"
127.0.0.1:6379> set name oracleblog
OK
127.0.0.1:6379> get name
"oracleblog"
127.0.0.1:6379>
#可以看到即使set了两次，但是其实只有一个值

应用场景：一般的key-value。注，一个value最大可以容纳512Mb长度的string值。

1.2 list类型
• lpush/rpush 将值插入左端/右端 (注：list可以存储多个相同的串)
• lrange 获取给定范围的列表中的值
• lindex 获取列表中的某个值
• lpop 从左边弹出列表中的一个值（注：pop之后，值就不在列表中了）

127.0.0.1:6379> lpush lname oracle
(integer) 1
127.0.0.1:6379> lpush lname mysql
(integer) 2
127.0.0.1:6379> lpush lname oracle
(integer) 3
127.0.0.1:6379> lpush lname mssql
(integer) 4
127.0.0.1:6379> lrange lname 0 100
1) "mssql"
2) "oracle"
3) "mysql"
4) "oracle"
127.0.0.1:6379>
127.0.0.1:6379> lindex lname 2
"mysql"
127.0.0.1:6379>

注，最多可以包含2^32个元素。

1.3 set类型
• sadd 插入（set通过hash列表保证自己存储的每个字符串是不同的，无序，无左右）
• smember 列出所有member
• sismember 判断是否为member
• srem移除

127.0.0.1:6379> sadd sname s-mysql
(integer) 1
127.0.0.1:6379> sadd sname s-oracle
(integer) 1
127.0.0.1:6379> sadd sname s-mssql
(integer) 1
127.0.0.1:6379> sadd sname s-oracle
(integer) 0
127.0.0.1:6379> sadd sname s-mssql
(integer) 0
127.0.0.1:6379> sadd sname s-redis
(integer) 1
127.0.0.1:6379>
127.0.0.1:6379> sadd sname s-mango s-postgres
(integer) 2
127.0.0.1:6379>
127.0.0.1:6379> smembers sname
1) "s-redis"
2) "s-mssql"
3) "s-oracle"
4) "s-mango"
5) "s-postgres"
6) "s-mysql"
127.0.0.1:6379>

使用场景：你我的共同朋友，共同爱好等等

1.4 hash类型
• hset 插入
• hget 获取指定hash列
• hgetall 获取所有hash列的所有键值
• hdel 如果给定键存在于hash列里面，则删除这个键。

127.0.0.1:6379> hset hname passwd1 dji123
(integer) 1
127.0.0.1:6379> hset hname passwd1 dji123
(integer) 0
127.0.0.1:6379> hset hname passwd1 dji124
(integer) 0
127.0.0.1:6379> hgetall hname
1) "passwd1"
2) "dji124"
127.0.0.1:6379>
#注，后一个替换掉了前一个。
127.0.0.1:6379> hset hname passwd2 dji222 passwd3 dji333 passwd4 dji444
(error) ERR wrong number of arguments for 'hset' command
127.0.0.1:6379>
127.0.0.1:6379> hmset hname passwd2 dji222 passwd3 dji333 passwd4 dji444
OK
127.0.0.1:6379> hgetall hname
1) "passwd1"
2) "dji124"
3) "passwd2"
4) "dji222"
5) "passwd3"
6) "dji333"
7) "passwd4"
8) "dji444"
127.0.0.1:6379>
#注，如果要一次set多个hash，需要hmset

1.5 zset类型（有序集合）
• zadd
• zrange
• zrangebyscore
• zrem

127.0.0.1:6379> zadd zname 1 oracle
(integer) 1
127.0.0.1:6379> zadd zname 2 mysql
(integer) 1
127.0.0.1:6379> zadd zname 3 mssql
(integer) 1
127.0.0.1:6379> zadd zname 3 redis
(integer) 1
127.0.0.1:6379> zrangebyscore zname 0 1000
1) "oracle"
2) "mysql"
3) "mssql"
4) "redis"
127.0.0.1:6379>
127.0.0.1:6379>
127.0.0.1:6379>
127.0.0.1:6379> zrange zname 0 1000
1) "oracle"
2) "mysql"
3) "mssql"
4) "redis"
127.0.0.1:6379>

使用场景：排行榜，投票等等

2.持久化
2.1 RDB，类似snapshot。
当符合一定条件时 redis 会folk一个进程，利用copy on write原理，自动将内存中的所有数据生成一份副本并保存到硬盘上。
过程：
遍历每个DB，遍历每个db的dict，获取每个dictEntry
获取key后查询expire，如过期就丢弃
将数据的key，value，expiretime等写入文件
计算checksum，通过checksum交换旧的rdb文件。

执行的前提条件：
1）配置自动快照的规则
2）用户执行了 SAVE 或 BGSAVE 命令
3）执行 FLUSHALL 命令
4）执行复制时
缺点：一旦 redis 程序退出，会丢失最后一次快照以后更改的所有数据。

相关参数有：
save 60 100
stop-write-on-bysave-error no
rdbcompression yes
dbfilename dump.rdb

注，bgsave
如果redis在虚拟机上，那么bgsave时间可能会加长。
redis进程每占用1G内存，bgsave创建子进程所需要的时间增加10~20ms
save和bgsave的区别：save一直阻塞到快照生成。而bgsave由子进程完成。

RDB文件解析：
以db0中只存在set msg “hello”为例：

2.2 AOF,类似归档，起到追加的作用。
注，每次数操作都会调用flushApendOnlyFile来刷新AOF，每次操作都需要fsync，前台线程阻塞。
注，选用ssd将明显提高aof的性能。

相关参数有：
appendonly yes
appendsync everysec
no-appendsync-on-rewrite no
auto-aof-rewrite-percent 100
auto-aof-rewrite-min-size 64mb
dir ~/

AOF文件解析：

[root@redis01 6399]# cat  appendonly.aof
*2         #2个参数
$6         #第一个参数长度是6
SELECT     #第一个参数值是SELECT
$1         #第二个参数长度是
0          #第二个参数的值是0
*3         #3个参数
$3         #第一个参数长度是3
SET        #第一个参数值是SET
$4         #第二个参数长度是4
col2       #第二个参数值是col2
$2         #第三个参数长度是2
v2         #第三个参数值是v2

也就是如下：
select 0 ##选择db0
set col2 v2 ##插入key-value，col2-v2。

AOF重写（BGREWRITEAOF）：
目的：减少AOF文件大小
触发条件：
1. 发起命令bgrewriteaof
2. aof文件的大小增长超过一定比例，且aof文件实际大小超过一定

# Specify a percentage of zero in order to disable the automatic AOF
# rewrite feature.

auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

注，目前key的条目不多于64，如果多于64个条目，会进行拆分。

/* Static server configuration */
……
#define LOG_MAX_LEN    1024 /* Default maximum length of syslog messages */
#define AOF_REWRITE_PERC  100
#define AOF_REWRITE_MIN_SIZE (64*1024*1024)
#define AOF_REWRITE_ITEMS_PER_CMD 64
#define CONFIG_DEFAULT_SLOWLOG_LOG_SLOWER_THAN 10000
#define CONFIG_DEFAULT_SLOWLOG_MAX_LEN 128
……

2.3 数据完整性
如果允许几分钟的数据丢失。可以采用rdb，如果需要持续记录，那么可以采用aof。另外，从性能考虑，由于aof是持续写，可以将aof放在备库，主库只有rdb。

注，redis server异常crash后重启，将进行如下优先级操作：
如果只配置了aof，启动时加载aof
如果同步配置了aof和rdb，启动时只加载aof
如果只配置了rdb，启动时加载rdb的dump文件。

注，在linux 6（centos 6，redhat 6，oel 6）中，重启redis可以用/etc/init.d/redis-server restart命令，但是这个命令在重启的时候是不save的。就会导致如果不开aof，会丢失上次save之后的数据。
正确的做法是redis-cli之后，用shutdown命令（默认带save），或者shutdown save命令。不要用shutdown nosave。

如果在中途开启AOF，比较好的方式是：
a. 动态的修改CONFIG SET appendonly yes，此时会生成appendonly.aof 文件，不仅包含修改之前的值，还包含修改之后的值。
b. 修改redis.conf的值为appendonly yes
c. 在有停机窗口的时候，重启redis。

掉电导致AOF或者rdb文件损坏，相关修复工具：
redis-check-aof 检查、修复aof（会删除出错命令之后（含）所有的命令）
redis-check-dump 检查、修复rdb

3. Redis的key过期(expire)。
我们可以设置某个key过期：

127.0.0.1:6379> get testexpire
(nil)
127.0.0.1:6379> set testexpire value1
OK
127.0.0.1:6379> get testexpire
"value1"
127.0.0.1:6379>
127.0.0.1:6379>
127.0.0.1:6379> expire testexpire 60
(integer) 1
127.0.0.1:6379>
127.0.0.1:6379> ttl testexpire
(integer) 56
127.0.0.1:6379>
127.0.0.1:6379> set testexpire value2
OK
127.0.0.1:6379> get testexpire
"value2"
127.0.0.1:6379>
127.0.0.1:6379> ttl testexpire
(integer) -1
127.0.0.1:6379> ttl testexpire
(integer) -1
127.0.0.1:6379> get testexpire
"value2"
127.0.0.1:6379> ttl testexpire
(integer) -1
127.0.0.1:6379> ttl testexpire
(integer) -1
127.0.0.1:6379>
127.0.0.1:6379>
127.0.0.1:6379> expire testexpire 10
(integer) 1
127.0.0.1:6379> ttl testexpire
(integer) 8
127.0.0.1:6379> ttl testexpire
(integer) 5
127.0.0.1:6379> ttl testexpire
(integer) 1
127.0.0.1:6379> ttl testexpire
(integer) -2
127.0.0.1:6379> get testexpire
(nil)
127.0.0.1:6379>

上面的测试中，ttl key，返回-1表示不会expire，-2表示已经expired。大于0的数字表示剩余时间。另外可以看到，当set新值之后覆盖了原来的值，则设置在原来key上的expire也被取消了。

注1. Redis keys过期后叫volatile（在后面谈到设置maxmemory-policy的时候，会提到这个词），过期删除有两种方式：被动和主动方式。

当一些客户端尝试访问它时，key会被发现并主动的过期。

当然，这样是不够的，因为有些过期的keys，永远不会访问他们。
所以Redis每秒10次做的事情：
测试随机的20个keys进行相关过期检测。
删除所有已经过期的keys。
如果有多于25%的keys过期，重复步骤1.

注2.expire的限制：只能应用于整个键，而不能对键的某一部分数据做expire。也就是说，expire 列，不能expire 行。

注3. RDB对过期key的处理：过期key对RDB没有任何影响

从内存数据库持久化数据到RDB文件
持久化key之前，会检查是否过期，过期的key不进入RDB文件
从RDB文件恢复数据到内存数据库
数据载入数据库之前，会对key先进行过期检查，如果过期，不导入数据库（主库情况）

注4. AOF对过期key的处理：过期key对AOF没有任何影响

从内存数据库持久化数据到AOF文件：
当key过期后，还没有被删除，此时进行执行持久化操作（该key是不会进入aof文件的，因为没有发生修改命令）
当key过期后，在发生删除操作时，程序会向aof文件追加一条del命令（在将来的以aof文件恢复数据的时候该过期的键就会被删掉）
AOF重写
重写时，会先判断key是否过期，已过期的key不会重写到aof文件

不过期的话，到达maxmemory之后，所有和内存增加的操作都会报错。在64bit系统下，maxmemory默认设置为0表示不限制Redis内存使用，在32bit系统下，maxmemory隐式不能超过3GB。所以在64位系统中，默认值是个危险的值。

当memory使用量到达maxmemory之后，将根据设置的maxmemory-policy的方式，进行内存回收。
maxmemory-policy可以设置的值有：
1. noeviction:返回错误当内存限制达到并且客户端尝试执行会让更多内存被使用的命令（大部分的写入指令，但DEL和几个例外）
2. allkeys-lru: 尝试回收最少使用的键（LRU），使得新添加的数据有空间存放。
3. volatile-lru: 尝试回收最少使用的键（LRU），但仅限于在过期集合的键,使得新添加的数据有空间存放。
4. allkeys-random: 回收随机的键使得新添加的数据有空间存放。
5. volatile-random: 回收随机的键使得新添加的数据有空间存放，但仅限于在过期集合的键。
6. volatile-ttl: 回收在过期集合的键，并且优先回收存活时间（TTL）较短的键,使得新添加的数据有空间存放。

注，redis采用的LRU算法是近似LRU算法，LRU的采样率通过设置如maxmemory-samples 5来确定。新版本的redis的近似LRU算法，在同等的maxmemory-samples条件下，比旧版本的好很多。

4 redis的安装。
4.1 主机相关参数配置（注， for Linux 7）：
4.1.1. 选择文件系统至少为ext4，xfs更佳。

4.1.2. 关闭numa，关闭redis所在文件系统/分区的atime选项。

4.1.3. 如果是非SSD，设置文件系统IO调度方式为deadline，如果是SSD则为noop。

4.1.4. 调整kernel。
4.1.4.1 检查当前操作系统使用的tuned profile
cat /etc/tuned/active_profile
virtual-guest

4.1.4.2. 建立一个目录用来放for redis的tuned profile
mkdir /etc/tuned/for_redis

4.1.4.3. 将当前系统默认的tune profile复制到for redis 下：
cp /usr/lib/tuned/virtual-guest/tuned.conf /etc/tuned/for_redis/

4.1.4.4.修改/etc/tuned/for_redis/
[main]
include=throughput-performance

[vm]
transparent_hugepages=never

[sysctl]
vm.dirty_ratio = 30
vm.swappiness = 30
vm.overcommit_memory = 1

net.core.somaxconn = 65535

4.1.4.5.指定tuned profile为for_redis
tuned-adm profile for_redis

4.1.4.6.重启主机。

4.2 下载、解压redis：
mkdir /root/redis_install
cd /root/redis_install
wget http://download.redis.io/releases/redis-3.2.9.tar.gz
tar -zxvf /root/redis_install/redis-3.2.9.tar.gz
cd /root/redis_install/redis-3.2.9
make
make test
注，make test时如果报错You need tcl 8.5 or newer in order to run the Redis test，则需要yum install tcl，正常情况下，如果make test通过，则显示如下：
make install
mkdir -p /etc/redis
mkdir -p /var/redis
mkdir -p /var/redis/6379
cp /root/redis_install/redis-3.2.9/redis.conf /etc/redis/redis_6379.conf
修改redis_6379.conf

daemonize yes
pidfile /var/run/redis_6379.pid
logfile /var/log/redis_6379.log
dir /var/redis/6379
## 注释掉IP绑定，使得其他主机的客户端也可以连接redis
# bind 127.0.0.1
## 设置远程访问密码。本案例允许远程访问，已经取消设置bind为127.0.0.1和protected-mode yes
requirepass "oracleblog"
## 修改高危命令为error命令。
rename-command FLUSHDB "FLUSHDB_ORACLE_MASK"
rename-command FLUSHALL "FLUSHALL_ORACLE_MASK"
rename-command CONFIG "CONFIG_ORACLE_MASK"
##如果不是如下save值，请修改配置
save 900 1
save 300 10
save 60 10000
appendonly yes
##注，tcp-backlog需要小于操作系统设置的somaxconn大小
tcp-backlog 511
##最多设置maxmemory为内存40%。（40%用于redis，40%用于bgsave，20%用于系统）
maxmemory 838860800
maxmemory-policy allkeys-lru
maxmemory-samples 5

启动：redis-server /etc/redis/redis_6379.conf
连接：redis-cli -a oracleblog -h 192.168.56.108 -p 6380 (注，oracleblog就是在requiepass中设置的密码)
关闭：192.168.56.108:6379> shutdown save

5. 一个redis最多包含16个db，可以通过select进行跳转,move可以转移key到别的db。

127.0.0.1:6399> keys *
1) "col6"
2) "col2"
3) "col1"
4) "col5"
5) "col3"
6) "col4"
127.0.0.1:6399>
127.0.0.1:6399> select 1
OK
127.0.0.1:6399[1]> keys *
(empty list or set)
127.0.0.1:6399[1]>
127.0.0.1:6399[1]> set db1_col1 v1;
OK
127.0.0.1:6399[1]>
127.0.0.1:6399[1]>
127.0.0.1:6399[1]> keys *
1) "db1_col1"
127.0.0.1:6399[1]>
127.0.0.1:6399[1]>
127.0.0.1:6399[1]> select 0
OK
127.0.0.1:6399> move col2 1
(integer) 1
127.0.0.1:6399>
127.0.0.1:6399> select 1
OK
127.0.0.1:6399[1]> keys *
1) "col2"
2) "db1_col1"
127.0.0.1:6399[1]>

二、redis的主从复制和sentinel

1. 主从复制配置：
我们来配一个1主2从的redis。分别是在3台主机，3个端口上。
主：192.168.56.108 port 6379 –> 从1：192.168.56.109 port 6380 –> 从2：192.168.56.110 port 6381

各个主机上的配置文件：

/etc/redis/redis_6379.conf
不需要改


在运行状态下：
slaveof 192.168.56.108 6379
CONFIG_ORACLE_MASK set masterauth oracleblog
并且修改/etc/redis/redis_6380.conf
slaveof 192.168.56.108 6379
masterauth oracleblog


slaveof 192.168.56.109 6380
CONFIG_ORACLE_MASK set masterauth oracleblog
并且修改/etc/redis/redis_6381.conf
slaveof 192.168.56.109 6380
masterauth oracleblog

注1：复制启动过程中，从节点会丢弃旧数据（如果有的话）
注2：实际使中最好让redis主节点只使用50%~60%内存，留30%~45%用于bgsave。
注3：redis不支持主-主复制
注4：redis支持级联主从
注5：info命令看：aof-pending_bio_fsync是否为0，如果为0，表示主从同步正常。aof-pending_bio_fsync的含义是number of fsync pending job in background I/O queue
注6：从网上的测试看，启动复制，比没有复制的TPS会有所降低，在100 client并发的情况下，大约降低30%。响应时间，从0.8毫秒到1.2毫秒。

如何更换故障的主服务器:

（1）A -----------> B
（2）A（crash）     B
（3）A（crash）     B（运行save命令生成rdb）
（4）A（crash）     B -----（将rdb传输到C主机）-----  C
（5）A（crash）     B-----（slaveof C port）------> C

2. 高可用架构sentinel配置：
sentinel是redis实例的一个特殊模式，可以通过如下两种方式启动：
redis-sentinel /path/to/sentinel.conf
和
redis-server /path/to/sentinel.conf –sentinel

Sentinel 原理：

1. Sentinel 集群通过给定的配置文件发现 master，启动时会监控 master。通过向 master 发送 info 信息获得该服务器下面的所有从服务器。
2. Sentinel 集群通过命令连接向被监视的主从服务器发送 hello 信息 (每秒一次)，该信息包括 Sentinel 本身的 IP、端口、id 等内容，以此来向其他 Sentinel 宣告自己的存在。
3. Sentinel 集群通过订阅连接接收其他 Sentinel 发送的 hello 信息，以此来发现监视同一个主服务器的其他 Sentinel；集群之间会互相创建命令连接用于通信，因为已经有主从服务器作为发送和接收 hello 信息的中介，Sentinel 之间不会创建订阅连接。
4. Sentinel 集群使用 ping 命令来检测实例的状态，如果在指定的时间内（down-after-milliseconds）没有回复或则返回错误的回复，那么该实例被判为下线。
5. 当 failover 主备切换被触发后，failover 并不会马上进行，还需要 Sentinel 中的大多数 Sentinel 授权后才可以进行 failover，即进行 failover 的 Sentinel 会去获得指定 quorum 个的 Sentinel 的授权，成功后进入 ODOWN 状态。如在 5 个 Sentinel 中配置了 2 个 quorum，等到 2 个 Sentinel 认为 master 死了就执行 failover。
6. Sentinel 向选为 master 的 slave 发送 SLAVEOF NO ONE 命令，选择 slave 的条件是 Sentinel 首先会根据 slaves 的优先级来进行排序，优先级越小排名越靠前。如果优先级相同，则查看复制的下标，哪个从 master 接收的复制数据多，哪个就靠前。如果优先级和下标都相同，就选择进程 ID 较小的。
7. Sentinel 被授权后，它将会获得宕掉的 master 的一份最新配置版本号 (config-epoch)，当 failover 执行结束以后，这个版本号将会被用于最新的配置，通过广播形式通知其它 Sentinel，其它的 Sentinel 则更新对应 master 的配置。

●主观下线（Subjectively Down，简称 SDOWN）指的是单个 Sentinel 实例对服务器做出的下线判断。
●客观下线（Objectively Down，简称 ODOWN）指的是多个 Sentinel 实例在对同一个服务器做出 SDOWN 判断，并且通过 SENTINEL is-master-down-by-addr 命令互相交流之后，得出的服务器下线判断。

注，客观下线条件只适用于主服务器。
1 到 3 是自动发现机制:
以 10 秒一次的频率，向被监视的 master 发送 info 命令，根据回复获取 master 当前信息。
以 1 秒一次的频率，向所有 redis 服务器、包含 Sentinel 在内发送 PING 命令，通过回复判断服务器是否在线。
以 2 秒一次的频率，通过向所有被监视的 master，slave 服务器发送当前 Sentinel master 信息的消息。
4 是检测机制，5 和 6 是 failover 机制，7 是更新配置机制。

#该行的意思是sentinel工作在62379端口,这个是默认sentinel端口，可以修改。
port 26379
#该行的意思是：监控的master的名字叫做mymaster（自定义）,地址为127.0.0.1:6379，行尾最后的一个2代表在sentinel集群中，多少个sentinel认为masters死了，才能真正认为该master不可用了。
sentinel monitor mymaster 192.168.56.108 6379 2
#该行的意思是sentinel会向master发送心跳PING来确认master是否存活，如果master在“一定时间范围”内不回应PONG 或者是回复了一个错误消息，那么这个sentinel会主观地(单方面地)认为这个master已经不可用了(subjectively down, 也简称为SDOWN)。而这个down-after-milliseconds就是用来指定这个“一定时间范围”的，单位是毫秒，默认30秒。
sentinel down-after-milliseconds mymaster 60000
#该行的意思是sentinel会向master发送心跳PING来确认master是否存活，如果master在“一定时间范围”内不回应PONG 或者是回复了一个错误消息，那么这个sentinel会主观地(单方面地)认为这个master已经不可用了(subjectively down, 也简称为SDOWN)。而这个down-after-milliseconds就是用来指定这个“一定时间范围”的，单位是毫秒，默认30秒。
sentinel failover-timeout mymaster 180000
#该行的意思是，在发生failover主备切换时，这个选项指定了最多可以有多少个slave同时对新的master进行同步，这个数字越小，完成failover所需的时间就越长，但是如果这个数字越大，就意味着越多的slave因为replication而不可用。可以通过将这个值设为 1 来保证每次只有一个slave处于不能处理命令请求的状态。
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster oracleblog
## Following parameter add by Jimmy
daemonize yes
logfile /var/log/redis_sentinel_6379.log
pidfile /var/run/redis_sentinel_6379.pid
# bind 127.0.0.1 注意，不能加bind，因为sentinel之间会互相通信，需要做仲裁
dbfilename dump_sentinel.rdb
#注意每个sentinel 的myid应该都不一样，否则会互相忽略对方的存在
sentinel myid be167fc5c77a14ef53996d367e237d3cc33a53b6
#注意需要rename-command还原，因为sentinel会使用这些命令，不然会造成虽然可以识别节点故障，但是无法实现切换。
#rename-command FLUSHDB "FLUSHDB_ORACLE_MASK"
#rename-command FLUSHALL "FLUSHALL_ORACLE_MASK"
#rename-command CONFIG "CONFIG_ORACLE_MASK

注：sentinel不建议是单个，因为：
1：即使有一些sentinel进程宕掉了，依然可以进行redis集群的主备切换；
2：如果只有一个sentinel进程，如果这个进程运行出错，或者是网络堵塞，那么将无法实现redis集群的主备切换（单点问题）;
3：如果有多个sentinel，redis的客户端可以随意地连接任意一个sentinel来获得关于redis集群中的信息

实施步骤：
我们来配一个1主2从的redis。分别是在3台主机，3个端口上。

主：192.168.56.108 port 6379 --> 从1：192.168.56.109 port 6380
       |
       |
       L--> 从2：192.168.56.110 port 6381

1. cp /root/redis_install/redis-3.2.9/sentinel.conf /etc/redis/sentinel_6379.conf
2. 按照上面说的配置项进行修改
3. redis-sentinel /etc/redis/sentinel_6379.conf
4. 在其他从节点重复上面的步骤
5. 在主节点redis-cli -p 26379 ping，正常返回pong，在从1节点redis -p 26380 ping，正常返回pong；在从2节点redis-cli -p 26381 ping，正常会返回pong
6. 在从1节点redis-cli -p 26380 sentinel masters，显示如下，注意34行和35行，显示了有2个slave和察觉到了有2个sentinel：

127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "192.168.56.108"
    5) "port"
    6) "6379"
    7) "runid"
    8) "357e8a88b9ae5aeb325f16239aac20ec965ae167"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "970"
   19) "last-ping-reply"
   20) "970"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "9971"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "713185"
   29) "config-epoch"
   30) "0"
   31) "num-slaves"
   32) "2"
   33) "num-other-sentinels"
   34) "2"
   35) "quorum"
   36) "2"
   37) "failover-timeout"
   38) "180000"
   39) "parallel-syncs"
   40) "1"
127.0.0.1:26379>

7. 在从1节点运行sentinel slaves mymaster：

[root@redis02 ~]# redis-cli -p 26380 sentinel slaves mymaster
1)  1) "name"
    2) "192.168.56.110:6381"
    3) "ip"
    4) "192.168.56.110"
    5) "port"
    6) "6381"
    7) "runid"
    8) "aaa02dfaf12a59bba2b84c0b2b1bc6fb13ae76e2"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "373"
   19) "last-ping-reply"
   20) "373"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "8707"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "500482"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "192.168.56.108"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "274239"
2)  1) "name"
    2) "192.168.56.109:6380"
    3) "ip"
    4) "192.168.56.109"
    5) "port"
    6) "6380"
    7) "runid"
    8) "860e99210956e311ad522e7f5c582e7ba50d09bd"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "373"
   19) "last-ping-reply"
   20) "373"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "8707"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "500482"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "192.168.56.108"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "274239"
[root@redis02 ~]#

8.在从1运行sentinel get-master-addr-by-name ：

[root@redis02 ~]# redis-cli -p 26380 sentinel get-master-addr-by-name mymaster
1) "192.168.56.108"
2) "6379"
[root@redis02 ~]#

9. kill掉redis进程，或者运行sentinel failover命令：

在节点1的log中，可以看到：

2824:X 27 Jun 05:58:05.551 # +new-epoch 17
2824:X 27 Jun 05:58:05.551 # +config-update-from sentinel be167fc5c77a14ef53996d367e237d3cc33a53b6 192.168.56.109 26380 @ mymaster 192.168.56.108 6379
2824:X 27 Jun 05:58:05.551 # +switch-master mymaster 192.168.56.108 6379 192.168.56.109 6380
2824:X 27 Jun 05:58:05.551 * +slave slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.109 6380
2824:X 27 Jun 05:58:05.551 * +slave slave 192.168.56.108:6379 192.168.56.108 6379 @ mymaster 192.168.56.109 6380
2824:X 27 Jun 05:58:15.594 * +convert-to-slave slave 192.168.56.108:6379 192.168.56.108 6379 @ mymaster 192.168.56.109 6380

在节点2的log中可以看到：

3047:X 27 Jun 05:58:05.272 # Executing user requested FAILOVER of 'mymaster'
3047:X 27 Jun 05:58:05.273 # +new-epoch 17
3047:X 27 Jun 05:58:05.273 # +try-failover master mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:05.285 # +vote-for-leader be167fc5c77a14ef53996d367e237d3cc33a53b6 17
3047:X 27 Jun 05:58:05.285 # +elected-leader master mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:05.285 # +failover-state-select-slave master mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:05.362 # +selected-slave slave 192.168.56.109:6380 192.168.56.109 6380 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:05.362 * +failover-state-send-slaveof-noone slave 192.168.56.109:6380 192.168.56.109 6380 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:05.424 * +failover-state-wait-promotion slave 192.168.56.109:6380 192.168.56.109 6380 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:06.316 # +promoted-slave slave 192.168.56.109:6380 192.168.56.109 6380 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:06.316 # +failover-state-reconf-slaves master mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:06.363 * +slave-reconf-sent slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:07.026 * +slave-reconf-inprog slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:08.050 * +slave-reconf-done slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:08.117 # +failover-end master mymaster 192.168.56.108 6379
3047:X 27 Jun 05:58:08.117 # +switch-master mymaster 192.168.56.108 6379 192.168.56.109 6380
3047:X 27 Jun 05:58:08.117 * +slave slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.109 6380
3047:X 27 Jun 05:58:08.117 * +slave slave 192.168.56.108:6379 192.168.56.108 6379 @ mymaster 192.168.56.109 6380

在节点3的log中，可以看到：

3018:X 27 Jun 05:58:06.395 # +new-epoch 17
3018:X 27 Jun 05:58:06.395 # +config-update-from sentinel be167fc5c77a14ef53996d367e237d3cc33a53b6 192.168.56.109 26380 @ mymaster 192.168.56.108 6379
3018:X 27 Jun 05:58:06.395 # +switch-master mymaster 192.168.56.108 6379 192.168.56.109 6380
3018:X 27 Jun 05:58:06.396 * +slave slave 192.168.56.110:6381 192.168.56.110 6381 @ mymaster 192.168.56.109 6380
3018:X 27 Jun 05:58:06.396 * +slave slave 192.168.56.108:6379 192.168.56.108 6379 @ mymaster 192.168.56.109 6380

三、Redis的分片和集群高可用架构

redis的高可用+分片技术，是通过redis cluster来实现的。一般情况下，如果单个或者主从结构，撑不住业务的需求，如单核CPU撑爆，或者内存使用过多，我们一般会将redis拆成多个分片。

1. Redis的分片技术，可以分成
1.1 redis原生分片：

1.2 proxy分片：

1.3 应用程序分片：

我们这里聊的是redis的原生分片，即redis cluster。它可以实现多主多从的架构，多从是为了给主在down掉的时候，实现切换。这个切换在cluster内提供，不需要在额外的使用sentinel。注意：redis集群需要至少6个节点，也就是六台服务器。如果服务器数量不足可在每台服务器上建立多个节点，如2台服务器，每台服务器上建立3个节点。

另外，由于redis cluster官方自带的redis-trib.rb工具不支持密码，因此在配置完成前，不能加密码。

Redis集群不能保证强一致性。产生写操作丢失的第一个原因，是因为主从节点之间使用了异步的方式来同步数据。

一个最小的集群需要最少３个主节点。建议配置至少６个节点：３个主节点和３个从节点。

2. 安装redis cluster：
redis cluster的管理工具是redis-trib。
2.1. 要运行redis-trib要先安装ruby运行环境：
yum -y install ruby

2.2. 接下来安装ruby gems，用它来查找、安装、升级和卸载ruby软件包：
yum -y install rubygems

2.3. 然后通过gem来安装ruby的redis客户端
gem install redis

这一步有可能会失败，大多是因为国内连不上gem官方库，那只能修改gem库为国内的源，如淘宝网的RubyGems镜像：
下面是换源操作：

# gem source -l
# gem source --remove http://rubygems.org/
# gem sources -a http://ruby.taobao.org/
# gem source -l

# gem install redis

2.4. 修改各个主机上的redis.conf文件，添加cluster选项：

集群配置：
cluster-enabled <yes/no>: 如果配置”yes”则开启集群功能，此redis实例作为集群的一个节点，否则，它是一个普通的单一的redis实例。
cluster-config-file : 注意：虽然此配置的名字叫“集群配置文件”，但是此配置文件不能人工编辑，它是集群节点自动维护的文件，主要用于记录集群中有哪些节点、他们的状态以及一些持久化参数等，方便在重启时恢复这些状态。通常是在收到请求之后这个文件就会被更新。
cluster-node-timeout : 这是集群中的节点能够失联的最大时间，超过这个时间，该节点就会被认为故障。如果主节点超过这个时间还是不可达，则用它的从节点将启动故障迁移，升级成主节点。注意，任何一个节点在这个时间之内如果还是没有连上大部分的主节点，则此节点将停止接收任何请求。
cluster-slave-validity-factor : 如果设置成０，则无论从节点与主节点失联多久，从节点都会尝试升级成主节点。如果设置成正数，则cluster-node-timeout乘以cluster-slave-validity-factor得到的时间，是从节点与主节点失联后，此从节点数据有效的最长时间，超过这个时间，从节点不会启动故障迁移。假设cluster-node-timeout=5，cluster-slave-validity-factor=10，则如果从节点跟主节点失联超过50秒，此从节点不能成为主节点。注意，如果此参数配置为非0，将可能出现由于某主节点失联却没有从节点能顶上的情况，从而导致集群不能正常工作，在这种情况下，只有等到原来的主节点重新回归到集群，集群才恢复运作。
cluster-migration-barrier :主节点需要的最小从节点数，只有达到这个数，主节点失败时，它从节点才会进行迁移。更详细介绍可以看本教程后面关于副本迁移到部分。
cluster-require-full-coverage <yes/no>:在部分key所在的节点不可用时，如果此参数设置为”yes”(默认值), 则整个集群停止接受操作；如果此参数设置为”no”，则集群依然为可达节点上的key提供读操作。

cluster-enabled yes
##注意各个主机的这个文件是不同的
cluster-config-file nodes-6379.conf
##表示超时5000毫秒，cluster就认为该节点下线，在大规模集群中（近1000个redis实例）集群间通信占用大量带宽资源，调整cluster-node-timeout 参数能有效降低带宽。
cluster-node-timeout 5000
##修改成允许所有网络连接
#bind 127.0.0.1
##取消protect mode
protected-mode no
##取消密码：
#masterauth oracleblog
#requirepass "oracleblog"

2.5. 我们分别在3个主机上启动6个实例：
192.168.56.108 ： redis_6379.conf + redis_6389.conf
192.168.56.109 ： redis_6380.conf + redis_6381.conf
192.168.56.110 ： redis_6381.conf + redis_6391.conf

2.6. 创建cluster:

cp -p /root/redis_install/redis-3.2.9/src/redis-trib.rb /usr/local/bin/
[root@redis01 6379]# cd /usr/local/bin/
[root@redis01 bin]#
[root@redis01 bin]# ./redis-trib.rb create --replicas 1 192.168.56.108:6379 192.168.56.109:6380 192.168.56.110:6381 192.168.56.108:6389 192.168.56.109:6390 192.168.56.110:6391
>>> Creating cluster
>>> Performing hash slots allocation on 6 nodes...
Using 3 masters:
192.168.56.108:6379
192.168.56.109:6380
192.168.56.110:6381
Adding replica 192.168.56.109:6390 to 192.168.56.108:6379
Adding replica 192.168.56.108:6389 to 192.168.56.109:6380
Adding replica 192.168.56.110:6391 to 192.168.56.110:6381
M: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots:0-5460 (5461 slots) master
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
S: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   replicates 50df897b5bad63a525a8d46998b30d47698d9cd9
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join...
>>> Performing Cluster Check (using node 192.168.56.108:6379)
M: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
S: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots: (0 slots) slave
   replicates 50df897b5bad63a525a8d46998b30d47698d9cd9
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#

replicas表示需要有几个slave–replicas 1 表示自动为每一个master节点分配一个slave节点上面有6个节点，程序会按照一定规则生成 3个master（主），3个slave(从) 。

注，如果遇到ERR Slot xxxx is already busy (Redis::CommandError)的报错，就按照下面的方法解决：

1. 删除所有node_xxxx.conf文件
2. redis-cli -p xxxx flushall
3.redis-cli -p xxxx cluster reset soft

我们来插入数据：
1. 先检查一下哪个是master：

[root@redis01 bin]# ./redis-trib.rb check 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#

或者下面的命令也可以：

[root@redis01 bin]# redis-cli -p 6389 cluster nodes
9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390 master - 0 1498756230880 7 connected 0-5460
00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380 master - 0 1498756231884 2 connected 5461-10922
50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379 slave 9e112eed7f8a5830e907d97792c50a2171d9f13b 0 1498756231381 7 connected
632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381 master - 0 1498756230377 3 connected 10923-16383
e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389 myself,slave 00974c9c1acede227f1ef25fd56460a1a19818a0 0 0 4 connected
3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391 slave 632f31d57d6fcbf48e277ed8cd34299188d2c675 0 1498756231381 6 connected
[root@redis01 bin]#

2. 我们登录192.168.56.109:6390进行操作：

[root@redis01 bin]# redis-cli -h 192.168.56.109 -p 6390 -c
192.168.56.109:6390> keys *
(empty list or set)
192.168.56.109:6390> set name myname1
-> Redirected to slot [5798] located at 192.168.56.109:6380
OK
192.168.56.109:6380>

注意，这里的redis-cli要用-c参数。不然会报错：

192.168.56.108:6379> set name myname1
(error) MOVED 5798 192.168.56.109:6380
192.168.56.108:6379>

3. 我们来尝试添加节点：
3.1. 先启2个redis实例，实例参数参考原来已经在跑的实例。
3.2. 添加一个实例到cluster，注意，这个是作为master的节点加进去的。

[root@redis01 bin]# pwd
/usr/local/bin
[root@redis01 bin]# ./redis-trib.rb add-node 192.168.56.108:6399 192.168.56.108:6379
>>> Adding node 192.168.56.108:6399 to cluster 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 192.168.56.108:6399 to make it join the cluster.
[OK] New node added correctly.
[root@redis01 bin]#

[root@redis01 bin]#   redis-cli -p 6389 cluster nodes |grep 6399
df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399 master - 0 1498814734265 0 connected
[root@redis01 bin]#

注意这里的df5bf5d030453acddd4db106fda76a1d1687a22f ，我们一会会用到。

添加从节点，注意我们这里用到了刚刚的主节点的mast id

[root@redis01 bin]# ./redis-trib.rb add-node --slave --master-id  df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.109:6370 192.168.56.108:6399
>>> Adding node 192.168.56.109:6370 to cluster 192.168.56.108:6399
>>> Performing Cluster Check (using node 192.168.56.108:6399)
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots: (0 slots) master
   0 additional replica(s)
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 192.168.56.109:6370 to make it join the cluster.
Waiting for the cluster to join...
>>> Configure node as replica of 192.168.56.108:6399.
[OK] New node added correctly.
[root@redis01 bin]#

4. 添加完之后，数据并没有重新分布，我们需要reshard。

重新分片命令：
交互式：
/redis-trib.rb reshard [host]:[port]
非交互式：
./redis-trib.rb reshard –from [node-id] –to [node-id] –slots [number of slots] –yes [host]:[port]

注意看下面那些master中0 slot的部分：

[root@redis01 bin]# ./redis-trib.rb check 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots: (0 slots) master
   1 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
S: a724660e17bf5dbfd7266f33ed37d5eb952dd3d0 192.168.56.109:6370
   slots: (0 slots) slave
   replicates df5bf5d030453acddd4db106fda76a1d1687a22f
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#

[root@redis01 bin]# ./redis-trib.rb reshard 192.168.56.108:6399
>>> Performing Cluster Check (using node 192.168.56.108:6399)
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots: (0 slots) master
   1 additional replica(s)
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
S: a724660e17bf5dbfd7266f33ed37d5eb952dd3d0 192.168.56.109:6370
   slots: (0 slots) slave
   replicates df5bf5d030453acddd4db106fda76a1d1687a22f
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
How many slots do you want to move (from 1 to 16384)? 4096
What is the receiving node ID? df5bf5d030453acddd4db106fda76a1d1687a22f
Please enter all the source node IDs.
  Type 'all' to use all the nodes as source nodes for the hash slots.
  Type 'done' once you entered all the source nodes IDs.
Source node #1:all

Ready to move 4096 slots.
  Source nodes:
    M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:10923-16383 (5461 slots) master
   1 additional replica(s)
    M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-5460 (5461 slots) master
   1 additional replica(s)
    M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:5461-10922 (5462 slots) master
   1 additional replica(s)
  Destination node:
    M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots: (0 slots) master
   1 additional replica(s)
  Resharding plan:
    Moving slot 5461 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    Moving slot 5462 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    Moving slot 5463 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    Moving slot 5464 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    Moving slot 5465 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    Moving slot 5466 from 00974c9c1acede227f1ef25fd56460a1a19818a0
    ……
    Moving slot 12285 from 192.168.56.110:6381 to 192.168.56.108:6399:
    Moving slot 12286 from 192.168.56.110:6381 to 192.168.56.108:6399:
    Moving slot 12287 from 192.168.56.110:6381 to 192.168.56.108:6399:
[root@redis01 bin]#

再次检查reshard之后的情况，可以看到每个master基本都分到了4096个slot。(因为总共16384 个slot，现在有4个master，如果平均分配，那么每个4096个slots。)

[root@redis01 bin]# ./redis-trib.rb check 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots:0-1364,5461-6826,10923-12287 (4096 slots) master
   1 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:6827-10922 (4096 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:12288-16383 (4096 slots) master
   1 additional replica(s)
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:1365-5460 (4096 slots) master
   1 additional replica(s)
S: a724660e17bf5dbfd7266f33ed37d5eb952dd3d0 192.168.56.109:6370
   slots: (0 slots) slave
   replicates df5bf5d030453acddd4db106fda76a1d1687a22f
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#

5. 删除节点。步骤相反：删除从节点，reshard数据去不用删除的节点，删除主节点。
删除从节点：

[root@redis01 bin]# ./redis-trib.rb del-node 192.168.56.109:6370 'a724660e17bf5dbfd7266f33ed37d5eb952dd3d0'
>>> Removing node a724660e17bf5dbfd7266f33ed37d5eb952dd3d0 from cluster 192.168.56.109:6370
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.
[root@redis01 bin]#
[root@redis01 bin]#

reshard数据：

[root@redis01 bin]# ./redis-trib.rb check 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots:0-1364,5461-6826,10923-12287 (4096 slots) master
   0 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:6827-10922 (4096 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:12288-16383 (4096 slots) master
   1 additional replica(s)
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:1365-5460 (4096 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#
[root@redis01 bin]#
[root@redis01 bin]# ./redis-trib.rb reshard 192.168.56.108:6399
>>> Performing Cluster Check (using node 192.168.56.108:6399)
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots:0-1364,5461-6826,10923-12287 (4096 slots) master
   0 additional replica(s)
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:12288-16383 (4096 slots) master
   1 additional replica(s)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:1365-5460 (4096 slots) master
   1 additional replica(s)
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:6827-10922 (4096 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
How many slots do you want to move (from 1 to 16384)? 4096
What is the receiving node ID? 9e112eed7f8a5830e907d97792c50a2171d9f13b //接受者的master id
Please enter all the source node IDs.
  Type 'all' to use all the nodes as source nodes for the hash slots.
  Type 'done' once you entered all the source nodes IDs.
Source node #1:df5bf5d030453acddd4db106fda76a1d1687a22f //需要被删除的master id
Source node #2:done

检查已经是0 slot

[root@redis01 bin]# ./redis-trib.rb check 192.168.56.108:6379
>>> Performing Cluster Check (using node 192.168.56.108:6379)
S: 50df897b5bad63a525a8d46998b30d47698d9cd9 192.168.56.108:6379
   slots: (0 slots) slave
   replicates 9e112eed7f8a5830e907d97792c50a2171d9f13b
M: df5bf5d030453acddd4db106fda76a1d1687a22f 192.168.56.108:6399
   slots: (0 slots) master
   0 additional replica(s)
S: e73dc8caf474076fdbeb4da346333bc8410c8486 192.168.56.108:6389
   slots: (0 slots) slave
   replicates 00974c9c1acede227f1ef25fd56460a1a19818a0
M: 00974c9c1acede227f1ef25fd56460a1a19818a0 192.168.56.109:6380
   slots:8192-10922 (2731 slots) master
   1 additional replica(s)
S: 3005c5adea38cc21cf47fb86cbe1e8cfd1cbfce7 192.168.56.110:6391
   slots: (0 slots) slave
   replicates 632f31d57d6fcbf48e277ed8cd34299188d2c675
M: 632f31d57d6fcbf48e277ed8cd34299188d2c675 192.168.56.110:6381
   slots:13654-16383 (2730 slots) master
   1 additional replica(s)
M: 9e112eed7f8a5830e907d97792c50a2171d9f13b 192.168.56.109:6390
   slots:0-8191,10923-13653 (10923 slots) master
   1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
[root@redis01 bin]#

[root@redis01 bin]#
[root@redis01 bin]# ./redis-trib.rb del-node 192.168.56.108:6399 'df5bf5d030453acddd4db106fda76a1d1687a22f'
>>> Removing node df5bf5d030453acddd4db106fda76a1d1687a22f from cluster 192.168.56.108:6399
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.
[root@redis01 bin]#

删除确认没有slot的主节点

[root@redis01 bin]#
[root@redis01 bin]# ./redis-trib.rb del-node 192.168.56.108:6399 'df5bf5d030453acddd4db106fda76a1d1687a22f'
>>> Removing node df5bf5d030453acddd4db106fda76a1d1687a22f from cluster 192.168.56.108:6399
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.
[root@redis01 bin]#

四. Redis的监控：

Redis的监控，主要还是从info命令的返回结果看。

1. 内存使用
如果 Redis 使用的内存超出了可用的物理内存大小，那么 Redis 很可能系统会被 OOM Killer 杀掉。针对这一点，你可以通过 info 命令对 used_memory 和 used_memory_peak 进行监控，为使用内存量设定阈值，并设定相应的报警机制。当然，报警只是手段，重要的是你得预先计划好，当内存使用量过大后，你应该做些什么，是清除一些没用的冷数据，还是把 Redis 迁移到更强大的机器上去。

# Memory
used_memory:822504
used_memory_human:803.23K
used_memory_rss:3960832
used_memory_rss_human:3.78M
used_memory_peak:822504
used_memory_peak_human:803.23K
total_system_memory:16803835904
total_system_memory_human:15.65G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:4.82
mem_allocator:jemalloc-4.0.3

2. 持久化
如果因为你的机器或 Redis 本身的问题导致 Redis 崩溃了，那么你唯一的救命稻草可能就是 dump 出来的 rdb文件了，所以，对 Redis dump 文件进行监控也是很重要的。你可以通过对 rdb_last_save_time 进行监控，了解你最近一次 dump 数据操作的时间，还可以通过对 rdb_changes_since_last_save 进行监控来知道如果这时候出现故障，你会丢失多少数据。

# Persistence
loading:0
rdb_changes_since_last_save:35
rdb_bgsave_in_progress:0
rdb_last_save_time:1498833577
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:1621
aof_base_size:1621
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

3. 主从复制
如果你设置了主从复制模式，那么你最好对复制的情况是否正常做一些监控，主要是对 info 输出中的 master_link_status 进行监控，如果这个值是 up，那么说明同步正常，如果是 down，那么你就要注意一下输出的其它一些诊断信息了。

# Replication
role:slave
master_host:192.168.56.109
master_port:6380
master_link_status:up
master_last_io_seconds_ago:3
master_sync_in_progress:0
slave_repl_offset:3011
slave_priority:100
slave_read_only:1
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

4. Fork 性能
当 Redis 持久化数据到磁盘上时，它会进行一次 fork 操作，通过 fork 对内存的 copy on write 机制最廉价的实现内存镜像。但是虽然内存是 copy on write 的，但是虚拟内存表是在 fork 的瞬间就需要分配，所以 fork 会造成主线程短时间的卡顿（停止所有读写操作），这个卡顿时间和当前 Redis 的内存使用量有关。通常 GB 量级的 Redis 进行 fork 操作的时间在毫秒级。你可以通过对 info 输出的 latest_fork_usec 进行监控来了解最近一次 fork 操作导致了多少时间的卡顿。

# Stats
total_connections_received:1
total_commands_processed:16
instantaneous_ops_per_sec:0
total_net_input_bytes:477
total_net_output_bytes:6000613
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:206
migrate_cached_sockets:0

5. 配置一致
Redis 支持使用 CONFIG SET 操作来实现运行实的配置修改，这很方便，但同时也会导致一个问题。就是通过这个命令动态修改的配置，是不会同步到你的配置文件中去的。所以当你因为某些原因重启 Redis 时，你使用 CONFIG SET 做的配置修改就会丢失掉，所以我们最好保证在每次使用 CONFIG SET 修改配置时，也把配置文件一起相应地改掉。为了防止人为的失误，所以我们最好对配置进行监控，使用 CONFIG GET 命令来获取当前运行时的配置，并与 redis.conf 中的配置值进行对比，如果发现两边对不上，就启动报警。

6. 监控服务
-Sentinel
Sentinel 是 Redis 自带的工具，它可以对 Redis 主从复制进行监控，并实现主挂掉之后的自动故障转移。在转移的过程中，它还可以被配置去执行一个用户自定义的脚本，在脚本中我们就能够实现报警通知等功能

-Redis Live
Redis Live 是一个更通用的 Redis 监控方案，它的原理是定时在 Redis 上执行 MONITOR 命令，来获取当前 Redis 当前正在执行的命令，并通过统计分析，生成web页面的可视化分析报表。

7. 数据分布
弄清 Redis 中数据存储分布是一件很难的是，比如你想知道哪类型的 key 值占用内存最多。下面是一些工具，可以帮助你对 Redis 的数据集进行分析。

-Redis-sampler
Redis-sampler 是 Redis 作者开发的工具，它通过采样的方法，能够让你了解到当前Redis 中的数据的大致类型，数据及分布状况。

-Redis-audit
Redis-audit 是一个脚本，通过它，我们可以知道每一类 key 对内存的使用量。它可以提供的数据有：某一类 key 值的访问频率如何，有多少值设置了过期时间，某一类 key 值使用内存的大小，这很方便让我们能排查哪些 key 不常用或者压根不用。

-Redis-rdb-tools
Redis-rdb-tools 跟 Redis-audit 功能类似，不同的是它是通过对 rdb 文件进行分析来取得统计数据的。

五、Redis的Docker化：

1.docker上安装redis
先search一下有哪些redis：

LoveHousedeiMac:~ lovehouse$ docker search redis
NAME                      DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
redis                     Redis is an open source key-value store th...   3866      [OK]
sameersbn/redis                                                           54                   [OK]
bitnami/redis             Bitnami Redis Docker Image                      50                   [OK]
torusware/speedus-redis   Always updated official Redis docker image...   32                   [OK]
webhippie/redis           Docker images for redis                         7                    [OK]
anapsix/redis             11MB Redis server image over AlpineLinux        6                    [OK]
williamyeh/redis          Redis image for Docker                          3                    [OK]
clue/redis-benchmark      A minimal docker image to ease running the...   3                    [OK]
unblibraries/redis        Leverages phusion/baseimage to deploy a ba...   2                    [OK]
abzcoding/tomcat-redis    a tomcat container with redis as session m...   2                    [OK]
miko2u/redis              Redis                                           1                    [OK]
greytip/redis             redis 3.0.3                                     1                    [OK]
frodenas/redis            A Docker Image for Redis                        1                    [OK]
xataz/redis               Light redis image                               1                    [OK]
nanobox/redis             Redis service for nanobox.io                    0                    [OK]
maestrano/redis           Redis is an open source key-value store th...   0                    [OK]
cloudposse/redis          Standalone redis service                        0                    [OK]
watsco/redis              Watsco redis base                               0                    [OK]
appelgriebsch/redis       Configurable redis container based on Alpi...   0                    [OK]
maxird/redis              Redis                                           0                    [OK]
trelllis/redis            Redis Primary                                   0                    [OK]
drupaldocker/redis        Redis for Drupal                                0                    [OK]
yfix/redis                Yfix docker redis                               0                    [OK]
higebu/redis-commander    Redis Commander Docker image. https://gith...   0                    [OK]
continuouspipe/redis      Redis                                           0                    [OK]
LoveHousedeiMac:~ lovehouse$

开始pull镜像：

LoveHousedeiMac:~ lovehouse$ docker pull redis:latest
latest: Pulling from library/redis
23e3d0773492: Pull complete
bc8f870e2eab: Pull complete
9fb63685a3db: Pull complete
7d5f2d3e9188: Pull complete
4b386c0238f4: Pull complete
33c08d492082: Pull complete
Digest: sha256:6022356f9d729c858000fc10fc1b09d1624ba099227a0c5d314f7461c2fe6020
Status: Downloaded newer image for redis:latest
LoveHousedeiMac:~ lovehouse$

建议pull的时候，选择一个比较好的fuckgfw网络，不然总是会报错：

error pulling image configuration: Get https://dseasb33srnrn.cloudfront.net/registry-v2/docker/registry/v2/blobs/sha256/83/83744227b191fbc32e3bcb293c1b90ecdb86b3636d02b1a0db009effb3a5b8de/data?Expires=1497887535&Signature=aVLHPVuNv4zjReHDu8ZLum23CgZrSJkmU1~WZzy1mOQdcYu1gVvepxZeV4j44DCCfvM56VCewGzl7FFdNxev4Mtm~KpmKJjHFNQtavJNmu1nqx4MEhdjJNKWX8KNeFuL-euTU7hCwVzrzUs8OIeGO3RKhiva7w0KIFc7ql-xHC8_&Key-Pair-Id=APKAJECH5M7VWIS5YZ6Q: net/http: TLS handshake timeout

安装并启动redis，且设置appendonly为yes，注意我们这里把容器内的/data目录映射到本地目录/Users/[username]/redisdata下，用于做持久化：

LoveHousedeiMac:~ lovehouse$ docker run -p 6379:6379 -v /Users/lovehouse/redisdata:/data  -d redis:latest redis-server --appendonly yes
1fa497550b7e232eee63e050ff5e0f12c530aee992c158138af75b9442c7403f
LoveHousedeiMac:~ lovehouse$
LoveHousedeiMac:~ lovehouse$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
1fa497550b7e        redis:latest        "docker-entrypoint..."   5 seconds ago       Up 6 seconds        0.0.0.0:6379->6379/tcp   pensive_sinoussi
LoveHousedeiMac:~ lovehouse$
LoveHousedeiMac:~ lovehouse$ docker ps -a
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS                        PORTS                    NAMES
1fa497550b7e        redis:latest                  "docker-entrypoint..."   15 seconds ago      Up 16 seconds                 0.0.0.0:6379->6379/tcp   pensive_sinoussi
c9f09116cc83        oracle/database:12.2.0.1-ee   "/bin/sh -c 'exec ..."   4 weeks ago         Exited (137) 42 minutes ago                            oracle
LoveHousedeiMac:~ lovehouse$

登录redis主机后运行redis-cli：

LoveHousedeiMac:~ lovehouse$ docker exec -it 1fa497550b7e /bin/bash
root@1fa497550b7e:/data#
root@1fa497550b7e:/data#
root@1fa497550b7e:/data# redis-cli
127.0.0.1:6379>
127.0.0.1:6379>
127.0.0.1:6379> info
# Server
redis_version:3.2.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:d837dd4aae3a6933
redis_mode:standalone
os:Linux 4.9.27-moby x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:a490709296a4a15606af8650b4ae2eb922de81ff
tcp_port:6379
uptime_in_seconds:135
uptime_in_days:0
hz:10
lru_clock:4716607
executable:/data/redis-server
config_file:

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:822232
used_memory_human:802.96K
used_memory_rss:4005888
used_memory_rss_human:3.82M
used_memory_peak:822232
used_memory_peak_human:802.96K
total_system_memory:16803835904
total_system_memory_human:15.65G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:4.87
mem_allocator:jemalloc-4.0.3

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1497888696
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:0
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:1
total_commands_processed:1
instantaneous_ops_per_sec:0
total_net_input_bytes:31
total_net_output_bytes:6005118
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.09
used_cpu_user:0.03
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
127.0.0.1:6379>

或者运行docker run -it redis:latest redis-cli也可以：

LoveHousedeiMac:~ lovehouse$ docker run -it redis:latest redis-cli -h 192.168.1.207
192.168.1.207:6379> info
# Server
redis_version:3.2.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:d837dd4aae3a6933
redis_mode:standalone
os:Linux 4.9.27-moby x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:a490709296a4a15606af8650b4ae2eb922de81ff
tcp_port:6379
uptime_in_seconds:262
uptime_in_days:0
hz:10
lru_clock:4716734
executable:/data/redis-server
config_file:

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:822232
used_memory_human:802.96K
used_memory_rss:4005888
used_memory_rss_human:3.82M
used_memory_peak:822232
used_memory_peak_human:802.96K
total_system_memory:16803835904
total_system_memory_human:15.65G
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
mem_fragmentation_ratio:4.87
mem_allocator:jemalloc-4.0.3

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1497888696
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:0
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:2
total_commands_processed:5
instantaneous_ops_per_sec:0
total_net_input_bytes:122
total_net_output_bytes:11980083
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0

# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.16
used_cpu_user:0.05
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

# Keyspace
192.168.1.207:6379>

2.备份，迁移和克隆docker镜像:
2.1 检查原有信息：

LoveHousedeiMac:~ lovehouse$ docker ps -a
CONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS                     PORTS               NAMES
f98eeeebda7e        redis:latest                  "docker-entrypoint..."   2 weeks ago         Exited (137) 2 weeks ago                       quizzical_torvalds
1a6e061b7233        redis:latest                  "docker-entrypoint..."   2 weeks ago         Exited (0) 2 weeks ago                         hungry_spence
1fa497550b7e        redis:latest                  "docker-entrypoint..."   2 weeks ago         Exited (0) 3 days ago                          pensive_sinoussi
c9f09116cc83        oracle/database:12.2.0.1-ee   "/bin/sh -c 'exec ..."   6 weeks ago         Exited (137) 10 days ago                       oracle
LoveHousedeiMac:~ lovehouse$

2.1 停下container，并将container commit成images：

LoveHousedeiMac:~ lovehouse$ docker stop pensive_sinoussi
pensive_sinoussi
LoveHousedeiMac:~ lovehouse$
LoveHousedeiMac:~ lovehouse$
LoveHousedeiMac:~ lovehouse$ docker commit -p 1fa497550b7e container-backup
sha256:b5dfe58c6528f02c7652f3261e1e60ea45c52aadf3f004d0dbd01acb0236f884
LoveHousedeiMac:~ lovehouse$
LoveHousedeiMac:~ lovehouse$

2.3 检查一下images是否已经建立

LoveHousedeiMac:~ lovehouse$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
container-backup    latest              b5dfe58c6528        9 seconds ago       98.9MB
redis               latest              83744227b191        3 weeks ago         98.9MB
oracle/database     12.2.0.1-ee         4f9df5f46a19        6 weeks ago         14.8GB
oraclelinux         7-slim              442ebf722584        2 months ago        114MB
LoveHousedeiMac:~ lovehouse$

2.4 将container-backup 这个image做成tar文件：

LoveHousedeiMac:idocker lovehouse$ docker save -o ./container-backup.tar container-backup
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$ ls -l
total 200792
-rw-------   1 lovehouse  staff  102801920  7  5 00:34 container-backup.tar
drwxr-xr-x@ 19 lovehouse  staff        646  5 20 20:04 docker-images-master
drwxr-xr-x@  7 lovehouse  staff        238  6  2  2016 docker-redis-cluster-master
LoveHousedeiMac:idocker lovehouse$

2.5 我们这里将备份的东西，load进去，并且成为redis_2

LoveHousedeiMac:~ lovehouse$ cp -pR redisdata redisdata_2
LoveHousedeiMac:~ lovehouse$ cd idocker
LoveHousedeiMac:idocker lovehouse$ ls
container-backup.tar		docker-images-master		docker-redis-cluster-master
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
container-backup    latest              b5dfe58c6528        8 minutes ago       98.9MB
redis               latest              83744227b191        3 weeks ago         98.9MB
oracle/database     12.2.0.1-ee         4f9df5f46a19        6 weeks ago         14.8GB
oraclelinux         7-slim              442ebf722584        2 months ago        114MB
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$

2.6 docker run创建第二个redis，注意这里第二个redis的端口映射为26379，不修改的话，会和第一个redis的端口冲突。

LoveHousedeiMac:idocker lovehouse$ docker run --name redis_2 -p 26379:6379 -v /Users/lovehouse/redis_2:/data container-backup:latest
                _._
           _.-``__ ''-._
      _.-``    `.  `_.  ''-._           Redis 3.2.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 1
  `-._    `-._  `-./  _.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |           http://redis.io
  `-._    `-._`-.__.-'_.-'    _.-'
 |`-._`-._    `-.__.-'    _.-'_.-'|
 |    `-._`-._        _.-'_.-'    |
  `-._    `-._`-.__.-'_.-'    _.-'
      `-._    `-.__.-'    _.-'
          `-._        _.-'
              `-.__.-'

1:M 04 Jul 16:43:54.942 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
1:M 04 Jul 16:43:54.942 # Server started, Redis version 3.2.9
1:M 04 Jul 16:43:54.942 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
1:M 04 Jul 16:43:54.943 * The server is now ready to accept connections on port 6379

2.7 启动第二个redis

LoveHousedeiMac:idocker lovehouse$ docker start redis_2
redis_2
LoveHousedeiMac:idocker lovehouse$

2.8 检查2个redis已经部署好了。

LoveHousedeiMac:idocker lovehouse$ docker ps
CONTAINER ID        IMAGE                     COMMAND                  CREATED              STATUS              PORTS                     NAMES
240382527b36        container-backup:latest   "docker-entrypoint..."   About a minute ago   Up 17 seconds       0.0.0.0:26379->6379/tcp   redis_2
1fa497550b7e        redis:latest              "docker-entrypoint..."   2 weeks ago          Up 3 minutes        0.0.0.0:6379->6379/tcp    pensive_sinoussi
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$ docker run -it redis:latest redis-cli -h 192.168.1.207 -p 26379 info server
# Server
redis_version:3.2.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:d837dd4aae3a6933
redis_mode:standalone
os:Linux 4.9.31-moby x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:a326152e6689deb5bcf507354e92c53e13bbeaeb
tcp_port:6379
uptime_in_seconds:300
uptime_in_days:0
hz:10
lru_clock:6014761
executable:/data/redis-server
config_file:
LoveHousedeiMac:idocker lovehouse$
LoveHousedeiMac:idocker lovehouse$ docker run -it redis:latest redis-cli -h 192.168.1.207 -p 6379 info server
# Server
redis_version:3.2.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:d837dd4aae3a6933
redis_mode:standalone
os:Linux 4.9.31-moby x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:dd00a257e902c91c5316ae3aafb1d67f6e30270b
tcp_port:6379
uptime_in_seconds:478
uptime_in_days:0
hz:10
lru_clock:6014771
executable:/data/redis-server
config_file:
LoveHousedeiMac:idocker lovehouse$

六、Redis 4.0的新特性

1. Module的支持。
module可以在不改变redis源代码主分支的基础上，通过高层抽象的API挂载外部模块，来提供更多的功能，我的理解，这是类似postgresSQL的hook。

2. PSYNC v2
PSYNC(Partial Replication，增量同步)得到改进。
之前是从库尝试发送 psync 命令到主库，主库判断是否满足 psync 条件, 满足就返回 +CONTINUE 进行增量同步, 否则返回 +FULLRESYNC runid offfset。

虽然psync 可以解决短时间主从同步断掉重连问题，但以下几个场景仍然是需要全量同步:
a). 主库/从库有重启过。因为 runnid 重启后就会丢失，所以当前机制无法做增量同步。
b). 从库提升为主库。其他从库切到新主库全部要全量不同数据，因为新主库的 runnid 跟老的主库是不一样的

psync v2增加了一个replid2，来记录是从哪个master做同步的，这个replid2是从master的replid继承过来的。如果之前这两个曾经属于同一个主库(多级也允许)，那么新主库的 replid2 就是之前主库的 replid。只要之前是同一主库且新主库的同步进度比这个从库还快就允许增量同步。
因此上述的第二点，从库提升为主库之后，还是可以使用增量同步。

3. 缓存回收策略改进。
增加了LFU（Last Frequently Used）缓存回收策略。最不常用的缓存数据进行清理。

4. 非阻塞性DEL和FLUSHALL/FLUSHDB
在 Redis 4.0 之前，用户在使用 DEL 命令删除体积较大的键，又或者在使用 FLUSHDB 和 FLUSHALL 删除包含大量键的数据库时，都可能会造成服务器阻塞。
redis 4.0提供了unlink命令来替代del命令。这个命令可以异步的删除大量key且不会阻塞。（注，为了保留向前的兼容性，del命令仍然保留）
同时，redis 4.0还提供了flushdb async和flushall async，两个命令的async选项，来提供异步的删除大量key。

redis 4.0还提供了一个交换db的命令swapdb，如swapdb 0 1，就可以将db0和db1交换。原来在db0中的key，全部去了db1。

5.支持mixed RDB-AOF的持久化模式。
Redis 就可以同时兼有 RDB 持久化和 AOF 持久化的优点 —— 既能够快速地生成重写文件，也能够在出现问题时，快速地载入数据。

开启混合存储模式后 aof 文件加载的流程如下:
a). aof 文件开头是 rdb 的格式, 先加载 rdb 内容再加载剩余的 aof
b). aof 文件开头不是 rdb 的格式，直接以 aof 格式加载整个文件
判断 aof 文件的前面部分是否为 rdb 格式，只需要判断前 5 个字符是否是 REDIS。这个是因为rdb持久化开头就是REDIS, 同时aof文件开头一定不会是 REDIS（以前的版本文件开头都是*）。

6. 增加了内存检查命令，memory。如memory stats,memory usage,memory purge

7.增加了对NAT的支持。（主要是为了解决redis cluster在docker上的问题）。

↧

Real-time materialized view，面向开发者的12.2新特性

July 12, 2017, 3:54 am

≫ Next: ASM添加磁盘最佳实践

≪ Previous: Redis学习笔记

先来谈谈为什么要有这个real time mv。

在12.2之前，如果你想获得实时的数据，那么在利用query rewrite前，你必须得用on commit的刷新方式刷新物化视图。但是on commit的刷新方式有众多限制，如sql的复杂度，如频繁对系统的压力等等。所以，我们不得不采用on command的方式来进行刷新（不管是全量刷新还是增量刷新）。那么在使用on command刷新的时候，必须得有个job来定时的刷，那么，在一次job运行之后，下一次job到来之前，如果基表有数据变化，那么此时的数据肯定不是最新的。

real time mv就是为了解决这个问题而生的。它即可以帮你获取实时的数据，且不用频繁的刷新mv。

我们来看一下这是怎么实现的。

传统mv的创建方式：

SQL> create table t1 (x not null primary key, y not null) as
  2    select rownum x, mod(rownum, 10) y from dual connect by level <= 1000000;

Table created.

SQL> create materialized view log on t1 with rowid (x, y) including new values;

Materialized view log created.

SQL>
SQL> create materialized view mv_old
  2  refresh fast on demand
  3  enable on query computation
  4  enable query rewrite
  5  as
  6    select y , count(*) c1
  7    from t1
  8    group by y;

Materialized view created.

SQL>
SQL>

Real time mv的创建方式：
注意在create mv时的关键字：enable on query computation

SQL> create table t2 (x not null primary key, y not null) as
  2    select rownum x, mod(rownum, 10) y from dual connect by level <= 1000000;

Table created.

SQL> create materialized view log on t2 with rowid (x, y) including new values;

Materialized view log created.

SQL>
SQL> create materialized view mv_new
  2  refresh fast on demand
  3  enable on query computation
  4  enable query rewrite
  5  as
  6    select y , count(*) c1
  7    from t2
  8    group by y;

Materialized view created.

SQL>
SQL>

我们来比较一下传统mv和real time mv的差别：
相关参数：

SQL> show parameter rewr

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
query_rewrite_enabled                string      TRUE
query_rewrite_integrity              string      enforced
SQL>
SQL>
SQL>
SQL>
SQL> set autotrace on explain stat
SQL>

初始状态：
传统mv：

SQL> select  y as y_new_parse1, count(*) from t1
  2  group by y;

Y_NEW_PARSE1   COUNT(*)
------------ ----------
           1     100000
           6     100000
           2     100000
           4     100000
           5     100000
           8     100000
           3     100000
           7     100000
           9     100000
           0     100000

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2738786661

---------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |    10 |    60 |     3   (0)| 00:00:01 |
|   1 |  MAT_VIEW REWRITE ACCESS FULL| MV_OLD |    10 |    60 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
       1029  recursive calls
          2  db block gets
       1587  consistent gets
         76  physical reads
          0  redo size
        739  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         86  sorts (memory)
          0  sorts (disk)
         10  rows processed

SQL>
SQL>

Real time mv:

SQL> select  y as y_new_parse1, count(*) from t2
  2  group by y;

Y_NEW_PARSE1   COUNT(*)
------------ ----------
           1     100000
           6     100000
           2     100000
           4     100000
           5     100000
           8     100000
           3     100000
           7     100000
           9     100000
           0     100000

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 496717744

---------------------------------------------------------------------------------------
| Id  | Operation                    | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |        |    10 |    60 |     3   (0)| 00:00:01 |
|   1 |  MAT_VIEW REWRITE ACCESS FULL| MV_NEW |    10 |    60 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------------------


Statistics
----------------------------------------------------------
        170  recursive calls
         13  db block gets
        248  consistent gets
          7  physical reads
       2008  redo size
        739  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         21  sorts (memory)
          0  sorts (disk)
         10  rows processed

SQL>

看到此时2个物化视图，数据都是最新的，staleness显示是fresh：

SQL> select mview_name,staleness,on_query_computation from user_mviews;

MVIEW_NAME                               STALENESS           O
---------------------------------------- ------------------- -
MV_OLD                                   FRESH               N
MV_NEW                                   FRESH               Y

SQL>

物化视图日志里面也没有记录

SQL> select count(*) from MLOG$_T1;

  COUNT(*)
----------
         0

SQL> select count(*) from MLOG$_T2;

  COUNT(*)
----------
         0

SQL>

我们对基表insert数据：

SQL> insert into t1
  2  select 1000000+rownum, mod(rownum, 3) from dual connect by level <= 999;

999 rows created.

SQL>
SQL> insert into t2
  2  select 1000000+rownum, mod(rownum, 3) from dual connect by level <= 999;

999 rows created.

SQL> commit;

Commit complete.

SQL>

可以看到2个表的staleness已经变成need compile，且物化视图日志表里面，也与了日志的记录：

SQL> select mview_name,staleness,on_query_computation from user_mviews;

MVIEW_NAME                               STALENESS           O
---------------------------------------- ------------------- -
MV_OLD                                   NEEDS_COMPILE       N
MV_NEW                                   NEEDS_COMPILE       Y

SQL>
SQL> select count(*) from MLOG$_T1;

  COUNT(*)
----------
       999

SQL> select count(*) from MLOG$_T2;

  COUNT(*)
----------
       999

SQL>

我们来见证一下奇迹的时刻。我们先重复上面第一个查询，可以看到，由于数据stale，且没有set query_rewrite_integrity=stale_tolerated，传统mv没有进行query write。

SQL> select  y as y_new_parse1, count(*) from t1
  2  group by y;

Y_NEW_PARSE1   COUNT(*)
------------ ----------
           1     100333
           6     100000
           2     100333
           4     100000
           5     100000
           8     100000
           3     100000
           7     100000
           9     100000
           0     100333

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 136660032

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    10 |    30 |   515   (4)| 00:00:01 |
|   1 |  HASH GROUP BY     |      |    10 |    30 |   515   (4)| 00:00:01 |
|   2 |   TABLE ACCESS FULL| T1   |  1000K|  2929K|   498   (1)| 00:00:01 |
---------------------------------------------------------------------------


Statistics
----------------------------------------------------------
       1975  recursive calls
         30  db block gets
       4167  consistent gets
       1786  physical reads
       5440  redo size
        754  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
        131  sorts (memory)
          0  sorts (disk)
         10  rows processed

SQL>
SQL>

我们看到，real time mv，进行了query rewrite，且查到的数据是最新实时数据！

SQL> select  y as y_new_parse1, count(*) from t2
  2  group by y;

Y_NEW_PARSE1   COUNT(*)
------------ ----------
           6     100000
           4     100000
           5     100000
           8     100000
           3     100000
           7     100000
           9     100000
           1     100333
           2     100333
           0     100333

10 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 542978159

------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                |    12 |   312 |    22  (14)| 00:00:01 |
|   1 |  VIEW                               |                |    12 |   312 |    22  (14)| 00:00:01 |
|   2 |   UNION-ALL                         |                |       |       |            |          |
|*  3 |    VIEW                             | VW_FOJ_0       |    10 |   290 |     9  (12)| 00:00:01 |
|*  4 |     HASH JOIN FULL OUTER            |                |    10 |   240 |     9  (12)| 00:00:01 |
|   5 |      VIEW                           |                |     1 |     7 |     6  (17)| 00:00:01 |
|   6 |       HASH GROUP BY                 |                |     1 |    22 |     6  (17)| 00:00:01 |
|*  7 |        TABLE ACCESS FULL            | MLOG$_T2       |   999 | 21978 |     5   (0)| 00:00:01 |
|   8 |      VIEW                           |                |    10 |   170 |     3   (0)| 00:00:01 |
|   9 |       MAT_VIEW ACCESS FULL          | MV_NEW         |    10 |    60 |     3   (0)| 00:00:01 |
|  10 |    VIEW                             |                |     2 |    52 |    13  (16)| 00:00:01 |
|  11 |     UNION-ALL                       |                |       |       |            |          |
|* 12 |      FILTER                         |                |       |       |            |          |
|  13 |       NESTED LOOPS OUTER            |                |     1 |    32 |     6  (17)| 00:00:01 |
|  14 |        VIEW                         |                |     1 |    26 |     6  (17)| 00:00:01 |
|* 15 |         FILTER                      |                |       |       |            |          |
|  16 |          HASH GROUP BY              |                |     1 |    22 |     6  (17)| 00:00:01 |
|* 17 |           TABLE ACCESS FULL         | MLOG$_T2       |   999 | 21978 |     5   (0)| 00:00:01 |
|* 18 |        INDEX UNIQUE SCAN            | I_SNAP$_MV_NEW |     1 |     6 |     0   (0)| 00:00:01 |
|  19 |      NESTED LOOPS                   |                |     1 |    35 |     7  (15)| 00:00:01 |
|  20 |       VIEW                          |                |     1 |    29 |     6  (17)| 00:00:01 |
|  21 |        HASH GROUP BY                |                |     1 |    22 |     6  (17)| 00:00:01 |
|* 22 |         TABLE ACCESS FULL           | MLOG$_T2       |   999 | 21978 |     5   (0)| 00:00:01 |
|* 23 |       MAT_VIEW ACCESS BY INDEX ROWID| MV_NEW         |     1 |     6 |     1   (0)| 00:00:01 |
|* 24 |        INDEX UNIQUE SCAN            | I_SNAP$_MV_NEW |     1 |       |     0   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("AV$0"."OJ_MARK" IS NULL)
   4 - access(SYS_OP_MAP_NONNULL("SNA$0"."Y")=SYS_OP_MAP_NONNULL("AV$0"."GB0"))
   7 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-07-12 14:35:01', 'syyyy-mm-dd hh24:mi:ss'))
  12 - filter(CASE  WHEN ROWID IS NOT NULL THEN 1 ELSE NULL END  IS NULL)
  15 - filter(SUM(1)>0)
  17 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-07-12 14:35:01', 'syyyy-mm-dd hh24:mi:ss'))
  18 - access("MV_NEW"."SYS_NC00003$"(+)=SYS_OP_MAP_NONNULL("AV$0"."GB0"))
  22 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-07-12 14:35:01', 'syyyy-mm-dd hh24:mi:ss'))
  23 - filter("MV_NEW"."C1"+"AV$0"."D0">0)
  24 - access(SYS_OP_MAP_NONNULL("Y")=SYS_OP_MAP_NONNULL("AV$0"."GB0"))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)
   - this is an adaptive plan


Statistics
----------------------------------------------------------
        906  recursive calls
         64  db block gets
       1232  consistent gets
         14  physical reads
      10548  redo size
        744  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
         64  sorts (memory)
          0  sorts (disk)
         10  rows processed

SQL>

我们看到，在查t2的时候，优化器会根据成本决定是否使用query rewrite。
我们的这个例子中CBO选择使用query rewrite。可以看到query rewrite到物化视图之后，不是取的是过期的物化视图的值，而是最新的值。结合执行计划，可以看到，是结合了stale的物化视图，再union all和hash join outer了物化视图日志。得到了最新的结果。

可以看到，使用的物化视图日志是”MAS$”.”SNAPTIME$$”>TO_DATE(‘ 2017-07-12 14:35:01’, ‘syyyy-mm-dd hh24:mi:ss’)之后的。

对比直接从table取值，到利用real time物化视图取值，consistent get从4167变成了1232。

注意我们的mv log还是没有被刷新的。还是需要去定期的job刷新：

SQL> select count(*) from MLOG$_T1;

  COUNT(*)
----------
       999

SQL> select count(*) from MLOG$_T2;

  COUNT(*)
----------
       999

SQL>

另外再提一下，有个/*+ fresh_mv */的hint，可以直接查询real time mv的实时结果：

SQL> select * from mv_new;

         Y         C1
---------- ----------
         1     100000
         6     100000
         2     100000
         4     100000
         5     100000
         8     100000
         3     100000
         7     100000
         9     100000
         0     100000

10 rows selected.

SQL>
SQL> select /*+ fresh_mv */* from mv_new;

         Y         C1
---------- ----------
         6     100000
         4     100000
         5     100000
         8     100000
         3     100000
         7     100000
         9     100000
         1     100333
         2     100333
         0     100333

10 rows selected.

综上，Real time mv利用原来的已经stale的物化视图，结合mv log，通过计算后，帮你获取实时的数据。你即能获得实时数据，又不必那么频繁的刷新mv。

参考：
https://blogs.oracle.com/sql/12-things-developers-will-love-about-oracle-database-12c-release-2#real-time-mv
https://blog.dbi-services.com/12cr2-real-time-materialized-view-on-query-computation/
https://uhesse.com/2017/01/05/real-time-materialized-views-in-oracle-12c/
https://docs.oracle.com/database/122/SQLRF/CREATE-MATERIALIZED-VIEW.htm#SQLRF01302

↧

ASM添加磁盘最佳实践

August 21, 2017, 8:58 am

≫ Next: 利用Oracle存储过程发送邮件

≪ Previous: Real-time materialized view，面向开发者的12.2新特性

当FRA区或者DATA区磁盘空间不够的时候，我们需要为ASM添加磁盘。
添加磁盘的high level的步骤为：

1. SA分配共享磁盘，要求在多个节点都能看到这些磁盘。
2. 将共享磁盘分区，将分区后的磁盘，创建成asmdisk
3. 将asmdisk加入到asm的diskgroup中

下面是具体的实施步骤：
（一）. SA分配共享磁盘，要求在多个节点都能看到这些磁盘。
1. 在SA未加磁盘之前，记下/dev/sd*的磁盘名称，已经到了那个字母，以便识别后续的下一个字母为新加的磁盘。
对于已经加入到asm的磁盘，对应于哪个磁盘，可以先用oracleasm listdisks列出有多少个已经创建的asm磁盘，然后用oracleasm querydisk -p 看对应的物理路径

2. SA加盘之后，需要在多个节点都能看到这些盘，通过ls -l /dev/sd*应该可以看到新增之后的磁盘。

（二）. 将共享磁盘分区，将分区后的磁盘，创建成asmdisk
1. 在一个节点上，用fdisk命令，将新建的共享磁盘分区：
fdisk /dev/sdn
Command (m for help): n
Partition type:
p primary (0 primary, 0 extended, 4 free)
e extended
Select (default p):
Using default response p
Partition number (1-4, default 1):
First sector (32768-25165823, default 32768):
Using default value 32768
Last sector, +sectors or +size{K,M,G} (32768-25165823, default 25165823): +50G
Partition 1 of type Linux and of size 10 GiB is set
Command (m for help): w
注意，这里分区的大小，必须严格遵守和已经存在盘一致的大小，不然同一个diskgroup中不同大小的盘，会导致rebalance不平衡，引起性能问题。

2. 在一个节点完成分区后，在其他节点观察是否也完成了分区。查看是否存在sd*1，如果没有，可以利用fdisk /dev/sd 然后 p参数（p表示print），然后退出。即可看到分区后的硬盘sd1

3. 在一个节点上，创建对应的asm磁盘：
oracleasm createdisk FRA05 /dev/sdn1
oracleasm createdisk FRA06 /dev/sdo1
oracleasm createdisk FRA07 /dev/sdp1
oracleasm createdisk FRA08 /dev/sdq1

4. 在多个节点上oracleasm listdisks，查看是否创建了新的asmdisk（对比第(一)步的第1点），如果没有看到，用oracleasm scandisks一次之后，再次oracleasm listdisks。如果还是没有看到，说明之前的创建步骤有问题。停止后续操作，检查分析之前步骤的执行情况

5. 登录sqlplus ‘/as sysasm”
select GROUP_NUMBER,DISK_NUMBER,MOUNT_STATUS,HEADER_STATUS,MODE_STATUS,name,path from v$asm_disk;
观察上述新加的asm磁盘的HEADER_STATUS状态，应该是PROVISIONED

（三）将asmdisk加入到asm的diskgroup中
1. 先在一个节点创建一个test diskgroup，注意使用的是path name，而不是name。
CREATE DISKGROUP TEST EXTERNAL REDUNDANCY DISK ‘ORCL:FRA05′,’ORCL:FRA06’;
ALTER DISKGROUP TEST ADD DISK ‘ORCL:FRA07′,’ORCL:FRA08′;

2. 在其他节点进行mount该diskgroup，注意mount之前，状态应该是DISMOUNTED
SELECT STATE, NAME FROM V$ASM_DISKGROUP where name=’TEST’;

3.mount磁盘，看是否有报错：
ALTER DISKGROUP TEST MOUNT;

4. mount之后，状态应该是MOUNTED：
SELECT STATE, NAME FROM V$ASM_DISKGROUP where name=’TEST’;

5. 确认上述操作步骤没有失败后。删除测试用的test diskgroup，先在别的节点dismount该磁盘组
alter diskgroup test dismount;

6. 在第一个节点drop diskrgoup：
DROP DISKGROUP TEST;

7. 在第一个节点添加磁盘：
ALTER DISKGROUP FRA ADD DISK ‘ORCL:FRA05′,’ORCL:FRA06′,’ORCL:FRA07′,’ORCL:FRA08’ ;

8. 根据情况，调整rebalance power，（注：白天业务高峰期，禁止使用超过3的power）
alter diskgroup fra rebalance power 8;

9. 观察asm rebalance的情况，直到v$asm_operation返回0行记录，才算变更完成。
select * from v$asm_operation;

↧

利用Oracle存储过程发送邮件

August 21, 2017, 8:58 am

≫ Next: 是的，大疆DBA团队需要你的加入

≪ Previous: ASM添加磁盘最佳实践

/**配置ACL***/
begin
dbms_network_acl_admin.create_acl (
acl => ‘smtp_permissions.xml’, — or any other name
description => ‘SMTP Access’,
principal => ‘DBMGR’, — the user name trying to access the network resource
is_grant => TRUE,
privilege => ‘connect’,
start_date => null,
end_date => null
);
end;
/
commit;
begin
DBMS_NETWORK_ACL_ADMIN.ADD_PRIVILEGE(acl => ‘smtp_permissions.xml’,
principal => ‘DBMGR’,
is_grant => true,
privilege => ‘connect’);
end;
/
commit;

BEGIN
dbms_network_acl_admin.assign_acl (
acl => ‘smtp_permissions.xml’,
host => ‘10.10.8.1’, /*can be computer name or IP , wildcards are accepted as well for example – ‘*.us.oracle.com’*/
lower_port => 25,
upper_port => null
);
END;
/
commit;

/**创建发送邮件的存储过程***/
CREATE OR REPLACE PROCEDURE send_mail(
p_recipient VARCHAR2, — 邮件接收人
p_subject VARCHAR2, — 邮件标题
p_message VARCHAR2– 邮件正文
)
IS
–下面四个变量请根据实际邮件服务器进行赋值
v_mailhost VARCHAR2(30) := ‘10.10.8.1’; –SMTP服务器地址
v_user VARCHAR2(30) := ‘mymailuser’; –登录SMTP服务器的用户名
v_pass VARCHAR2(20) := ‘mypasswd’; –登录SMTP服务器的密码
v_sender VARCHAR2(50) := ‘mymailuser@dji.com ‘; –发送者邮箱，一般与 ps_user 对应
v_conn UTL_SMTP.connection; –到邮件服务器的连接
v_msg varchar2(4000); –邮件内容
BEGIN
v_conn := UTL_SMTP.open_connection(v_mailhost, 25);
UTL_SMTP.ehlo(v_conn, v_mailhost); –是用 ehlo() 而不是 helo() 函数
–否则会报：ORA-29279: SMTP 永久性错误: 503 5.5.2 Send hello first.
UTL_SMTP.command(v_conn, ‘AUTH LOGIN’); — smtp服务器登录校验
UTL_SMTP.command(v_conn,UTL_RAW.cast_to_varchar2(UTL_ENCODE.base64_encode(UTL_RAW.cast_to_raw(v_user))));
UTL_SMTP.command(v_conn,UTL_RAW.cast_to_varchar2(UTL_ENCODE.base64_encode(UTL_RAW.cast_to_raw(v_pass))));
UTL_SMTP.mail(v_conn, ‘<'||v_sender||'>‘); –设置发件人
UTL_SMTP.rcpt(v_conn, ‘<'||p_recipient||'>‘); –设置收件人
— 创建要发送的邮件内容注意报头信息和邮件正文之间要空一行
v_msg :=’Date:’|| TO_CHAR(SYSDATE, ‘yyyy mm dd hh24:mi:ss’)
|| UTL_TCP.CRLF || ‘From: ‘|| v_sender || ”
|| UTL_TCP.CRLF || ‘To: ‘ || p_recipient || ”
|| UTL_TCP.CRLF || ‘Subject: ‘ || p_subject
|| UTL_TCP.CRLF || UTL_TCP.CRLF — 这前面是报头信息
|| p_message; — 这个是邮件正文
UTL_SMTP.open_data(v_conn); –打开流
UTL_SMTP.write_raw_data(v_conn, UTL_RAW.cast_to_raw(v_msg)); –这样写标题和内容都能用中文
UTL_SMTP.close_data(v_conn); –关闭流
UTL_SMTP.quit(v_conn); –关闭连接
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.put_line(DBMS_UTILITY.format_error_stack);
DBMS_OUTPUT.put_line(DBMS_UTILITY.format_call_stack);
END send_mail;
/

/**发送邮件**/
begin
send_mail(‘xxxxx@dji.com’,’Tablespace XX is full.’,’Tablespace XXX is NN full, please add more space.’);
end;
/

↧

是的，大疆DBA团队需要你的加入

August 21, 2017, 9:05 am

≫ Next: pg的跨库查询

≪ Previous: 利用Oracle存储过程发送邮件

是的，我们在招人。

大疆DBA团队扩建了，目前有6个headcount，欢迎各路豪杰的加入。

岗位职责：
1. 负责内网和云上（aws和阿里云）数据库的故障响应。
2. 负责公司数据库安装，部署，SQL优化，数据库故障的根因分析；
3.负责数据库自动化运维的开发,推进数据库的自动化建设；
4. 根据项目的不同需求，制定数据同步方案、高可用方案、备份方案、安全方案等。

任职要求：
1. 本科以上学历，2年以上相关工作经验；
2. 精通oracle、sql server、mysql、postgresql、redis、mongodb中的至少2种数据库。包括安装、备份恢复、问题诊断，高可用架构、容灾架构、性能优化和代码优化。熟悉公有云（AWS或者阿里云）数据库的问题诊断方式；
3. 熟悉数据原理，熟悉上述数据库之间的区别，包括上述数据库的事务隔离机制、锁机制、版本控制机制等。熟悉上述数据库常用监控指标、范围、影响；
4. 熟悉linux性能优化，熟悉shell/python任意一种脚本语言；
5. 了解数据库中间件，如mycat、atlas、codis等；
6. 了解存储、虚拟化和网络相关知识。

简历请投：jimmy.he[at]dji.com。

其他运维工程师同期也在火热招聘中：

↧

pg的跨库查询

August 30, 2017, 7:32 am

≫ Next: 如何找到postgres中疯狂增长的wal日志的语句

≪ Previous: 是的，大疆DBA团队需要你的加入

mysql和mssql的跨库查询，基本只需要dbname.schema.table_name就可以实现，而pg的跨库查询，和oracle一样，需要通过类似dblink的方式来实现。pg在9.3之前建议使用dblink，在9.3之后，建议使用postgres_fwd(foreign-data wrapper)。
我们假设有个库mydb001，里面有2个用户mydb001_rw和mydb001_r，分别是读写用户和只读用户。有另外一个库dbprd2，里面也是有2个用户dbprd2_rw和dbprd2_r。
我们需要在mydb001库中利用mydb001_rw用户，去只读的查询dbprd2库的tb_orad_mutex表。

一、需要以superuser安装extension（注，如果你需要每个database都使用，那么每个database都要装一次这个extension，或者你也可以一开始就在template1中安装，那么后续新建的database也都会包含了这个extension）:

psql -U dbmgr -d mydb001
--drop extension postgres_fdw;
create extension postgres_fdw;

mydb001=> \dx
                               List of installed extensions
     Name     | Version |   Schema   |                    Description
--------------+---------+------------+----------------------------------------------------
 plpgsql      | 1.0     | pg_catalog | PL/pgSQL procedural language
 postgres_fdw | 1.0     | public     | foreign-data wrapper for remote PostgreSQL servers
 uuid-ossp    | 1.1     | public     | generate universally unique identifiers (UUIDs)
(3 rows)

mydb001=>

二、还是以superuser用户，创建remote server，用于连接远程数据库。

--drop server remote_db;
create server remote_db foreign data wrapper postgres_fdw options(host '127.0.0.1',port '5432',dbname 'dbprd2');
mydb001=> \des
         List of foreign servers
   Name    | Owner | Foreign-data wrapper
-----------+-------+----------------------
 remote_db | dbmgr | postgres_fdw
(1 row)

mydb001=>
GRANT USAGE ON FOREIGN SERVER remote_db TO mydb001_rw;
GRANT USAGE ON FOREIGN SERVER remote_db TO mydb001_r;
\q

注意此时修改pg_hba.conf文件，允许连接。

# TYPE DATABASE  USER   ADDRESS     METHOD
……
host all all 127.0.0.1/32 md5

三、以应用用户连接，创建user mapping：

psql -U mydb001_rw -d mydb001
--drop user mapping for mydb001_rw server remote_db;
create user mapping for mydb001_rw server remote_db options(user 'dbprd2_r',password 'WTDw2#@e');

四、应用用户下创建 FOREIGN TABLE：

--drop FOREIGN TABLE  db_dbprd2_tb_orad_mutex;
CREATE FOREIGN TABLE
db_dbprd2_tb_orad_mutex(appid integer,appkey character varying(40),appindex character varying(40) ,status integer)
server remote_db
options (schema_name 'dbprd2_rw',table_name 'tb_orad_mutex');

五、测试查询，以及尝试是否能更新（注，如果mapping user的时候，用的是读写用户，那么也是可以更新的）

-- mydb001_rw用户查询dbprd2数据库的表。
-bash-4.2$ psql -U mydb001_rw -d mydb001
psql (9.6.2)
Type "help" for help.

mydb001=> select * from db_dbprd2_tb_orad_mutex limit 2;
 appid  |      appkey      |    appindex    | status
--------+------------------+----------------+--------
 123456 | AAAAAAAAAAAAAAAA | lm             |    999
 654321 | BBBBBBBBBBBBBBB  | abcdefghijklm  |    999
(2 rows)


--由于之前的user mapping是通过只读用户连接，所以更新操作会报错：
mydb001=> begin;
BEGIN
mydb001=> update db_dbprd2_tb_orad_mutex set appindex='zxsaqwerre' where appid='654321' and appkey='BBBBBBBBBBBBBBB';
ERROR:  permission denied for relation tb_orad_mutex
CONTEXT:  Remote SQL command: UPDATE dbprd2_rw.tb_orad_mutex SET appindex = 'zxsaqwerre'::character varying(40) WHERE ((appid = 654321)) AND ((appkey = 'BBBBBBBBBBBBBBB'::text))
mydb001=> rollback;
ROLLBACK
mydb001=>
mydb001=>

↧

如何找到postgres中疯狂增长的wal日志的语句

September 9, 2017, 11:07 am

≫ Next: 远程数据库的表超过20个索引的影响

≪ Previous: pg的跨库查询

很久以前，我写过一个文章，《如何查找疯狂增长arch的进程》，讲述在oracle数据库中如何查找导致当前疯狂增长arch的session。今天，我们在postgresql数据库中也遇到了类似的问题。

在一个时间内，wal日志疯狂的增长，大约每分钟产生1G，而xlog疯狂cp去归档的结果，导致xlog来不及流复制到从库就已经切去了归档目录，进而导致了主从断开。

和开发一起诊断了这个问题之后，发现是一个update语句更新了大量的记录，每次更新1000多万记录中的200多万，这个表上14个字段中10个字段有有索引。更新时是非HOT update。这个语句每小时跑一次，每次跑的时候，有12个类似的语句。开发修改语句之后，增加了where的过滤条件后，减少了更新的数据量，从200多万减少了几行，从而解决了这个问题。

事后，我一直在想，如果没有开发人员，我们dba是否也可以从数据库本身的信息中发现问题？找到语句？在一次偶然的机会中，和平安科技的梁海安聊天时中得到了答案。

在oracle中导致归档过多的，是过于频繁的dml语句。在pg中也是一样。只是在oracle中有v$sesstat中可以看到redo size的信息，而在pg的pg_stat_activity中只有session的信息，并没有语句的wal信息。但是由于wal的产生也是因为过多的dml引起的，我们可以从pg_catalog.pg_stat_all_tables中去找变动频繁的tuple(n_tup_ins,n_tup_upd,n_tup_del，主要是update)，从而发现导致dml过多的语句。

解决方法如下：

1. 开启pg的dml审计。在postgresql.conf中设置log_statement=’mod’

2. 截取一个时间的pg_catalog.pg_stat_all_tables：

create table orasup1 as
select date_trunc('second',now()) as sample_time,schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd from pg_catalog.pg_stat_all_tables;

3. 截取另外一个时间的pg_catalog.pg_stat_all_tables：

create table orasup2 as
select date_trunc('second',now()) as sample_time,schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd from pg_catalog.pg_stat_all_tables;

4. 检查在单位时间内，那个对象的dml最多：

select t2.schemaname,t2.relname,
 (t2.n_tup_ins-t1.n_tup_ins) as delta_ins,
 (t2.n_tup_upd-t1.n_tup_upd) as delta_upd,
 (t2.n_tup_del-t1.n_tup_del) as delta_del,
(t2.n_tup_ins+t2.n_tup_upd+t2.n_tup_del-t1.n_tup_ins-t1.n_tup_upd-t1.n_tup_del) as del_dml,
(EXTRACT (EPOCH FROM  t2.sample_time::timestamp )::float-EXTRACT (EPOCH FROM  t1.sample_time::timestamp )::float) as delta_second,
round(cast((t2.n_tup_ins+t2.n_tup_upd+t2.n_tup_del-t1.n_tup_ins-t1.n_tup_upd-t1.n_tup_del)/(EXTRACT (EPOCH FROM  t2.sample_time::timestamp )::float-EXTRACT (EPOCH FROM  t1.sample_time::timestamp )::float)as numeric),2) as delta_dml_per_sec
from  orasup2 t2, orasup1 t1
where t2.schemaname=t1.schemaname and t2.relname=t1.relname
order by delta_dml_per_sec desc limit 10;
platform=#

6. 此时我们已经得到了dml最多的对象，结合第1步的审计，就可以找到对应的语句了。

7. 清理战场，drop table orasup1; drop table orasup2;并且恢复审计粒度为log_statement=ddl

↧

远程数据库的表超过20个索引的影响

September 23, 2017, 9:28 am

≫ Next: Oracle 12.2 新特性介绍

≪ Previous: 如何找到postgres中疯狂增长的wal日志的语句

昨天同事参加了一个研讨会,有提到一个案例。一个通过dblink查询远端数据库，原来查询很快，但是远端数据库增加了一个索引之后，查询一下子变慢了。

经过分析，发现那个通过dblink的查询语句，查询远端数据库的时候，是走索引的，但是远端数据库添加索引之后，如果索引的个数超过20个，就会忽略第一个建立的索引，如果查询语句恰好用到了第一个建立的索引，被忽略之后，只能走Full Table Scan了。

听了这个案例，我查了一下，在oracle官方文档中，关于Managing a Distributed Database有一段话：

Several performance restrictions relate to access of remote objects:

Remote views do not have statistical data.
Queries on partitioned tables may not be optimized.
No more than 20 indexes are considered for a remote table.
No more than 20 columns are used for a composite index.

说到，如果远程数据库使用超过20个索引，这些索引将不被考虑。这段话，在oracle 9i起的文档中就已经存在，一直到12.2还有。

那么，超过20个索引，是新的索引被忽略了？还是老索引被忽略了？如何让被忽略的索引让oracle意识到?我们来测试一下。
（本文基于12.1.0.2的远程库和12.2.0.1的本地库进行测试，如果对测试过程没兴趣的，可以直接拉到文末看“综上”部分）

（一）初始化测试表：

--创建远程表：
DROP TABLE t_remote;

CREATE TABLE t_remote (
col01 NUMBER,
col02 NUMBER,
col03 VARCHAR2(50),
col04 NUMBER,
col05 NUMBER,
col06 VARCHAR2(50),
col07 NUMBER,
col08 NUMBER,
col09 VARCHAR2(50),
col10 NUMBER,
col11 NUMBER,
col12 VARCHAR2(50),
col13 NUMBER,
col14 NUMBER,
col15 VARCHAR2(50),
col16 NUMBER,
col17 NUMBER,
col18 VARCHAR2(50),
col19 NUMBER,
col20 NUMBER,
col21 VARCHAR2(50),
col22 NUMBER,
col23 NUMBER,
col24 VARCHAR2(50),
col25 NUMBER,
col26 NUMBER,
col27 VARCHAR2(50)
);


alter table t_remote modify (col01 not null);

INSERT INTO t_remote
SELECT
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*')
FROM dual
CONNECT BY level <= 10000;

commit;


create unique index t_remote_i01_pk on t_remote (col01);
alter table t_remote add (constraint t_remote_i01_pk primary key (col01) using index t_remote_i01_pk);

create index t_remote_i02 on t_remote (col02);
create index t_remote_i03 on t_remote (col03);
create index t_remote_i04 on t_remote (col04);
create index t_remote_i05 on t_remote (col05);
create index t_remote_i06 on t_remote (col06);
create index t_remote_i07 on t_remote (col07);
create index t_remote_i08 on t_remote (col08);
create index t_remote_i09 on t_remote (col09);
create index t_remote_i10 on t_remote (col10);
create index t_remote_i11 on t_remote (col11);
create index t_remote_i12 on t_remote (col12);
create index t_remote_i13 on t_remote (col13);
create index t_remote_i14 on t_remote (col14);
create index t_remote_i15 on t_remote (col15);
create index t_remote_i16 on t_remote (col16);
create index t_remote_i17 on t_remote (col17);
create index t_remote_i18 on t_remote (col18);
create index t_remote_i19 on t_remote (col19);
create index t_remote_i20 on t_remote (col20);

exec dbms_stats.gather_table_stats(user,'T_REMOTE');

--创建本地表：
drop table t_local;

CREATE TABLE t_local (
col01 NUMBER,
col02 NUMBER,
col03 VARCHAR2(50),
col04 NUMBER,
col05 NUMBER,
col06 VARCHAR2(50)
);

INSERT INTO t_local
SELECT
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*')
FROM dual
CONNECT BY level <= 50;

COMMIT;

create index t_local_i01 on t_local (col01);
create index t_local_i02 on t_local (col02);
create index t_local_i03 on t_local (col03);
create index t_local_i04 on t_local (col04);
create index t_local_i05 on t_local (col05);
create index t_local_i06 on t_local (col06);

exec dbms_stats.gather_table_stats(user,'t_local');


create database link dblink_remote CONNECT TO test IDENTIFIED BY test USING 'ora121';


SQL> select host_name from v$instance@dblink_remote;

HOST_NAME
----------------------------------------------------------------
testdb2

SQL> select host_name from v$instance;

HOST_NAME
----------------------------------------------------------------
testdb10

SQL>

可以看到，远程表有27个字段，目前还只是在前20个字段建立了索引，且第一个字段是主键。本地表，有6个字段，6个字段都建索引。

（二）第一轮测试，远程表上有20个索引。
测试场景1：
在远程表20索引的情况下，本地表和远程表关联，用本地表的第一个字段关联远程表的第一个字段：

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col01=r.col01
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |    53 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |    53   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     1   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL01"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>
-- 我们这里注意一下，WHERE :1="COL01"的存在，正是因为这个条件，所以在远程是走了主键而不是全表扫。我们把这个语句带入到远程执行。

远程：
SQL> explain plan for
  2  SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL01";

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 829680338

-----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                 |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T_REMOTE        |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | T_REMOTE_I01_PK |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL01"=TO_NUMBER(:1))

14 rows selected.

我们可以看到，对于远程表的执行计划，这是走主键的。

测试场景2：
在远程表20索引的情况下，本地表和远程表关联，用本地表的第一个字段关联远程表的第20个字段：

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col01=r.col20
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，这是走索引范围扫描的。

测试场景3：
在远程表20索引的情况下，本地表和远程表关联，用本地表的第2个字段关联远程表的第2个字段：

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col02=r.col02
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
SQL> explain plan for
  2  SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 2505594687

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL02"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，这是走索引范围扫描的。

测试场景4：
在远程表20索引的情况下，本地表和远程表关联，用本地表的第2个字段关联远程表的第20个字段：

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col02=r.col20
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，这是走索引范围扫描的。

（三）建立第21个索引：

create index t_remote_i21 on t_remote (col21);
exec dbms_stats.gather_table_stats(user,'T_REMOTE');

（四）远程表上现在有21个索引，重复上面4个测试：

测试场景1：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

--我们看到，这里已经没有了之前的 WHERE :1="COL01",即使不带入到远程看执行计划，我们也可以猜到它是全表扫。

远程：
SQL> explain plan for
  2  SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 4187688566

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          | 10000 |   615K|   238   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| T_REMOTE | 10000 |   615K|   238   (0)| 00:00:01 |
------------------------------------------------------------------------------

8 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，如果关联条件是远程表的第一个字段，第一个字段上的索引是被忽略的，执行计划是选择全表扫描的。

测试场景2：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，如果关联条件是远程表的第20个字段，这第20个字段上的索引是没有被忽略的，执行计划是走索引。

测试场景3：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
SQL> explain plan for
  2  SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 2505594687

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL02"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，如果关联条件是远程表的第2个字段，这第2个字段上的索引是没有被忽略的，执行计划是走索引。

测试场景4：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程：
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到，对于远程表的执行计划，如果关联条件是远程表的第20个字段，这第20个字段上的索引是没有被忽略的，执行计划是走索引。

我们目前可以总结到，当远程表第21个索引建立的时候，通过dblink关联本地表和远程表，如果关联条件是远程表的第1个建立的索引的字段，那么这个索引将被忽略，从而走全表扫描。如果关联条件是远程表的第2个建立索引的字段，则不受影响。
似乎是有效索引的窗口是20个，当新建第21个，那么第1个就被无视了。

（五）建立第22个索引，我们在来看看上述猜测是否符合。

create index t_remote_i22 on t_remote (col22);
exec dbms_stats.gather_table_stats(user,'T_REMOTE');

（六），目前远程表有22个索引，重复上面4个测试：

测试场景1：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

测试场景2：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

测试场景3：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

测试场景4：

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

上述的测试，其实是可以验证我们的猜测的。oracle对于通过dblink关联访问远程表，只是会意识到最近创建的20个索引的字段。这个意识到索引的窗口是20个，一旦建立了一个新索引，那么最旧的一个索引会被无视。

（七）我们尝试rebuild索引，看看有没有效果：
rebuild第2个索引

alter index t_remote_i02 rebuild;
exec dbms_stats.gather_table_stats(user,'T_REMOTE');

（八）在第2个索引rebuild之后，重复上面4个测试：

--测试场景1：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

--测试场景2：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>


--测试场景3：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>


--测试场景4：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

所以我们看到，索引rebuild，是不能起到重新“唤醒”索引的作用。

（九）我们尝试 drop and recreate 第2个索引。

drop index t_remote_i02;
create index t_remote_i02 on t_remote (col02);

exec dbms_stats.gather_table_stats(user,'T_REMOTE');

（十）重复上面的测试3和测试4：

测试3：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

测试4：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

此时，其实我们可以预测，远程表此时col03上的索引是用不到的，我们来测试验证一下：
测试5：
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  bhkczcfrhvsuw, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col03=r.col03

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   157 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |   500K|    89M|   157   (1)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  5400 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   781K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL03"="R"."COL03")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL03","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

我们可以看到，通过drop之后再重建，是可以“唤醒”第二个索引的。这也证明了我们20个索引识别的移动窗口，是按照索引的创建时间来移动的。

综上：

1. 对于通过dblink关联本地表和远程表，如果远程表的索引个数少于20个，那么不受影响。
2. 对于通过dblink关联本地表和远程表，如果远程表的索引个数增加到21个或以上，那么oracle在执行远程操作的时候，将忽略最早创建的那个索引，但是会以20个为窗口移动，最新建立的索引会被意识到。此时如果查询的关联条件中，使用到最早创建的那个索引的字段，由于忽略了索引，会走全表扫描。
3. 要“唤醒”对原来索引的意识，rebuild索引无效，需要drop & create索引。
4. 在本地表数据量比较少，远程表的数据量很大，而索引数量超过20个，且关联条件的字段时最早索引的情况下，可以考虑使用DRIVING_SITE的hint，将本地表的数据全量到远程中，此时远程的关联查询可以意识到那个索引。可见文末的例子。是否使用hint，需要评估本地表数据全量推送到远程的成本，和远程表使用全表扫的成本。

附：在22个索引的情况下，尝试采用DRIVING_SITE的hint：

SQL> select  l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
  2  from t_local l, t_remote@dblink_remote r
  3  where l.col02=r.col02
  4  ;

50 rows selected.

Elapsed: 00:00:00.03

Execution Plan
----------------------------------------------------------
Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



Statistics
----------------------------------------------------------
        151  recursive calls
          0  db block gets
        246  consistent gets
         26  physical reads
          0  redo size
       2539  bytes sent via SQL*Net to client
        641  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
         10  sorts (memory)
          0  sorts (disk)
         50  rows processed

SQL>

--可以看到远程表示走全表扫。

SQL> select /*+DRIVING_SITE(r)*/ l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
  2  from t_local l, t_remote@dblink_remote r
  3  where l.col02=r.col02
  4  ;

50 rows selected.

Elapsed: 00:00:00.03

Execution Plan
----------------------------------------------------------
Plan hash value: 1716516160

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT REMOTE      |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   1 |  NESTED LOOPS                |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   2 |   NESTED LOOPS               |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   3 |    REMOTE                    | T_LOCAL      |    50 |  3300 |     3   (0)| 00:00:01 |      ! | R->S |
|*  4 |    INDEX RANGE SCAN          | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 | ORA12C |      |
|   5 |   TABLE ACCESS BY INDEX ROWID| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 | ORA12C |      |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("A2"."COL02"="A1"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL04","COL05","COL06" FROM "T_LOCAL" "A2" (accessing '!' )


Note
-----
   - fully remote statement
   - this is an adaptive plan


Statistics
----------------------------------------------------------
        137  recursive calls
          0  db block gets
        213  consistent gets
         25  physical reads
          0  redo size
       2940  bytes sent via SQL*Net to client
        641  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
         10  sorts (memory)
          0  sorts (disk)
         50  rows processed

SQL>

--可以看到本地表是走全表扫，但是远程表使用了第2个字段的索引。

↧

Oracle 12.2 新特性介绍

October 24, 2017, 9:25 am

≫ Next: pg数据库授权表给只读用户之后，权限慢慢消失

≪ Previous: 远程数据库的表超过20个索引的影响

计划明年等12.2.0.2出来之后，在公司全面推行oracle 12.2版本数据库。

在此之前，为了团队做好知识储备，总结了一下12.2的新特性，进行宣讲。

【PDF下载】：Oracle 12.2新特性介绍

↧

pg数据库授权表给只读用户之后，权限慢慢消失

November 16, 2017, 7:55 am

≫ Next: 官方推荐的MySQL参数设置值

≪ Previous: Oracle 12.2 新特性介绍

越来越多的互联网企业在使用postgresql数据库，我们也不例外。

昨天开发请我建立了一个只读用户abc_tmp_test用户，并且将mkl_rw用户下的32个表授权给只读用户用。ok，请简单轻松的一个需求，很快就完成了。

但是今天开发来和我说，昨天授权的几个表中，有部分表还是没有权限去读取，让我帮忙看看。一开始，我以为是昨天遗漏了，先道了一个歉，再次进行了授权，授权完成之后，检查了32个表，都能被只读用户查询，于是放心的告诉开发，昨天的所有表都已经授权好了，我也检查过一次了。这次肯定不会漏了。

万万没想到，半小时后，开发来和我说，不行，还是有其中几个表没有权限。我之前的连接还没断开，再次跑了一遍之前的检查语句，确实没有权限了。卧槽？这是咋回事？数据库中有雷锋了？

我再次授权了一次，并且检查了information_schema.table_privileges，确认了再次授权后，是新增了32行记录。这次我没有先通知开发，说已经授权完成了，而是过了一会，我再次去查，变成了28行，又过了一会，变成了16行！

也就是我授权的32个表的select权限给只读用户，过一段时间之后，这32个表中的一些表的权限会慢慢消失！而且消失权限的表，也没有发现先授权的先消失，后授权的后消息的规律，但是可以发现最终剩下的，就是那16个表。我开始怀疑起人生了……

难道是pg中授权的表的数量有限？不能超过16个？也没查到相关的参数啊。

难道是那16个表有什么特殊设置？从建表语句中也没看到啊。

难道授权之后需要checkpoint刷盘？测试了checkpoint还是一样丢权限。

难道真的有雷锋出现啊。还说什么pg和oracle一样牛，一样稳定，连基本的授权都会丢。

正在逐个检查参数之际，同事通过检查log，发现了drop table的语句……

原来如此，这个案例，可以用下面的测试过程模拟出来了：

-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> --创建表bbbdba_test，并且授权给abc_tmp_test用户
mybidbaa=> create table bbbdba_test as select now();
SELECT 1
mybidbaa=>
mybidbaa=>
mybidbaa=> grant select on mkl_rw.bbbdba_test to abc_tmp_test;
GRANT
mybidbaa=>
mybidbaa=>
mybidbaa=>
mybidbaa=>
mybidbaa=> \q
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=>
mybidbaa=> --用abc_tmp_test登录，可以查到bbbdba_test表。
mybidbaa=> select * from mkl_rw.bbbdba_test;
              now
-------------------------------
 2017-11-16 16:08:14.123217+08
(1 row)

mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> --删除表bbbdba_test，然后重建表bbbdba_test。
mybidbaa=> drop table mkl_rw.bbbdba_test;
DROP TABLE
mybidbaa=>
mybidbaa=>
mybidbaa=> create table bbbdba_test as select now();
SELECT 1
mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=> --可以看到，权限丢了！！
mybidbaa=> select * from mkl_rw.bbbdba_test;
ERROR:  permission denied for relation bbbdba_test
mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$
-bash-4.2$

是的，如果table被drop了之后，再次重建，此时原本授权给只读用户的权限，也会消失。

向开发确认，是否有drop之后重建表的操作，开发确认，有段程序确实会定期的逐个drop表后重建表！！

为什么要进行drop表之后重建表的操作？开发说是通过调用框架清理数据，框架就是这么干的。

ok，明白了目的是为了清理数据，而不涉及到表结构的修改，那么其实用truncate来清理就可以了。如下测试，权限不会丢。

-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> grant select on mkl_rw.bbbdba_test to abc_tmp_test;
GRANT
mybidbaa=> truncate table mkl_rw.bbbdba_test;
TRUNCATE TABLE
mybidbaa=> \q
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=> select * from mkl_rw.bbbdba_test;
 now
-----
(0 rows)

mybidbaa=> \q
-bash-4.2$

最终，开发修改了代码，再次授权那32张表之后，权限不再慢慢消失了。

这个故事中可以学到的教训有二：
1. 大千世界无奇不有，数据库中没有雷锋，而是有各种万万没想到的逻辑。
2. 幸亏我们在建库的时候，建库标准要求设置了log_statement=ddl, 才能在log中发现线索。（其实我们oracle和pg的建库标准，都设置了记录ddl）

↧

官方推荐的MySQL参数设置值

January 30, 2018, 8:34 am

≫ Next: 18c新特性的一些小结

≪ Previous: pg数据库授权表给只读用户之后，权限慢慢消失

今天，在找MySQL补丁的时候，在metalink找到一篇非常好的文章。这oracle官方推荐的在OLTP环境下，MySQL参数设置的最佳实践。

下面的参数设置，对系统的性能会很有帮助。但是建议大家还是结合实际情况使用。

APPLIES TO:

MySQL Server – Version 5.6 and later
Information in this document applies to any platform.

PURPOSE

Strongly recommended initial settings for MySQL Server when used for OLTP or benchmarking.

SCOPE

For DBAs having OLTP-like workloads or doing benchmarking.

DETAILS

We recommend that when using MySQL Server 5.6 you include the following settings in your my.cnf or my.ini file if you have a transaction processing or benchmark workload. They are also a good starting point for other workloads. These settings replace some of our flexible server defaults for smaller configurations with values that are better for higher load servers. These are starting points. For most settings the optimal value depends on the specific workload and you should ideally test to find out what settings are best for your situation. The suggestions are also likely to be suitable for 5.7 but 5.7-specific notes and recommendations are a work in progress.

If a support engineer advises you to change a setting, accept that advice because it will have been given after considering the data they have collected about your specific situation.

Changes to make in all cases

These improve on the defaults to improve performance in some cases, reducing your chance of encountering trouble.

innodb_stats_persistent = 1 # Also use ANALYZE TABLE for all tables periodically
innodb_read_io_threads = 16 # Check pending read requests in SHOW ENGINE INNODB STATUS to see if more might be useful, if seldom more than 64 * innodb_read_io_threads, little need for more.
innodb_write_io_threads = 4
table_open_cache_instances = 16 # 5.7.8 onwards defaults to 16

metadata_locks_hash_instances = 256 # better hash from 5.6.15,5.7.3. Irrelevant and deprecated from 5.7.4 due to change in metadata locking

Main settings to review

Also make these additions and adjust as described to find reasonably appropriate values:

innodb_buffer_pool_size: the single most important performance setting for most workloads, see the memory section later for more on this and InnoDB log files. Consider increasing innodb_buffer_pool_instances (5.6 manual) from the 8 default to buffer pool size / 2GB (so 32 for 64g pool) if concurrency is high, some old benchmark results to illustrate why..

innodb_stats_persistent_sample_pages: a value in the 100 to 1000 range will produce better statistics and is likely to produce better query optimising for non-trivial queries. The time taken by ANALYZE TABLE is proportional to this and this many dives will be done for each index, so use some care about setting it to very large values.

innodb_flush_neighbors = 0 if you have SSD storage. Do not change from server default of 1 if you are using spinning disks. Use 0 if both.

innodb_page_size: consider 4k for SSD because this better matches the internal sector size on older disks but be aware that some might use the newer 16k sector size, if so, use that. Check your drive vendor for what it uses.

innodb_io_capacity: for a few spinning disks and lower end SSD the default is OK, but 100 is probably better for a single spinning disk. For higher end and bus-attached flash consider 1000. Use smaller values for systems with low write loads, larger with high. Use the smallest value needed for flushing and purging to keep up unless you see more modified/dirty pages than you want in the InnoDB buffer pool. Do not use extreme values like 20000 or more unless you have proved that lower values are not sufficient for your workload. It regulates flushing rates and related disk i/o. You can seriously harm performance by setting this or innodb_io_capacity_max too high and wasting disk i/o operations with premature flushing.

innodb_io_capacity_max: for a few spinning disks and lower end SSD the default is OK but 200-400 is probably better for a single spinning disk. For higher end and bus-attached flash consider 2500. Use smaller values for systems with low write loads, larger with high. Use the smallest value needed for flushing and purging to keep up. Twice innodb_io_capacity will often be a good choice and this can never be lower than innodb_io_capacity.

innodb_log_file_size = 2000M is a good starting point. Avoid making it too small, that will cause excessive adaptive flushing of modified pages. More guidance here.

innodb_lru_scan_depth: Reduce if possible. Uses disk i/o and can be a CPU and disk contention source. This multiplied by innodb_buffer_pool_instances sets how much work the page cleaner thread does each second, attempting to make that many pages free.Increase or decrease this to keep the result of multiplying the two about the same whenever you change innodb_buffer_pool_instances, unless you are deliberately trying to tune the LRU scan depth. Adjust up or down so that there are almost never no free pages but do not set it much larger than needed because the scans have a significant performance cost. A smaller value than the default is probably suitable for most workloads, give 100 a try instead of the default if you just want a lower starting point for your tuning, then adjust upwards to keep some free pages most of the time. Increase innodb_page_cleaners to lower of CPU count or buffer pools if it cannot keep up, there are limits to how much writing one thread can get done; 4 is a useful change for 5.6, 5.7 default is already 4. The replication SQL thread(s) can be seriously delayed if there are not usually free pages, since they have to wait for one to be made free. Error log Messages like “Log Messages: page_cleaner: 1000ms intended loop took 8120ms. The settings might not be optimal.” usually indicate that you have the page cleaner told to do more work than is possible in one second, so reduce the scan depth, or that there is disk contention to fix. Page cleaner has high thread priority in 5.7, particularly important not to tell it to do too much work, helps it to keep up. Document 2014477.1 has details of related settings and measurements that can help to tune this.

innodb_checksum_algorithm=strict_crc32 if a new installation, else crc32 for backwards compatibility. 5.5 and earlier cannot read tablespaces created with crc32. Crc32 is faster and particularly desirable for those using very fast storage systems like bus-attached flash such as Fusion-IO with high write rates.

innodb_log_compressed_pages = 0 if using compression. This avoids saving two copies of changes to the InnoDB log, one compressed, one not, so reduces InnoDB log writing amounts. Particularly significant if the log files are on SSD or bus-attached flash, something that should often be avoided if practical though it can help with commit rates if you do not have a write caching disk controller, at the cost of probably quite shortened SSD lifetime.

binlog_row_image = minimal assuming all tables have primary key, unsafe if not, it would prevent applying the binary logs or replication from working. Saves binary log space. Particularly significant if the binary logs are on SSD or flash, something that should often be avoided.

table_definition_cache: Set to the typical number of actively used tables within MySQL. Use SHOW GLOBAL STATUS and verify that Opened_table_definitions is not increasing by more than a few per minute. Increase until that is true or the value becomes 30000 or more, if that happens, evaluate needs and possibly increase further. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required or you will greatly increase the RAM needs of PS in 5.6, much less of an issue in 5.7. Note that in 5.6 801 can cause four times the PS RAM usage of 800 by switching to large server calculation rules and 400 can be about half that of 401 if no other setting causes large rules.

table_open_cache: set no smaller than table_definition_cache, usually twice that is a good starting value. Use SHOW GLOBAL STATUS and verify that Opened_tables is not increasing by more than a few per minute. Increase until that is true or the value becomes 30000 or more, if that happens, evaluate needs and possibly increase further. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required in 5.6, much less of an issue in 5.7, or you will greatly increase the RAM needs of PS. Note that in 5.6 4001 can cause four times the PS RAM usage of 4000 by switching to large server calculation rules and 2000 can be about half that of 2001 if no other setting causes large rules.

max_connections: This is also used for autosizing Performance Schema. Do not set it to values that are far higher than really required or you will greatly increase the memory usage of PS. If you must have a large value here because you are using a connection cache, consider using a thread cache as well to reduce the number of connections to the MySQL server. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required or you will greatly increase the RAM needs of PS. Note that 303 can cause four times the PS RAM usage of 302 by switching to large server calculation rules and 151 can be about half that of 302 if no other setting causes large rules.

open_files_limit: This is also used for autosizing Performance Schema. Do not set it to values that are far higher than really required in 5.6, less of an issue in 5.7.

sort_buffer_size = 32k is likely to be faster for OLTP, change to that from the server default of 256k. Use SHOW GLOBAL STATUS to check Sort_merge_passes. It the count is 0 or increasing by up to 10-20 per second you can decrease this and probably get a performance increase. If the count is increasing by less than 100 per second that is also probably good and smaller sort_buffer_size may be better. Use care with large sizes, setting this to 2M can reduce throughput for some workloads by 30% or more. If you see high values for Sort_merge_passes, identify the queries that are performing the sorts and either improve indexing or set the session value of sort_buffer_size to a larger value just for those queries.

innodb_adaptive_hash_index (5.6 manual) Try both 0 and 1, 0 may show improvement if you do a lot of index scans, particularly in very heavy read-only or read-mostly workloads. Some people prefer always 0 but that misses some workloads where 1 helps. There’s an improvement in concurrency from 5.7.8 to use multiple partitions and the option innodb_adaptive_hash_index_parts was added, this may change the best setting from 0 to 1 for some workloads, at the cost of slower DBT3 benchmark result with a single thread only. More work planned for 5.8.

innodb_doublewrite (5.6 manual) consider 0/off instead of the default 1/on if you can afford the data protection loss for high write load workloads. This has gradually changed from neutral to positive in 5.5 to more negative for performance in 5.6 and now 5.7.

Where there is a recommendation to check SHOW GLOBAL STATUS output you should do that after the server has been running for some time under load and has stabilised. Many values take some time to reach their steady state levels or rates.

SSD-specific settings

Ensure that trim support is enabled in your operating system, it usually is.

Set innodb_page_size=4k unless you want a larger size to try to increase compression efficiency or have an SSD with 16k sectors. Use innodb_flush_neighbors=0 .

Memory usage and InnoDB buffer pool

For the common case where InnoDB is storing most data, setting innodb_buffer_pool_size to a suitably large value is the key to good performance. Expect to use most of the RAM in the server for this, likely at least 50% on a dedicated database server.

The Performance Schema can be a far more substantial user of RAM than in previous versions, particularly in 5.6, less of an issue in 5.7. You should check the amount of RAM allocated for it using SHOW ENGINE PERFORMANCE_SCHEMA STATUS . Any increase of max_connections, open_files_limit, table_open_cache or table_definition_cache above the defaults causes PS to switch to allocating more RAM to allow faster or more extensive monitoring. For this reason in 5.6 in particular you should use great care not to set those values larger than required or should adjust PS memory allocation settings directly. You may need to make PS settings directly to lower values if you have tens of thousands of infrequently accessed tables. Or you can set this to a lower value in my.cnf and change to a higher value in the server init file. It is vital to consider the PS memory allocations in the RAM budget of the server. See On configuring the Performance Schema for more details on how to get started with tuning it. If all of max_connections, table_definition_cache and table_open_cache are the same as or lower than their 151, 400 and 2000 defaults small sizing rules will be used. If all are no more than twice the defaults medium will be used at about twice the small memory consumption (eg. 98 megabytes instead of 52 megabytes). If any is more than twice the default, large rules will be used and the memory usage can be about eight times the small consumption (eg. 400 megabytes). For this reason, avoid going just over the 302, 800 and 4000 values for these settings if PS is being used, or use direct settings for PS sizes. The size examples are with little data and all other settings default, production servers may see significantly larger allocations. From 5.7 the PS uses more dynamic allocations on demand so these settings are less likely to be troublesome and memory usage will vary more with demand than startup settings.

Very frequent and unnecessarily large memory allocations are costly and per-connection allocations can be more costly and also can greatly increase the RAM usage of the server. Please take particular care to avoid over-large settings for: read_buffer_size, read_rnd_buffer_size, join_buffer_size, sort_buffer_size, binlog_cache_size and net_buffer_length. For OLTP work the defaults or smaller values are likely to be best. Bigger is not usually better for these workloads. Use caution with larger values, increasing sort_buffer_size from the default 256k to 4M was enough to cut OLTP performance by about 30% in 5.6. If you need bigger values for some of these, do it only in the session running the query that needs something different.

The operating system is likely to cache the total size of log files configured with innodb_log_file_size. Be sure to allow for this in your memory budget.

Thread_stack is also a session setting but it is set to the minimum safe value for using stored procedures, do not reduce it if using those. A maximum reduction of 32k might work for other workloads but remember that the server will crash effectively randomly if you get it wrong. It’s not worth touching unless you are both desperate and an expert. We increase this as and only when our tests show that the stack size is too small for safe operation. There is no need for you to increase it. Best not to touch this setting.

Operating systems

CPU affinity: if you are limiting the number of CPU cores, use CPU affinity to use the smallest possible number of physical CPUs to get that core count, to reduce CPU to CPU hardware consistency overhead. On Linux use commands like taskset -c 1-4 pid of mysqld or in windows START /AFFINITY or the Task Manager affinity control options.

Linux

Memory allocator: we ship built to use libc which is OK up to about 8-16 concurrent threads. From there switch to using TCMalloc using the mysqld_safe —malloc-lib option or LD_PRELOAD or experiment with the similar and possibly slightly faster jemalloc, which might do better with memory fragmentation, though we greatly reduced potential fragmentation in MySQL 5.6. TCMalloc 1.4 was shipped with many MySQL versions until 5.6.31 and 5.7.13. A the time of writing TCMalloc 2.5 is the latest version so you may want to experiment with that and jemalloc to see which works best for your workload and system.

IO scheduler: use noop or deadline. In rare cases CFQ can work better, perhaps on SAN systems, but usually it is significantly slower. echo noop > /sys/block/{DEVICE-NAME}/queue/scheduler .

nice: using nice -10 in mysqld_safe can make a small performance difference on dedicate servers, sometimes larger on highly contended servers. nice -20 can be used but you may find it hard to connect interactively if mysqld is overloaded and -10 is usually sufficient. If you really want -20, use -19 so you can still set the client mysql to -20 to get in and kill a rogue query.

Use cat “/proc/pgrep -n mysqld/limits to check the ulimit values for a running process. May need ulimit -n to set maximum open files per process and ulimit -u for a user. The MySQL open_files_limit setting should set this but verify and adjust directly if needed.

It is often suggested to use “numactl –interleave all” to prevent heavy swapping when a single large InnoDB buffer pool is all allocated on one CPU. Two alternatives exist, using multiple InnoDB buffer pools to try to prevent the allocations all going on one CPU is primary. In addition, check using SHOW VARIABLES whether your version has been built with support for the setting innodb_numa_interleave . If the setting is present, turn it on, setting to 1. It changes to interleaved mode (MPOL_INTERLEAVE) before allocating the buffer pool(s) then back to standard (MPOL_DEFAULT) after. The setting is present on builds compiled on a NUMA system from 5.6.27 onwards.

Set vm.swappiness=1 in /etc/sysctl.conf . It is often suggested to use 0 to swap only an out of memory situation but 1 will allow minimal swapping before that and is probably sufficient. Use whatever works for you but please use caution with 0. Higher values can have a tendency to try to swap out the InnoDB buffer pool to increase the OS disk cache size, a really bad idea for a dedicated database server that is doing its own write caching. If using a NUMA system, get NUMA settings in place before blaming swapping, on swappiness. It is known that a large single buffer pool can trigger very high swapping levels if NUMA settings aren’t right, the fix is to adjust the NUMA settings, not swappiness.

Do not set the setting in this paragraph by default. You must test to see if it is worth doing, getting it wrong can harm performance. Default IO queue size is 128, higher or lower can be useful, you might try experimenting with echo 1000 > /sys/block/[DEVICE]/queue/nr_requests . Not likely to be useful for single spinning disk systems, more likely on RAID setups.

Do not set the setting in this paragraph by default. You must test to see if it is worth doing, getting it wrong can harm performance. The VM subsystem dirty ratios can be adjusted from the defaults of 10 and 20. To set a temporary value for testing maybe use echo 5 > /proc/sys/vm/dirty_background_ratio and echo 60 > /proc/sys/vm/dirty_ratio . After proving what works best you can add these parameters to the /etc/sysctl.conf : vm.dirty_background_ratio = 5 vm.dirty_ratio = 60 . Please do follow the instruction to test, it is vital not to just change this and 5 and 60 are just examples.

Tools to monitor various parts of a linux system.

Linux Filesystems

We recommend that you use ext4 mounted with (rw,noatime,nodiratime,nobarrier,data=ordered) unless ultimate speed is required, because ext4 is somewhat easier to work with. If you do not have a battery backed up write caching disk controller you can probably improve your write performance by as much as 50% by using the ext4 option data=journal and then the MySQL option skip–innodb_doublewrite. The ext4 option provides the protection against torn pages that the doublewrite buffer provides but with less overhead. The benefit with a write caching controller is likely to be minimal.

XFS is likely to be faster than ext4, perhaps for fsync speed, but it is more difficult to work with. Use mount options (rw,noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k).

ext3 isn’t too bad but ext4 is better. Avoid ext2, it has significant limits. Best to avoid these two.

NFS in homebrew setups has more reliability problems than NFS in professional SAN or other storage systems which works well but may be slower than directly attached SSD or bus-attached SSD. It’s a balance of features and performance, with SAN performance possibly being boosted by large caches and drive arrays. Most common issue is locked InnoDB log files after a power outage, time or switching log files solves this. Incidence of problems has declined over the last ten years and as of 2016 is now low. If possible use NFSv4 or later protocol for its improved locking handling. If concerned about out of order application of changes, not a problem normally observed in practice, consider using TCP and hard,intr mount option.

Solaris

Use LD_PRELOAD for one of the multi-threaded oriented mallocs, either mtmalloc or umem.

Use UFS/forcedirectio

Use ZFS.

Windows

To support more connections or connection rates higher than about 32 per second you may need to set MaxUserPort higher and TcpTimedWaitDelay for TCP/IP, particularly for Windows Server 2003 and earlier. The defaults are likely to be no more than 4000 ports and TIME_WAIT of 120 seconds. See Settings that can be Modified to Improve Network Performance. Settings of 32768 ports and between 30 and 5 seconds timeout are likely to be appropriate for server usage. The symptom of incorrect settings is likely to be a sudden failure to connect after the port limit is reached, resuming at a slow rate as the timeout slowly frees ports.

Hardware

Battery-backed write-caching disk controllers are useful for all spinning disk setups and also for SSD. SSD alone is a cheaper way to get faster transaction commits than spinning disks, for lower load systems. Do not trust that the controller disables the hard drive write buffers, test with real power outages. You will probably lose data even with a battery if the controller has not disabled the hard drive write buffers.

It is best to split files across disk types in these general groups:

SSD: data, InnoDB undo logs, maybe temporary tables if not using tmpfs or other RAM-based storage for them.

Spinning disks: Binary logs, InnoDB redo logs, bulk data. Also, large SATA drives are cheap and useful for working and archival space as well as the biggest of bulk data sets.

Bus-attached SSD: the tables with the very highest change rates i the most highly loaded systems only.

You can put individual InnoDB tables on different drives, allowing use of SSD for fast storage and SATA for bulk.

Hyperthreading on is likely to be a good choice in most cases. MySQL 5.6 scales up to somewhere in the range of 32-48 cores with InnoDB and hyperthreading counts as an extra core for this purpose. For 5.5 that would be about 16 and before that about 8 cores. If you have more physical cores either without hyperthreading or more when it is enabled, experiment to determine the optimal number to use for MySQL. There is no fixed answer because it depends on workload properties.

Thread pool

Use the thread pool if you routinely run with more than about 128 concurrently active connections. Use it to keep the server at the optimal number of concurrently running operations, which is typically in the range between 32 and 48 threads on high core count servers in MySQL 5.6. If not using the thread pool, use innodb_thread_concurrency if you see that your server has trouble with a build-up of queries above about 128 or so concurrently running operations inside InnoDB. InnoDB shows positive scalability up to an optimal number of running jobs, then negative scalability but innodb_thread_concurrency = 0 has lower overhead when that regulating is not needed, so there is some trade off in throughput stability vs raw performance. The value for peak throughput depends on the application and hardware. If you see a benchmark that compares MySQL with a thread pool to MySQL without, but which does not set innodb_thread_concurrency, that is an indication that you should not trust the benchmark result: no production 5.6 server should be run with thousands of concurrently running threads and no limit to InnoDB concurrency.

Background

Here are more details of why some of these changes should be made.

innodb_stats_persistent = 1

Enables persistent statistics in InnoDB, producing more stable and usually better query optimiser decisions. Very strongly recommended for all servers. With persistent statistics you should run ANALYZE TABLE periodically to update the statistics. Once a week or month is probably sufficient for tables that have fairly stable or gradually changing sizes. For tables that are small or have very rapidly changing contents more frequent will be beneficial. There are minimal possible disadvantages, mainly the need for ANALYZE TABLE sometimes.

innodb_read_io_threads = 16, innodb_write_io_threads = 4

Increases the number of threads used for some types of InnoDB operation, though not the foreground query processing work. That can help the server to keep up with heavy workloads. No significant negative effects for most workloads, though sometimes contention for disk resources between these threads and foreground threads might be an issue if disk utilisation is near 100%.

table_open_cache_instances = 16

Improves the speed of operations involving tables at higher concurrency levels, important for reducing the contention in this area to an insignificant level. No significant disadvantages. This is unlikely to be a bottleneck until 24 cores are in full use but given the lack of cost it is best to set it high enough and never worry about it.

metadata_locks_hash_instances = 256

Reduces the effects of locking during the metadata locking that is used mainly for consistency around DDL. This has been an important bottleneck. As well as the general performance benefit, the hashing algorithm used has been shown to be non-ideal for some situations and that also makes it desirable to increase this value above the default, to reduce the chance of encountering that issue. We’re addressing that hash also but this will still be a useful setting with no significant negatives.

innodb_flush_neighbors = 0

When set to 1 InnoDB will look to flush nearby data pages as an optimisation for spinning disks. That optimisation is harmful for SSDs because it increase the number of writes. Set to 0 data on SSDs, 1 for spinning disks. If mixed, 0 is probably best.

innodb_log_file_size = 2000M

This is a critical setting for workloads that do lots of data modification and severe adverse performance will result if it is set too small. You must check the amount of log space used and ensure that it never reaches 75%. You must also consider the effect of your adaptive flushing settings and ensure that the percentage of the log space used does not cause excessive flushing. You can do that by using larger log files or having adaptive flushing start at a higher percentage. There is a trade off in this size because the total amount of log file space will usually be cached in operating system caches due to the nature of the read-modify-write operations performed. You must allow for this in the memory budget of the server to ensure that swapping does not occur. On SSD systems you can significantly extend the life of the drive by ensuring that this is set to a suitably high value to allow lots of dirty page caching and write combining before pages are flushed to disk.

table_definition_cache

Reduces the need to open tables to get dictionary information about the table structures. If set too low this can have a severe negative performance effect. There is little negative effect for the size range given on the table definition cache itself. See the Performance Schema portion of the memory notes above for critical memory usage considerations.

table_open_cache

Reduces the need to open tables to access data. If set too low this can have severe negative performance effects. There is little negative effect for the size range given on the table open cache itself. See the Performance Schema portion of the memory notes above for critical memory usage considerations.

sort_buffer_size = 32k

The key cost here is reduced server speed from setting this too high. Many common recommendations to use several megabytes or more have been made in a wide range of published sources and these are harmful for OLTP workloads. that normally benefit most from 32k or other small values. Do not set this to significantly larger values such as above 256k unless you see very excessive numbers of Sort_merge_passes – many hundreds or thousands per second on busy servers. Even then, it is far better to adjust the setting only in the connection of the few queries that will benefit from the larger size. In cases where it is impossible to adjust settings at the session level and when the workload is mixed it can be useful to use higher than ideal OLTP values to address the needs of the mixture of queries.

Other observations

Query cache

The query cache is effectively a single-threaded bottleneck. It can help performance at low query rates and concurrency, perhaps up to 4 cores routinely used. Above that it is likely to become a serious bottleneck. Leave this off unless you want to test it with your workload, and have measurements that will tell you if it is helping or hurting. Ensure that Qcache_free_blocks in global status is not above 10,000. 5,000 is a good action level. Above these levels the CPU time used in scans of the free list can be an issue, check with FLUSH QUERY CACHE, which defragments the free list, the change in CPU use is the cost of the free list size you had. Reducing the size is the most effective way to manage the free list size. Remember that the query cache was designed for sizes of up to a few tens of megabytes, if you’re using hundreds of megabytes you should check performance with great care, it’s well outside of its design limitations. Also check for waits with:

SELECT EVENT_NAME AS nm, COUNT_STAR AS cnt, sum_timer_wait, CONCAT(ROUND( sum_timer_wait / 1000000000000, 2), ‘ s’) AS sec
FROM performance_schema.events_stages_summary_global_by_event_name WHERE COUNT_STAR > 0 ORDER BY SUM_TIMER_WAIT DESC LIMIT 20;

Also see MySQL Query Cache Fragmentation Slows Down the Server (Doc ID 1308051.1).

Sync_binlog and innodb_flush_log_at_trx_commit

The 1 setting for these causes one fsync each at every transaction commit in 5.5 and earlier. From 5.6 concurrent commit support helps greatly to reduce that but you should still use care with these settings. Sync_binlog=1 can be expected to cause perhaps a 20% throughput drop with concurrent commit and 30 connections trying to commit, an effect that reduces as actively working connections count increases through to a peak throughput at about 100 working connections. To check the effect, just set sync_binlog to 0 and observe, then set innodb_flush_log_at_trx_commit = 0 and observer. Try innodb_flush_log_at_trx-commit = 2 also, it has less overhead than 1 and more than 0. Finally try both at 0. The speed increase from the 0 settings effect will be greatest on spinning disks with low concurrency and lowest at higher concurrency on fast SSD or with write caching disk controllers.

Note that it is mandatory to use innodb_flush_log_at_trx_commit=1 to get full durability guarantees. Write caching disk controllers with batter backup are the typical way that full durability combined with low performance penalty is achieved.

Bugs that affect upgrades and usage for 5.6 compared to 5.5 and earlier

http://bugs.mysql.com/bug.php?id=69174

Innodb_max_dirty_pages_pct is effectively broken at present, only working when the server has been idle for a second or more. There are a range of implications:

1. In past versions the limit on innodb_log_file_size sometimes made it necessary to use this setting to avoid hitting 75% of log space use and having a production disruption incident due to hitting async flushing at 75%. The much more gentle flushing batches from innodb_max_dirty_pages_pct were normally acceptable and it wasn’t uncommon for systems with large buffer pools and high needs to have innodb_max_dirty_pages_pct set to values in the 2-5% range just for this reason. In 5.6 you have two possibilities that should work better:

1a. You can use larger values for innodb_log_file_size. That will let you use more of your buffer pool for write combining and reduce total io operations, instead of being forced to do lots of avoidable ones just to avoid reaching 75% of the log file use. Be sure you allow for the RAM your OS will use for buffering the log files, assume as much RAM use as the total log file space you set. This should greatly increase the value of larger buffer pools for high write load workloads.

1b. You can set innodb_adaptive_flushing_lwm to avoid reaching 75% of log space use. The highest permitted value is 70%, so adaptive flushing will start to increase flushing rate before the server gets to 75% of the log file use. 70% is a good setting for systems with low write rates or very fast disk systems that can easily handle a burst of writes. For others you should adjust to whatever lower value it takes to produce a nice and smooth transition from innodb_io_capacity based level flushing to adaptive flushing. 10% is the default but that is probably too low for most production systems, just what we need for a default that has to handle a wide range of possible cases.

2. You can’t effectively use the normal practice of gradually reducing innodb_max_dirty_pages_pct before a shutdown, to reduce outage duration. The best workaround at present is to set innodb_io_capacity to high values so it will cause more flushing.

3. You can’t use innodb_max_dirty_pages_pct to manage crash recovery time, something it could do with less disruptive writing than the alternative of letting the server hit async flushing at 75% of log file space use, after deliberately setting innodb_log_file_size too low. The workarounds are to use higher than desirable innodb_io_capacity and smaller than desirable innodb_log_file_size. Both cause unnecessary flushing compared to using innodb_max_dirty_pages_pct for this task. Before using a too small innodb_log_file_size, experiment with innodb_io_capacity and innodb_adaptive_flushing_lwm. Also ensure that innodb_io_capacity_max is set to around twice innodb_io_capacity, rarely up to four or more times. This may eliminate the issue with less redundant io than very constrained log file sizes because adaptive flushing will increase the writing rate as the percentage of log space used increases, so you should be able to reach almost any target recovery time limit, though still at the cost of more io than using innodb_max_dirty_pages_pct to do it only when a hard cap is reached.

4. You can’t use innodb_max_dirty_pages_pct to effectively regulate the maximum percentage of dirty pages in the buffer pool, constraining them to a target value. This is likely to be of particular significance during data loading and with well cached workloads where you want to control the split between pages used for caching modified data and pages used for caching data used purely for reads.

The workaround for this is to regard innodb_adaptive_flushing_lwm as equivalent to the use of innodb_max_dirty_pages_pct for normal production and set it to something like 60% with a suitable value of innodb_io_capacity for the times when the workload hasn’t reached that amount of log file usage. Start low like 100 and gradually increase so that at medium load times it just about keeps up. Have innodb_io_capacity_max set to a relatively high value so that as soon as the low water mark is passed, lots of extra IO will be done to cap the dirty pages/log space use.

You may then be able to reduce the size of your InnoDB log files if you find that you don’t reach 60% of log space use when you have reached a suitable percentage of dirty pages for the page read/write balance for your server. If you can you should do this because you can reallocate the OS RAM used for caching the bigger log files to the InnoDB buffer pool or other uses.

REFERENCES

参考：Recommended Settings for MySQL 5.6, 5.7 Server for Online Transaction Processing (OLTP) and Benchmarking (Doc ID 1531329.1)

↧

18c新特性的一些小结

February 20, 2018, 3:01 am

≫ Next: Outline的部署和使用

≪ Previous: 官方推荐的MySQL参数设置值

Oracle 18c在2018-02-16 release出来的，还是秉承着Oracle的cloud first理念，18c现在cloud和Engineered Systems上推出，想在传统的机器上安装18c，估计还要等到2018年下半年。

下面，我从我的角度，来快速review一下值得关注的18c新特性（当然可能还有其他多值得关注的新特性）：

（一）Availability
1. Oracle Data Guard Multi-Instance Redo Apply Supports Use of Block Change Tracking Files for RMAN Backups
现在，Multiple-Instance Redo Apply (也叫MIRA，可以参考我之前写的【Oracle 12.2新特性介绍】的第48页)，也可以支持BCT（Block Change Tracking）的备份方式了。这对于超大的数据库，且是主从都是RAC，且备份发生在从库上，这是非常有效的一种增量备份方式。

2.Automatic Correction of Non-logged Blocks at a Data Guard Standby Database
新增2种类standby logging模式（主要是为了加快主库loading数据）：
一种是Standby Nologging for Data Availability，即loading操作的commit会被delay，直到所有的standby都apply data为止。

SQL> ALTER DATABASE SET STANDBY NOLOGGING FOR DATA AVAILABILITY;

一种是Standby Nologging for Load Performance，这种模式和上一种类似，但是会在load数据的时候，遇到网络瓶颈时，先不发送数据，这就保证了loading性能，但是丢失了数据，但是丢失的数据，会从primary中再次获取。

SQL> ALTER DATABASE SET STANDBY NOLOGGING FOR LOAD PERFORMANCE;

3.Shadow Lost Write Protection
创建一个shadow tablespaces（注，是big file tablespace）来提供保护。注，此时你就可以不需要ADG来提供额外的lost write的保护了。
搞mysql的同学，你们是不是发现了什么？这是不是很像double write？嘿嘿……

4. Backups from non-CDBs are usable after migration to CDB
原来的non-CDB，可以以这种方式，作为一个PDB迁移到当前已经存在的CDB中。

5. Support for PDBs as Shards and Catalogs
天，终于支持shard是pdb了。但是！但是！但是！也只是支持单个pdb在单个cdb中。哈哈哈哈哈，等于没支持。

6. User-Defined Sharding Method
这个在12.2中的beta版中存在的特性在，在正式发布是被取消了。现在，再次release出来了。

7. Consistency Levels for Multi-Shard Queries
提供MULTISHARD_QUERY_DATA_CONSISTENCY初始化参数，执行的之前可以先设置该初始化参数，避免跨分片查询时的SCN synchronization。

8. Manual termination of run-away queries
现在，你可以手动的杀掉一个语句，而不断开这个session：ALTER SYSTEM CANCEL SQL。

ALTER SYSTEM CANCEL SQL 'SID, SERIAL, @INST_ID, SQL_ID';

（二）Big Data and Data Warehousing
9. Approximate Top-N Query Processing
注，18c中，增加了 APPROX_COUNT和APPROX_SUM来配合APPROX_RANK的使用。

10. LOB support with IMC, Big Data SQL
LOB对象也支持in memory了。

（三）Database Overall
11. Copying a PDB in an Oracle Data Guard Environment
新增了2个参数，方便在ADG环境中创建PDB。
一个是STANDBY_PDB_SOURCE_FILE_DIRECTORY，自动寻找ADG的数据文件路径（注，在18c之前，如果将一个pdb插入到一个standby环境的中cdb，需要手动将文件拷贝到pdb的OMF路径下）

另一个是STANDBY_PDB_SOURCE_FILE_DBLINK，方便remote clone时自动查找ADG文件路径（注，在18c之前，如果是本地clone，就不用复制数据文件，但是远程clone，就需要手动复制。）。

12. PDB Lockdown Profile Enhancements
现在可以在application root和CDB root中创建PDB lockdown profile。如果你还没了解什么事application root和lockdown profile，可以参考我之前写的【Oracle 12.2新特性介绍】的第7页和第40页。

你现在还可以根据一个pdb lockdown profile，创建另外一个pdb lockdown profile。

18c包含三个默认的lockdown profile：PRIVATE_DBAAS,SAAS，PUBLIC_DBAAS

13. Refreshable PDB Switchover
PDB refresh一直号称是穷人的ADG，这个特性，在18c中也越来越好用了。支持了switchover。switchover分成计划内核计划外的两种场景。
计划内的，可以切回去，主要用于平衡CDB的负载。
计划外的，主要用于PDB master失效之后，不用整个CDB做切换。
如果你还没了解什么是PDB refresh，可以参考我之前写的【Oracle 12.2新特性介绍】的第37页)

14. PDB Snapshot Carousel
pdb的snapshot备份转盘，默认保留8份，每24小时备份一次。

ALTER PLUGGABLE DATABASE SNAPSHOT MODE EVERY 24 HOURS;

15. New Default Location of Oracle Database Password File
注意，新的密码文件路径已经在ORACLE_BASE，而不是ORACLE_HOME。

16. Read-Only Oracle Home
可以在dbca或者roohctl -enable来进程read only oracle home的安装，
运行orabasehome命令可以检查当前的Oracle Home是否只读，如果这个命令输出的结果和$ORACLE_HOME一样，则表示Oracle Home是可读写的。如果输出是ORACLE_BASE/homes/HOME_NAME，则表示Oracle Home是只读。

17.Online Merging of Partitions and Subpartitions
支持在线合并分区。注，需要使用ONLINE关键字。

18. Concurrent SQL Execution with SQL Performance Analyzer
SPA可以并行运行了（默认情况还是串行），帮你更快的完成SPA测试。

（四）Performance
19. Automatic In-Memory
自动In Memory会根据Heat Map，在内存使用紧张的情况下，将不常访问的IM列驱逐出内存。

20. Database In-Memory Support for External Tables
外部表支持IM特性。

21. Memoptimized Rowstore
在SGA中有一块memoptimize pool区域，大小受MEMOPTIMIZE_POOL_SIZE参数设置，当开启fast lookup的时候，就能利用该内存区域，进行快速的查找。
开启fast lookup，需要在建表语句中加上关键字：

当基于主键查询时，就能使用到fast lookup。

Memoptimized Rowstore将极大的提高物联网中基于主键的高频查询。

（五）RAC and Grid
22. ASM Database Cloning
可以基于ASM做pdb的克隆。基于asm的flex diskgroup来实现。关于flex diskgroup，参考我之前写的【Oracle 12.2新特性介绍】的第17页。

23. Converting Normal or High Redundancy Disk Groups to Flex Disk Groups without Restricted Mount
呵呵，鼓励往flex diskgroup上转型。

（六）Security
24. Integration of Active Directory Services with Oracle Database
和微软的AD结合。在18c之前，需要使用Oracle Enterprise User Security (EUS)进行交互，现在，可以使用centrally managed users (CMU) 直接将AD的users和groups和Oracle的users和role进行mappiing。

（七）其他
25. 新增初始化参数：

ADG_ACCOUNT_INFO_TRACKING

FORWARD_LISTENER

INMEMORY_AUTOMATIC_LEVEL

INMEMORY_OPTIMIZED_ARITHMETIC

MEMOPTIMIZE_POOL_SIZE

MULTISHARD_QUERY_DATA_CONSISTENCY

OPTIMIZER_IGNORE_HINTS

OPTIMIZER_IGNORE_PARALLEL_HINTS

PARALLEL_MIN_DEGREE

PRIVATE_TEMP_TABLE_PREFIX

STANDBY_PDB_SOURCE_FILE_DBLINK

STANDBY_PDB_SOURCE_FILE_DIRECTORY

TDE_CONFIGURATION

UNIFIED_AUDIT_SYSTEMLOG

WALLET_ROOT

值得说一下的是OPTIMIZER_IGNORE_PARALLEL_HINTS，在纯OLTP的系统中，你终于可以禁用开发人员不受控制的并发了（往往不写并发度）。：）

26. dbms_session.sleep
可以

exec dbms_session.sleep(3);

终于不用再单独grant dbms_lock的权限了。

看到上面的这些新特性，你怎么想？“DBA将死”、“Oracle将实现自动驾驶”？这些新特性是不是还需要DBA？：）

参考：
1. Oracle Database Database New Features Guide, 18c
2. ORACLE 18C: ORACLE 18C.. NEW FEATURES.. WHAT’S NEWS..
3. Franck Pachot (@FranckPachot)

↧

Outline的部署和使用

April 5, 2018, 8:19 am

≫ Next: centos 7中配置keepalived日志为别的路径

≪ Previous: 18c新特性的一些小结

Outline是一款突破网络封锁的工具，Jigsaw开发的项目，而Jigsaw是属于alphabet旗下的，而alphabet，是google的母公司。
现在你明白了吧，这是一款google出的工具。

outline的官方网站是：
https://getoutline.org/en/home

outline需要服务器端和客户端。
1. 客户端，已经有各种版本，包括Andriod、iOS等等：

iOS的下载地址是这里，目前中国区也还有的下载。

2. 服务器端，你需要在你自己搭建的服务器上安装，安装过程非常简单，但是还是需要在电脑上操作一下，我们需要先下载一个Outline Manager：

我们这里以Mac版为例,Mac版的Outline manager的下载地址是这里。

下载后安装：

在launchpad启动outline manager，你可以看到他会叫你如何在你自己搭建的服务器上安装outline的服务器端。

默认是用Digital Ocean这家云服务商的服务器

当然你也可以使用其他任意云端的服务器：

我们以使用其他云端服务器为例，进行说明，点击get started：

看到没？很简单，只有2步骤。

那么我在我的云端服务器运行如下命令即可：

wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash

注意，这要求云端的服务器要已经安装好docker，并且启动docker服务，并且关闭防火墙。如果你没有做到这些，你可以会遇到和我一样的报错：

[root@vultr outline_server]# wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash
> Verifying that Docker is installed .......... Docker CE must be installed, please run "curl -sS https://get.docker.com/ | sh" or visit https://docs.docker.com/install/

Sorry! Something went wrong. If you can't figure this out, please copy and paste all this output into the Outline Manager screen, and send it to us, to see if we can help you.
[root@vultr outline_server]#

此时我需要先安装docker：

[root@vultr outline_server]# curl -sS https://get.docker.com/ | sh

# Executing docker install script, commit: e749601
+ sh -c 'yum install -y -q yum-utils'
+ sh -c 'yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo'
Loaded plugins: fastestmirror
adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
grabbing file https://download.docker.com/linux/centos/docker-ce.repo to /etc/yum.repos.d/docker-ce.repo
repo saved to /etc/yum.repos.d/docker-ce.repo
+ '[' edge '!=' stable ']'
+ sh -c 'yum-config-manager --enable docker-ce-edge'
Loaded plugins: fastestmirror
============================================================================================== repo: docker-ce-edge ==============================================================================================
[docker-ce-edge]
async = True
bandwidth = 0
base_persistdir = /var/lib/yum/repos/x86_64/7
baseurl = https://download.docker.com/linux/centos/7/x86_64/edge
cache = 0
cachedir = /var/cache/yum/x86_64/7/docker-ce-edge
check_config_file_age = True
compare_providers_priority = 80
cost = 1000
deltarpm_metadata_percentage = 100
deltarpm_percentage =
enabled = 1
enablegroups = True
exclude =
failovermethod = priority
ftp_disable_epsv = False
gpgcadir = /var/lib/yum/repos/x86_64/7/docker-ce-edge/gpgcadir
gpgcakey =
gpgcheck = True
gpgdir = /var/lib/yum/repos/x86_64/7/docker-ce-edge/gpgdir
gpgkey = https://download.docker.com/linux/centos/gpg
hdrdir = /var/cache/yum/x86_64/7/docker-ce-edge/headers
http_caching = all
includepkgs =
ip_resolve =
keepalive = True
keepcache = False
mddownloadpolicy = sqlite
mdpolicy = group:small
mediaid =
metadata_expire = 21600
metadata_expire_filter = read-only:present
metalink =
minrate = 0
mirrorlist =
mirrorlist_expire = 86400
name = Docker CE Edge - x86_64
old_base_cache_dir =
password =
persistdir = /var/lib/yum/repos/x86_64/7/docker-ce-edge
pkgdir = /var/cache/yum/x86_64/7/docker-ce-edge/packages
proxy = False
proxy_dict =
proxy_password =
proxy_username =
repo_gpgcheck = False
retries = 10
skip_if_unavailable = False
ssl_check_cert_permissions = True
sslcacert =
sslclientcert =
sslclientkey =
sslverify = True
throttle = 0
timeout = 30.0
ui_id = docker-ce-edge/x86_64
ui_repoid_vars = releasever,
   basearch
username =

+ sh -c 'yum makecache'
Loaded plugins: fastestmirror
base                                                                                                                                                                                       | 3.6 kB  00:00:00
docker-ce-edge                                                                                                                                                                             | 2.9 kB  00:00:00
docker-ce-stable                                                                                                                                                                           | 2.9 kB  00:00:00
extras                                                                                                                                                                                     | 3.4 kB  00:00:00
updates                                                                                                                                                                                    | 3.4 kB  00:00:00
(1/14): docker-ce-edge/x86_64/filelists_db                                                                                                                                                 | 8.5 kB  00:00:00
(2/14): docker-ce-edge/x86_64/primary_db                                                                                                                                                   |  15 kB  00:00:00
(3/14): docker-ce-stable/x86_64/primary_db                                                                                                                                                 |  12 kB  00:00:00
(4/14): docker-ce-edge/x86_64/other_db                                                                                                                                                     |  62 kB  00:00:00
(5/14): docker-ce-stable/x86_64/other_db                                                                                                                                                   |  66 kB  00:00:00
(6/14): docker-ce-stable/x86_64/filelists_db                                                                                                                                               | 7.3 kB  00:00:00
(7/14): extras/7/x86_64/prestodelta                                                                                                                                                        | 129 kB  00:00:00
(8/14): extras/7/x86_64/filelists_db                                                                                                                                                       | 709 kB  00:00:00
(9/14): updates/7/x86_64/prestodelta                                                                                                                                                       | 960 kB  00:00:00
(10/14): base/7/x86_64/other_db                                                                                                                                                            | 2.5 MB  00:00:01
(11/14): updates/7/x86_64/other_db                                                                                                                                                         | 734 kB  00:00:00
(12/14): extras/7/x86_64/other_db                                                                                                                                                          | 121 kB  00:00:00
(13/14): updates/7/x86_64/filelists_db                                                                                                                                                     | 4.2 MB  00:00:00
(14/14): base/7/x86_64/filelists_db                                                                                                                                                        | 6.7 MB  00:00:01
Loading mirror speeds from cached hostfile
 * base: repo1.dal.innoscale.net
 * extras: repo1.ash.innoscale.net
 * updates: mirror.nodesdirect.com
Metadata Cache Created
+ sh -c 'yum install -y -q docker-ce'
warning: /var/cache/yum/x86_64/7/docker-ce-edge/packages/docker-ce-18.03.0.ce-1.el7.centos.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Public key for docker-ce-18.03.0.ce-1.el7.centos.x86_64.rpm is not installed
Importing GPG key 0x621E9F35:
 Userid     : "Docker Release (CE rpm) <docker@docker.com>"
 Fingerprint: 060a 61c5 1b55 8a7f 742b 77aa c52f eb6b 621e 9f35
 From       : https://download.docker.com/linux/centos/gpg
If you would like to use Docker as a non-root user, you should now consider
adding your user to the "docker" group with something like:

  sudo usermod -aG docker your-user

Remember that you will have to log out and back in for this to take effect!

WARNING: Adding a user to the "docker" group will grant the ability to run
         containers which can be used to obtain root privileges on the
         docker host.
         Refer to https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
         for more information.
[root@vultr outline_server]#
[root@vultr outline_server]#

然后启动docker服务：

[root@vultr outline_server]# service docker start
Redirecting to /bin/systemctl start docker.service
[root@vultr outline_server]#

然后关闭防火墙：

[root@vultr outline_server]# service firewalld stop

开始安装：

[root@vultr outline_server]# wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash
> Verifying that Docker is installed .......... OK
> Verifying that Docker daemon is running ..... OK
> Creating persistent state dir ............... OK
> Generating secret key ....................... OK
> Generating TLS certificate .................. OK
> Generating SHA-256 certificate fingerprint .. OK
> Starting Shadowbox .......................... Unable to find image 'quay.io/outline/shadowbox:stable' locally
stable: Pulling from outline/shadowbox
605ce1bd3f31: Pulling fs layer
9d1b67fd48b4: Pulling fs layer
f87706f29a6f: Pulling fs layer
b50c2fcde876: Pulling fs layer
e1ecd3c15a4b: Pulling fs layer
f72ac4625f86: Pulling fs layer
98be2229c9b1: Pulling fs layer
5b2bb8abc0c7: Pulling fs layer
3852ab6d98b2: Pulling fs layer
8219c6ace457: Pulling fs layer
88c337662eb5: Pulling fs layer
5ce0d168fc22: Pulling fs layer
170df050f533: Pulling fs layer
b50c2fcde876: Waiting
e1ecd3c15a4b: Waiting
f72ac4625f86: Waiting
98be2229c9b1: Waiting
5b2bb8abc0c7: Waiting
3852ab6d98b2: Waiting
8219c6ace457: Waiting
88c337662eb5: Waiting
5ce0d168fc22: Waiting
170df050f533: Waiting
605ce1bd3f31: Verifying Checksum
605ce1bd3f31: Download complete
f87706f29a6f: Verifying Checksum
f87706f29a6f: Download complete
9d1b67fd48b4: Verifying Checksum
9d1b67fd48b4: Download complete
b50c2fcde876: Verifying Checksum
b50c2fcde876: Download complete
e1ecd3c15a4b: Verifying Checksum
e1ecd3c15a4b: Download complete
605ce1bd3f31: Pull complete
98be2229c9b1: Verifying Checksum
98be2229c9b1: Download complete
5b2bb8abc0c7: Verifying Checksum
5b2bb8abc0c7: Download complete
3852ab6d98b2: Verifying Checksum
3852ab6d98b2: Download complete
f72ac4625f86: Verifying Checksum
f72ac4625f86: Download complete
8219c6ace457: Verifying Checksum
8219c6ace457: Download complete
88c337662eb5: Verifying Checksum
88c337662eb5: Download complete
170df050f533: Verifying Checksum
170df050f533: Download complete
5ce0d168fc22: Verifying Checksum
5ce0d168fc22: Download complete
9d1b67fd48b4: Pull complete
f87706f29a6f: Pull complete
b50c2fcde876: Pull complete
e1ecd3c15a4b: Pull complete
f72ac4625f86: Pull complete
98be2229c9b1: Pull complete
5b2bb8abc0c7: Pull complete
3852ab6d98b2: Pull complete
8219c6ace457: Pull complete
88c337662eb5: Pull complete
5ce0d168fc22: Pull complete
170df050f533: Pull complete
Digest: sha256:ed974a668b0c858781188882cde0c802afa9a36337587884a4e7ff6a5e96ec5b
Status: Downloaded newer image for quay.io/outline/shadowbox:stable
OK
> Starting Watchtower ......................... Unable to find image 'v2tec/watchtower:latest' locally
latest: Pulling from v2tec/watchtower
a5415f98d52c: Pulling fs layer
c3f7208ad77c: Pulling fs layer
169c1e589d74: Pulling fs layer
a5415f98d52c: Verifying Checksum
a5415f98d52c: Download complete
c3f7208ad77c: Verifying Checksum
c3f7208ad77c: Download complete
169c1e589d74: Verifying Checksum
169c1e589d74: Download complete
a5415f98d52c: Pull complete
c3f7208ad77c: Pull complete
169c1e589d74: Pull complete
Digest: sha256:4cb6299fe87dcbfe0f13dcc5a11bf44bd9628a4dae0035fecb8cc2b88ff0fc79
Status: Downloaded newer image for v2tec/watchtower:latest
OK
> Waiting for Outline server to be healthy .... OK
> Creating first user ......................... OK
> Adding API URL to config .................... OK
> Checking host firewall ...................... OK

CONGRATULATIONS! Your Outline server is up and running.

To manage your Outline server, please copy the following text (including curly
brackets) into Step 2 of the Outline Manager interface:

{
  "apiUrl": "https://11.22.33.44:51714/-9w7ZBvaEt88dwpb1dASFD",
  "certSha256": "2349DDF1D15SGDEESE504TSREQQ59060B42044B04A47A32635ASB4EE249HSFES"
}

If have connection problems, it may be that your router or cloud provider
blocks inbound connections, even though your machine seems to allow them.

- If you plan to have a single access key to access your server make sure
  ports 51714 and 50581 are open for TCP and UDP on
  your router or cloud provider.
- If you plan on adding additional access keys, you’ll have to open ports
  1024 through 65535 on your router or cloud provider since the Outline
  Server may allocate any of those ports to new access keys.

[root@vultr outline_server]#

注意，上面那段apiUrl和certSha256就是要填到outline manager中的：

点击done，就会连接到远处的server。然后，在界面中点击ADD Key：

生成key之后点击share

会生成一个分享连接。把连接发给别人或者自己。

点击连接就能看到一个connect to this server，点击之后，可以看到一个ss：//开头的地址，将这个地址填写到你的iPhone的客户端中，点击add server：

然后，就可以使用了。

现在可以通畅的访问所有的网络了。

注1，如果你在outline manager remove了某个key，那边你发送给别人或者自己的这个key就失效了。后续iPhone等客户端无法使用这个key连接。
注2，这是全局代理，没法写规则，所以要注意一下流量。
注3，如果你想根据规则，其实也很容易。因为ss:\\的这个地址，你复制到shadowrocket中，就会自动的转换成IP，密码，端口，加密访问，你就可以直接用在shadowrocket中走规则。

最后，再简单分析一下。
outline其实还是基于shadowsocks协议进行的通信，只不过包了一层docker。即将ss server包在docker里面，然后将docker部署到你的机器上。

↧

centos 7中配置keepalived日志为别的路径

October 18, 2018, 12:50 am

≫ Next: raft协议学习笔记

≪ Previous: Outline的部署和使用

keepalived 安装：

cd 
./configure --prefix=/usr/local/keepalived

make &&  make install

mkdir /etc/keepalived
mkdir /etc/keepalived/scripts
cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
cp /root/keepalived-2.0.6/keepalived/etc/init.d/keepalived  /etc/init.d/
cp /usr/local/keepalived/sbin/keepalived /sbin/keepalived
cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
chmod +x /etc/init.d/keepalived

由于在默认状态下keepalived的日志会写入到/var/log/message中，我们需要将此剥离出来。

在centos 6下可以：

（1）首先修改/etc/sysconfig/keepalived文件，注释掉如下，添加如下： 
#KEEPALIVED_OPTIONS="-D"
KEEPALIVED_OPTIONS="-D -d -S 0" 

（2）其次修改 /etc/rsyslog.conf 文件，添加如下：
local0.* /var/log/keepalived.log

在centos 7 下，还需要修改/lib/systemd/system/keepalived.service 文件：

## centos 7使用。因为centos 7使用systemctl，通过systemctl调用service，所以需要修改/lib/systemd/system/keepalived.service文件。

将里面的：
EnvironmentFile=-/usr/local/keepalived/etc/sysconfig/keepalived
ExecStart=/usr/local/keepalived/sbin/keepalived $KEEPALIVED_OPTIONS
修改成：
EnvironmentFile=/etc/sysconfig/keepalived
ExecStart=/sbin/keepalived $KEEPALIVED_OPTIONS

然后重新加载service：
systemctl daemon-reload

整体的思路就是，
1. 通过systemctl start keepalived去启动； 2. 启动keepalived的时候，会去读service的配置文件：/lib/systemd/system/keepalived.service； 3. 在service的配置文件时： 3.1 启动文件路径ExecStart=/sbin/keepalived $KEEPALIVED_OPTIONS，即启动方式是带环境变量文件中参数来启动； 3.2 读取环境变量参数EnvironmentFile=/etc/sysconfig/keepalived。 4. $KEEPALIVED_OPTIONS参数是在/etc/sysconfig/keepalived的配置；我们配置的是KEEPALIVED_OPTIONS="-D -d -S 0"；而-S是syslog的facility，0表示放在local0，在/etc/rsyslog.conf 中配置local0.* /var/log/keepalived.log 5. 所以，写日志就去/var/log/keepalived.log了。

↧

raft协议学习笔记

November 17, 2018, 11:35 am

≫ Next: Azure云MySQL数据库受限功能列表

≪ Previous: centos 7中配置keepalived日志为别的路径

注，需要注意的是raft是个默认消息可靠，但是不提防消息有害的系统。

（一）. 共识机制有2种：
一种是leader-less（对称的），即没有leader，大家都是平等的，客户端可以连接任意的节点。
一种是leader-base（非对称的），即有leader，在任意的某个时间点，只有一个leader，其他的节点接受leader的决定。客户端只和leader 节点发生交互。

raft是属于leader-base的共识机制。注意raft是一种协议，那么它就不是一种公式，而是一种分布式系统达成共识的各种条件的约定。

（二）节点状态：

节点的状态有3种，一个是叫follower，一个是candidate，一个是叫leader。各自角色的作用，有如下约定：
（1）leader：会处理来自所有客户端的交互请求，会记录复制的进度，同一时刻，集群中只会有一个leader。
（2）follower：完全被动，不发起RPC请求（Remote Procedure Call，即远程过程调用请求），只是回应RPC请求。
（3）candidate：用于选举leader。

开始的时候，所有节点的状态为follower（后续简称为F），节点在规定的时间内，没有收到来自leader(后续简称为L)的RPC请求，发起投票，先是让自己成为candidate（后续简称C）。
在某一时刻，可能存在多个C，这些C是从F变过来的，从C变成F的时间，各个节点也不尽相同,有些可能久一点，有些可能短一点；各个C也会在不同的瞬间发起投票；而发给F的路径长度不一样，可能收到F的反馈的时间点也不一样。
如果在某个选举的时间单位内，C收到了大部分节点的同意的信息，那么C就变成L，如果没有收到信息，那么发起下来一轮任期的投票。

（三）任期（Term）
raft将时间划分成一个一个的任期。一个是选举期间，一个是常规操作的时间（此时只有一个L）。
有些term是没有L的，
每个节点都保留这当前的任期值。

所以，一开始，大家都是F，一开始是出于选举阶段的（蓝色），等选出了L，进入了常规操作时间（绿色），当出现超时或者故障的时候，此时就进入了投票阶段，节点们会发起投票，最后投票结束，又进入了常规操作阶段。
在Term1，需要从F开始选举，所以蓝色部分比较长；在Term2，此时已经有了L，所以只需要确认L和F之间的心跳，所以蓝色比较短；在Term3，L挂掉需要重新投票，新节点获取投票成为L后，进入Term4；Term5和Term2一样。

（四）节点信息持久化
1. 每个节点会以同步的方式，在回应RPC之前，持久化如下信息：

这个可以看出是不是像一个小型的数据库？有日志（记录历史term和command），有数据（current term和votefor）？

2. 不用持久化的信息：

（五）心跳和timeout：
1. RPC请求包括两种类型的RPC请求，一种是AppendEntries，如写日志，如发送心跳，是有L发出来的；一种是VOTE，是由C发出来的。
timeout分成两种，第一种是（待补充，见http://thesecretlivesofdata.com/raft/）
2. 初始状态，大家都是F
3. F期望收到的是来自C或L的RPC请求
4. L必须不停的发送AppendEntries来维持自己的领导地位
5. 如果在选举时间（通常是100-500ms）内，L没有收到RPC请求，那么F就认为L已经死掉，F会开始一个新的选举。

（六）选举：
F选举时，会先设定一个超时的时间，是∆election到2倍的∆election时间之间。
会递增当前的Term的值。
F的状态会变成C。并且投票的第一票是给自己。并且给其他所有的节点发送VOTE的RPC请求。然后：
1. 如果在timeout内收到了大部分节点的回应，则这自己成为L。并且发送AppendEntries 给其他所有节点。
2. 如果别的节点已经成为了L，此时从别的节点收到AppendEntries的请求，则自己降成F。
3. 如果从其他节点没有收到任何消息，timeout，则重新发起选举。
4. 当选举完成之后，如果保证选举正确？
4.1 Safty（允许在一个Term内，最多只有一个L）：
（a）每个节点在每次term内，只投出一票。
（b）2个不同的C，不能在同一个的term内累积“大多数”的投票。
4.2 Liveness（一些C必须最终获胜）
（a）election timeouts是随机的，（在∆election到2倍的∆election之间）
（b）在其他节点醒来之前，一个先发起的节点通常会timeout，并且赢得选举。
（c）如果∆election >> broadcast time，这种Liveness的机制将会工作的很好。

（七）日志的结构：
1. log是被持久化在磁盘上，用于crash恢复
2. commited的意思是，被大多数节点已经写入。
3. 最终一致性

可以从上图看到日志的结构，包含2部分，一部分是term的值，一部分是command的历史信息。

（八）常规操作：
1. 客户端将command传给L
2. L将command写日志
3. L将AppendEntries的RPC请求发给F
4. 一旦新条目commit，L将command传自己的状态机，向客户端返回结果。L在后续AppendEntries的RPC请求中告诉告诉F，被commit的条目。F将commit命令发给自己的状态机。
5. 遇到crash或者slow的客户端，L会一直尝试发送直到成功。通常情况下的性能最佳：一个成功的RPC请求给大多数个节点。

（九）一致性
1. 什么是日志的一致性：所有的节点的日志，都有一样的index和term；如果某个给定的条目是已经commit的了，那么前面的所有的条目也是commit的。
如下图所示，数字123456是代表log index，所有的节点，都有一样的index和term，且某个给定的条目，如index=4，T2的条目，已经是commit的，那么之前的条目也都是commit的。

2. 什么是AppendEntries的一致性检查：每个AppendEntries的RPC请求，都包含需要处理的新的index和term；F必须包含有相符合的条目，不然就拒绝新来的AppendEntries请求；将上述步骤实施递归步骤，确保一致性。
如下图所示，每个新来的AppendEntries RPC请求，都包含了index和term，且在下面的第二个图中，由于F包含的条目和L不一致，所以会拒绝log index=5的新的AppendEntries的请求。

（十）Leader的产生
1. L产生的开始：
1.1 旧的L可能会留下一下部分被同步的条目
1.2 新的L只是做“常规操作”，并不会做一些特别的动作。
1.3 L的日志，是“真理”，会以L的日志为准。
1.4 F的日志最终会到达和L一致。最终一致性。
1.5 多次崩溃可能会留下许多无关的日志条目。

2. Safty的要求：
2.1. 如果L已确认某个log条目是已经commit了的，则该条目将出现在所有未来L的日志中
2.2. L不会覆盖写日志的条目：只有在L的日志中的条目，才能被commit；日志条目只有commit之后，才会被同步到其他节点。
2.3. 集群成员数量改变，每次只变一台，不同时变多台。即使有，在内部操作时也是拆成一台一台的改变。如果同时改变多台，可能出现脑裂的情况，同一时刻有老配资的leader和新配置的leader，两个leader。

2.4. 集群成员数量改变，采用两阶段方法变动（即存在同时为C-old和C-new，老配置和新配置同时生效的时刻）。

↧

Azure云MySQL数据库受限功能列表

November 19, 2018, 11:02 pm

≫ Next: MySQL 不显示输出结果

≪ Previous: raft协议学习笔记

微软Azure云MySQL功能受限列表（截止2018年4月）
多样性	支持的数据库种类	SQL Server MySQL、MariaDB Postgresql CosmosDB（类似MongoDB） Redis
高可用性	支持的区域	Azure中国账号支持2个区域 Azure全球账号支持22个区域
高可用性	高可用性（RDS 本身的高可用性，如 multiAZ）	支持
备份恢复	支持在线复制功能	不支持
	支持克隆功能	Azure中国账号支持 Azure全球账号不支持
	备份日志	不支持
	支持恢复到任意时间点（多种存储引擎）	不支持，未看到恢复到指定时间点的菜单
同步	支持 Paas 层的专线同步（如中美专线）	需要后台帮助配置代理才能做专线同步
	同步延时界面	不支持
	有 RDS 专用的全量迁移工具	不支持
	有 RDS 专用的增量迁移工具	支持 ,VNET service tunnel
	支持MySQL的GTID同步	不支持，但是可以开工单让后台开启
权限管理	用户管理的模式（root 用户的权限）	不支持
安全和审计	支持VPC	支持
	支持安全组	不支持
	支持审计	不支持，没有看到数据库审计相关菜单
	支持存储加密	不支持，没有看到数据落盘加密菜单
	支持连接加密（SSL）	支持
监控	支持常用指标监控及配置告警	Azure中国账号没有提供监控指标 Azure全球账号有提供监控指标
	支持数据库层面的性能监控（等待事件，长事务）	不支持
	查看数据库错误日志	不支持
销毁	保留最后快照到云存储上	不支持
扩展性	支持的最高 CPU，最高内存，最高 IOPS	32 v-core, 160G mem 100-30000 IOPS
扩展性	支持在线扩容，缩容	支持

↧

MySQL 不显示输出结果

December 12, 2018, 12:23 am

≫ Next: SQL Server报错The datediff function resulted in an overflow

≪ Previous: Azure云MySQL数据库受限功能列表

有的时候，想看看语句执行时间有多长，但是有不想看的刷屏的输出，各个数据库可以用下面的方法：
（1）Oracle： set autotrace trace，恢复的话，用set autottrace off
（2）postgresql： EXPLAIN ANALYZE
（3）MySQL： pager cat > /dev/null，恢复的话，直接打pager

MySQL的举例说明一下：

mysql> pager
Default pager wasn't set, using stdout.
mysql> 
mysql> select count(*) from orasup1;
+----------+
| count(*) |
+----------+
|   960896 |
+----------+
1 row in set (0.60 sec)

mysql> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
mysql> 
mysql> select count(*) from orasup1;
1 row in set (0.65 sec)

mysql> pager
Default pager wasn't set, using stdout.
mysql> 
mysql> select count(*) from orasup1;
+----------+
| count(*) |
+----------+
|   960896 |
+----------+
1 row in set (0.63 sec)

mysql>

参考： Fun with the MySQL pager command

↧

SQL Server报错The datediff function resulted in an overflow

December 26, 2018, 3:17 am

≫ Next: 解决openwrt中关于某些域名无法解析的问题

≪ Previous: MySQL 不显示输出结果

zabbix的监控有一个报错:

The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

经检查，这个报错，调用的是下面的一个监控：

select count(*) as cnt from sys.sysprocesses 
where  DateDiff(ss,last_batch,getDate())>=10 
and lastwaittype Like 'PAGE%LATCH_%' 
And waitresource Like '2:%'

这个监控脚本，是用来监控发生在temp上的pagelatch_up的争用。监控脚本中，包含了datediff函数。datediff的返回值如果overflow，将导致上面的报错。

我们来看看，datediff这个值溢出的情况。在官方文档中，datediff函数定义返回的是int值，int值的取值范围是 (-2,147,483,648 to +2,147,483,647)。所以，第一步的怀疑，是抓取的起始时间和结束时间之差，溢出了。

那么，什么时候会溢出？如果进程是刚刚发起的，那么之间的差值，应该会很短，不会溢出。那么离目前时间最远的进程，会不会溢出？

在SQL Server中进程分为客户端进程和系统进程，一般情况下，客户端进程都是最近发起的，所以时间差不会溢出。是否是系统进程导致时间差溢出的呢？

因为系统进程不是客户端发起的，所以系统进程的last_batch时间，就是数据库的启动时间，我们检查了一下数据库的启动时间：

SELECT sqlserver_start_time FROM sys.dm_os_sys_info;

发现是2018-04-14 22:02:46.377。这个时间是否有可能导致溢出？

还是根据官方文档：

可以看到，如果是到秒级，即datediff(ss)，中间的时间差是可以长达68年19天3小时14分7秒的。而我们的数据库启动时间，远远没有超过68年。

去掉where条件之后，重新运行了几次上述的SQL语句，没有发现早于2018-04-14的。

正当束手无策的时候，想起在这个数据库上部署过msawr，会定期snapshot各项性能指标，那么可以从msawr中去找找线索。

确实，我们在msawr中发现了有些进程的last_batch早于数据库启动的时间，这个时间，是1900-01-01 00:00:00.000。

last_batch的含义，在官方文档是这样解释的：

last_batch是个datetime的值，在官方文档中说明中，datetime类型默认值是1900：

而last_batch的这个字段，是not null：

也就是说，在为null的情况下，这个datetime类型的值，将有默认值来填充，所以也就出现了1900-01-01 00:00:00.000。

那么sysprocesses的last_batch会出现控制，进而被替代成1900-01-01 00:00:00.000 ？这个在网上找很多文章，都归结到微软的这个文章：”INF: Last Batch Date is Seen as 1900-01-01 00:00:00.000″ at http://support.microsoft.com/?kbid=306625 ，但是点进去你会发现，这个文章已经404找不到了。

幸好，还有另外的一个文章启发了我：

它说：

However, it's possible to create a connection to SQL Server without issuing any RPC calls at all. In this case, the value of last_batch will never have been set and master..sysprocesses will display the value as 1900-01-01 00:00:00.000.

也就是说，由非远程调用（RPC，remote procedure call）发起的进程，其last_batch是null值，而null值继而会被1900-01-01 00:00:00.000所替代。

我们进而看lastwaittype：发现其大部分的，是CXPACKET的并发等待。

所以，应该是并发进程，不是有RPC远程调用的，而是直接在本地调用的。在第一次的时候，last_batch没有被更新，只是留有了null值，进而被替换成了1900年。从而导致了我们的溢出报错。

解决方式也很简单。因为1900年的固定的值，加个条件and last_batch<>‘1900-01-01 00:00:00.000’ 就可以了。

select count(*) as cnt from sys.sysprocesses 
where  DateDiff(ss,last_batch,getDate())>=10 
and lastwaittype Like 'PAGE%LATCH_%' 
And waitresource Like '2:%'
and last_batch<>'1900-01-01 00:00:00.000'

↧

解决openwrt中关于某些域名无法解析的问题

February 1, 2019, 12:33 am

≫ Next: aws RDS 版本升级最佳实践的探讨

≪ Previous: SQL Server报错The datediff function resulted in an overflow

之前刷的一个openwrt的路由，虽然能很方便的登陆google和百度，但是发现不少网站还是登陆不上去，连我自己的博客也无法登陆。

检查连一下，发现是我的博客的域名无法解析。

root@OpenWrt:/etc/dnsmasq.d# dig oracleblog.org

; <<>> DiG 9.9.4 <<>> oracleblog.org
;; global options: +cmd
;; connection timed out; no servers could be reached
root@OpenWrt:/etc/dnsmasq.d# 
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d# dig youtube.com

; <<>> DiG 9.9.4 <<>> youtube.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21312
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
;; QUESTION SECTION:
;youtube.com.                   IN      A

;; ANSWER SECTION:
youtube.com.            613     IN      A       216.58.221.238

;; Query time: 33 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 01 15:05:10 CST 2019
;; MSG SIZE  rcvd: 56

root@OpenWrt:/etc/dnsmasq.d#

那就是一个域名解析的问题了，由于我是通过dnsmasq进行域名解析，在我的/etc/dnsmasq.d目录下，已经有连需要特别解析的配置，那么剩下的就是一般的域名走默认配置。

检查了一下，发现没有配置no-resolv和server。把/etc/dnsmasq.conf添加如下，就可以解决了（见第6行～12行）：

# for targets which are names from DHCP or /etc/hosts. Give host
# "bert" another name, bertrand
# The fields are &lt;cname&gt;,&lt;target&gt;
#cname=bertand,bert
conf-dir=/etc/dnsmasq.d
#Add by Jimmy BEGIN HERE 
no-poll
no-resolv
all-servers
cache-size=5000
server=114.114.114.114
#Add by Jimmy END HERE

root@OpenWrt:/etc/dnsmasq.d# dig oracleblog.org

; <<>> DiG 9.9.4 <<>> oracleblog.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24044
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;oracleblog.org.                        IN      A

;; ANSWER SECTION:
oracleblog.org.         300     IN      A       45.76.217.207

;; Query time: 259 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 01 16:24:10 CST 2019
;; MSG SIZE  rcvd: 59

root@OpenWrt:/etc/dnsmasq.d#

↧

aws RDS 版本升级最佳实践的探讨

February 2, 2019, 9:49 am

≫ Next: 在Docker上安装oracle 19c

≪ Previous: 解决openwrt中关于某些域名无法解析的问题

这篇文章其实在草稿箱中存在了挺长的一段时间，去年10月就已经开始写了，但是由于工作上的其他事情的干扰，一直还没写完。所以你可以看到我画的图中，now其实是指2018年10月（OCT）。趁着过年休假，把这个文章终于写完了。

aws rds被强制升级是个无奈的事情，版本不支持，而被强制升级会影响业务可用性。与其被动强制升级，不如制定主动升级战略。

1. aws RDS 的升级周期说明：
根据亚马逊的文档 Amazon RDS FAQs上的说明，aws RDS的大版本，至少能支持3年，小版本至少会支持1年。

根据和aws的交流得知，一般社区基本版本发布约5个月之后，aws会发布基于aws的RDS。

因此，aws的RDS升级周期是，待社区版本发布后，约5个月，aws发布对应的版本，每个大版本至少支持3年，每个小版本至少支持1年。

2. aws RDS的版本过期的后果：
根据亚马逊的文档 Amazon RDS FAQs上的说明，当某个大版本或者小版本，过了亚马逊的服务支持期，亚马逊会提前提醒客户（大版本提前6个月提醒，小版本提前3个月提醒），在提醒期过后，aws会强制自动升级数据库到最新的版本（即使客户选择的是关闭了自动小版本升级）。升级的过程，应用程序无法连接数据库，造成业务影响。

注1：无论大版本，还是小版本，一旦过了亚马逊的服务支持期，都会面临强制升级的过程。
注2：小版本的升级过程，会包含备份，升级，再次备份。经验值是第一次备份和最后一次备份，不影响业务正常访问，升级数据库的过程，影响业务正常访问。整个升级的过程，大约30分钟，其中影响业务访问的时间为3分半钟。但具体的业务影响时间，以实际测试为准。
注3：小版本在提醒期的deadline来之前的一周，已经不能对数据库做任何modify的操作，包括搭建replica或者更改维护窗口。但是可以从备份的snapshot还原出来一个数据库，用于测试升级的时长。
注4：小版本升级步骤是先升级从库，再升级主库。

3. 内部升级步骤解析：

即：
a). 在升级前，做一次快照，注意这个快照的时间，和数据库的大小的有关。 b). 进行slow shutdown，即set global innodb_fast_shutdown=0然后进行shutdown。由于设置了slow shutdown，因此dirty buffer会刷到磁盘上+insert buffer 也会刷到磁盘上（即system tablespace，ibdata1中）+full purge（即清理无用的undo页） c). 将mysql挂载到新的存储引擎下，并且禁止远程网络访问； d). 运行mysql_upgrade程序，升级数据字典。 e). 运行RDS特殊的一些脚本，以便升级RDS封装的表和存储过程。 f). 重启实例，开放网络远程连接。

注1，在某些情况下，mysql_upgrade这个步骤会物理的重建表，表的大小会影响升级时间，所以实际升级的时间，需要以测试为准。如 MySQL 5.6.4 升级到5.7版本，因为 5.6.4 版本中的TIME, DATETIME, 和TIMESTAMP类型的存储有改变，升级的时候，需要重建表。
注2，由于大版本不能跨大版本升级，如升级MySQL 5.5.46到5.7.19，不能直接升级，需要先将5.5.46升级到5.6，如5.6.37，再升级到5.7.19。因此业务受影响的时间，是两次升级的时间。而不是一次。故不做大版本的交替升级。如分成5.5 升级 5.7，5.4升级5.6。

4.版本发布路线图：
根据社区发布的版本时间，和aws已经发布的版本的时间，我们可以作出下面的发布路线图。
MySQL：

Postgresql：

注1，最开始的浅绿色表示社区版第一版的发布时间，后面的灰色，表示社区版基于第一版之后的小版本GA的时间，而其对应的aws发布的版本是彩色的。
注2，通常情况，aws小版本至少支持一年（即12个月），但是有些小版本，aws已经支持超过了12个月，有可能会随时终止支持，所以我画到了截止当前时间（2018年10月），后面的时间没有继续画。（即没有画的不表示不支持，只是表示aws版本发布超过了12个月，在此之后可能会被终止支持而强制下线）

5.升级最佳实践：
5.1. 大版本升级：
a). 先创建2个replica实例； b). 升级其中一个实例到高版本，此时，还保持着主从的同步关系； c) .创建dms实例，配置好源和目标的endpoint，和创建好task，注意创建task时选择changes only，并且取消 Start task on create的勾勾。 d). 业务中断开始，将新建的replica实例提升为主库； d). 点击dms的task中的start ，等待其完成全量数据库的对比，开始准备同步增量数据； e). 切换应用连接到高版本的数据库；

注1，从5.6.4以下的版本升级到5.6.4之后，需要alter table table_name force，重建表，才能使用online ddl的方式create index。
注2，大版本升级，需要验证应用程序的性能，需要抓取至少一周的SQL，进行sql replay看性能的变化。
注3，升级之后，为了减少物理读，尽快的将更多的数据加载到内存，可以用mysqldump做prewarm
注4，减少downtime，其中一个步骤是dms点击task的start进行全量数据的校对，如果加大主库的IOPS，有助于提高该步骤的速度。（该步骤是业务停机操作的，因此减少该步骤的时间，等于减少停业务时间）。

aws mysql major version upgrade best practise.pdf

5.2. 小版本升级：
方法一： a). 先创建replica实例，或直接使用现有的replica实例； b). 升级replica实例到高版本，此时，还保持着主从的同步关系； c). 业务中断开始，将高版本的replica实例提升为主库； d). 切换应用连接到高版本的数据库。应用的连接串配置，可以提前配置好，重启应用即可；

aws mysql minor version upgrade best practise.pdf

方法二： a). 先升级replica实例到高版本，这是所有aws升级到必要前提，即必须先升级从库； b). 中断业务和数据库之间的连接，开始升级主库； c). 将主库升级到高版本； d). 恢复应用连接；

aws mysql minor version upgrade best practise_2.pdf

注1，方法一是aws推荐的方案，但是方案二，对于小系统也是非常合适的。
注2，方法一的应用影响时间，是提升从库为主库的时间+应用重启的时间。根据我们的某个数据库的测试，提升的时间，大约是3分钟02秒。加上应用重启时间，也大约是3分半钟。
注3，方法二，我们的某个数据库测试数据是，整体的升级时间大约是34分钟（因为包含了升级前数据库做backup和升级完成后做backup，这都是升级过程中，aws自己做的），而这34分钟，并不是应用都不可用，在做数据库backup时，数据库还是可以用的，真正业务不能连数据库的时间，是3分32秒。
注4，两个方法，服务不可用的时间都差不多，都是大约3分半钟。但是方法一有个风险，就是如果是因为需要强制升级小版本，已经快到升级的维护时间，且已经是deadline的维护时间，那么虽然我们没有去动主库，但万一失败需要切换回主库，而强制升级的时间又到了，触发强制升级，那么此时就是一个不可控的状态了。因此我们还是选择了方法二。
注5，最终应该选择哪个方法，还是要依赖实际做升级测试的演练情况而定。

6.总结：
因此，我们可以制定如下的主动升级战略：

(1). 禁止所有的小版本自动升级；

(2). 根据上面的所述，规定今后MySQL的新安装版本的为5.7.23；

(3). 在一年内，对于之前MySQL 5.5版本，小版本统一过渡到5.5.61，MySQL 5.6版本，小版本统一过渡到5.6.41。这个可以避免MySQL的小版本因为不被支持导致强制升级，并且这2个版本的下一次强制升级时间，至少是在2019年9月之后。（pg类似指导思路）；

(4). 在一年内，对于之前的MySQL 5.5版本升级到5.6版本；在两年内，对于MySQL 5.6版本，升级到5.7版本；在两到三年内，统一到MySQL 8.0版本。解决由于多版本共存，导致运维难度增加的问题。（pg类似指导思路）；

(5). 后续的版本升级，将会按照1年一升小版本，3年一升大版本的进度推进，以符合aws RDS的版本支持规则。

的

参考文档：

Upgrading the MySQL DB Engine

AWS RDS Blog

AWS RDS forum

What’s New with AWS

What’s New with AWS – RDS/

Changes in MySQL 5.7

MySQL 8.0 Release Notes

MySQL 5.7 Release Notes

MySQL 5.6 Release Notes

MySQL 5.5 Release Notes