Patroni
Upgrade etcd from 3.4 to 3.5
Processes, checklists, and notes on upgrading etcd from 3.4 to 3.5
In the general case, upgrading from etcd 3.4 to 3.5 can be a zero-downtime, rolling upgrade:
-
one by one, stop the etcd v3.4 processes and replace them with etcd v3.5 processes
-
after running all v3.5 processes, new features in v3.5 are available to the cluster
Before starting an upgrade, read through the rest of this guide to prepare.
Upgrade checklists
NOTE: When migrating from v2 with no v3 data, etcd server v3.2+ panics when etcd restores from existing snapshots but no v3 ETCD_DATA_DIR/member/snap/db
file. This happens when the server had migrated from v2 with no previous v3 data. This also prevents accidental v3 data loss (e.g. db
file might have been moved). etcd requires that post v3 migration can only happen with v3 data. Do not upgrade to newer v3 versions until v3.0 server contains v3 data.
NOTE: If your cluster enables auth, rolling upgrade from 3.4 or older version isn’t supported because 3.5 changes a format of WAL entries related to auth.
Highlighted breaking changes in 3.5.
Deprecated etcd_debugging_mvcc_db_total_size_in_bytes
Prometheus metrics
v3.5 promoted etcd_debugging_mvcc_db_total_size_in_bytes
Prometheus metrics to etcd_mvcc_db_total_size_in_bytes
, in order to encourage etcd storage monitoring. And v3.5 completely deprecates etcd_debugging_mvcc_db_total_size_in_bytes
.
-etcd_debugging_mvcc_db_total_size_in_bytes +etcd_mvcc_db_total_size_in_bytes
Note that etcd_debugging_*
namespace metrics have been marked as experimental. As we improve monitoring guide, we may promote more metrics.
Deprecated etcd_debugging_mvcc_put_total
Prometheus metrics
v3.5 promoted etcd_debugging_mvcc_put_total
Prometheus metrics to etcd_mvcc_put_total
, in order to encourage etcd storage monitoring. And v3.5 completely deprecates etcd_debugging_mvcc_put_total
.
-etcd_debugging_mvcc_put_total +etcd_mvcc_put_total
Note that etcd_debugging_*
namespace metrics have been marked as experimental. As we improve monitoring guide, we may promote more metrics.
Deprecated etcd_debugging_mvcc_delete_total
Prometheus metrics
v3.5 promoted etcd_debugging_mvcc_delete_total
Prometheus metrics to etcd_mvcc_delete_total
, in order to encourage etcd storage monitoring. And v3.5 completely deprecates etcd_debugging_mvcc_delete_total
.
-etcd_debugging_mvcc_delete_total +etcd_mvcc_delete_total
Note that etcd_debugging_*
namespace metrics have been marked as experimental. As we improve monitoring guide, we may promote more metrics.
Deprecated etcd_debugging_mvcc_txn_total
Prometheus metrics
v3.5 promoted etcd_debugging_mvcc_txn_total
Prometheus metrics to etcd_mvcc_txn_total
, in order to encourage etcd storage monitoring. And v3.5 completely deprecates etcd_debugging_mvcc_txn_total
.
-etcd_debugging_mvcc_txn_total +etcd_mvcc_txn_total
Note that etcd_debugging_*
namespace metrics have been marked as experimental. As we improve monitoring guide, we may promote more metrics.
Deprecated etcd_debugging_mvcc_range_total
Prometheus metrics
v3.5 promoted etcd_debugging_mvcc_range_total
Prometheus metrics to etcd_mvcc_range_total
, in order to encourage etcd storage monitoring. And v3.5 completely deprecates etcd_debugging_mvcc_range_total
.
-etcd_debugging_mvcc_range_total +etcd_mvcc_range_total
Note that etcd_debugging_*
namespace metrics have been marked as experimental. As we improve monitoring guide, we may promote more metrics.
Deprecated etcd --logger capnslog
v3.4 defaults to --logger=zap
in order to support multiple log outputs and structured logging.
etcd --logger=capnslog
has been deprecated in v3.5, and now --logger=zap
is the default.
-etcd --logger=capnslog +etcd --logger=zap --log-outputs=stderr +# to write logs to stderr and a.log file at the same time +etcd --logger=zap --log-outputs=stderr,a.log
v3.4 adds etcd --logger=zap
support for structured logging and multiple log outputs. Main motivation is to promote automated etcd monitoring, rather than looking back server logs when it starts breaking. Future development will make etcd log as few as possible, and make etcd easier to monitor with metrics and alerts. etcd --logger=capnslog
will be deprecated in v3.5.
Deprecated etcd --log-output
v3.4 renamed etcd --log-output
to --log-outputs
to support multiple log outputs.
etcd --log-output
has been deprecated in v3.5.
-etcd --log-output=stderr +etcd --log-outputs=stderr
Deprecated etcd --debug
flag (now --log-level=debug
)
etcd --debug
flag has been deprecated.
-etcd --debug +etcd --log-level debug
Deprecated etcd --log-package-levels
etcd --log-package-levels
flag for capnslog
has been deprecated.
Now, etcd --logger=zap
is the default.
-etcd --log-package-levels 'etcdmain=CRITICAL,etcdserver=DEBUG' +etcd --logger=zap --log-outputs=stderr
Server upgrade checklists
Upgrade requirements
To upgrade an existing etcd deployment to 3.5, the running cluster must be 3.4 or greater. If it’s before 3.4, please upgrade to 3.4 before upgrading to 3.5.
Also, to ensure a smooth rolling upgrade, the running cluster must be healthy. Check the health of the cluster by using the etcdctl endpoint health
command before proceeding.
Preparation
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
Before beginning, download the snapshot backup. Should something go wrong with the upgrade, it is possible to use this backup to downgrade back to existing etcd version. Please note that the snapshot
command only backs up the v3 data. For v2 data, see backing up v2 datastore.
Mixed versions
While upgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version. The cluster is only considered upgraded once all of its members are upgraded to version 3.5. Internally, etcd members negotiate with each other to determine the overall cluster version, which controls the reported version and the supported features.
Limitations
Note: If the cluster only has v3 data and no v2 data, it is not subject to this limitation.
If the cluster is serving a v2 data set larger than 50MB, each newly upgraded member may take up to two minutes to catch up with the existing cluster. Check the size of a recent snapshot to estimate the total data size. In other words, it is safest to wait for 2 minutes between upgrading each member.
For a much larger total data size, 100MB or more , this one-time process might take even more time. Administrators of very large etcd clusters of this magnitude can feel free to contact the etcd team before upgrading, and we’ll be happy to provide advice on the procedure.
Downgrade
If all members have been upgraded to v3.5, the cluster will be upgraded to v3.5, and downgrade from this completed state is not possible. If any single member is still v3.4, however, the cluster and its operations remains “v3.4”, and it is possible from this mixed cluster state to return to using a v3.4 etcd binary on all members.
Please download the snapshot backup to make downgrading the cluster possible even after it has been completely upgraded.
Upgrade procedure
This example shows how to upgrade a 3-member v3.4 etcd cluster running on a local machine.
Step 1: check upgrade requirements
Is the cluster healthy and running v3.4.x?
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint health <<COMMENT localhost:2379 is healthy: successfully committed proposal: took = 2.118638ms localhost:22379 is healthy: successfully committed proposal: took = 3.631388ms localhost:32379 is healthy: successfully committed proposal: took = 2.157051ms COMMENT curl \http://localhost:2379/version <<COMMENT {"etcdserver":"3.4.0","etcdcluster":"3.4.0"} COMMENT curl \http://localhost:22379/version <<COMMENT {"etcdserver":"3.4.0","etcdcluster":"3.4.0"} COMMENT curl \http://localhost:32379/version <<COMMENT {"etcdserver":"3.4.0","etcdcluster":"3.4.0"} COMMENT
Step 2: download snapshot backup from leader
Download the snapshot backup to provide a downgrade path should any problems occur.
etcd leader is guaranteed to have the latest application data, thus fetch snapshot from leader:
curl -sL \http://localhost:2379/metrics | grep etcd_server_is_leader <<COMMENT # HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise. # TYPE etcd_server_is_leader gauge etcd_server_is_leader 1 COMMENT curl -sL \http://localhost:22379/metrics | grep etcd_server_is_leader <<COMMENT etcd_server_is_leader 0 COMMENT curl -sL \http://localhost:32379/metrics | grep etcd_server_is_leader <<COMMENT etcd_server_is_leader 0 COMMENT etcdctl --endpoints=localhost:2379 snapshot save backup.db <<COMMENT {"level":"info","ts":1526585787.148433,"caller":"snapshot/v3_snapshot.go:109","msg":"created temporary db file","path":"backup.db.part"} {"level":"info","ts":1526585787.1485257,"caller":"snapshot/v3_snapshot.go:120","msg":"fetching snapshot","endpoint":"localhost:2379"} {"level":"info","ts":1526585787.1519694,"caller":"snapshot/v3_snapshot.go:133","msg":"fetched snapshot","endpoint":"localhost:2379","took":0.003502721} {"level":"info","ts":1526585787.1520295,"caller":"snapshot/v3_snapshot.go:142","msg":"saved","path":"backup.db"} Snapshot saved at backup.db COMMENT
Step 3: stop one existing etcd server
When each etcd process is stopped, expected errors will be logged by other cluster members. This is normal since a cluster member connection has been (temporarily) broken:
{"level":"info","ts":1526587281.2001143,"caller":"etcdserver/server.go:2249","msg":"updating cluster version","from":"3.0","to":"3.4"} {"level":"info","ts":1526587281.2010646,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","from":"3.0","from":"3.4"} {"level":"info","ts":1526587281.2012327,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"} {"level":"info","ts":1526587281.2013083,"caller":"etcdserver/server.go:2272","msg":"cluster version is updated","cluster-version":"3.4"} ^C{"level":"info","ts":1526587299.0717514,"caller":"osutil/interrupt_unix.go:63","msg":"received signal; shutting down","signal":"interrupt"} {"level":"info","ts":1526587299.0718873,"caller":"embed/etcd.go:285","msg":"closing etcd server","name":"s1","data-dir":"/tmp/etcd/s1","advertise-peer-urls":["\http://localhost:2380"],"advertise-client-urls":["\http://localhost:2379"]} {"level":"info","ts":1526587299.0722554,"caller":"etcdserver/server.go:1341","msg":"leadership transfer starting","local-member-id":"7339c4e5e833c029","current-leader-member-id":"7339c4e5e833c029","transferee-member-id":"729934363faa4a24"} {"level":"info","ts":1526587299.0723994,"caller":"raft/raft.go:1107","msg":"7339c4e5e833c029 [term 3] starts to transfer leadership to 729934363faa4a24"} {"level":"info","ts":1526587299.0724802,"caller":"raft/raft.go:1113","msg":"7339c4e5e833c029 sends MsgTimeoutNow to 729934363faa4a24 immediately as 729934363faa4a24 already has up-to-date log"} {"level":"info","ts":1526587299.0737045,"caller":"raft/raft.go:797","msg":"7339c4e5e833c029 [term: 3] received a MsgVote message with higher term from 729934363faa4a24 [term: 4]"} {"level":"info","ts":1526587299.0737681,"caller":"raft/raft.go:656","msg":"7339c4e5e833c029 became follower at term 4"} {"level":"info","ts":1526587299.073831,"caller":"raft/raft.go:882","msg":"7339c4e5e833c029 [logterm: 3, index: 9, vote: 0] cast MsgVote for 729934363faa4a24 [logterm: 3, index: 9] at term 4"} {"level":"info","ts":1526587299.0738947,"caller":"raft/node.go:312","msg":"raft.node: 7339c4e5e833c029 lost leader 7339c4e5e833c029 at term 4"} {"level":"info","ts":1526587299.0748374,"caller":"raft/node.go:306","msg":"raft.node: 7339c4e5e833c029 elected leader 729934363faa4a24 at term 4"} {"level":"info","ts":1526587299.1726425,"caller":"etcdserver/server.go:1362","msg":"leadership transfer finished","local-member-id":"7339c4e5e833c029","old-leader-member-id":"7339c4e5e833c029","new-leader-member-id":"729934363faa4a24","took":0.100389359} {"level":"info","ts":1526587299.1728148,"caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.1751974,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.1752589,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.177348,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.1774004,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"b548c2511513015"} {"level":"info","ts":1526587299.177515,"caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.1777067,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015","error":"read tcp 127.0.0.1:34636->127.0.0.1:32380: use of closed network connection"} {"level":"info","ts":1526587299.1778402,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015"} {"level":"warn","ts":1526587299.1780295,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015","error":"read tcp 127.0.0.1:34634->127.0.0.1:32380: use of closed network connection"} {"level":"info","ts":1526587299.1780987,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"b548c2511513015"} {"level":"info","ts":1526587299.1781602,"caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"b548c2511513015"} {"level":"info","ts":1526587299.1781986,"caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1802843,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1803446,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1824749,"caller":"rafthttp/stream.go:291","msg":"closed TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.18255,"caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"stream Message","remote-peer-id":"729934363faa4a24"} {"level":"info","ts":1526587299.18261,"caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1827736,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24","error":"read tcp 127.0.0.1:51482->127.0.0.1:22380: use of closed network connection"} {"level":"info","ts":1526587299.182845,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1830168,"caller":"rafthttp/stream.go:436","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24","error":"context canceled"} {"level":"warn","ts":1526587299.1831107,"caller":"rafthttp/peer_status.go:65","msg":"peer became inactive","peer-id":"729934363faa4a24","error":"failed to read 729934363faa4a24 on stream Message (context canceled)"} {"level":"info","ts":1526587299.1831737,"caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7339c4e5e833c029","remote-peer-id":"729934363faa4a24"} {"level":"info","ts":1526587299.1832306,"caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"729934363faa4a24"} {"level":"warn","ts":1526587299.1837125,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"b548c2511513015","cluster-id":"7dee9ba76d59ed53"} {"level":"warn","ts":1526587299.1840093,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"b548c2511513015","cluster-id":"7dee9ba76d59ed53"} {"level":"warn","ts":1526587299.1842315,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"729934363faa4a24","cluster-id":"7dee9ba76d59ed53"} {"level":"warn","ts":1526587299.1844475,"caller":"rafthttp/http.go:424","msg":"failed to find remote peer in cluster","local-member-id":"7339c4e5e833c029","remote-peer-id-stream-handler":"7339c4e5e833c029","remote-peer-id-from":"729934363faa4a24","cluster-id":"7dee9ba76d59ed53"} {"level":"info","ts":1526587299.2056687,"caller":"embed/etcd.go:473","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"} {"level":"info","ts":1526587299.205819,"caller":"embed/etcd.go:480","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"} {"level":"info","ts":1526587299.2058413,"caller":"embed/etcd.go:289","msg":"closed etcd server","name":"s1","data-dir":"/tmp/etcd/s1","advertise-peer-urls":["\http://localhost:2380"],"advertise-client-urls":["\http://localhost:2379"]}
Step 4: restart the etcd server with same configuration
Restart the etcd server with same configuration but with the new etcd binary.
-etcd-old --name s1 \ +etcd-new --name s1 \ --data-dir /tmp/etcd/s1 \ --listen-client-urls \http://localhost:2379 \ --advertise-client-urls \http://localhost:2379 \ --listen-peer-urls \http://localhost:2380 \ --initial-advertise-peer-urls \http://localhost:2380 \ --initial-cluster s1=\http://localhost:2380,s2=\http://localhost:22380,s3=\http://localhost:32380 \ --initial-cluster-token tkn \ --initial-cluster-state new
The new v3.5 etcd will publish its information to the cluster. At this point, cluster still operates as v3.4 protocol, which is the lowest common version.
{"level":"info","ts":1526586617.1647713,"caller":"membership/cluster.go:485","msg":"set initial cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","cluster-version":"3.0"}
{"level":"info","ts":1526586617.1648536,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.0"}
{"level":"info","ts":1526586617.1649303,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"7339c4e5e833c029","from":"3.0","from":"3.4"}
{"level":"info","ts":1526586617.1649797,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.4"}
{"level":"info","ts":1526586617.2107732,"caller":"etcdserver/server.go:1770","msg":"published local member to cluster through raft","local-member-id":"7339c4e5e833c029","local-member-attributes":"{Name:s1 ClientURLs:[\http://localhost:2379]}","request-path":"/0/members/7339c4e5e833c029/attributes","cluster-id":"7dee9ba76d59ed53","publish-timeout":7}
Verify that each member, and then the entire cluster, becomes healthy with the new v3.5 etcd binary:
etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 <<COMMENT localhost:32379 is healthy: successfully committed proposal: took = 2.337471ms localhost:22379 is healthy: successfully committed proposal: took = 1.130717ms localhost:2379 is healthy: successfully committed proposal: took = 2.124843ms COMMENT
Un-upgraded members will log warnings like the following until the entire cluster is upgraded.
This is expected and will cease after all etcd cluster members are upgraded to v3.5:
:41.942121 W | etcdserver: member 7339c4e5e833c029 has a higher version 3.5.0 :45.945154 W | etcdserver: the local etcd version 3.4.0 is not up-to-date
Step 5: repeat step 3 and step 4 for rest of the members
When all members are upgraded, the cluster will report upgrading to 3.5 successfully:
Member 1:
{"level":"info","ts":1526586949.0920913,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
{"level":"info","ts":1526586949.0921566,"caller":"etcdserver/server.go:2272","msg":"cluster version is updated","cluster-version":"3.5"}
Member 2:
{"level":"info","ts":1526586949.092117,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"729934363faa4a24","from":"3.4","from":"3.5"}
{"level":"info","ts":1526586949.0923078,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
Member 3:
{"level":"info","ts":1526586949.0921423,"caller":"membership/cluster.go:473","msg":"updated cluster version","cluster-id":"7dee9ba76d59ed53","local-member-id":"b548c2511513015","from":"3.4","from":"3.5"}
{"level":"info","ts":1526586949.0922918,"caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.5"}
endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379 <<COMMENT localhost:2379 is healthy: successfully committed proposal: took = 492.834µs localhost:22379 is healthy: successfully committed proposal: took = 1.015025ms localhost:32379 is healthy: successfully committed proposal: took = 1.853077ms COMMENT curl \http://localhost:2379/version <<COMMENT {"etcdserver":"3.5.0","etcdcluster":"3.5.0"} COMMENT curl \http://localhost:22379/version <<COMMENT {"etcdserver":"3.5.0","etcdcluster":"3.5.0"} COMMENT curl \http://localhost:32379/version <<COMMENT {"etcdserver":"3.5.0","etcdcluster":"3.5.0"} COMMENT
Last modified August 19, 2023: etcd-io/website#479 Use new and better canonical link to Google Groups (cd8b01f)