Memsql partition offline or recovery failed

In this article, I will demonstrate a few steps to recover the offline Memsql partition. Due to high memory usage or other reason memsql leaf nodes may go offline. If this has happened how we can recover the nodes or partition?

Error:

If memory usage is high you will get below error when you run a query (most of the query will fail)

15:32:57,045 – ERROR – Error occurred while moving data from  Error: 1712 (HY000): Leaf Error (prod-leaf-node-002:3306): Memory usage by MemSQL for tables (106804.00 Mb) has reached the value of ‘maximum_table_memory’ global variable (106804 Mb). See https://docs.memsql.com/troubleshooting/latest/memory-errorsfor additional information
2019-11-19 15:32:57,04

Check the MemSQL node status.

Log in to the master aggregator and execute the below command. It will show all nodes and it’s status ( node status can be found from the MemSQL studio as well).

$ memsql-admin list-nodes;
+————+————+—————————+——+—————+————–+———+—————-+——————
| MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+————+————+—————————+——+—————+————–+———+—————-+——————
| 191AF2836C | Master | prod-master-node-1 | 3306 | Running | True | 6.8.8 | Online | | 0.0.0.0 |
| BC71158704 | Aggregator | prod-aggre-node-1 | 3306 | Running | True | 6.8.8 | Online | | 0.0.0.0 |
| 4A1AA67A16 | Leaf | prod-leaf-node-001 | 3306 | Running | True | 6.8.8 | Online | 1 | 0.0.0.0 |
| 1F1DEDDE2B | Leaf | prod-leaf-node-002 | 3306 | Running | True | 6.8.8 | Recovering | 2 | 0.0.0.0 |
| 7AA394AD48 | Leaf | prod-leaf-node-003 | 3306 | Running | True | 6.8.8 | Online | 1 | 0.0.0.0 |

Here you can see one leaf node status is “Recovering”.  In our case we have used the memsql pipeline for ingesting data, so to free up some memory we stopped all the pipelines.

To stop all pipelines execute the below query.

memsql> stop all pipelines;
Query OK, 0 rows affected (1.21 sec)

memsql> show pipelines;
+------------------------+---------+-----------+
| Pipelines_in_mytest_db | State | Scheduled |
+------------------------+---------+-----------+
| pl_test_pipe1 | Stopped | False |
| pl_test_pipe2 | Stopped | False |
| pl_test_pipe3 | Stopped | False |
+------------------------+---------+-----------+
3 rows in set (0.00 sec)

It may take up to 2 hours to recover node, memsql will take care of this node recovery.

Recovery  failed

If the node status is recovery failed then we have to do the below steps.

Execute memsql-admin list-nodes to see the status.

$ memsql-admin list-nodes;
+————+————+—————————+——+—————+————–+———+—————-+——————MemSQL ID | Role | Host | Port | Process State | Connectable? | Version | Recovery State | Availability Group | Bind Address |
+————+————+—————————+——+—————+————–+———+—————-+——————
| 191AF2836C | Master       | prod-master-node-1 | 3306 | Running | True | 6.8.8  | Online | | 0.0.0.0 |
| BC71158704 | Aggregator | prod-aggre-node-1 | 3306 | Running | True | 6.8.8   | Online | | 0.0.0.0 |
| 4A1AA67A1  | Leaf           | prod-leaf-node-001 | 3306 | Running | True | 6.8.8   | Online | 1 | 0.0.0.0 |
| 1F1DEDDE2 | Leaf           | prod-leaf-node-002 | 3306 | Running | True | 6.8.8 | RecoveryFailed | 2 | 0.0.0.0 |
| 7AA394AD4  | Leaf           | prod-leaf-node-003 | 3306 | Running | True | 6.8.8   | Online | 1 | 0.0.0.0                | 5GFCD39SI  | Leaf           | prod-leaf-node-004 | 3306 | Running | True | 6.8.8   | Online | 1 | 0.0.0.0 

Here you can see the status of “prod-leaf-node-002″ Recovery status is failed”. To know the details about which partition is failed execute the below command.

memsql -h0 -uroot -p -e "SHOW CLUSTER STATUS;" > /tmp/output.txt

It will generate cluster status and save in the /tmp/ouput.txt file. Open this fail and search for the keyword “unrecoverable” to get the details. The details will be similar to below.

 

23   prod-leaf-node-002 3306    mytest_db_14   slave   unrecoverable   S8365:6006      NULL    NULL    0       vm-prod-leaf-node-003 3306    Slave

23      prod-leaf-node-002 3306    my_test_29   master  online  8230:157880960  NULL    NULL    NULL    NULL    NULL    Master

 

Here the partition is in unrecoverable state and it is a slave partition. So we drop the partition and do a rebalancing. Please note that before dropping this partition, make sure that it is a slave partition.

I recommend to do it in a screen session (Linux screen command) as it takes a longer time.

memsql> DROP PARTITION mytest_db:14 ON 'prod-leaf-node-002':3306;
Query OK, 1 row affected, 2 warnings (0.84 sec)


memsql> REBALANCE PARTITIONS ON mytest_db;
Query OK, 1 row affected (1 hour 52 min 23.58 sec)

Change partition name, hostname and port number based on your cluster. Also, in the output partition name will be mytest_db_14 but in the query, you need to change as mytest_db:114.

Once the rebalancing is done execute cluster status command again and make sure that there is no unrecoverable partition present. Also, start all the pipelines which were stopped.

The above command will help you to recover the partition even though I recommend you to contact the MemSQL support team before applying it to a production cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *