Memcached problems(?)

hkunz · December 21, 2017, 4:08pm

Hi all,

I see tons of messages like:
[12/21/2017 05:02:35 PM] …/common/obj-cache.c(106): Failed to set BLOCKEXISTENCE2-e612a24e-0
2f8-44fa-8700-ea2b4e05e7bf-8e47e5016965f1d85b0fb87a5ad3765b7b7bd3e6 to memcached: SERVER IS M
ARKED DEAD.
[12/21/2017 05:02:35 PM] …/common/obj-cache.c(106): Failed to set BLOCKEXISTENCE2-e612a24e-0
2f8-44fa-8700-ea2b4e05e7bf-d35cfdf918728a179159893ea9cbc4b540a5633c to memcached: SERVER IS M
ARKED DEAD.
[12/21/2017 05:02:38 PM] …/common/obj-cache.c(106): Failed to set 996762a1-36bc-42d8-99fd-9b
ff3d1ca46b-d524f40b26c055507df74b7bbdf87193d4ed27fd to memcached: SERVER IS MARKED DEAD.
[12/21/2017 05:02:38 PM] …/common/obj-cache.c(106): Failed to set 442a5890-86b8-4a96-8589-d0
6860390364-ef4e3620cda97a241d1aa71372d1969a17136071 to memcached: SERVER IS MARKED DEAD.

however, all my memcached servers are up and running. so I do not really understand the meaning of these messages nor do I have any clue, what I should do to get rid of them.

Any hint appreciated.
Hp

I am using:
seafile pro 6.1.8
memcached 1.4.21
on a debian jessie

DerDanilo · December 21, 2017, 7:53pm

Did you try turning it off and on again? clear the cache, check the firewall.

hkunz · December 22, 2017, 8:34am

you mean turning off/on memcached? yes. furthermore I can talk to all my memcache daemons. they are up and running.
(I have a seafile cluster).

Clear the cache sounds like a very good idea. Which cache do you mean? Has memcache a cache to be cleared? (sounds a bit funny, cache of the cache). Any hints welcome, I am not at all an expert with memcache.

there is no firewall issue, as mentioned I can talk (from all cluster machines) to all memcached instances.

many thanks!

daniel.pan · December 23, 2017, 1:47am

Restarting Seafile should solve the problem. The memcache library sometimes has inconsistence state about the memcache servers.

hkunz · December 23, 2017, 9:29am

I have this errors since a long time. Rebooting the whole cluster makes absolutely no difference.

could this come from the fact that this is a seafile cluster? If I reboot the cluster, I do this one machine after the other. So, except the machine which is rebooted last, all the other have seen outages of the other machines.

I see this errors always on the machine, that is the primary server (handling the requests from the clients and the web interface).

if you think this is a configuration issue, let me know which parts of the configuration you like to see.

DerDanilo · December 23, 2017, 9:57am

What do you use as loadbalancer? Nginx or haproxy?

hkunz · December 23, 2017, 9:58am

I have no load-balancing. It is an hot-standby setup.

DerDanilo · December 23, 2017, 10:00am

Ehm, ok. put a loadbalancer in front of it and try again. Should be simple.

hkunz · December 23, 2017, 10:05am

ehm, no. how should this help with my memcached errors?

marcusm · December 23, 2017, 12:10pm

This would not help, right. Can you try to use just one memcached server?

hkunz · December 23, 2017, 12:22pm

I could, but what would that achieve?

My seafile installation works perfectly, so I am not sure if these memcached error messages are really errors, or just “hickups” of memcached. In other words, these error nessage have no negative impact on the user experience. Sometimes we see “internal server error” when users try to access encrypted libraries, but I am not sure, if this is related to memcached.

would clearing the memcache cache help? or is this a dangerous operation?

marcusm · December 23, 2017, 1:31pm

That’s related to memcached and your system should be slower than normal. For me it looks like one or more memcached servers just don’t work.

I already saw such errors if you use more than one memcached server and just one is dead. So it would be better if you can test it with one. And that’s one of the reasons why redis will be supported in future seafile versions.

marcusm · December 23, 2017, 1:32pm

You can try to access your memcached nodes from the frontend nodes with telnet and port 11211

hkunz · December 23, 2017, 1:44pm

thanks for pointing this out.

this is nice to hear.
On the primary server, which sees a lot of these memcache errors, I can telnet to all three memcached server without any problems. I also took the opportunity to clear all memcache caches (using flush_all in the telnet session). Please believe me, my memcached servers are running.

hkunz · December 23, 2017, 1:49pm

unfortunately, the flush_all did not help.

may be it is worth mentioning that I am using the ceph backend.

marcusm · December 23, 2017, 4:35pm

Can you post your config files?

hkunz · December 23, 2017, 4:43pm

these are my settings related to memcached.

seahub_setting.py:

CACHES = {
  'default': {
    'BACKEND': 'django_pylibmc.memcached.PyLibMCCache',
    'LOCATION': [ '10.65.16.116:11211', '10.65.16.117:11211', '10.65.16.115:11211', ],
    'OPTIONS': {
        'ketama': True,
        'remove_failed': 1,
        'retry_timeout': 3600,
        'dead_timeout': 3600
    }
  }
}
AVATAR_FILE_STORAGE = 'seahub.base.database_storage.DatabaseStorage'

COMPRESS_CACHE_BACKEND = 'django.core.cache.backends.locmem.LocMemCache'

seafile.conf:

[cluster]
enabled = true
memcached_options = --SERVER=10.65.16.116 --SERVER=10.65.16.117 --SERVER=10.65.16.115 --POOL-MIN=10 --POOL-MAX=100 --RETRY-TIMEOUT=3600

[block_backend]
name = ceph 
ceph_config = /etc/ceph/ceph.conf 
pool = seafile-blocks
memcached_options = --SERVER=10.65.16.116 --SERVER=10.65.16.117 --SERVER=10.65.16.115 --POOL-MIN=10 --POOL-MAX=100 --RETRY-TIMEOUT=3600

[commit_object_backend]
name = ceph
ceph_config = /etc/ceph/ceph.conf
pool = seafile-commits
memcached_options = --SERVER=10.65.16.116 --SERVER=10.65.16.117 --SERVER=10.65.16.115 --POOL-MIN=10 --POOL-MAX=100 --RETRY-TIMEOUT=3600

[fs_object_backend]
name = ceph
ceph_config = /etc/ceph/ceph.conf
pool = seafile-fs
memcached_options = --SERVER=10.65.16.116 --SERVER=10.65.16.117 --SERVER=10.65.16.115 --POOL-MIN=10 --POOL-MAX=100 --RETRY-TIMEOUT=3600

marcusm · December 24, 2017, 2:03pm

Looks good so far. Again, i think it would be worth trying it with just one server.

hkunz · December 24, 2017, 9:56pm

We are talking about a production environment here. I am willing to do experiments to some extent, but I would first like to understand what we would gain by doing them. why should it work better with one server? and how would this help us in solving my current error messages?
please understand that I cannot undertake such experiments lightly.
and be assured, I am very grateful for your help.

marcusm · December 25, 2017, 4:49pm

As i already told you, there may be problems if you use more than one memcached servers. So i would like to see what happens if you just use one.

And if just one works, this is better than the current state:

You wrote that you’re just using hot standby setup, so if one webserver is down you still have to do work manually to switch over to the other system, that’s not HA, so why do you want to have memcached HA? - Certainly not if it does not work properly.