it seems that during a period of time (several days) many uploads and files synchronized by the client have been stored incorrectly. I.e. those files appear in the folder/library view in the web frontend, but they cannot be downloaded. similarly, for libraries containing such files, the sync client throws an “server error”.
reverting the library to an earlier (constistent) state and re-uploading the files solves the problem.
I have two questions however:
what could be the cause of such incorrectly stored files?
is there an easier way to remedy the situation?
it is still an open question if seaf-gc and seaf-fsck work on such “restored” libraries. for the libraries with incorrectly stored files, these tools abort.
What’s the version of the client causing this issue? Any error messages in server seafile.log when the client upload the corrupted files?
If the files are corrupted, fsck should be able to reset them to empty files without affecting other files. I remember there is a few bug fixes related to missing blocks for fsck. Since you’re in 7.1.7 version, they may not be fixed in that version. You can upgrade to latest 7.1 version or 8.0 version (which is stable too).
hm, I guess it did not matter which client it was. It happened both with the sync client and with uploads done over the web interface.
I will check the logs.
yes, I should definitely update to 8.0. If I understand correctly, then even a newer fsck could not really repair the files (because the data is not available) but will repair the library such, that the clients are able to sync again.
one last urgent question @Jonathan: is there a fast way to find all libraries with missing blocks? I mean faster than using fsck.
nI see the following things in the logs, that I do not understand:
various message of the form:
Dec 28 09:18:05 filipe 2021-12-28 09:18:05,448 [WARNING] django.request:152 get_response Not Found: /api2/repos/0be1dd52-530f-48d9-b259-2151d3efbbed/
or similar:
│Dec 28 09:50:41 filipe 2021-12-28 09:50:41,053 [WARNING] django.request:152 get_response Not Found: /f/ca2e045464be4f7e9091/
several message like:
│Dec 28 15:23:03 filipe seaf-server[9522]: zip-download-mgr.c(808): Zip progress info not found for token 8aa75b17-76ae-450b-963c-bd33918c53f1: invalid token or related zip task failed.
│Dec 28 15:23:03 filipe 2021-12-28 15:23:03,446 [ERROR] seahub.api2.endpoints.query_zip_progress:34 get Zip progress info not found.
this is the first missing block error I see:
Dec 30 07:46:31 filipe seaf-server[9522]: …/common/block-backend-ceph.c(559): [Block bend] Failed to stat block e430a8b6: No such file or directory.
without any other error message from seafile preceeding it.
sometimes I see:
Dec 30 10:59:23 filipe seaf-server[9522]: pack-dir.c(165): Failed to stat block c88b2351-7dae-463f-9eea-2244fe833083:63799a37ed73b99393237fdbea5ecd467d21d7d3
Dec 30 10:59:23 filipe seaf-server[9522]: pack-dir.c(478): Failed to archive dir AFD69E97-0D25-4A8F-A8D8-9E68C3E8D688.snapshots in repo c88b2351.
in ceph I see no error messages, everything seems to be fine.
how could I check if the blocks exist in the (ceph) cluster?
we are using nautilus with pretty much the default settings. as far as I know this means that the pgs are scrubbed. these are the settings related to scrub:
I don’t know the exact structure anymore, but there should be an object named by its checksum. Maybe it was the library as namespace or a namespace for blocks, one for commits and one for fs and objects are stored with their checksum or ad that below the library (uuid). Could be that the first two characters of the checksum are one element above (ab/cdef… instead of abcdef…).
Can a request timeout with ceph? Is there a log for timeouts?
Is the garbage collector working on the production database?
hi @shoeper , I am quite confident that the blocks are actually missing in ceph. why?
most of the blocks are found (so seafile can access ceph without problems)
it worked a long time in this configuration
there are no timeouts and no instability in the network
load on ceph is low
is is consistent: a block that is not found, is never found (it is not the case that it works sometimes and sometimes not)
I believe that there is an inconsistency between what seafile expects should be on the (block) storage and what actually is on the storage.
furthermore it seems to be the case that for some period of time (roughly 29.12. up to 2.1.) all additions (and probably also modification) of files lead to the missing blocks error. in other words, during that time it seems that seafile was adding files to its metadata (mysql?) but did not actually write the data to the storage.
I cannot be sure of that, it is very difficult/time consuming to confirm that from the logs.
anyway, after restarting all seafile nodes (I am using a 3-node seafile cluster) all the new additions/modifications work fine. I have no explanation for this. and from the seafile logs, there are no indications of such a problem.
I looks like seafile was “thinking” that everything was fine.
We had one drive in the ceph cluster with smartd warnings. just to be on the save side, I took the whole node out of the ceph cluster. but this had no effect. I am not sure how ceph would deal with faulty data from a problematic drive. but since we have a redundancy of 3, I would expect that we would have at least two valid copies of each seafile block.
If it happens at certain period and for all file uploads, it’s likely some configuration or environmental change. It’s unlikely that Seafile reports upload success without actually saving data for all files.
what happens exactly when fsck resets the files with missing blocks? will then the sync client attempt to sync the local (non-empty) file to the servers, or will it sync the empty file from the server and overwrite the local (non-empty) version of the file?
you may be right, could you indicate what changes in the environment/configuration could have such an effect? I am not aware of any such changes, but you as developer of seafile have a better understanding than me, under which conditions this could happen.
hm, it seems I found the problem. as mentioned we are using a seafile cluster as described in the docs. this worked very well for many years. I just realized, that the mysql-galera-cluster running on the seafile servers was broken. two nodes (the background node, and one of the frontend nodes) thought they are not in a galera cluster and the second frontend node was in galera-cluster of size one. in other words, there was no mysql-replication working between the seafile servers.
this is a very bad situation and so far I have no clue how this could have happened.
people/clients access seafile over only one of the frontend nodes, except when this primary node is rebooted, but of course we have background jobs (gc, spam, elasticsearch, …) running on the background node.
@Jonathan do you think that could explain the missing blocks we see?