Real-time backup hangs for large library

We currently have a real-time backup setup of a Seafile Server pod on Kubernetes being backed up on another one at a different cluster. The backup had no issues with most of the libraries, but was getting some errors with one of the bigger ones (2.7TB). The errors were the following:

[03/12/2021 08:03:02 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Sync init] success, transition to [Get head commit id].
[03/12/2021 08:03:02 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Get head commit id] success, transition to [Get diff commit ids].
[03/12/2021 08:03:04 AM] http-tx-mgr.c(1971): Failed to sync repo 58946ebf, error in [Sync error].
[03/12/2021 08:03:32 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Sync init] success, transition to [Get head commit id].
[03/12/2021 08:03:32 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Get head commit id] success, transition to [Get diff commit ids].
[03/12/2021 08:03:34 AM] http-tx-mgr.c(1971): Failed to sync repo 58946ebf, error in [Sync error].

For other libraries, we had fixed that issue by running seaf-fsck.sh on the library on the primary server, and if it encountered no errors, we would force a sync. So after running seaf-fsck.sh, I forced a sync on that library, and got the following in the logs:

cat logs/seafile.log | grep 58946ebf:

[03/12/2021 09:46:54 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Sync init] success, transition to [Get head commit id].
[03/12/2021 09:46:54 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Get head commit id] success, transition to [Get diff commit ids].
[03/12/2021 09:47:16 AM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Get diff commit ids] success, transition to [Get diff commits].
[03/12/2021 01:59:10 PM] http-tx-mgr.c(1984): Repo 58946ebf sync status: [Get diff commits] success, transition to [Get fs].

cat logs/slow_logs/seafile_slow_storage.log | grep 58946ebf:

2021/03/12:09:47:36 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - e9cf614225a9187b38bd143a4c61651b8f2ce0a6 - 3.167
2021/03/12:09:47:40 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - 896bdf05b837c5462bdf386a0e36aa5c9bcc1c98 - 5.000
2021/03/12:09:47:45 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - 9e29fe2a04f8b6ed67d7696a655c4920bbe68549 - 5.000
[…]
2021/03/12:13:58:55 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - 3db652c7cf9f9c5222584d129236b1b9645bf0a9 - 5.000
2021/03/12:13:59:00 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - 69f652d1301a198bb03fe3e8340c08403956e903 - 4.927
2021/03/12:13:59:05 - commits - write - 58946ebf-d080-46de-971f-52e9401ba79c - 23952b36ce9d12a17e4cdf28cdd96179cbfeb1a5 - 5.000

It is not showing any errors, but there have been no mention of that library in the logs since that date, yet the backups are still shown as ongoing (it has now been running for 10 days):
./seafile-server-latest/seaf-backup-cmd.sh status

Total number of libraries: 175
Number of synchronized libraries: 173
Number of libraries waiting for sync: 0
Number of libraries syncing: 1
Number of libraries failed to sync: 1
List of syncing libraries:
58946ebf-d080-46de-971f-52e9401ba79c
List of libraries failed to sync:
b531cce8-5171-4499-a360-b3ebea49e10c

Also, df -h shows no change in the currently used disk space (currently at 1.2TB out of 4.9TB, which accounts for all the other libraries).
Has anyone seen that kind of strange behavior? Would anyone know what would be blocking this process, or where I could find any other logs that would help in identifying the root of this issue?

Also, to help me troubleshoot this issue, I need a deeper understanding of how the backup server syncs with the primary server. I get that it uses the same syncing algorithm and protocol as the sync client, with it being based on commit history of the libraries, but I’m uncertain as to the specifics. Is it using scp? Rsync? Some proprietary secret sauce? Digging into the libraries in the Seafile server and the seaf-backup-cmd scripts, I see mentions of RPC, but if I understand correctly, that is just a type of API, like REST. I haven’t been able to dig much deeper than that.

In the “Get fs” stage, the primary server has to traverse the entire library history to collect all fs object IDs to send to backup server. When the library has many files and has long history, it can take very long. A solution could be to shorten the history of that library and then run GC on that library. That can also reduce the number of blocks that needs to be backed up.

I’ll try that out, but before I do so, is it really expected for it to take more than 11 days when we have 30 day history for this library? Should I expect it to take much longer? Do you have metrics to give us an idea as to how long it should take depending on the size of the library and the history length setting?

Just to confirm, the library size is 2.7TB, contains 58583 files, and has a 30 day history setting.

What about my second question concerning how syncs happen and the files are transferred over?

@eric.dupuis-cengn How long does the log stay in Get fs state? Perhaps the request timed out and is retried for many times. So it takes very long but never succeeded.

Roughly:

  1. The commit object list are retrieved for the library. The primary runs a diff algorithm to calculate which commit objects need to be transferred.
  2. All the file trees under all those commits are traversed and a complete fs object list is sent.
  3. Blocks are calculated based on the fs object list and sent over.

What protocol does it use to transfer the blocks over?

The message you quoted is the last message I see concerning repo 58946ebf sync status in the backup server’s seafile.log, and has now been that way 12 days and counting. The last log in seafile.log for that repo in the primary server was 13 days ago, and is unrelated.

What are my next steps to help you pinpoint this issue?

  • I can see the snapshots of the files in the History of the Library, but not how many, or how much space they are taking. Is there a way to determine that? Would that contribute in the long sync that we’ve been seeing or that it’s timing out?
  • Are there any other logs that would provide you with more information? Would there be any that would give us a better indication whether or not it is a timeout issue as you have mentioned? Would that be in the primary server, or the backup server?
  • We could potentially setup a call with remote control if you need to take a hands-on look at the server.

Did you get this fixed? Don’t really want to clean the history or anything else, just want to transfer my server over to another one.