Large database file

Faky · October 10, 2017, 4:23pm

Hi,
im currenty using the latest version of Seafile on Raspberry Pi (6.2.2). This was upgraded a couple of times so I don’t know when the problem started, but I presume sometime between 5.x and 6.2.2.

Issue that I have is with the database size. While actual data I have in Seafile is around 6 GB in size (checked from web page), database is using a whooping 46 GB ( I have set it to 2 x 20 GB because I have two users)
Its not so much an issue with size on disk, but when I make a backup this means server is down for quite a while (100MB network) and each backup is 40 GB in size.

The script I use to backup data goes like this (I am using it for like forever and it always worked fine). I have also tried manual garbage collection with -r parameter with no difference whatsoever.

#!/bin/bash
BACKUPDIR=/home/RaspberryPiBackup
SEAFILEDIR=/home/seafile
SEAFILEDATA=/mnt/USB64G/seafile-data

#remove old backups
find $BACKUPDIR/SeaFile* -mtime +15 -exec rm {} ;

#stop seafile
$SEAFILEDIR/seafile-server-latest/seafile.sh stop
sleep 2
$SEAFILEDIR/seafile-server-latest/seahub.sh stop
sleep 2
$SEAFILEDIR/seafile-server-latest/seaf-gc.sh
sleep 2

#backup seafiledir
tar -zcvf $BACKUPDIR/SeaFileDir_date +"%Y-%m-%d-%H-%M-%S".tar.gz $SEAFILEDIR
tar -zcvf $BACKUPDIR/SeaFileData_date +"%Y-%m-%d-%H-%M-%S".tar.gz $SEAFILEDATA

#start seafile
$SEAFILEDIR/seafile-server-latest/seafile.sh start
sleep 15
$SEAFILEDIR/seafile-server-latest/seahub.sh start
sleep 2

Anyone have the same problem or knows how to “fix” this? Thanks in advance for any help.

holantomas · October 10, 2017, 4:55pm

First of all. Folder seafile-data is not database. There are file content blocks.

Seafile, keeping removed data. Every library have settings where you can trash deleted files, and setup history time.

I recommend, use rsync for backup. You don’t have to remove old backups a copy whole seafile-data directory. rsync copy only changed files.

https://manual.seafile.com/maintain/seafile_gc.html

If you set history length limit on some libraries, the out-dated blocks in those libraries will also be removed.

shoeper · October 10, 2017, 11:33pm

Agree, as I’m not 100% sure if the GC stuff is understood right. Here in other words:

To free any disk space one needs to define a data retention period for each library in seahub (once per library as long as one is not satisfied with the defaults) and run seaf-gc (on regular basis).

Faky · October 11, 2017, 5:28pm

Ohhh, thank you for the explanation, now it makes sense. You can mark this topic as resolved. Now i just have to figure how to use rsync copy

holantomas · October 12, 2017, 4:54am

Glad to help you. You have to mark reply as solution.

gogofc · February 3, 2022, 10:34am

So you think I should stop the seafile.and seahub while doing rsync to remote, in case a library is being changed at the moment of rsync, or is it fine and rsync will just sync changes the next day

holantomas · February 3, 2022, 11:55am

It’s really hard to say this.

Yes - You will be sure about data consistency (Data vs Database)
No - Consistency will break, but nothing what seaf-fsck cannot deal with.

Seafile chunks files to small bits and save them on disk(seafile-data folder) with hash signature. This hash is saved to database(mysql, sqllite …).
When you do DB backup and then data backup while someone modifying files then library will be inconsistent with database. And that’s why there is seaf-fsck script which check data latests hash compared with hash in database - and file contents but this is for long conversation. When you use --repair flag on seaf-fsck, then seafile with restore library history to hash saved in database. That’s why seafile manual saying “Backup database first”.

Conclusion: When you restore all backups and run seaf-fsck, modified file(while running backup) will be lost - but will be backedup on next backup period.

Choice is on you, both ways works.

Suggestion: I’m using live backup for years now every day at 3AM … Never had problem with restoring, but there’s no one except me who works at this time. And down time is worst solution for me, and bad for automatic backups, cause:

you have to set in cron stop seafile/seahub - wait and check if it’s really down(maybe check PID files?)
than start backup - wait until it end(How you check it’s finish - regexp to log file?)
than start seafile/seahub - there is good some health check if everything works and again how?

For example we have more then 6TB data store, what if backup take several hours and your employees already coming to work? What if seafile just not start for some reason?

gogofc · February 3, 2022, 12:54pm

Thanks now it’s clear, I wrote a reply below before I read the whole thing and understood.

I’m not worried about few files missing as long as they will get updated on the next rsync, and after many of those at the end if last few files are missing that is OK, but I kind of thought rsync doesn’t see the files underneath the seafile-blocks but it does, I just rsync-ed my data and started an upload, and then did rysnc again and the ISO file was added, and then again more of it was sent, so it does see the files inside blocks because it clearly said the name of the file I was uploading, will post at the end. Makes me feel good.

And that’s why there is seaf-fsck script which check data latests hash compared with hash in database - and file contents but this is for long conversation.
I don’t mind long conversations hah. So let’s say I ran it like this for years and then do seaf-fsck, would it miss only the latest missing files or would something get corrupted every day, if it’s daily.
I did run fsck yesterday to test it out.
When you use --repair flag on seaf-fsck, then seafile with restore library history to hash saved in database. That’s why seafile manual saying “Backup database first”.
well maybe I need a bit more reading to do :), but that’s the path I’m taking database first then data because I don’t want to be rebuilding anything in the database anytime soon.
Conclusion: When you restore all backups and run seaf-fsck, modified file(while running backup) will be lost - but will be backedup on next backup period.
Oh ok well, I was reading and reaplying along and I just saw this line, so I should be good, at least 99.2% , thank you much appreciated.
Also thanks for letting me know how you do backups, this helps a lot. I also don’t want to do downtime for other people so I will do live rsyncs with database first
And thanks for letting me know that seafile stop and start script needs checking up on, since this is going to be a cronjob I will not be there to check if it’s up or down, so better not stop it like you said, what if it doesn’t start

So here it is. What I did was rsync, then started uploading ISO and rsync twice, and then stop upload and rsync and then it gets deleted, nice.
I guess you don’t need to see this, but I will leave it here for someone else to see, in case they need to know.

me@seafile:~$ sudo rsync -avzHP --delete /opt/seafile-data backup.backup:/home/me
sending incremental file list

sent 154,902 bytes  received 1,673 bytes  104,383.33 bytes/sec
total size is 2,950,509,888  speedup is 18,844.07
me@seafile:~$ sudo rsync -avzHP --delete /opt/seafile-data backup.backup:/home/me
sending incremental file list
seafile-data/logs/var-log/nginx/access.log.1
         90,741 100%   10.73MB/s    0:00:00 (xfr#1, ir-chk=1042/1071)
seafile-data/logs/var-log/nginx/seahub.access.log.1
        330,587 100%   26.27MB/s    0:00:00 (xfr#2, ir-chk=1019/1071)
seafile-data/seafile/seafile-data/httptemp/
seafile-data/seafile/seafile-data/httptemp/debian-8.11.1-amd64-DVD-1.isoJRLTG1
        506,534 100%   11.23MB/s    0:00:00 (xfr#3, ir-chk=1004/1090)

sent 662,252 bytes  received 5,341 bytes  445,062.00 bytes/sec
total size is 2,951,017,436  speedup is 4,420.38
me@seafile:~$ sudo rsync -avzHP --delete /opt/seafile-data backup.backup:/home/me
sending incremental file list
seafile-data/seafile/seafile-data/httptemp/debian-8.11.1-amd64-DVD-1.isoJRLTG1
      3,029,670 100%   15.28MB/s    0:00:00 (xfr#1, ir-chk=1004/1090)

sent 1,287,218 bytes  received 6,004 bytes  862,148.00 bytes/sec
total size is 2,953,540,572  speedup is 2,283.86
me@seafile:~$ sudo rsync -avzHP --delete /opt/seafile-data backup.backup:/home/me
sending incremental file list
deleting seafile-data/seafile/seafile-data/httptemp/debian-8.11.1-amd64-DVD-1.isoJRLTG1
seafile-data/logs/var-log/nginx/error.log.1
         80,320 100%   75.93MB/s    0:00:00 (xfr#1, ir-chk=1035/1070)
seafile-data/logs/var-log/nginx/seafhttp.access.log.1
        696,883 100%  221.53MB/s    0:00:00 (xfr#2, ir-chk=1027/1070)
seafile-data/seafile/seafile-data/httptemp/

sent 155,543 bytes  received 7,506 bytes  108,699.33 bytes/sec
total size is 2,950,511,390  speedup is 18,095.86
me@seafile:~$ sudo rsync -avzHP --delete /opt/seafile-data backup.backup:/home/me
sending incremental file list

sent 154,903 bytes  received 1,677 bytes  313,160.00 bytes/sec
total size is 2,950,511,390  speedup is 18,843.48

holantomas · February 3, 2022, 1:36pm

would it miss only the latest missing files or would something get corrupted every day

Yes you will missing only latest file

And thanks for letting me know that seafile stop and start script needs checking up on, since this is going to be a cronjob I will not be there to check if it’s up or down, so better not stop it like you said, what if it doesn’t start

That’s why I mention that. Cause I’ve got several times situation when seahub service looks like running (systemd is ok, PID files exists), but seahub shows error page caused by file permissions(for example) - typically after upgrading server. And this is really hard to check(never mind about resolving ) by some automatic job or service.

BTW our server’s file system crashed three weeks ago with 3.7TB of data. Restore was easy. If I can recommend for everyone, do backup by ZFS or virtual machine snapshot backups, or make “in-house” backup by rsync to another machine on same network and from there send backup somewhere else. Cause restoring bigger datastore by just copying(cp, rsync or whatever) seafile-data will take ages. for example first 3.7TB with 3 years library history taken around 4 days to copy back from external USB3.0 HDD.

Remeber that upload/copy one big file is faster than same size in multiple files, for us was copied around 5 000 000 000 files.

shoeper · February 3, 2022, 4:36pm

This is too much text for me.

Just keep the following in mind

Make sure the garbage collector does not run while backing up data
Do a database backup
Backup files. It is no issue when new data is added (removed, renamed) but on restore you’ll see exactly the data which was there as of the database backup and it’ll be consistent

gogofc · February 3, 2022, 4:37pm

Yess

And you made me make a good decision today.

Wow that’s long time.

ZFS is awesome, but not an option for me here. When I first did zfs send my files from a NAS to another local machine, it was very fast.

Do you know what happens to make it crash?

holantomas · February 3, 2022, 5:05pm

It, was only half … add next 4 days copying to external disk from backup NAS 40km away from server.

That was only suggestion use what you want but backups by snapshots are easy to make, move/copy and mainly restore cause you moving only one or few files. You can run server in virtual environment like proxmox or vmware and they have own snapshot backups.

Crash wasn’t come from FS. I don’t know how(nobody around me don’t know), but datastore managment in VMWare ESXi did something bad and it destroyed FS in Datastore where were virtual disk files. I was able to boot up virtual machines and they look like working but if you tried to write in some folder then whole system freez even I tried move whole virtual disks files. I had to destoroy whole datastore to get that work. S.M.A.R.T or other diagnostic app don’t report any problems and after reinstalation virtual machines, everythings works

gogofc · February 3, 2022, 6:14pm

Oh so you work in ESXi. I liked esxi, I tried it for a bit, then I went on to try XCP-ng and I liked that even more because of snapshot copy VM and their appliance.
But now I’m on bhyve and FreeBSD because of ZFS, also good but there’s much to learn.

Well that’s good

gogofc · February 4, 2022, 1:22am

That took two hours because of Disk IO on VPS.

Thanks again