I have seafile client 6.1.2 installed on a ubuntu 16.04 LTS machine. I’m trying to sync a library containing a large file (several hundred GBs) and a few smaller files from this machine to my server. The server is seafile 6.2.5 running through an apache proxy to provide https.
The other files are synced properly until seafile tries to sync the large file, at which point seafile fills up the entire / filesystem (I get a warning about there only being some tens of kb left on the filesystem), then it just seems to give up. Whatever is taking up space (I assume temp/index files) are deleted, the large file and any small files that haven’t been synced simply aren’t uploaded, and the library is marked as synced with a green check.
I haven’t been able to find any guidance on how much free space seafile requires, but there’s about 50gb free on /. Based on this https://github.com/haiwen/seafile/issues/871 I believe 50gb should be enough. The library is also not an old one (the server started on seafile 5) as mentioned in that github issue.
I realize syncing files this large is probably an edge case, but seafile says that is supports this. However, even if this upload failed, the library definitely should not be marked as synced/good when even a simple compare of the file names would show that it isn’t.
Sorry, misunderstood you. I thought you were talking about the server. I’m not certain how the client interacts with Linux, as I use Windows and Mac for syncing. I have about 250 gb of storage used on my server. On my clients that sync with it, only about 30 Mb is used by Seafile.
I would use a program on Linux to discover which folders are eating up space. That should help you narrow down which of Seafile’s local folders are having the issue and what type files they are.
I had some more time to look into this. The directory that’s filling up is /home/[my username]/Seafile/.seafile-data/storage/blocks/[a UUID]/
Inside that there are a bunch of small folders with 2 character alpha numeric names (indexed chunks of the file to be uploaded, I assume).
This appears to be the relevant part from the logs:
[05/05/18 13:14:56] …/common/fs-mgr.c(584): Failed to write chunk 77c87a82ccf2e442369d65daf82de694f09787fd.
[05/05/18 13:14:56] CDC: failed to write chunk.
[05/05/18 13:14:56] …/common/fs-mgr.c(697): Failed to chunk file with CDC.
I assume it’s because there’s no more space on the disk. After that the files get deleted, the repository changes to good / synced (!) despite the big file and any files that haven’t been synced yet not being uploaded.
Seafile is also using protocol 2, as per the log:
[05/03/18 05:08:25] http-tx-mgr.c(4288): Download with HTTP sync protocol version 2.
I checked that folder on my client. It’s empty. Your client is queing up files and leaving them there, most likely due to lack of disk space. What is the possibility of replacing the hdd? What size is it?
The files are only there while it’s indexing files… Otherwise it’s empty. As I’ve said, this happens when trying to sync a very large file.
To be clear: Seafile tries to sync a very large file, starts indexing the file, fills up the disk, can’t write a chunk, gives up syncing the file, then deletes the chunks and marks the library as synced with no issues.
Based on what I’ve read, seafile shouldn’t need double the space to sync large files, but maybe shoeper is right. That just seems like such a shit design… and https://github.com/haiwen/seafile/issues/871 explicitly says that no longer happens… idk. Really beginning to think I made a huge mistake and should’ve just setup rsync and lived without webdav and other niceties.
It makes more sense from a programming perspective to create all the chunks first and then send them. If it were to send chunks while it was creating other chunks, you would have the same problem since sending the chunks is slower than creating them due to the network bottleneck and I/O operations on the server. Eventually, with a large file, the creation of chunks would outpace the sending of them and fill up the hard drive.
The only way around it from a programming perspective that I know of would be to have new chunks created as old chunks were sent, and that would be a performance hit on transfers. Not very efficient.
Now, there may be a way to specify a different folder/hard drive for the cache, but I don’t know for sure. If it doesn’t have that option, it might be a nice option to see in a future release of the client.
However, the easiest solution would be to replace the hard drive. Hard drives are dirt cheap these days.
The seafile client is running on a VM, and anyway, buying a 2TB+ harddrive (assuming my data set doesn’t grow at all) just for seafile to cache is ridiculous. What if my dataset grows? or someone has a truly massive file? Asking people to double up all their storage just for seafile isn’t reasonable.
I get that programming it to first make all the chunks then start sending them is easier, more straight forward, but, including logic to only write up to a certain X blocks (or use Y space on disk), send those, then repeat isn’t that much more complicated. It seems like an odd omission since seafile already supports delta sync which is fairly complicated and the promotional material talks about supporting large files. In fact, I chose seafile because it was the only self-hosted-cloud-thing that supports delta sync at all.
Either way, the bigger problem here is the fact that the library stop syncing and then fails silently. How could I trust a file sync system that does this? This is by far the biggest problem.
P.S. I think a feature to limit cache to X blocks or Y amount of space would be far more useful than changing the location of the cache. Or even simpler, a “one block at a time” mode. Generate a block then send it, repeat until finished, would be one check box in the options.
The one block at a time, or a few blocks at a time would take a hit on performance. It’s more than a matter of “easier and more straightforward”. There’s also the issue of file corruption, extended lock times, etc…
Caching is and has been a disk hog for decades. At least Seafile deletes its cache files. I can’t say that about many of the other programs I use. My temp folder on one of my machines fills up so quickly I have to stay on top of it. I generate about 100 gb of tmp and cache files per week, and those programs don’t delete their cache files.
Windows is very notorious for caching just about everything, up to and including updates, installer packages, drivers, etc, and never deletes them.
However, I do agree with you that you should get some kind of notification in the client that the file failed to upload. Just dumping it to the log file is not enough.
BTW… If I’m understanding you correctly, you’ve installed UBuntu in a VM and you run the client there? What is the host OS? Any time I run Linux in a VM (and I avoid that if at all possible), I dedicate at least 128 gb for the hard drive. The only time I ever really use Linux in a VM any longer is for test purposes. I just had too many problems keeping the VM running, and it was often painfully slow.
But that is about datasets. In your case it is a single file and not a set of many files.
I agree that it should be improved. E.g. by limiting the allowed space for file indexing to 1 GB, but would prefer if the client would do it in memory (with limited amount of ram / or block by block) instead of writing data to disk first, transfer it in the next step and remove the temporary files directly after transferring the chunks.
I haven’t look into the algorithm in detail and am not sure whether it needs to have the blocks locally to let the deduplication work, but maybe it would be enough to compute a chunk (or a few in parallel to allow using the capacity of gbit links), ask the server whether it exists and only upload it if not.
My CS professor used to say ‘It doesn’t matter how fast you made it if it doesn’t work’. I also wasn’t aware that syncing per-block or per-set-number-of-blocks just magically caused “file corruption”, “extended lock times” and whatever “etc…” supposedly is.
Aw jeez, I didn’t realize some other programs (and your personal configurations) were trash! I should be so very thankful seafile isn’t completely trash! How could I have been so arrogant to expect this to not fail silently! Oh the hubris!
Again, you misunderstand. This is not even in the log file. Seafile believe the library synced properly. The only error in the log file is the one I’ve provided about failing to write a chunk.
I don’t see how your inability to use a VM or inexplicable advice about 128gb virutal disks is relevant here. I very much doubt seafile is aware it’s running on a VM in the first place.
Ah, so it is.
At least for my case I’d be willing to take a pretty significant performance hit; I have a 1gbps link between the computers. I don’t think anyone with huge files like this is expecting instant results or going to be limited by anything but network speed anyway.
Indeed - this is the headline here. Silent failures are just about the worst thing outside of outright dataloss.
“Not Enough Disk Space to Complete Operation” would be a nice addition to any program rather than just dumping an error to a log file. I agree with you on that. However, there are many programs out there, even programs that are paid programs, that don’t check for available disk space prior to running an operation. It is, and has been an issue that has plagued software for decades.
Fortunately, it’s gotten better over time, but just in the last 10 years or so, programmers have gotten lax on the issue since disk space is so cheap and most people have enough as they are not power users. So, programmers assume that people will have enough space and simply leave the code out. The result? It bombs during the operation and the user is left with no clue or some general error, such as “unhandled exception” or “unknown error”
Recently, I had to upgrade a boot drive due to lack of disk space because the backup program I was using would just hang… No errors… No logs… Just hung and kept on trying. Took me a spell to figure out it was because the volume shadow copy it created was too large to fit in the remaining space.
I’ve just recategorized this thread to a “Feature Request” since we don’t have a “Bug Report” section. In a future version of the client, we need code that reports back to the user that there is a lack of disk space to complete the operation.
As have I. However, his problem is that his file is larger than the available disk space, and it appears that you must have sufficient space at least equal to or larger than the file you are trying to transfer. However, the main issue here is that it does not alert the user that the file failed to sync, excepting in a log file, and then it shows all files synced in the client even though they haven’t.
The only way to fix that is for an alert be added to the client in a future version. Seafile should, at a minimum, report that it could not sync all files, and detecting a lack of available disk space for an operation is a simple enough task in a program. So, I moved it to the Feature Request category since there is no Bug Report category.