I try to use rclone to sync files between two seafile servers these days and find that some files are synced again and again even they have been successfully synced from src to dest. It is because some files share same ids on two different seafile servers. I want to know how a file id is generated. Why some files share same ids on two different seafile servers (e.g. report.pptx on server A and report.pptx on server B share an id: 40f4d2fd5857d3342380d15f48d4392995ebc4a4)?
This is down to the way seafile stores data. If you upload a file (say report.pptx), it gets chopped up into chunks, and those chunks (or blocks) stored in /path/to/storage/library_id/blocks/2_digit_hex_number/
There are some other files created, like one that lists what blocks to read in what order to get back to the original file, but I’m ignoring those.
These block files get hashed (sha1), and the first 2 digits of the hash decide what directory they go in, and the rest of the hash is the filename.
Here’s an example from my seafile server:
seafile@seafile:/seafile-data/data/storage/blocks/05edf9f4-6163-4eb1-a69c-d8b7bda4cc88$ sha1sum ff/*
ff3bf73aac049482931309ae5c9877d3c3367fa6 ff/3bf73aac049482931309ae5c9877d3c3367fa6
ff549fb5fff680ac9dd87da83627d33f703da769 ff/549fb5fff680ac9dd87da83627d33f703da769
Now, logically the next question is “why do such a complicated thing?”. This is he heart of the de-duplicated storage. If you upload that same file again with a different name, almost no additional space is used because that file’s content will hash to the same value (and so point to the same block files). If you upload a slightly modified version of the file, there might only be one new block, and several blocks that are the same as the old version.
This also allows efficient operation of the client. The sync client can do this dividing of files into blocks, and send only the blocks the server doesn’t already have, so small changes to large files don’t waste huge amounts of network bandwidth.
More detail if you want it can be found here:
https://manual.seafile.com/12.0/develop/data_model/#fs
According to your explanation, if two identical files are placed on different Seafile servers, would their file IDs be exactly the same? On my two servers (you can think of them as mirrors of each other), why do some identical files have the same ID, while other identical files have different IDs?
The file ID is the sha1sum hash of the files content. If the IDs are different then the file contents are different!
laptop$ sha1sum 202501.xlsx
3bc285c5d92888372f3a792d5a0685df5831176e 202501.xlsx
seafile# ls seafile-data/storage/blocks/$LIBRARYID/3b/c285c5d92888372f3a792d5a0685df5831176e
seafile-data/storage/blocks/$LIBRARYID/3b/c285c5d92888372f3a792d5a0685df5831176e