Data recovery without commits folder

WekTorALL · October 28, 2024, 9:01pm

Hi everyone,

I’m running a Seafile server on a Raspberry Pi, but due to a faulty hard drive, I’ve lost the /commits and /fs folders from /storage. Don’t ask me how, I realized a few weeks ago that I don’t have them anymore… As a result, I can no longer access any of my files in the library. The /blocks folder, however, seems to be intact.

I tried using FSCK, but it didn’t work, likely because I don’t have all the necessary data. I also have a Seafile client on a Windows laptop, and I noticed that I can navigate through files (I can’t open any), and it has a /commits and /fs folder. I tried copying these folders to the server, but I still couldn’t restore access to the files — apparently, the contents of Windows client’s folders don’t match what’s in /blocks on the server.

I also attempted using a recovery script I found on GitHub, but this didn’t work either, again because the data in /commits and /blocks don’t seem to align.

I’m really hoping to recover my library data — I have over 200GB of family photos and videos with no backup. Any suggestions or ideas would be greatly appreciated. I’m even open to a brute-force solution to restore files from blocks, if that’s possible and if that’s my only option… Please help!

Thanks!

tomservo · October 30, 2024, 2:13am

That sucks. I know in your place I would be anxious, and getting no replies would only make that worse. So while I’m not an expert I will reply with what I know (or think I know), and hopefully someone will be eager to step in with a “well actually” if I am wrong.

I don’t think you will get far without the FS files. These are meta-data about the files and directories, so without them you don’t have the filenames and directories. But even worse, I’m pretty sure that these also have the list of what blocks are needed in what order to make up the contents of each file. I couldn’t find much documentation about that, but here’s something: Data Model · seafile-docs

So a few first things you should do first (and probably have already done, but just in case):

Backup what’s here. Or at least snapshot your server VM (if it is a VM). If anything you do makes things worse, it would be nice to be able to at least get back to the current state.
Check the server’s disk for problems. If this is a hardware server, check the disk(s) for smart errors (smartctl), and run the smart self-test on the disks. If a disk is failing do what you can to get what you can off to another disk now, and repair later.
Check the server’s filesystem for problems. The specific how is going to depend a lot on how you setup the server, but something like fsck, xfs_repair, or zfs scrub. This filesystem level needs to work for the next level up to do anything useful.

Now with that out of the way, I have some hopeful news. I compared the storage/commits and storage/fs directories for a library on my server to those on a client (linux client, but I hope it’s close enough). The server has lots more commit files, but when the same filename existed in both places, the contents of those files were identical. I think both sides keep their own chain of commits tracking what has changed, but will have an exact copy when for the commit they make when uploading/downloading changes between them.

The server had more than 2x as many files in fs/, but again, any file that existed in both places was bit-for-bit identical. I suspect the server has lots more because it has older versions of files that have been edited and deleted, but the client doesn’t keep history like that. As far as I know, on the server blocks aren’t deleted except during the seaf-gc (garbage collect), so even if you end up with an older state of files, all the blocks needed should still be there.

So I think you are on the right track with copying the commits and fs from the client. You need to make sure to set the owner for all of those files to be the same as the blocks. On my server that’s seafile:seafile, so sudo chown -R seafile:seafile commits fs to set that.

I would also try using seaf-fuse to see if you can get to the files. AFAIK this doesn’t need the database, and can get around at least some kinds of brokenness in libraries. So if this works but the library can’t be fully fixed, you can at least get to the data and copy it into a new library before deleting the broken one.

One more thing I noticed when looking at this. The blocks (but not the commits or fs files) are named for their sha1 hash. So you can see if your block files are corrupt in some way by hashing them:

cd storage/blocks/aeb20487-5bfc-40dd-8253-c53124dd2a73
$ sudo sha1sum ff/*
ff005940f6242423790d281be6c553a643d1b182  ff/005940f6242423790d281be6c553a643d1b182
ff057ace37bd62dc3794172d21749fba9d97a702  ff/057ace37bd62dc3794172d21749fba9d97a702
...

The first 2 characters of the hash are the directory name, then the rest should match the filename. Might be worth spot-checking a few files, since if the contents of the blocks are lost so is the data.

Good luck, and ask if there’s anything more I can help with.

WekTorALL · October 30, 2024, 7:29pm

Thanks for your reply, I really appreciate it! Will try your idea in the next day(s) and come back with an answer

WekTorALL · November 7, 2024, 9:13am

It’s been a while, I tried some things so far…

Put the commits and fs folders from the client to the server, as I already mentioned, and tried again to run the server and to get the data. This didn’t work… I can see the files, but they all have size=0.
Mounted the folder with seaf-fuse but same thing happened: I could see the files, but all of them have the size=0.
I also checked the content of some files from the fs folder. I could only find there files that have the following format:

"dirents": [
  {
    "id": "soemthing",
    "mode": "3",
    ...
  }
],
"type": 3,
"version": 1

None of them has the blocks order for any file, or anything like this…
4. I also tried to look over the blocks folder. I guess I could recover all the images, or at least some of them (~50GB), using a python script, because it looks they were not split into blocks, because of their size. For video files, I can see only the first few seconds, then the video stops, of course, because it was split. I also saw that each block has ~8MB.
Now I am looking for a way to put this blocks back to the original video, but I’m not sure how can I do it…
@tomservo, I couldn’t understand the last part you explained above, with the sha1 hash for the blocks. How can I understand from that if my files are corrupt?

tomservo · November 9, 2024, 12:44am

First, to answer your question about the sha1. The blocks’ file names are created from the sha1 hash of that block. This way when adding new data it’s easy to chop the file up, to get the block(s), hash the block, and then either write it to the disk, or just throw it away because a file with that name already exists (since that should be a block with the same content). If a file is edited, the changed parts will become new blocks, the block files should never be edited, just added and removed.

So we can use this in reverse to see if a file’s contents have changed. If even one bit in the file is different, the hash should be different (and generally should be wildly different, not just a tiny bit different). So “sha1sum ff/*” ran sha1sum and asked it to calculate the sha1sum (hash) of every file in the ff/ directory. The results of that look like:

ff005940f6242423790d281be6c553a643d1b182  ff/005940f6242423790d281be6c553a643d1b182
ff057ace37bd62dc3794172d21749fba9d97a702  ff/057ace37bd62dc3794172d21749fba9d97a702

You should see the file name and the hash matching (except for the first 2 characters which come from the directory name). So if they don’t match, that file’s contents have changed since it was first uploaded to seafile. I was trying to suggest that you pick a couple of samples and just compare by eye, like all of 01/* and ff/* and if the last 8 characters match, then the file is probably fine, but if they don’t match that file is broken. If none match, then it’s probably already a lost cause.

That file you have seems weird to me. My files in fs/ contain no readable text at all. But that gave me a clue that lead to someone else’s work along these lines. https://awant.medium.com/seafile-data-structure-c8a1e62a64e4

And that lead to a long ramble that isn’t going to be practical to you. Feel free to skip to the next post for the practical plan B (or is it C?).

According to that site, the files in fs/ are all compressed with zlib (explaining why I mine don’t have any readable text). It seems that isn’t true of your files, so I suspect that the windows agent doesn’t do zlib for some reason.

With a bit of python I was able to decompress some, and they look like reported in that blog. Some are the “dirents” files like you quoted, that seem to be a directory (a list of files and sub-directories). Files have a size number, other directories don’t. The id is the name of another file in the “fs/” directory with the / missing (so “id”: “713c4175bb8ab07ac7b006d159aaa7e7af8aa90f”, points to file 71/3c4175bb8ab07ac7b006d159aaa7e7af8aa90f ).

The other kind of file in fs/ doesn’t have the “dirents” string, and looks like this:

{
    "block_ids": [
        "0b750676df87beacf1e42a3e3e33ecbdcc98ac0d",
        "6633a66e9e2fb14cde76177173430af23e9bd325",
        "a3352a4ef431be9a031f5fc598da5fa8d522fb17",
        "629f832541e6f61a1047eee4da0f27de56b6b961"
    ],
    "size": 4622322,
    "type": 1,
    "version": 1
}

That lists the 4 blocks from the blocks dir (again, need to add the / after the second character), that have to be read in order to make the original file. Since you don’t have any of these, we don’t know the connection between the filename and the block(s) for that file.

tomservo · November 9, 2024, 1:22am

It sounds like we don’t have the fs/ files we need. So this is the plan for doing without. You already worked out most of this, a lot of your files are complete in a single block.

So my basic idea is these steps:

Duplicate the blocks files with hardlinks. This will be a second filename pointing to the same file on disk, so won’t use much more disk space, but gives you the originals as an easy “undo” path.
Step through the files to find all the files that have a JPG file header, and use the exif metadata inside them to get the original creation date, and rename them like “/recovered/jpg/date-time.jpg”.
Use something like imagemagic’s identify program to find any incomplete images and move them to an “incomplete files” directory.
Something like 2 to find the files with a video header and move them out. Any file remaining after that is probably the rest of an image or video.
From there I don’t have a good idea how to stitch the remaining files together other than just a script that would try every combination in order until the video looks right. This seems like it would be crazy if you have more than a few dozen video files.

For 1, something like this:

mkdir /recovery_files/ /recovered_jpgs /incomplete_jpgs
cp -rl blocks/* /recovery_files/

You might need /recovery_files somewhere else since it needs to be in the same filesystem for the hardlinks to work.

For 2:
I think a script to do things like this.
cd /recovery_files
do a loop through each file (like find -type f | while read block_file ; do ).

use the file command to find out if the file has a jpeg header, like: if [[ $(file "$block_file" | grep -o JPEG) == JPEG ]] ; then continue to 3

For 3:
For those that were JPEG, use exiftool to get the date the picture was taken and then move the file to another directory with that date as the file name. Might need to have a mechanism to fall back to the file’s date if there isn’t a date in exif. And probably worth checking to make sure there isn’t already a file with the new name, and append a number if there is.

If you would like, I could try making a first draft of that script. But it will be a couple of days before I have time to do that.

tomservo · November 17, 2024, 7:17am

I’m sorry, I thought I would get back to this sooner than I did. Anyway, I made a library with a bunch of pictures to simulate yours for testing, and I made this this script. It’s ugly, but I hope it helps. First, here’s the steps to use it:

Save this script as /tmp/move_images, and make it executable (chmod +x /tmp/move_images).
Install needed programs: “sudo apt install imagemagick exiftool”
Go to the storage/blocks directory for your library. For me that would be “cd /seafile-data/data/storage/blocks/a36e41ca-1fbd-4c6a-92c9-64da51ad028f”
Make a new directory for the library. In this directory we will make a hard-link of the original files so that any changes we make will not change the originals. This way if anything goes wrong, we can just delete this directory and start again from here. “mkdir …/working”
Now we make the hardlinks: “cp -rl * …/working/”
And move into the working directory: “cd …/working”
Make directories for the results: “mkdir …/complete_images …/incomplete_images”
And finally run the script. Make sure you are inside the “working” directory, and then run “/tmp/move_images”

That could run for a while (my testing was about 5 minutes for about 8,000 files), but when it is done it will have found all files with a header that looks like a jpg, renamed to the image’s creation date and time (based on what’s in the exif metadata, or file’s date if there is no exif data), and moved into …/complete_images (or if the file appears to be incomplete, moved to …/incomplete_images).

#!/bin/bash
#

move_complete_to="../complete_images"
move_incomplete_to="../incomplete_images"

if [[ "$(which exiftool)" == "" ]] ; then
    echo "Couldn't find exiftool"
    exit 1
fi

if [[ "$(which identify)" == "" ]] ; then
    echo "Couldn't find identify from the imagemagick package"
    exit 1
fi

# loop through every file in the current directory and subdirectories
find . -type f | while read -r block_file ; do
    echo "Considering file $block_file"
    if [[ $(file "$block_file" | grep -o JPEG) == JPEG ]] ; then
	# if the file command says this is a jpeg file
	complete=unknown
	date=unknown
	dest_dir=unknown
	dest_file=unknown
	count=1
	continu=true
 
	# lets see if it seems to be a complete file
	if identify -regard-warnings "$block_file" > /dev/null 2>&1 ; then
	    complete=yes
	else
	    complete=no
	fi

	# lets get the creation date for this file if we can
	date="$(exiftool -CreateDate "$block_file" | sed 's/^.*: \([0-9]*\):\([0-9]*\):\([0-9]*\) \([0-9]*\):\([0-9]*\):\([0-9]*\)$/\1-\2-\3 \4:\5:\6/g' || echo "error")"

	# If getting the date from exif info failed, lets use the file's
	# timestamp
	if [[ "$(echo "$date" | grep -o error)" == "error" ]] ; then
	    date="$(stat -c %y "$block_file")"
	fi

	# Normalize the time and date format to something nice
	date="$(date --date="$date" +%F_%T)"

	# Now if the file is complete, move to the complete file directory
	# and if it isn't complete, move to the incomplete file directory
	if [[ $complete == yes ]] ; then
	    dest_dir="$move_complete_to"
	else
	    dest_dir="$move_incomplete_to"
	fi
	
	# If there is already a file with that date and time, add a number to
	# the end until we get a unique filename
	if [[ ! -f "${dest_dir}/${date}.jpg" ]] ; then
	    dest_file="${dest_dir}/${date}.jpg"
	else
	    while [[ $continu == true ]] ; do
		if [[ ! -f "${dest_dir}/${date}_${count}.jpg" ]] ; then
		    dest_file="${dest_dir}/${date}_${count}.jpg"
		    continu=false
		else
		    ((count++))
		    if [[ $count -gt 1000 ]] ; then
			echo "more than 1000 files with the same date and time?"
			echo "I think something went wrong!"
			exit 1
		    fi
		fi
	    done 
	fi

	#Finally move the file
	echo "   Moving to $dest_file"
	mv "$block_file" "$dest_file"
    fi
done

WekTorALL · December 15, 2024, 8:50pm

Apologies again for my delayed response — it’s been a while. In the meantime, I’ve tried several approaches, including running your script. Thank you so much for your help! With a few tweaks, I managed to recover the JPG files and the first few seconds of the video files. However, I’m still working on figuring out how to recover the rest.
I haven’t yet tried the brute force method you suggested. With thousands of files to process, I’m uncertain if it would be feasible. Manually combining them all and checking each one to see if the video is complete seems too much…
I might just have to accept the current results and leave it as it is…

tomservo · December 16, 2024, 10:10pm

I haven’t thought of any other way to fix your videos, but I sure wouldn’t look at hundreds or thousands of possibilities for each file. There might be some clues in the video files that can help (maybe some blocks get numbered sequentially or something), but I kinda doubt it.

Thanks for the follow up. I am glad you got at least some of your files recovered.