GC has to be run multiple times before all unused blocks are removed

Hi,

I have a library which is set to keep full history. Recently I finished the project and would like to clean up the redundancy. I entered the Trash in that library and clean All trash and history files. Then I run the GC on that library using command:
./seaf-gc.sh 3b0601d9-0225-44c8-b61a-4231e03d8d9d

The first run it shows:
[08/24/16 11:52:13] gc-core.c(577): GC finished for repo 3b0601d9. 7670 blocks total, about 10896 reachable blocks, 1052 blocks are removed.
Then I run the same command again and it shows:
[08/24/16 11:52:17] gc-core.c(577): GC finished for repo 3b0601d9. 6618 blocks total, about 10896 reachable blocks, 95 blocks are removed.
Run it again and it shows:
[08/24/16 11:52:20] gc-core.c(577): GC finished for repo 3b0601d9. 6523 blocks total, about 10896 reachable blocks, 12 blocks are removed.
Again:
[08/24/16 11:52:24] gc-core.c(577): GC finished for repo 3b0601d9. 6511 blocks total, about 10896 reachable blocks, 5 blocks are removed.
Again:
[08/24/16 11:52:27] gc-core.c(577): GC finished for repo 3b0601d9. 6506 blocks total, about 10896 reachable blocks, 0 blocks are removed.

In other words, I have to run GC 4 times on the same library consecutively before all unused blocks can be completely cleaned. Is it a bug? or why does it behavior like this?
Is there any way to run it recursively until all unused blocks are removed?
Thanks.

1 Like

Note that before I run the GC, I also run it at the dry-run mode:
./seaf-gc.sh --dry-run 3b0601d9-0225-44c8-b61a-4231e03d8d9d

and it shows:
[08/24/16 11:52:00] gc-core.c(582): GC finished for repo 3b0601d9. 7670 blocks total, about 10896 reachable blocks, 1052 blocks can be removed.

This is not a bug. This is a property of the GC algorithm. It uses bloom filter to keep which blocks are still in use. Bloom filter uses much less memory than a hash table. But bloom filters can have false positive when you lookup a block in it. So some “dead” blocks may be missed. But we choose the size of the filter so that the false positive probability is relatively low. After multiple runs, all garbage will be cleaned up.

2 Likes

Thanks for the clarification. It is quite clear now.

@Jonathan Maybe you should mention that in the manual…

2 Likes