Leafing through CVMFS Catalogs
I’m mainly writing this down so I remember how to dig into these bits of CVMFS internals next time1.
As has happened a couple of times, I discovered today that the campus firewall was blocking one specific file on the CVMFS Stratum 0 and Stratum 1 servers hosted inside the campus network. Fetching the file from inside the network works as expected, but from outside it fails:
[user@stratum1 ~]$ cvmfs_server snapshot repo.example.org
Replicating from catalog at /gtdb
Processing chunks [8056 registered chunks]: ........
failed to download http://stratum0.example.org/cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P (15 - host serving data too slowly)
couldn't reach Stratum 0 - please check the network connection
terminate called after throwing an instance of 'ECvmfsException'
what(): PANIC: /home/sftnight/jenkins/workspace/CvmfsFullBuildDocker/CVMFS_BUILD_ARCH/docker-x86_64/CVMFS_BUILD_PLATFORM/cc9/build/BUILD/cvmfs-2.11.5/cvmfs/swissknife_pull.cc : 286
Download error
Additional testing with curl shows that no bytes are returned for the request and eventually the connection times out:
[user@stratum1 ~]$ curl -sv \
http://stratum0.example.org/cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P >/dev/null
* Trying 192.0.2.42...
* TCP_NODELAY set
* Connected to stratum0.example.org (192.0.2.42) port 80 (#0)
> GET /cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P HTTP/1.1
> Host: stratum0.example.org
> User-Agent: curl/7.61.1
> Accept: */*
>
0 0 0 0 0 0 0 0 --:--:-- 0:10:13 --:--:-- 0* Recv failure: Connection timed out
0 0 0 0 0 0 0 0 --:--:-- 0:10:14 --:--:-- 0
* Closing connection 0
curl: (56) Recv failure: Connection timed out
Upon describing the problem for the firewall admins, they sent back a few lines from the log showing the denial: the
file type (or “application” in Palo Altese) was detected as flash
(as in, Adobe/Shockwave Flash). Why is my CVMFS repo
full of public genomic data serving up Adobe Flash?
Well, it’s not. CVMFS is a chunked, content-hash-addressed filesystem. I had recently published the Genome Taxonomy
Database and suspected that the file in question (2104e02468491fa4e61599c176b16c98a052d4P
) to be a chunk from
that rather large database. But it’s best to confirm such suspicions.
To do that, I need to know what file the chunk belongs to.
Locate the revision that added the chunk
There’s probably a more clever way to do this but I started with the mod date on the hash file:
[user@stratum0 ~]$ ls -lh /srv/cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P
-rw-r--r-- 1 user user 2.4M Apr 4 14:04 /srv/cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P
This should point to the revision/tag (unless you have a lot of them around that time):
[user@stratum0 ~]$ cvmfs_server tag repo.example.org | grep -B1 '4 Apr 2025'
stuff | 41 | 1 Apr 2025 12:34:05 | | Stuff
gtdb │ 42 │ 4 Apr 2025 14:09:38 │ │ GTDB database
Identify files changed in that revision
The diff subcommand shows me what files were added, removed, or changed:
[user@stratum0 ~]$ cvmfs_server diff -s stuff -d gtdb repo.example.org
d(# regular files): 10
d(# symlinks): 0
d(# directories): 1
d(# catalogs): 1
/ modify directory [link-count, timestamp]
/gtdb add directory +4096 bytes
/gtdb/GTDB add file +33731108154 bytes
/gtdb/GTDB_h add file +13565000049 bytes
/gtdb/GTDB.index add file +2601811970 bytes
/gtdb/GTDB.dbtype add file +4 bytes
/gtdb/GTDB.lookup add file +3621260940 bytes
/gtdb/GTDB.source add file +786640 bytes
/gtdb/GTDB.version add file +28 bytes
/gtdb/GTDB_h.index add file +2559366433 bytes
/gtdb/GTDB_mapping add file +1562822676 bytes
/gtdb/.cvmfscatalog add file +0 bytes
/gtdb/GTDB_h.dbtype add file +4 bytes
/gtdb/GTDB_taxonomy add file +9978672 bytes
Locate the catalog for the offending chunk
Knowing what’s changed, now I can locate the catalog corresponding to those changes:
[user@stratum0 ~]$ cvmfs_server list-catalogs -h repo.example.org
218f6ec14c5f8cf2028d9f419e57157f0dbe8d2d /
├─ ae762dea060a55e6d1d075168bbbf8cd440ca6b2 /gtdb
...
Extract the catalog and verify the filename
With the catalog identified, it needs to be extracted:
[user@stratum0 ~]$ cat /srv/cvmfs/repo.example.org/data/ae/762dea060a55e6d1d075168bbbf8cd440ca6b2C \
| cvmfs_swissknife zpipe -d > /tmp/catalog.sqlite
I can then verify that the chunk belongs to the GTDB
file:
[user@stratum0 ~]$ sqlite3 /tmp/catalog.sqlite
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> SELECT c.name
FROM chunks ch
JOIN catalog c
ON ch.md5path_1 = c.md5path_1 AND ch.md5path_2 = c.md5path_2
WHERE ch.hash = X'ae762dea060a55e6d1d075168bbbf8cd440ca6b2';
GTDB
Check file contents
But why stop there? A little digging suggests that the file magic for a Flash file should be ASCII CWS
, FWS
, or
FLV
, so let’s confirm:
[user@stratum0 ~]$ cat /srv/cvmfs/repo.example.org/data/73/2104e02468491fa4e61599c176b16c98a052d4P \
| cvmfs_swissknife zpipe -d | head -c 3 ; echo
CWS
And there we are, a Flash application, according to Palo Alto. Viewing the rest of the contents of the chunk also
confirm that the data after CWS
do not resemble Adobe Flash.
-
I do confess, I let ChatGPT have a crack at it and it gave me some comically wrong answers, including: 1. making up the command
cvmfs_find <repo> --catalogs
, which it claims is a “helper tool to search metadata, including nested catalogs”, 2. making up thecvmfs_server catalog-chroot
subcommand “to mount the catalog structure”, and 3. making up thecvmfs_swissknife cat
subcommand to extract catalogs. Oh you wacky LLM and your confidently incorrect assertions. You’d make a great redditor. ↩