/gap/ - /gif/ archival project #3

Torrents of full images from /gif/ - Adult GIF, released monthly. >>>/gif/ is a NSFW 4chan board. Info:
- Contents: porn, random videos, LiveLeak-esque videos, and other videos which were too interesting for weak video sharing sites like YouTube.
- Stats so far: more than half a million files, totals to more than one terabyte.
- Why? Among other reasons, no one else is archiving /gif/, so I decided to.

Previous: >>1231730== History ==

/gap/ began in 2022-10. Archive files from 2022-10 to 2024-03 only have full image files and not threads (plain text). Starting in this thread, including 2024-05, each release going forward should have threads.

== How does it work? ==

Every 24 hours, this happens:
1. Full image links obtained 24 hours ago get downloaded.
2. All /gif/ threads get downloaded in API/JSON format.
3. Full image links get extracted from the JSONs. Goto 1.*more than half a million video files (gif, webm)bump, not sure if u did but it would be good to publish your exact setup and workflow for this just in case someone else might want to use it for another board or do it for gif themselves or continue doing it one day if you disappear>trying to find archive of two specific threads, grab torrents
>it's a zip instead of folders i can pick through properly but whatever
>open one today
>it's a flatfile dump
>no index file, no metadata, filenames are presumably md5
>go to archivedmoe to find the thread to pull the hashes out
>the archive for that is 404 for some reason
I don't want to bitch too much because this is all still better than nothing, but please, I beg, an SQLite file, OR simply putting the thread number in the name.>>1311696
>flatfile dump
Is that what that's called? In the past I called a similar thing a "simple many-file folder" (which contains zero folders).

>simply putting the thread number in the [file]name.
Sounds like a bad idea. One reason: same files in multiple threads. Could have thread number of only the first thread it showed up in, but still.

More data stuff - unrelated to your post... I run an ipfs node which is consistently online in one computer. I run another IPFS node which is inconsistently/temporarily online in another computer. I have a cid which is only in the temp-online one. After recursively providing that cid to the dht in the temp one (for hours, probably still doing it) I saw that its storage went from 666 gb total to 669 gb. Conclusion: recursively providing a dag/CID which you don't have to the DHT seems to make you download it (if running as read-write which you are almost certainly doing).>>1310152
is 2024_05 available yet?>>1310152
Fuckin piece of shit, I found a fault that shows that 4chan /gif/ 2024-06 wasn't being downloaded for days. I think I fixed it now. I could check on logs and stuff going forward to see that it's working correctly.>>1311996 = day that I fixed that problem, has worked fine since then.>>1310152
What are you using for this? GChan?>>1310152
Where is the magnet link?>>1314207
See the previous thread. I didn't share 2024-06 and 2024-05 yet.

>>1313666
No>>1310155
Post source code plz>>1314635
>See the previous thread. I didn't share 2024-06 and 2024-05 yet.
But the previous thread is 404 alreadyFuck cloudflare captcha>>1310152
>>1310815
>>1314645
Setup: GNU/Linux computer, set .sh files to executable by running "chmod +x file.sh". I use simple/crappy code to download this stuff. My code does not enable /gif/ users to do remote code execution because it parses JSON in such a way that it deals with privileged bytes that are part of the JSON structure and not the contents (same thing with older versions of the code that I used which parsed HTML instead).

Folder: /path/4changif
Folder: /path/4changif/test
Folder: /path/4changif/threads
- Maybe sure you have those 3 folders created (replace "/path/4changif" with whatever you have.)

File: cron.sh
- ipfs://bafkreidzrvqgtjebkj7uqzibqjz6jri3ei3tvsjscf5i2lebejr5s3wgga / https://web.archive.org/web/20240628134551/https://sabrig1480.xyz/FSrXIjIxT1z4ndG9dx018rGq6MGD3cF6BUPNTXs2lu4
- Checks if cron0.sh is running, if not run cron0.sh. Specify the full path to "cron0.sh".

File: cron0.sh
- ipfs://bafkreiclpya533snasw6crtg4f6dgdoikkuudsgnmvz4uw4s53f572ugvy / https://aralper.xyz/ciw3Aigohm54YTN-EwQO0C3xPOckjS3BjjYCipRjqf0
- Main downloader. Change "basepath1='/path/4changif'" to whatever folder you have set up to download /gif/. Manually run lines 11 through 37 when first running it to kick things off. Make it so 11-37 lines are one command then run that command. After doing that, the downloads all happen automatically.
- In depth on each line. 1=Bash. 2=basepath1 variable. 3=HDD history. 6=if statement 1, 86400-second wait between links obtained and downloading the files of those same links, runs if above that number. 9=runs commands to download files, logs it. 10=clears commands to run. 12=Does stuff, gets thread OP numbers from https://a.4cdn.org/gif/catalog.json . 14=threadcount variable which is a number of all OPs. 16=while loop 1, to go over all the OPs. 18=selects a specific OP (var ii). 20=debug output. 22=Downloads a thread https://a.4cdn.org/gif/thread/$ii.json > $basepath1/threads/$ii.json.$now

1/?>>1314886
24=imgcount variable which is a number of all images in a thread ii - calculated as a count of JSON parts
>jq ".posts[].ext, .posts[].tim, .posts[].md5" | grep -v "^null$"
divided by 3. 26=filename variable - array of POSIX time filenames from the middle $imgcount lines of those JSON parts. 28=ext variable - array of file extensions from the top $imgcount lines of those JSON parts. 30=md5 variable - array of Base64(MD5) strings from the bottom $imgcount lines those JSON parts (formatted to "standard" URL-safe strings). 32=while loop 2: saves commands to download each image into a text file (downloads as "TZ=UTC wget -nc https://i.4cdn.org/gif/${filename[$n]}${ext[$n]}" -> "$basepath1/test/${md5[$n]}" - cmds in $basepath1/torun.txt); end while loop 2. 34-39=iterate while loop 1, end while loop 1, end if statement 1.

Ignore the/this "In depth on each line" section if you just want to use it and don't care about how exactly the code works. You can also replace each case of "/gif/" with "/mlp/" if you want to download another board. This skips downloading janny-deleted files, which is good and bad. Bad if it was some harmless video that got deleted because the poster was too based and got his post deleted due to politics. There's no HTTP archive of HTTP 4chan /gif/ files, so no thing to fall back on and check if it's a harmless file. In the /mlp/ example, there is. I don't have a thing to specifically record "found deleted", but you can look at cronlog2.txt for 404'd files if you want to use this script on other boards then check found-deleted against what's saved in desuarchive.org, for example. And since I brought it up, here's a one-terabyte torrent that an anon (not me) downloaded from desuarchive /mlp/ and other captures of /mlp/:
>magnet:?xt=urn:btih:9671fb0855c7931fe98f03f7612c18010fb10121&dn=4chan-mlp&tr=udp%3a%2f%2fopen.stealth.si%3a80%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a6969%2fannounce

2/?>>1314887
Run "crontab -e" (NOT as sudo) and put this in there:
>0 * * * * /path/4changif/cron.sh
crontab runs hourly, cron*.sh runs daily. I guess I could simplify it to not be hourly->daily and just have crontab run it daily, but what I use works to only download it daily, so whatever.

File: 404.txt
File: addext.sh
File: howto.txt
File: 4chan_gif_2024_03_empty.txt
- see the latest torrent, magnet:?xt=urn:btih:84b2a6b0865a26bac9b7deef0ba63f893d6931c4&dn=4chan_gif_2024_03.zip

File: cronlog2.txt
File: cronlog1.txt
- automatically created, see the latest (4chan_gif_2024_03) for one of those

File: time.txt
File: chkcmd.txt
File: torun.txt
- automatically created

3/3 for now.>>1310152
>>1314207
>>1314635
Hey OP not to be mean or anything but why did you create this thread, just post the fucking magnet link, the previous thread is long gone from the archive