Last Spring at work I was tasked with archiving all of the digital content made by Association pour une Solidarité Syndicale Étudiante (ASSÉ), a student union federation that was planned to shut down six months later.
Now that I've done it, I have to say it was quite a demanding task: ASSÉ was founded in 2001 and neither had proper digital archiving policies nor good web practices in general.
The goal was not only archiving those web resources, but also making sure they were easily accessible online too. I thus decided to create a meta site regrouping and presenting all of them.
All in all, I archived:
- a Facebook page
- a Twitter account
- a YouTube account
- multiple ephemeral websites
- old handcrafted PHP4 websites that I had to partially re-write
- a few crummy Wordpress sites
- 2 old phpBB Forum using PHP3
- a large Mailman2 mailing list
- a large Yahoo! Group mailing list
Here are the three biggest challenges I faced during this project:
The Twitter API has stupid limitations
The Twitter API won't give you more than an account's last 3000 posts. When you need to automate the retrieval of more than 5500 tweets, you know you're entering a world of pain.
Long story short, I ended up writing this crummy shell script to parse the HTML, statify all the Twitter links and push the resulting code to a Wordpress site using Ozh' Tweet Archive Theme. The URL list was generated using the ArchiveTeam's web crawler.
Of course, once done I made the Wordpress into a static website. I personally think the result looks purty.
Here's the shell script I wrote - displayed here for archival purposes only. Let's pray I don't ever have to do this again. Please don't run this, as it might delete your grandma.
cat $1 | while read line
do
# get the ID
id=$(echo $line | sed 's@https://mobile.twitter.com/.\+/status/@@')
# download the whole HTML page
html=$(curl -s $line)
# get the date
date=$(echo "$html" | grep -A 1 '<div class="metadata">' | grep -o "[0-9].\+20[0-9][0-9]" | sed 's/ - //' | date -f - +"%F %k:%M:%S")
# extract the tweet
tweet=$(echo "$html" | grep '<div class="dir-ltr" dir="ltr">')
# we strip the HTML tags for the title
title=$(echo $tweet | sed -e 's/<[^>]*>//g')
# get CSV list of tags
tags=$(echo "$tweet" | grep -io "\#[a-z]\+" | sed ':a;N;$!ba;s/\n/,/g')
# get a CSV list of links
links=$(echo "$tweet" | grep -Po "title=\"http.*?>" | sed 's/title=\"//; s/">//' | sed ':a;N;$!ba;s/\n/,/g')
# get a CSV list of usernames
usernames=$(echo "$tweet" | grep -Po ">\@.*?<" | sed 's/>//; s/<//' | sed ':a;N;$!ba;s/\n/,/g')
image_link=$(echo "$html" | grep "<img src=\"https://pbs.twimg.com/media/" | sed 's/:small//')
# remove twitter cruft
tweet=$(echo $tweet | sed 's/<div class="dir-ltr" dir="ltr"> /<p>/' | perl -pe 's@<a href="/hashtag.*?dir="ltr">@<span class="hashtag hashtag_local">@g')
# expand links
if [ ! -z $links ]
then
IFS=',' read -ra link <<< "$links"
for i in "${link[@]}"
do
tweet=$(echo $tweet | perl -pe "s@<a href=\"*.+?rel=\"nofollow noopener\"dir=\"ltr\"data-*.+?</a>@<a href='$i'>$i</a>@")
done
fi
# replace hashtags by links
if [ ! -z $tags ]
then
IFS=',' read -ra tag <<< "$tags"
for i in "${tag[@]}"
do
plain=$(echo $i | sed -e 's/#//')
tweet=$(echo $tweet | sed -e "s@$i@#<a href=\"https://oiseau.asse-solidarite.qc.ca/index.php/tag/$plain\">$plain@")
done
fi
# replace usernames by links
if [ ! -z $usernames ]
then
IFS=',' read -ra username <<< "$usernames"
for i in "${username[@]}"
do
plain=$(echo $i | sed -e 's/\@//')
tweet=$(echo $tweet | perl -pe "s@<a href=\"/$plain.*?</a>@<span class=\"username username_linked\">\@<a href=\"https://twitter.com/$plain\">$plain</a></span>@i")
done
fi
# replace images
tweet=$(echo $tweet | perl -pe "s@<a href=\"http://t.co*.+?data-pre-embedded*.+?</a>@<span class=\"embed_image embed_image_yes\">$image_link</span>@")
echo $tweet | sudo -u twitter wp-cli post create - --post_title="$title" --post_status="publish" --tags_input="$tag" --post_date="$date" > tmp
post_id=$(grep -Po "[0-9]{4}" tmp)
sudo -u twitter wp-cli post meta add $post_id ozh_ta_id $id
echo "$post_id created"
rm tmp
done
Does anyone ever update phpBBs?
What's worse than a phpBB forum? Two phpBB 2.0.x forums using PHP3 and last updated in 2006.
I had to resort to unholy methods just to be able to get those things running
again to be able to wget
the crap out of them.
By the way, the magic wget
command to grab a whole website looks like this:
wget --mirror -e robots=off --page-requisites --adjust-extension -nv --base=./ --convert-links --directory-prefix=./ -H -D www.foo.org,foo.org http://www.foo.org/
Depending on the website you are trying to archive, you might have to play with other obscure parameters. I sure had to. All the credits for that command goes to Koumbit's wiki page on the dark arts of website statification.
Archiving mailing lists
mailman2 is pretty great. You can get a dump of an email list pretty easily and
mailman3's web frontend, the lovely hyperkitty, is well, lovely.
Importing a legacy mailman2 mbox
went without a hitch thanks to the awesome
hyperkitty_import
importer. Kudos to the Debian Mailman Team for packaging
this in Debian for us.
But what about cramming a Yahoo! Group mailing list in hyperkitty? I wouldn't recommend it. After way too many hours spent battling character encoding errors I just decided people that wanted to read obscure emails from 2003 would have to deal with broken accents and shit. But hey, it kinda works!
Oh, and yes, archiving a Yahoo! Group with an old borken Perl script wasn't an easy task. Hell, I kept getting blacklisted by Yahoo! for scraping too much data to their liking. I ended up patching together the results of multiple runs over a few weeks to get the full mbox and attachments.
By the way, if anyone knows how to tell hyperkitty to stop at a certain year (i.e. not display links for 2019 when the list stopped in 2006), please ping me.