Archiving 20 years of online content

2019-09-19 - Louis-Philippe Véronneau

Last Spring at work I was tasked with archiving all of the digital content made by Association pour une Solidarité Syndicale Étudiante (ASSÉ), a student union federation that was planned to shut down six months later.

Now that I've done it, I have to say it was quite a demanding task: ASSÉ was founded in 2001 and neither had proper digital archiving policies nor good web practices in general.

The goal was not only archiving those web resources, but also making sure they were easily accessible online too. I thus decided to create a meta site regrouping and presenting all of them.

All in all, I archived:

  • a Facebook page
  • a Twitter account
  • a YouTube account
  • multiple ephemeral websites
  • old handcrafted PHP4 websites that I had to partially re-write
  • a few crummy Wordpress sites
  • 2 old phpBB Forum using PHP3
  • a large Mailman2 mailing list
  • a large Yahoo! Group mailing list

Here are the three biggest challenges I faced during this project:

The Twitter API has stupid limitations

The Twitter API won't give you more than an account's last 3000 posts. When you need to automate the retrieval of more than 5500 tweets, you know you're entering a world of pain.

Long story short, I ended up writing this crummy shell script to parse the HTML, statify all the Twitter links and push the resulting code to a Wordpress site using Ozh' Tweet Archive Theme. The URL list was generated using the ArchiveTeam's web crawler.

Of course, once done I made the Wordpress into a static website. I personally think the result looks purty.

Here's the shell script I wrote - displayed here for archival purposes only. Let's pray I don't ever have to do this again. Please don't run this, as it might delete your grandma.

cat $1 | while read line
  # get the ID
  id=$(echo $line | sed 's@\+/status/@@')
  # download the whole HTML page
  html=$(curl -s $line)
  # get the date
  date=$(echo "$html" | grep -A 1 '<div class="metadata">' | grep -o "[0-9].\+20[0-9][0-9]" | sed 's/ - //' | date -f - +"%F %k:%M:%S")
  # extract the tweet
  tweet=$(echo "$html" | grep '<div class="dir-ltr" dir="ltr">')
  # we strip the HTML tags for the title
  title=$(echo $tweet | sed -e 's/<[^>]*>//g')
  # get CSV list of tags
  tags=$(echo "$tweet" | grep -io "\#[a-z]\+" | sed ':a;N;$!ba;s/\n/,/g')
  # get a CSV list of links
  links=$(echo "$tweet" | grep -Po "title=\"http.*?>" | sed 's/title=\"//; s/">//' | sed ':a;N;$!ba;s/\n/,/g')
  # get a CSV list of usernames
  usernames=$(echo "$tweet" | grep -Po ">\@.*?<" | sed 's/>//; s/<//' | sed ':a;N;$!ba;s/\n/,/g')
  image_link=$(echo "$html" | grep "<img src=\"" | sed 's/:small//')

  # remove twitter cruft
  tweet=$(echo $tweet | sed 's/<div class="dir-ltr" dir="ltr"> /<p>/' | perl -pe 's@<a href="/hashtag.*?dir="ltr">@<span class="hashtag hashtag_local">@g')

  # expand links
if [ ! -z $links ]
    IFS=',' read -ra link <<< "$links"
    for i in "${link[@]}"
      tweet=$(echo $tweet | perl -pe "s@<a href=\"*.+?rel=\"nofollow noopener\"dir=\"ltr\"data-*.+?</a>@<a href='$i'>$i</a>@")

  # replace hashtags by links
  if [ ! -z $tags ]
    IFS=',' read -ra tag <<< "$tags"
    for i in "${tag[@]}"
      plain=$(echo $i | sed -e 's/#//')
      tweet=$(echo $tweet | sed -e "s@$i@#<a href=\"$plain\">$plain@")

  # replace usernames by links
  if [ ! -z $usernames ]
    IFS=',' read -ra username <<< "$usernames"
    for i in "${username[@]}"
      plain=$(echo $i | sed -e 's/\@//')
      tweet=$(echo $tweet | perl -pe "s@<a href=\"/$plain.*?</a>@<span class=\"username username_linked\">\@<a href=\"$plain\">$plain</a></span>@i")

  # replace images
  tweet=$(echo $tweet | perl -pe "s@<a href=\"*.+?data-pre-embedded*.+?</a>@<span class=\"embed_image embed_image_yes\">$image_link</span>@")

echo $tweet | sudo -u twitter wp-cli post create - --post_title="$title" --post_status="publish" --tags_input="$tag" --post_date="$date" > tmp
post_id=$(grep -Po "[0-9]{4}" tmp)
sudo -u twitter wp-cli post meta add $post_id ozh_ta_id $id
echo "$post_id created"
rm tmp

Does anyone ever update phpBBs?

What's worse than a phpBB forum? Two phpBB 2.0.x forums using PHP3 and last updated in 2006.

I had to resort to unholy methods just to be able to get those things running again to be able to wget the crap out of them.

By the way, the magic wget command to grab a whole website looks like this:

wget --mirror -e robots=off --page-requisites --adjust-extension -nv --base=./ --convert-links --directory-prefix=./ -H -D,

Depending on the website you are trying to archive, you might have to play with other obscure parameters. I sure had to. All the credits for that command goes to Koumbit's wiki page on the dark arts of website statification.

Archiving mailing lists

mailman2 is pretty great. You can get a dump of an email list pretty easily and mailman3's web frontend, the lovely hyperkitty, is well, lovely. Importing a legacy mailman2 mbox went without a hitch thanks to the awesome hyperkitty_import importer. Kudos to the Debian Mailman Team for packaging this in Debian for us.

But what about cramming a Yahoo! Group mailing list in hyperkitty? I wouldn't recommend it. After way too many hours spent battling character encoding errors I just decided people that wanted to read obscure emails from 2003 would have to deal with broken accents and shit. But hey, it kinda works!

Oh, and yes, archiving a Yahoo! Group with an old borken Perl script wasn't an easy task. Hell, I kept getting blacklisted by Yahoo! for scraping too much data to their liking. I ended up patching together the results of multiple runs over a few weeks to get the full mbox and attachments.

By the way, if anyone knows how to tell hyperkitty to stop at a certain year (i.e. not display links for 2019 when the list stopped in 2006), please ping me.