Louis-Philippe Véronneau - archivehttps://veronneau.org/2019-09-19T00:00:00-04:00Archiving 20 years of online content2019-09-19T00:00:00-04:002019-09-19T00:00:00-04:00Louis-Philippe Véronneautag:veronneau.org,2019-09-19:/archiving-20-years-of-online-content.html<p>Last Spring at work I was tasked with archiving all of the digital content made
by <em>Association pour une Solidarité Syndicale Étudiante (ASSÉ)</em>, a student union
federation that was planned to shut down six months later.</p>
<p>Now that I've done it, I have to say it was quite a demanding …</p><p>Last Spring at work I was tasked with archiving all of the digital content made
by <em>Association pour une Solidarité Syndicale Étudiante (ASSÉ)</em>, a student union
federation that was planned to shut down six months later.</p>
<p>Now that I've done it, I have to say it was quite a demanding task: ASSÉ was
founded in 2001 and neither had proper digital archiving policies nor good web
practices in general.</p>
<p>The goal was not only archiving those web resources, but also making sure they
were easily accessible online too. I thus decided to create <a href="https://asse-solidarite.qc.ca">a meta site</a>
regrouping and presenting all of them.</p>
<p>All in all, I archived:</p>
<ul>
<li>a Facebook page</li>
<li>a Twitter account</li>
<li>a YouTube account</li>
<li>multiple ephemeral websites</li>
<li>old handcrafted PHP4 websites that I had to partially re-write</li>
<li>a few crummy Wordpress sites</li>
<li>2 old phpBB Forum using PHP3</li>
<li>a large Mailman2 mailing list</li>
<li>a large Yahoo! Group mailing list</li>
</ul>
<p>Here are the three biggest challenges I faced during this project:</p>
<h2>The Twitter API has stupid limitations</h2>
<p>The Twitter API won't give you more than an account's last 3000 posts. When you
need to automate the retrieval of more than 5500 tweets, you know you're
entering a world of pain.</p>
<p>Long story short, I ended up writing this <em>crummy</em> shell script to parse the
HTML, statify all the Twitter links and push the resulting code to a Wordpress
site using <a href="https://github.com/ozh/ozh-tweet-archive-theme">Ozh' Tweet Archive Theme</a>. The URL list was generated
using the ArchiveTeam's web crawler.</p>
<p>Of course, once done I made the Wordpress into a static website. I personally
think the result <a href="https://oiseau.asse-solidarite.qc.ca/">looks purty</a>.</p>
<p>Here's the shell script I wrote - displayed here for archival purposes only.
Let's pray I don't ever have to do this again. Please don't run this, as it
might delete your grandma.</p>
<div class="highlight"><pre><span></span><code>cat<span class="w"> </span><span class="nv">$1</span><span class="w"> </span><span class="p">|</span><span class="w"> </span><span class="k">while</span><span class="w"> </span><span class="nb">read</span><span class="w"> </span>line
<span class="k">do</span>
<span class="w"> </span><span class="c1"># get the ID</span>
<span class="w"> </span><span class="nv">id</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$line</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s@https://mobile.twitter.com/.\+/status/@@'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># download the whole HTML page</span>
<span class="w"> </span><span class="nv">html</span><span class="o">=</span><span class="k">$(</span>curl<span class="w"> </span>-s<span class="w"> </span><span class="nv">$line</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># get the date</span>
<span class="w"> </span><span class="nv">date</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$html</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>-A<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="s1">'<div class="metadata">'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>-o<span class="w"> </span><span class="s2">"[0-9].\+20[0-9][0-9]"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s/ - //'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>date<span class="w"> </span>-f<span class="w"> </span>-<span class="w"> </span>+<span class="s2">"%F %k:%M:%S"</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># extract the tweet</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$html</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span><span class="s1">'<div class="dir-ltr" dir="ltr">'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># we strip the HTML tags for the title</span>
<span class="w"> </span><span class="nv">title</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span>-e<span class="w"> </span><span class="s1">'s/<[^>]*>//g'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># get CSV list of tags</span>
<span class="w"> </span><span class="nv">tags</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$tweet</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>-io<span class="w"> </span><span class="s2">"\#[a-z]\+"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">':a;N;$!ba;s/\n/,/g'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># get a CSV list of links</span>
<span class="w"> </span><span class="nv">links</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$tweet</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>-Po<span class="w"> </span><span class="s2">"title=\"http.*?>"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s/title=\"//; s/">//'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">':a;N;$!ba;s/\n/,/g'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># get a CSV list of usernames</span>
<span class="w"> </span><span class="nv">usernames</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$tweet</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span>-Po<span class="w"> </span><span class="s2">">\@.*?<"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s/>//; s/<//'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">':a;N;$!ba;s/\n/,/g'</span><span class="k">)</span>
<span class="w"> </span><span class="nv">image_link</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$html</span><span class="s2">"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>grep<span class="w"> </span><span class="s2">"<img src=\"https://pbs.twimg.com/media/"</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s/:small//'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># remove twitter cruft</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span><span class="s1">'s/<div class="dir-ltr" dir="ltr"> /<p>/'</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>perl<span class="w"> </span>-pe<span class="w"> </span><span class="s1">'s@<a href="/hashtag.*?dir="ltr">@<span class="hashtag hashtag_local">@g'</span><span class="k">)</span>
<span class="w"> </span><span class="c1"># expand links</span>
<span class="k">if</span><span class="w"> </span><span class="o">[</span><span class="w"> </span>!<span class="w"> </span>-z<span class="w"> </span><span class="nv">$links</span><span class="w"> </span><span class="o">]</span>
<span class="k">then</span>
<span class="w"> </span><span class="nv">IFS</span><span class="o">=</span><span class="s1">','</span><span class="w"> </span><span class="nb">read</span><span class="w"> </span>-ra<span class="w"> </span>link<span class="w"> </span><span class="o"><<<</span><span class="w"> </span><span class="s2">"</span><span class="nv">$links</span><span class="s2">"</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="s2">"</span><span class="si">${</span><span class="nv">link</span><span class="p">[@]</span><span class="si">}</span><span class="s2">"</span>
<span class="w"> </span><span class="k">do</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>perl<span class="w"> </span>-pe<span class="w"> </span><span class="s2">"s@<a href=\"*.+?rel=\"nofollow noopener\"dir=\"ltr\"data-*.+?</a>@<a href='</span><span class="nv">$i</span><span class="s2">'></span><span class="nv">$i</span><span class="s2"></a>@"</span><span class="k">)</span>
<span class="w"> </span><span class="k">done</span>
<span class="w"> </span><span class="k">fi</span>
<span class="w"> </span><span class="c1"># replace hashtags by links</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="o">[</span><span class="w"> </span>!<span class="w"> </span>-z<span class="w"> </span><span class="nv">$tags</span><span class="w"> </span><span class="o">]</span>
<span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="nv">IFS</span><span class="o">=</span><span class="s1">','</span><span class="w"> </span><span class="nb">read</span><span class="w"> </span>-ra<span class="w"> </span>tag<span class="w"> </span><span class="o"><<<</span><span class="w"> </span><span class="s2">"</span><span class="nv">$tags</span><span class="s2">"</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="s2">"</span><span class="si">${</span><span class="nv">tag</span><span class="p">[@]</span><span class="si">}</span><span class="s2">"</span>
<span class="w"> </span><span class="k">do</span>
<span class="w"> </span><span class="nv">plain</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$i</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span>-e<span class="w"> </span><span class="s1">'s/#//'</span><span class="k">)</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span>-e<span class="w"> </span><span class="s2">"s@</span><span class="nv">$i</span><span class="s2">@#<a href=\"https://oiseau.asse-solidarite.qc.ca/index.php/tag/</span><span class="nv">$plain</span><span class="s2">\"></span><span class="nv">$plain</span><span class="s2">@"</span><span class="k">)</span>
<span class="w"> </span><span class="k">done</span>
<span class="w"> </span><span class="k">fi</span>
<span class="w"> </span><span class="c1"># replace usernames by links</span>
<span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="o">[</span><span class="w"> </span>!<span class="w"> </span>-z<span class="w"> </span><span class="nv">$usernames</span><span class="w"> </span><span class="o">]</span>
<span class="w"> </span><span class="k">then</span>
<span class="w"> </span><span class="nv">IFS</span><span class="o">=</span><span class="s1">','</span><span class="w"> </span><span class="nb">read</span><span class="w"> </span>-ra<span class="w"> </span>username<span class="w"> </span><span class="o"><<<</span><span class="w"> </span><span class="s2">"</span><span class="nv">$usernames</span><span class="s2">"</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span>i<span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="s2">"</span><span class="si">${</span><span class="nv">username</span><span class="p">[@]</span><span class="si">}</span><span class="s2">"</span>
<span class="w"> </span><span class="k">do</span>
<span class="w"> </span><span class="nv">plain</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$i</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sed<span class="w"> </span>-e<span class="w"> </span><span class="s1">'s/\@//'</span><span class="k">)</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>perl<span class="w"> </span>-pe<span class="w"> </span><span class="s2">"s@<a href=\"/</span><span class="nv">$plain</span><span class="s2">.*?</a>@<span class=\"username username_linked\">\@<a href=\"https://twitter.com/</span><span class="nv">$plain</span><span class="s2">\"></span><span class="nv">$plain</span><span class="s2"></a></span>@i"</span><span class="k">)</span>
<span class="w"> </span><span class="k">done</span>
<span class="w"> </span><span class="k">fi</span>
<span class="w"> </span><span class="c1"># replace images</span>
<span class="w"> </span><span class="nv">tweet</span><span class="o">=</span><span class="k">$(</span><span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>perl<span class="w"> </span>-pe<span class="w"> </span><span class="s2">"s@<a href=\"http://t.co*.+?data-pre-embedded*.+?</a>@<span class=\"embed_image embed_image_yes\"></span><span class="nv">$image_link</span><span class="s2"></span>@"</span><span class="k">)</span>
<span class="nb">echo</span><span class="w"> </span><span class="nv">$tweet</span><span class="w"> </span><span class="p">|</span><span class="w"> </span>sudo<span class="w"> </span>-u<span class="w"> </span>twitter<span class="w"> </span>wp-cli<span class="w"> </span>post<span class="w"> </span>create<span class="w"> </span>-<span class="w"> </span>--post_title<span class="o">=</span><span class="s2">"</span><span class="nv">$title</span><span class="s2">"</span><span class="w"> </span>--post_status<span class="o">=</span><span class="s2">"publish"</span><span class="w"> </span>--tags_input<span class="o">=</span><span class="s2">"</span><span class="nv">$tag</span><span class="s2">"</span><span class="w"> </span>--post_date<span class="o">=</span><span class="s2">"</span><span class="nv">$date</span><span class="s2">"</span><span class="w"> </span>><span class="w"> </span>tmp
<span class="nv">post_id</span><span class="o">=</span><span class="k">$(</span>grep<span class="w"> </span>-Po<span class="w"> </span><span class="s2">"[0-9]{4}"</span><span class="w"> </span>tmp<span class="k">)</span>
sudo<span class="w"> </span>-u<span class="w"> </span>twitter<span class="w"> </span>wp-cli<span class="w"> </span>post<span class="w"> </span>meta<span class="w"> </span>add<span class="w"> </span><span class="nv">$post_id</span><span class="w"> </span>ozh_ta_id<span class="w"> </span><span class="nv">$id</span>
<span class="nb">echo</span><span class="w"> </span><span class="s2">"</span><span class="nv">$post_id</span><span class="s2"> created"</span>
rm<span class="w"> </span>tmp
<span class="k">done</span>
</code></pre></div>
<h2>Does anyone ever update phpBBs?</h2>
<p>What's worse than a phpBB forum? Two phpBB 2.0.x forums using PHP3 and last
updated in 2006.</p>
<p>I had to resort to unholy methods just to be able to get those things running
again to be able to <code>wget</code> the crap out of them.</p>
<p>By the way, the magic <code>wget</code> command to grab a whole website looks like this:</p>
<pre>
wget --mirror -e robots=off --page-requisites --adjust-extension -nv --base=./ --convert-links --directory-prefix=./ -H -D www.foo.org,foo.org http://www.foo.org/
</pre>
<p>Depending on the website you are trying to archive, you might have to play with
other obscure parameters. I sure had to. All the credits for that command goes
to <a href="https://wiki.koumbit.net/Fossilisation#Prendre_une_copie_statique_du_site">Koumbit's wiki page on the dark arts of website statification</a>.</p>
<h2>Archiving mailing lists</h2>
<p>mailman2 is pretty great. You can get a dump of an email list pretty easily and
mailman3's web frontend, the lovely <a href="https://gitlab.com/mailman/hyperkitty">hyperkitty</a>, is well, lovely.
Importing a legacy mailman2 <code>mbox</code> went without a hitch thanks to the awesome
<code>hyperkitty_import</code> importer. Kudos to the Debian Mailman Team for packaging
this in Debian for us.</p>
<p>But what about cramming a Yahoo! Group mailing list in hyperkitty? I wouldn't
recommend it. After way too many hours spent battling character encoding errors
I just decided people that wanted to read obscure emails from 2003 would have
to deal with broken accents and shit. <a href="https://support.asse-solidarite.qc.ca/list/asse-support@groupesyahoo.ca/">But hey, it kinda works!</a></p>
<p>Oh, and yes, archiving a Yahoo! Group with <a href="https://sourceforge.net/projects/grabyahoogroup/">an old borken Perl script</a>
wasn't an easy task. Hell, I kept getting blacklisted by Yahoo! for scraping
too much data to their liking. I ended up patching together the results of
multiple runs over a few weeks to get the full mbox and attachments.</p>
<p>By the way, if anyone knows how to tell hyperkitty to stop at a certain year
(i.e. not display links for 2019 when the list stopped in 2006), please ping
me.</p>