Press "Enter" to skip to content

Category: nuxx.net

Archiving Gallery 2 with HTTrack

Along with the static copy of the MediaWiki, I’ve been wanting to make a static, archival copy of the Gallery 2 install that I’ve been using to share photos, for 15+ years, at nuxx.net/gallery. Using HTTrack I was able to do so, after a bit of work, resulting in a copy at the same URL and with images accessed using the same paths, from static files.

The result is that I no longer need to run the aging Gallery 2 software, yet links and embedded images that point to my photo gallery did not break.

In the last few years I’ve both seen the traffic drop off, I haven’t posted many new things there, and it seems like the old Internet of pointing people to a personal photo gallery is nearly dead. I believe that blog posts, such as this, with links to specific photos, are where effort should be put. While there is 18+ years of personal history in digital images in my gallery, it doesn’t get used the same way it was 10 years ago.

On the technical side, the relatively-ancient (circa 2008) Gallery 2 has and the ~90GB of data in it has occasionally been a burden. I had to maintain an old copy of PHP just for this app, and this made updating things a pain. While there is a recent project, Gallery the Revival, which aims to update Gallery to newer versions of PHP, this is based around Gallery 3 and a migration to that brings about its own problems, including breaking static links.

I’m still not sure if I want to keep the gallery online but static as it is now, put the web app back up, completely take it off the internet and host it privately at home, or what… but figuring out how to create an archive has given me options.

What follows are my notes on how I used HTTrack, a package specifically designed to mirror websites, to archive nuxx.net’s Photo Gallery. I encountered a few bumps along the way, so this details each and how it was overcome, resulting in the current static copy. To find each of these I’d start HTTrack, let it run for a while, see if it got any errors, fix them, then try again. Eventually I got it to archive cleanly with zero errors:

Gallery Bug 83873

During initial runs, HTTrack finished after ~96MB (out of ~90GB of images) saved, reporting that it was complete. The main portions of the site looked good, but many sub-albums or original-resolution images were zero-byte HTML files on disk and displayed blank in the browser. This was caused by Gallery bug 83873, triggered by using HTTPS on the site. It seems to be fixed by adding the following line just before line 780 in .../modules/core/classes/GallerySession.class:

GalleryCoreApi::requireOnce('modules/core/classes/GalleryTranslator.class');

This error was found by via the following in Apache’s error log:

AH01071: Got error 'PHP message: PHP Fatal error: Class 'GalleryTranslator' not found in /var/www/vhosts/nuxx.net/gallery/modules/core/classes/GallerySession.class on line 780\n', referer: http://nuxx.net/gallery/

Minimize External Links / Footers

To clean things up further, minimizing external links, and make the static copy of the site as simple as possible, I also removed external links in footer by commenting out the external Gallery links and version from the footer, via .../themes/themename/templates/local/theme.tpl and .../themes/themename/templates/local/error.tpl:

<div id="gsFooter">
{*
{g->logoButton type="validation"}
*{g->logoButton type="gallery2"}
*{g->logoButton type="gallery2-version"}
*{g->logoButton type="donate"}
*}
</div>

Remove Details from EXIF/IPTC Plugin

The EXIF/IPTC Plugin for Gallery is excellent because it shows embedded metadata from the original photo, including things like date/time, camera model, location. This presents as a simple Summary view and a lengthier Details view. Unfortunately, when being indexed by HTTrack, selecting of the Details view — done via JavaScript — returns a server error. This shows up in the HTTrack UI as an increasing error count, and server errors as some pages are queried.

To not have a broken link on every page I modified the plugin to remove the Summary and Details view selector so it’d only display Summary, and used the plugin configuration to ensure that every field I wanted was shown in the summary.

To make this change copy .../modules/exif/templates/blocks/ExifInfo.tpl to .../modules/exif/templates/blocks/local/ExifInfo.tpl (to create a local copy, per the Editing Templates doc). Then edit the local copy and comment out lines 43 through 60 so that only the Summary view is displayed:

{* {if ($exif.mode == 'summary')}
* {g->text text="summary"}
* {else}
* <a href="{g->url arg1="controller=exif.SwitchDetailMode"
* arg2="mode=summary" arg3="return=true"}" onclick="return exifSwitchDetailMode({$exif.blockNum},{$item.id},'summary')">
* {g->text text="summary"}
* </a>
* {/if}
* &nbsp;&nbsp;
* {if ($exif.mode == 'detailed')}
* {g->text text="details"}
* {else}
* <a href="{g->url arg1="controller=exif.SwitchDetailMode"
* arg2="mode=detailed" arg3="return=true"}" onclick="return exifSwitchDetailMode({$exif.blockNum},{$item.id},'detailed')">
* {g->text text="details"}
* </a>
* {/if}
*}

Disable Extra Plugins

Finally, I disabled a bunch of plugins which both wouldn’t be useful in a static copy of the site, and cause a number of interconnected links which would make a mirror of the site overly complicated:

  • Search: Can’t search a static site.
  • Google Map Module: Requires a maps API key, which I don’t want to mess with.
  • New Items: There’s nothing new getting posted to a static site.
  • Slideshow: Not needed.

Fix Missing Files

My custom theme, which was based on matrix, linked to some images in the matrix directory which were no longer present in newer versions of the themes, so HTTrack would get 404 errors on these. I copied these files from my custom theme to the .../themes/matrix/images directory to fix this.

Clear Template / Page Cache

After making changes to templates it’s a good idea to clear all the template caches so all pages are rendering with the above changes. While all these steps may be overkill, I do this by going into Site Admin → Performance and setting Guest Users and Registered Users to No acceleration. I then uncheck Enable template caching and click Save. I then click Clear Saved Pages to clear any cached pages, then re-enable template caching and Full acceleration for Guest Users (which HTTrack will be working as).

PANIC! : Too many URLs : >99999

If your Gallery has a lot of images, HTTrack could quit with the error PANIC! : Too many URLs : >99999. Mine did, so I had to run it with the -#L1000000 argument so that it’ll then be limited to 1,000,000 URLs instead of the default 99,999.

Run HTTrack

After all of this, I ran the httrack binary with the security (bandwidth, etc) limits disabled (--disable-security-limits) and used its wizard mode to set up the mirror. The URL to be archived was https://nuxx.net/gallery/, stored in an appropriately named project directory, with no other settings.

CAUTION: Do not disable security limits if you don’t have good controls around the site you are mirroring and the bandwidth between the two. HTTrack has very sane defaults for rate limiting when mirroring that keep its behavior polite, it’s not wise to override these defaults unless you have good control of the source and destination site.

When httrack begins it shows no progress on screen, so I quit with Ctrl-C, switched to the project directory, and ran httrack --continue to allow the mirror to continue and show status info on the screen (the screenshot above). The argument --continue can be used to restart an interrupted mirror, and --update can be used to freshen up a complete mirror.

Alternately, the following command puts this all together, without the wizard:

httrack https://nuxx.net/gallery/ -W -O "/home/username/websites/nuxx.net Photo Gallery" -%v --disable-security-limits -#L1000000

As HTTrack spiders the site it comes across external links and needs to know what to do with them. Because I didn’t specify an action for external links on the command line, it prompts with the question “A link, [linkurl], is located beyond this mirror scope.”. Since I’m not interested in mirroring any external sites (mostly links to recipes or company websites) I answer * which is “Ignore all further links and do not ask any more questions” (text in httrack.c). (I was unable to figure out how to suppress this via a command line option before getting a complete mirror, although it’s likely possible.)

Running from a Dedicated VM

I ran this mirror task from a Linode VM, located in the same region as the VM hosting nuxx.net. This results in all traffic flowing over the Private network, avoiding bandwidth charge.

Because of the ~90GB of images, I set up a Linode 8GB, which has 160GB of disk, 8GB of RAM, and 4 CPUs. This should provide plenty of space for the mirror, with enough resources to allow the tool to work. This VM costs $40/mo (or $0.06/hr), which I find plenty affordable for getting this project done. The mirror took N days to complete, after which I tar’d it up and copied it a few places before deleting the VM.

By having a separate VM I was able to not worry about any dependencies or package problems and delete it after the work is done. All I needed to do on this VM was create a user, put it in the sudoers file, install screen (sudo apt-get install screen) and httrack (sudo apt-get install httrack), and get things running.

Wrapping It All Up

After the mirror was complete I replaced my .../gallery directory with the .../gallery directory from the HTTrack output directory and all was good.

Comments closed

Archiving MediaWiki with mwoffliner and zimdump

For a number of years on nuxx.net I used MediaWiki to host technical content. The markup language is nearly perfect for this sort of content, but in recent years I haven’t been doing as much of this and maintaining the software became a bit of a hassle. In order to still make the content available but get rid of the actual software, I moved all the content to static HTML files.

These files were created by creating a ZIM file — commonly used for offline copies of a website — and then extracting that file. The extracted files, a static copy of the MediaWiki-based site, was then made available using Apache.

You can get the ZIM file here, or browse the new static pages here.

Here’s the general steps I used to make it happen.

Create ZIM file: mwoffliner --mwUrl="https://nuxx.net/" --adminEmail=steve@nuxx.net --redis="redis://localhost:6379" --mwWikiPath="/w/" --customZimFavicon=favicon-32x32.png

Create HTML Directory from ZIM File: zimpdump -D mw_archive outfile.zim

Note: There are currently issues with zimdump and putting %2f HTML character codes in filenames instead of creating paths. This is openzim/zim-tools issue #68, and will need to be fixed by hand.

Consider using find . -name "*%2f*" to find problems with files, then use rename 's/.{4}(.*)/$1/' * (or so) to fix the filenames after moving them into appropriate subdirectories.

If using Apache (as I am) create .htaccess to set MIME Types Appropriately, turning off the rewrite engine so higher-level redirects don’t affect things:

<FilesMatch "^[^.]+$">
ForceType text/html
</FilesMatch>

RewriteEngine Off

Link to http://sitename.com/outdir/A/Main_Page to get to the original main wiki page. In my case, http://nuxx.net/wiki_archive/A/Main_Page.

 

Comments closed

No More Tables

For the last ten (or so) years that I’ve been posting to a weblog (first as c0nsumer on LiveJournal and now here on nuxx.net/blog) I’ve regularly posted images at the top of the post. Embarassingly, up until today I’ve been using a templatized HTML table with 1px of padding and a black background to make the 1px black border around each image:

<center><table cellpadding=1><tr><td bgcolor=”black”><a href=””><img src=”” height= width= border=0 title=””></a></td></tr></table></center>

I’ve know that this is the wrong way to go for a while now, but not knowing much about CSS I didn’t want to take the time to learn what was needed to change things for the better. I also had something that worked, cross-posted properly to LiveJournal, and wasn’t hard to maintain. One thing that it didn’t afford me was the ability to use WordPress’s visual editor; something which would allow me to easily create more image-laden posts and edit posts more quickly.

With the recent implementation of the new MMBA Trail Guide and some updates that needed to be done I’d been reworking a few different parts of the server, and it was time to change some things on this, my personal site. The main page had been MediaWiki (MW)-based for a while, but I now prefer WordPress (WP) for a main-website CMS, especially as I make blog posts far more frequently than the long-form technical writing that MediaWiki is best for. I started by upgrading MW and returning it to a more default theme, then moving WP to be the main page reached when one visits nuxx.net. Content on MW was the adjusted to house only Technical Pages, and links to the most useful pages were added to WP.

The result of this will be that the main page of nuxx.net is now WordPress based, and brings about all the ease-of-writing features that it is known for. MediaWiki remains present, but has been relegated to a repository for technical info that’d be difficult to write up cleanly in WP; something which I intend to continue using whenever I work on detailed technical topics.

Thanks to help from my friend Rob I was able to get my head around using Chrome’s Elements Panel to easily figure out what was needed to style the images with a nice 1px border without using a silly table. Hopefully I’ll stick to the use of CSS in the future, avoiding more silly hacks like using tables in 2013. All posts going back to the beginning of the year have been updated to remove the use of a table for a border, but previous posts will end up stuck with a 6px border: 1px for the original padding plus 5px added by a margin on the images. I don’t think this is terrible and is probably just part of the price of progress. I had to do it at some point.

Going forward I may also move some of the less-technical content (such as a journal written while on a solo cruise to Alaska in 2003 or mixes) to WordPress just as I did with the About page, but I’ve yet to decide on that. For now I’ll just enjoy the enhanced writing capabilites, growing CSS knowledge, and improved writing tools.

Leave a Comment

How To Disable IPv6 w/ Sendmail on FreeBSD 9.0-RELEASE

Due to some issues with Comcast flagging some email I’ve been sending via IPv6 as Spam I wanted to keep mail from being sent this way. Comcast publishes this document explaining how to keep IPv6 mail from being blocked, but I’ve got some rDNS issues to sort out before I can work through all of those. So, in the mean time I simply wanted to stop

It took a bit to figure out how to disable IPv6 in the base Sendmail, but now that I’ve got it done I figured I’d share. This is in 9.0-RELEASE, but I’m sure it applies to many other recent FreeBSD Versions:

Edit /etc/make.conf to ensure that IPv6 is turned off for Sendmail compiles. Add this line to the file:

SENDMAIL_CFLAGS= -UNETINET6

Rebuild Sendmail as described here in the FreeBSD Handbook:

# cd /usr/src/lib/libsmutil
# make cleandir && make obj && make
# cd /usr/src/lib/libsm
# make cleandir && make obj && make
# cd /usr/src/usr.sbin/sendmail
# make cleandir && make obj && make && make install

Then, go into your Sendmail config directory (/etc/mail), and if you haven’t so before, run make all to build your machine-specific Sendmail config files.

Edit hostname.mc and locate the line that reads DAEMON_OPTIONS(`Name=IPv6, Family=inet6, Modifiers=O') and comment it out by adding a dnl in front of it:

dnl DAEMON_OPTIONS(`Name=IPv6, Family=inet6, Modifiers=O')

Compile the Sendmail config and restart Sendmail:

make install
make restart

And, now you’re done! Look at /var/log/maillog to ensure that mail is no longer being delivered via IPv6.

Leave a Comment

Résumé Updated for 2012

Updating one’s résumé can be quite a pain especially if done under duress, so I like to periodically update it so that a fairly fresh copy is readily available. This afternoon I put the finishing touches on the most updated version, one which takes into account some changes at work, stuff that I’ve done with CRAMBA and the MMBA, and a few other newly-acquired skills.

If you’d like to see a copy of my resume it can be found at nuxx.net/resume.

Leave a Comment

Time Machine for… FreeBSD?

This week I finally got around to writing a new backup script for my webserver. I have it automatically pushing backups to a device at home, but in the past I’d only been doing a nightly rsync with --delete and periodic offline backups. The problem with this was that should something happen to my server and cause a data loss, but not be noticed before the next backup ran, the current backup would be modified reflect the now-compromised data, potentially causing massive data loss. Clearly this was a bad thing, and something had to be done.

A new backup scheme was devised and now that the new scripts are tweaked I wanted to present them here. rsync is still being used, but thanks to its glorious --link-dest option which makes hard links as it can, files already stored on disk (say, from a previous version of the backup) are reused, saving space. This is how Apple's Time Machine works, just without the nice GUI. The result is that I have a series of directories starting with backup.0 up through potentially backup.30 on the target, each containing a different backup. The suffixed number represents how many versions old the backup is. These versions are generally created once per day, but on days where the backup does not complete successfully the version is not incremented.

To start, there is a script called dailybackup.sh which runs once per day on banstyle.nuxx.net. This script pushes a backup to a Mac at home as follows:

  1. If needed, remotely execute rotatebackup.sh on the backup server. This will move backup.0 to backup.1, backup.1 to backup.2, keeping no more than 30 backups. The need to rotate backups is determined by the presence of backup.0/backup_complete. If there is no backup_complete file we know that the previous backup was not successful and that we should reuse backup.0.
  2. Create a new backup.0 and populate it with a backup_started flag file.
  3. Run the backup job via rsync.
  4. If the job completes successfully (exits with something other than 0 or 24), continue. Exit code 24 indicates that some files disappeared during backup, and as mail files (amongst others) tend to move and be deleted by users during the backup job, this is not a critical error for us.
  5. Remove backup_started and create the backup_complete flag.

Copies of the aforementioned scripts can be found here, if you’d like to look at / use them: dailybackup.sh · rotatebackups.sh

These scripts assume the presence of backup.0, a full copy of your backup, which you’ll have to create yourself before use. There’s also likely some necessary changes for your environment, most likely in some of the variables set at the top of the scripts, such as the number of days for which to keep backups and logs, the target hostname, SSH port, username, etc.

Leave a Comment

Busy, Busy, Busy…

A very small owl sitting on a branch outside of the window at Rochester Mills Brewery.

I’ve been really, really busy lately. This isn’t a bad thing, I just haven’t had enough time to get everything done that I’d hoped to. Lately I’ve had the MMBA website move, really bad weather on Saturday, shopping (REI, IKEA, Target, Meijer, etc) on Sunday, work then the MMBA Metro North quarterly meeting today, and now I’m making tapioca pudding.

I still have to find time (hopefully tomorrow) to fix a friend’s NAS, finish up the x0xb0x, and whatever else comes up. For now, though, have some moblog photos:

· A very small owl sitting on a branch outside of the window at Rochester Mills Brewery.
· Bags and carts at Ikea on Ford Road.
· Partially eaten veggie burger from J. Alexanders in Somerset.
· The urinal at J. Alexanders is a nice, old style model.
· After buying gas I bought this very large apple fritter.
· I do wonder why this person doesn’t just disable their touchpad.
· Partially eaten rosemary bread with jalapeno havarti melted on the top.
· Waiting for biyrani at Rangoli Express #1.

Also, this evening’s fortune (6):

Last login: Mon Jan 12 19:55:22 2009 from adsl-75-45-241-
Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994
        The Regents of the University of California.  All rights reserved.

FreeBSD 7.0-RELEASE (BANSTYLE) #4: Tue Dec  9 00:07:44 EST 2008
 
Snow Day -- stay home.
 
c0nsumer@banstyle:~>

Funny that, considering the current forecast. A snow day would be rather nice, actually.

Leave a Comment

MMBA Site Moved

Michigan Mountain Biking Association web site (mmba.org) soft launch after moving to nuxx.net for hosting.

Here’s the result of something I’ve been working on for the last couple months. The new Michigan Mountain Biking Association web site has launched, and it is now hosted here on my server. This is the soft launch of the site, as we should have a new unified theme / design across the main site, forum, and other places soon. However, we wanted to get the new site itself up and running because the old one was causing us a few problems.

I’m really, really glad we got this done. Now, time for bed.

Leave a Comment

Attributor Corporation

StatPress in WordPress on the nuxx.net Blog showing a bunch of requests from Attributor Corporation.

Do any of you who run blogs ever notice occasional rashes of indexing from 64.41.128.xxx? I’ve noticed this every few days when poking around in the copy of StatPress Reloaded which is running here to monitor pageviews and such.

It turns out that these queries are from Attributor Corporation, who regularly indexes blogs and such to look for copyright violations, duplications of text / image / video content, etc.

Attributor’s FAQ states that…

Attributor is the world’s first web-wide content tracking and analysis platform that enables publishers to build value with their content wherever it appears on the Internet.

With Attributor, publishers can now program when, how and where their content is presented across the web and social networks. Advanced fingerprinting algorithms, a large scale crawling infrastructure and detailed contextual analysis provide publishers with web-wide visibility of their articles, images or videos. Using the Attributor platform, customers can monitor licensed uses, identify new sales leads and revenue-sharing opportunities, and derive more links and better search engine placement.

The FAQ then goes on to talk about how they don’t want to immediately send out DMCA notices for such things, but instead enhance monetization by sending requests to those copying content asking for appropriate links back, attributions, etc. They also claim that their tool (Dashboard) can take Creative Commons licenses into account and help ensure that the license is being followed accordingly.

I don’t really mind, since all this content is fairly original and put out for everyone to see and read, but it is interesting to see the scanning actually happening.

2 Comments