To Those That Read firsttube.com via RSS

To those of you that read firsttube.com via RSS, I’m sorry about the recent difficulties. My conversion to WordPress is almost entirely complete, including handling all old links, etc. I have a few legacy things left to fix. In the meantime, I realize that my feed has been screwy for the last few days and I’m sorry about that.

It should be fixed now, so that likely means another 20 dupes or so, I can’t control that. But I can tell you that I think we’re all caught up. Thanks for hanging with me.

Hacking WordPress, Day Two

Thus far, my move to WordPress has been an adventure.  Here’s a few lessons learned.

First off, I was very excited about the features of WordPress.  I was really excited, most specifically, about the API, and about the rich text WYSIWYG of the backend.  I’ve done a lot of work on Small Axe’s backend, but it’s still nothing compared to WordPress.

When I imported my stuff, it worked well, but the “slugs” — or URL-friendly post titles — did not convert properly.  They converted as WordPress friendly, properly escaped slugs.  The problem was, my slugs needed to stay intact, because I didn’t want all old links to break.

Understanding the way WordPress functions is really tough for a WP newbie, because the code is so spread out, yet compact, voluminous, yet digestible. Start with index.php, onto wp-blog-header.php, into wp-settings.php, and then you find the massive list of files in the wp-includes directory.  You’ll dig all over trying to find files to find includes in includes in includes. I finally found a great article that tries to explain the WordPress slug architecture. It’s fairly complex. Much of it lives in/wp-includes/query.php. However, my problem was very specific.

Many of my post slugs had periods in them. The period does not interfere with the URL, but WordPress doesn’t like them, and somewhere in the massive beast. So I had to find the page that “gets” posts. Lo and behold, there is a function called “get_posts” that lives in /wp-includes/query.php. I kept poking around. Like anyone who keeps digging, eventually, you’ll find yourself in wp-includes/formatting.php. And there it is.

Slug posts get sanitized – like everything, virtually all input is strictly sanitized – by a function called sanitize_title_with_dashes(). This function generates the slug. In order to include dots in your slug titles, just replace lines 366 and 267 (on WordPress 2.6.0) with this:

$title = preg_replace('/&+?;/', '', $title); // kill entities
$title = preg_replace('/[^%a-z0-9 _.-]/', '', $title);

Then your slug titles will not strip periods. Of course, I don’t recommend you actually use periods, I just wanted them to work when fetching old posts created before I knew any better.

After that adventure, I have to tell you, I’m really loving WordPress. There are some incredible plugins that have done some amazing functionality extension for me. So far, so good.

Export Blogsome, Export Slashdot Journal

I was recently issued a challenge: backup a blogsome blog and the content of a Slashdot journal and merge them into a single database. I foolishly accepted this challenge, knowing that Blogsome is based on WordPress. Come to find out that Blogsome doesn’t allow you to backup or export their WordPress content. Also, Slashdot doesn’t provide you a way to export or backup your journal. The prize was sweet: a brand new, fairly expensive, unlocked mobile phone.

If you want to make a mirror of your blogsome blog, you can use a single very powerful command to generate a snapshot of it from any Linux machine or Windows with Cygwin installed:

wget -k -m -r http://url

But this will only create a static HTML mirror of your website. It won’t allow you manipulate content or put it into another database. That leaves only one way to do it – request the entire site page by page, and parse each page individually. RSS is not reliable here, as most people have it set to only 15 or so items and parsing an enormous page make make PHP or your server run out of memory or alloted script execution time.

It’s a multi-step process, to be certain. It was actually painful to go through the process. Requesting over and over, debugging the script line by line. It takes several steps to get things right. But, eventually, I did it. I was able to export a Blogsome blog in its entirety – every entry, all the categories, all the comments with emails and websites …everything.

Slashdot was the same. It took some tinkering, but eventually, I was able to backup a very lengthy Slashdot journal. Again, in its entirety. I got every post, the date, etc. It was not simple, but it worked well. And I not only got them backed up, I merged them into a single database serving up… Small Axe 0.6 (which was a whole adventure of its own, taking current firsttube.com code and “neutralizing” it). Suffice it to say, when I saw all entries working and served in Small Axe, I had a huge smile. Turns out that the person I was doing this for decided against Small Axe (only because it doesn’t yet offer all the bells and whistles WordPress does, even if it is a beast). But it was irrelevant – the hard part was getting the data properly, and that’s done. Migrating to *any* blog database is possible if you have the time, inclination, and skill to write a SQL export/import script.

Here’s how it works:

* Cycle through each page of the blogsome blog. On each page, we get the entries, the URL, the postid, and other relevant info. We set a flag on each item to 0.
* As we retreive the items, we correct the path to images, spacing, smilies, etc.
* Then we cycle through each page individually. We have all the URLs already, so we go through each one and parse comments. It’s important to know that comment owner comments are marked up differently. As we get the comments, we upodate the flag and let the script run on its own. Our 900+ items took about an hour by meta-refreshing the fully rendered page every 3 seconds.
* As we go through the comments, we strip tags we don’t want, we fix emoticons, we fix internal links, spacing, etc. We must expose emails temporarily if we want them to transfer over.
* Finally, we import them all into a central database with an agreed schema.

If you have a Blogsome blog and/or a Slashdot journal you need backed up, I can help you do it. It’s not a simple process, but it is very accurate and preserves whatever data is exposed via HTML. So for the right barter, I would be very motivated to help. If I can simplify the process, I may create an open script to do this. But for now, I’ve got the code.

On a related note, I’ll probably release an updated version of Small Axe sometime in the not too distant future, because the amount of changing I’ve done and all the features I’ve just implemented are killer. Small Axe is FAR from WordPress caliber tested, but it’s SUPER simple and can do all the basics of a normal blog, including templating, smart per-domain caching, blocking by ip, username, email, or keyword, gravatar support, tons of configuration options, RSS and Atom support, threaded commenting, post locking, post expiring, browser identification, slug-based permalinking, and much more.