Google+

My First Rust Program - A Web Server Using Nickel.rs and rust-postgres

Rust is a new programming language, from Mozilla, which appeared on my radar recently.

I wanted to get my feet wet by writing a basic HTTP server that could read and write data from a database, and here is my first attempt: nickel-postgres

There are many things that I would like to improve about it - (see the numerous //TODO comments) - but I thought I would share my rookie attempt!

Many thanks to Steve Fackler, author of rust-postgres library, for giving me a couple of pointers in the right direction.

Quick Select Algorithm, a Javascript Implementation

The function description comment says it all.

/*
 * Places the `k` smallest elements (by `propName`) in `arr` in the first `k` indices: `[0..k-1]`
 * Modifies the passed in array *in place*
 * Returns a slice of the wanted eleemnts for convenience
 * Efficient mainly because it never performs a full sort.
 *
 * The only guarantees are that:
 *
 * - The `k`th element is in its final sort index (if the array were to be sorted)
 * - All elements before index `k` are smaller than the `k`th element
 *
 * [Reference](http://en.wikipedia.org/wiki/Quickselect)
 */
function quickSelectInPlace(arr, k, propName) {
    if (!arr || arr.length <= k || typeof propName !== 'string') {
        throw 'Invalid arguments to quickSelectInPlace';
    }
    var len = arr.length;

    var from = 0;
    var to = len - 1;
    while (from < to) {
        var left = from;
        var right = to;
        var pivot = arr[Math.ceil((left + right) * 0.5)][propName];

        while (left < right) {
            if (arr[left][propName] >= pivot) {
                var tmp = arr[left];
                arr[left] = arr[right];
                arr[right] = tmp;
                --right;
            }
            else {
                ++left;
            }
        }

        if (arr[left][propName] > pivot) {
            --left;
        }

        if (k <= left) {
            to = left;
        }
        else {
            from = left + 1;
        }
    }
    return arr.slice(0, k);
}

Deploying an ember-cli app to Heroku - Demo Apps Only!

Deploying to Heroku is easy... if you can figure out all of the hidden gotchas!

No dev dependencies

That means that you cannot depend on any global npm packages.

Since ember-cli install itself locally by default, the only global package you will need is bower.

npm install --save bower

Except ember-cli

... which must be in both dependencies and devDependencies.

This is because the ember command inspects the package.json in the file, looking for ember-cli. It does this to determine if that project is indeed an ember-cli app. If it does not find this there, it will display an error saying that you need to run the command from within a folder containing an ember-cli app.

If this is too much trouble for what it is worth, simply issue this command instead:

heroku config:set NODE_ENV=staging

... so that Heroku will run npm install instead of npm install --production when it spins up the dyno.

Server on web Proc only

The process that runs the server must be called web. Do not call it main or anything else. If you want to access a server running on a Heroku dyno from port 80 externally, that server must be running in a Proc named web. I wish Heroku's documents actually stated this explicitly.

web: npm run start

Use scripts in package.json

NodeJs packages may define an optional scripts section in their package.json file. For ember-cli apps, use scripts.postinstall to do a bower install; and use scripts.start to start run ember serve

Use the PORT environment variable

When running ember serve, do not use a default port number. Whenever heroku spins up a dyno (which happens at least once per deploy), it will assign a new port number (among other things), and this is the one that Heroku will port forward from port 80.

"scripts": {
    "start": "./node_modules/ember-cli/bin/ember serve --environment=production --port=${PORT}",
    "build": "./node_modules/ember-cli/bin/ember build",
    "test": "./node_modules/ember-cli/bin/ember test",
    "postinstall": "./node_modules/bower/bin/bower install"
},

Note that npm install is not necessary in scripts.postinstall - Heroku does that automatically for all NodeJs projects.

A Word of Caution

You should not use ember serve to deploy production apps. There are possibly some security and performance problems that this entails. But of course, sometimes you simply want to deploy a demo app, and in these cases deploying and ember-cli app like this works quite well.

New to Heroku? - Quick Run-down

Heroku is a cloud hosting service, which allows you to spin up and down instances on the fly. You can operate it entirely via the command line by installing Heroku toolbelt, and deployment happens by pushing to a git remote hosted on Heroku.

If deploying to Heroku for the first time, you will need to set up the requisites on your computer

wget -qO- https://toolbelt.heroku.com/install-ubuntu.sh | sh
# for other OS'es: https://toolbelt.heroku.com/
ssh-keygen # save to id_rsa_heroku
echo "Host heroku.com" >> ~/.ssh/config
echo " IdentityFile ~/.ssh/id_rsa_heroku" >> ~/.ssh/config
chmod 600 ~/.ssh/config
heroku keys:add

To get a NodeJs app up and running on Heroku, first create the app, and when ready for deployment:

git init # if you have not done so already
git add . && git commit -a # commit whatever should be deployed
heroku create name-of-your-app
git push heroku master

Heroku's git repository has a post-hook that runs upon each push, which will attempt to (re)install and (re)deploy your app, and the push will only succeed if it the installation and deployment succeeds.

Migrating from Tumblr and Wordpress to Docpad - Extract and Transform

In the previous post, I made the case for static site generation. Let us take a look at how to extract data from tumblr and wordpress blogs, and transform it for docpad, a static site generator.

Get Your Node On

mkdir blog-extract
cd blog-extract
npm init #accept all the defaults, it isnot very important
npm install --save request mkdirp moment tumblr.js
touch index.js

Edit index.js, and add the following:

var fs = require('fs');
var url = require('url');
var path  = require('path');
var mkdirp = require('mkdirp');
var moment = require('moment');
var request = require('request');
var tumblr = require('tumblr.js');

Now we have a shiny new NodeJs project ready to go, with batteries (dependencies) included.

Wordpress Posts API

Wordpress exposes a JSON API that allows you to extract your posts. There is almost no set up required, as no form or authentication is required.

In order to get our posts, we can follow these instructions.

Extract and transform

With the API documentation in hand, we can now write some code to automate that - we certainly do not want to be issuing multiple wget or curl calls, and then copying the results from them into new files by hand . I would do that for maybe a couple of posts, but since I am dealing with about 80 posts here, and that is certainly going to be too time consuming of an endeavour!

var pos, step, total;

var wordpressSite = 'yourblogname.wordpress.com'; // replace with your own
pos = 0;
step = 20;
total = 0;
do {
    /*
     * Here we do the queries, and be sure to set total so that it loops more than once
     * The looping is necessary, because you cannot download all posts at once
     * and we must paginate the requests
     */
} while (pos < total);

That is the basic run loop. Within the run loop, we perform the requests to the wordpress API server:

    var reqUrl = 'https://public-api.wordpress.com/rest/v1/sites/'+wordpressSite+'/posts/?number='+postsAtATime+'&offset='+postIdx;
    request(reqUrl, function(err, resp, body) {
        if (err || resp.statusCode !== 200) {
            console.log(err);
            return;
        }
        body = JSON.parse(body);
        if (body.total_posts > total) {
            //set total count, should only happen the first time
            total = body.total_posts;
        }
        //parse each of the posts in the response
        body.posts.forEach(function(post) {
            //transform the post into the format required by docpad
            //and write to file
        });
    });

We can take a look at what the API response for each blog post looks like in these instructions.

The format that we need to translate to consists of two important parts:

  • Directory and file name
  • Metadata

The third part is the post's contents, but that can be copied verbatim without any transformation.

For a default docpad blog configuration, this would usually be: src/documents/posts/slug-for-this-post.html

We get check this by looking at docpad.coffee, and inspecting docpadConfig.collections.posts:

`@getCollection('documents').findAllLive({relativeDirPath: 'posts'}, [date: -1])`

We are however, not going to put our extracted files in the posts folder, and put them in a wordpressposts folder instead. Instead we will create a separate folder for all the wordpress posts, and configure docpad to look there as well. This configuration will be covered at the end, so if you want to test things out right away, skip to the bottom of the post

I am using the plugin, docpad-plugin-dateurls, so the URL paths of each of the posts is will match the default wordpress URL paths. Here, we want the directory and file name to follow this pattern: src/documents/wordpressposts/YYYY-MM-DD-slug-for-post.html

        var postUrl = url.parse(post.URL);
        var pathname = postUrl.pathname;
        if (pathname.charAt(pathname.length - 1) === '/') {
            pathname = pathname.slice(0, -1);
        }
        pathname = pathname.slice(1).replace( /\//g , '-');
        var filename = path.normalize('src/documents/wordpressposts/'+pathname+'.html');

For the metadata, we use moment to format the date and time

        var title = post.title && post.title.replace(/"/g, '\\"');
        var date = moment(post.date).format('YYYY-MM-DD hh:mm');
        var tags = Object.keys(post.tags).join(', ');
        var contents = '---\n'+
            'layout: post\n'+
            'comments: true\n'+
            'title: '+title+'\n'+
            'date: '+date+'\n'+
            'original-url: '+post.URL+'\n'+
            'dateurls-override: '+postUrl.pathname+'\n'+
            'tags: '+tags+'\n'+
            '---\n\n'+post.content;

Finally, write the output to file:

        var dirname = path.dirname(filename);
        mkdirp(dirname);
        fs.writeFile(path.normalize(filename), contents, function(err) {
            if (err) {
                console.log('Error', filename, err);
                return;
            }
            console.log('Written', filename);
        });

Tumblr Posts API

Tumblr is a little more involved than Wordpress, as in order to query any of their API, you will need to have a tumblr account (which you probably already have since you are extracting your posts from it), and register a tumblr app to obtain an API keys. Copy your "OAuth Consumer Key", and you are good to go.

Once that is done, we simply need to follow this section in the documents. The upside ofthis slightly higher complexity is that tumblr provides a NodeJs client library that makes it easier to call the tumblr API, and avoid having to deal with making raw HTTP requests, like we did for the Wordpress API.

Extract and transform

var tumblrSite = 'bguiz.tumblr.com'; // replace with your own
var client = tumblr.createClient({
    consumer_key: 'sfsdfsdfsdfjkjksjdfhkjkjhkjshdfkjhkjhskdjfhkjhkjhd' //replace with your own
});
pos = 0;
step = 20;
total = 0;
do {
    /*
     * Perform the paginated requests
     */
} while (pos < total);

Performing the requests:

    client.posts(tumblrSite, {
        offset: pos,
        limit: step,
    }, function(err, data) {
        if (err || ! data) {
            console.log(err, data);
            return;
        }
        if (data.total_posts > total) {
            //set total count, should only happen the first time
            total = data.total_posts;
        }
        data.posts.forEach(function(post) {
            //transform the post into the format required by docpad
            //and write to file
        });
    });

Here, we want the directory and file name to follow this pattern: src/documents/tumblrposts/YYYY-MM-DD-slug-for-post.html

        var ts = moment(post.timestamp*1000);
        var postUrl = url.parse(post.post_url);
        var dateStr = ts.format('YYYY-MM-DD hh:mm');
        var filename = 'src/documents/tumblrposts/'+ts.format('YYYY-MM-DD')+
            '-'+postUrl.pathname.split('/').slice(-1)[0]+'.html';

For the metadata, we want to set the dateurls-override property. Note that this feature is not yet available on in docpad-plugin-dateurls, and you will need my patch for this to work. To get this, modify package.json in your root folder, replacing the version number of the plugin with an explicit git URI, like so:

"docpad-plugin-dateurls": "git+ssh://git@github.com:bguiz/docpad-plugin-dateurls.git#exclude-option",

This tells npm to install a NodeJs package, not from the default npm repository, but instead by cloning a git repository. Unfortunately, this also means docpad will not be able to run the plugin yet, as npm installing a git url does not run prepublish. To work around this, for now, you need to do the following:

npm install
docpad run # fails "Error: Cannot find module 'node_modules/docpad-plugin-dateurls/out/dateurls.plugin.js'"
cd node_modules/docpad-plugin-dateurls
cake compile && cake install
ls out #you should see dateurls.plugin.js
cd ../..
docpad run # success!

For tumblr posts, the default URL path follows the format /post/12345678/slug-for-this-post, and if we migrate posts from the old blog to the new blog, any links, especially external ones, to the site will be broken. That will make for a really annoying experience for those visiting your sites, so it is best to preserve URLs where possible; hence the need to override the default URLs.

        var title = post.title && post.title.replace(/"/g, '\\"');
        var tags = post.tags.join(', ');
        var contents = '---\n'+
            'layout: post\n'+
            'comments: true\n'+
            'title: '+title+'\n'+
            'date: '+dateStr+'\n'+
            'original-url: '+post.post_url+'\n'+
            'dateurls-override: '+postUrl.pathname+'\n'+
            'tags: '+tags+'\n'+
            '---\n\n'+post.body;

Finally, write the output to file:

        var dirname = path.dirname(filename);
        mkdirp(dirname);
        fs.writeFile(path.normalize(filename), contents, function(err) {
            if (err) {
                console.log('Error', filename, err);
                return;
            }
            console.log('Written', filename);
        });

Docpad Configuration Changes

We edit docpad.coffee, in the root directory of the docpad project. Modify docpadConfig.collections.posts to look like this instead.

@getCollection('documents').findAllLive({relativeDirPath: {'$in' : ['docpadposts', 'tumblrposts', 'wordpressposts']}}, [date: -1])

All the wordpress posts should be in src/documents/wordpressposts, tumblr posts in src/documents/tumblrposts. When writing any new docpad posts save them in src/documents/docpadposts.

If you have any docpadConfig.environments configured, be sure to modify each of their collections.posts accordingly too.

That is all there is to do for now. Execute docpad run, and visit the newly extracted blog in a browser!

Where to from here?

One task in blog extraction, that we have not covered here, is that of any static assets, such as images, that may have been hosted on your previous blogs. Most notably, images. If you have hosted these on CDNs, they will continue to work. Otherwise, you will need to extract them too.

Another extraction task that we have not covered are links between posts. Since we have preserved the path for each post's URL here, this should not pose a problem.

The solution to both of these involves parsing the URLs in each post's content, be it href attributes in <a> tags, or src attributes in <img> tags, and download and save them too.

Migrating from Tumblr and Wordpress to Docpad - Static Site Generation

I currently write my blog using tumblr, and previously I blogged using wordpress. While both of these are great platforms, they share common pitfalls, when it comes to giving you control over your writing.

I wanted to be able to have a copy of all the assets that comprise my blog, in its entirety, on my hard disk, and be able to modify and publish them as I pleased. I also wanted to be able to include fancier things in my pages - like embed a Github gist, or create my own d3 visualisation, or, well why not take it to an extreme, create an AngularJs app running within one of my posts; and I wanted to be able to do all of these things without having to log into some website hosted in a far away country, and wait for all those bytes to fly across several oceans and back each time.

Flexibility and control - that is key.

Enter Static Site Generators

For a blog, the contents are almost static. The server only needs to send a different response for a page, when that page has been modified by the author. The exception to this are comments, but with the advent of disqus, that is no longer even a consideration.

A content management system, including both tumblr and wordpress, builds each page upon demand, which can be an expensive operation, as it involves database queries, assemlby of templates, et cetera. Quite often, when a CMS driven site receives a lot of concurrent visitors, its response times start to lag noticeably. To work around this, it has become common practice to cache the results of each dynamically generated page, using tools like memcached.

Static site generation is all about taking caching to the next level. The author of the site knows exactly when the previous cache needs to be invalidated - when they write a new post or update an existing one. Why not, at that point of time, generate the cache contents, and upload them directly to the server? Well, that is exactly what static site generators do; the static files are the cache

What about collaboration?

One of the big advantages of a CMS is that it enables collaboration. If everyone just logs into the same website, be it wordpress.org or tumblr.com, and made their edits on the site, then there is only one copy of the site, and therefore it is easy to manage collaboration on the contents of the site.

Indeed that is a very direct and simple solution that addresses collaboration. We do, however, have a more sophisticated solution, that is already readily available: distributed version control systems. Tools such as git and mercurial have solved the distributed collaboration problem in a rather elegant way. All collaborators get to keep a copy of the site that they are contributing to on their own computers, and thus get the benefits that come along with that. When they are done writing a post, they simply have to push their latest contributions to the master copy. There are built in mechanisms to resolve any conflicts, for example, if two collaborators edit the same file.

Docpad

After reviewing the top few in this humungous list, I have decided that Docpad suits my needs the best, and I should be able to hit the ground running. I will give it a go, and the best part is, if I do not like it, my data is not stuck on some server somewhere - it will all be on my computer, and easily moved to a different static site generator.

In the next post, I will be tackling that very problem: With hosted CMSs, like tumblr and wordpress, getting your data out can be a little tricky; as can be transforming it such that it can be used in a static site generator.