Whitey's Coding Blog: 2012

Friday, June 1, 2012

Hadoop - Removing Empty Output Files

Sometimes you have a Map Reduce job that runs 100/1000's of mappers or reducers where a number of tasks don't output any K/V pairs. Unfortunately Hadoop still creates the part-m-xxxxx or part-r-xxxxx files, but the contents of those files are empty.

With a little code however, you can extend most FileOutputFormat sub-classes to avoid committing these files to HDFS when the tasks complete. You need to override two methods from the OutputFormat:

getRecordWriter - Wrap the writer to track that something was output via the Context.write(K, V) method
getOutputCommitter - Extend the OutputCommitter to override the needsTaskCommit(Context) method

Here's an example for extending SequenceFileOutputFormat:

Thursday, May 24, 2012

Building Hadoop Source Jars

Script to build yourself the source jars for hadoop core:

Download a hadoop release from http://www.apache.org/dyn/closer.cgi/hadoop/common/
Unpack the tar.gz to a folder (in my case /opt/hadoop/)
Copy the script below into a file names mk-source.sh (Amend the VERSION and HADOOP_SRC variables for your environment)

Optionally install the sources into your local maven repository

mvn install:install-file \
    -Dfile=/opt/hadoop/hadoop-0.20.2/src/hadoop-core-0.20.2-sources.jar \
    -DgroupId=org.apache.hadoop \
    -DartifactId=hadoop-core \
    -Dversion=0.20.2 \
    -Dpackaging=java-sources \
    -DgeneratePom=false

Tuesday, May 22, 2012

Hadoop Local Map Aggregation

The much cited 'word-count' Hadoop map-reduce example has (in my opinion) some fundamental bad-practices with regards to map-reduce processing. In it, the mapper outputs a single for each word processed in the input file. As a simple example on a small dataset this is fine to teach the principals, but too often i see people apply this method of element 'counting' to huge datasets.

There are a number of improvements that can be made, and in this post, I outline a map-aggregation library i've put together (available on github) to address these problems

Limited domain keys

Given a body of text, you can expect the vast majority of words to be contained within a small set or domain of words. A number of words such as 'a, the, and' etc are used over and over again in bodies, and your mapper can be designed to account for this by using a concept of local map-aggregation.

Rather than outputting a pair for each word observed, you can maintain a local map of and update the IntWritable value as each word is discovered. You can easily store a map of some 60,000 word-frequency pairs in a in-memory map, and at the end of your mapper, dump the contents of the map to the output context. This will greatly reduce the volume of data sent between the mappers and reducers. Of course you could also implement this using a Combiner, but you're still hitting the local disk, and paying the cost of a map-side sort of this data.

map-aggregation-lib - amended word count example

Amending the word count mapper to use the map-aggregation-lib is easy - using an instance of the IntWritableAccumulateOutputAggregator class, we can aggregate the word counts in memory rather than delegating all the work to the combiner:

Monday, April 9, 2012

Part 2 - x509 Authentication with Spring, Eclipse, Jetty and Maven

Introduction

I've seen a few posts to Stack Overflow recently regarding x509 authentication using Spring. I know from my own experience that finding a single tutorial that has everything in one place is difficult, so I'd thought I would put one together that covers pretty much everything you need to get a simple web application going

I'm going to break this up into 4 parts:

Part 1 : Generation of a client & server self-signed certificates (with common self-signed CA root certificate)
Part 2 : Maven web application archetype generation and maven-jetty-plugin configuration
Part 3 : Using a simple in-memory authorization provider
Part 4 : Web application debugging using Eclipse

Part 2 - Maven web application

I am going to present the generation of a simple Java web application using a Maven archetype, and configuration of the maven-jetty-plugin to allow local testing

Maven archetype generation

I'm using a Maven archetype to build a template for my web application project, which can be easily accomplished with the following command: After which you'll have a Maven managed Java project, with templated folders and files for a simple web application:

./whitey-webapp
./whitey-webapp/src
./whitey-webapp/src/main
./whitey-webapp/src/main/resources
./whitey-webapp/src/main/webapp
./whitey-webapp/src/main/webapp/WEB-INF
./whitey-webapp/src/main/webapp/WEB-INF/web.xml
./whitey-webapp/src/main/webapp/index.jsp
./whitey-webapp/pom.xml

Maven-jetty-plugin configuration

We now want to configure the maven-jetty-plugin, which will allow us to bring up a jetty web container instance and host our web application.

Open up the pom.xml and add the following plugin definition into your project->build->plugins section: You'll notice we're referencing the server certificate and trust store created in part 1 - be sure to copy them into the src/test/certs folder before continuing (you'll need to create the directory). You should now be able to use Maven to run the jetty:run goal and test your jetty configuration: Note: If you get an error about the org.maven.plugins:maven-jetty-plugin not existing then you'll need to add a section to your ~/.m2/settings.xml file:

Browser validation

So now you've got the jetty server up and running, open up your browser and go to https://localhost:8443/whitey-webapp, you should get an error to the effect of ssl_error_bad_cert_alert (Firefox). This is for a number of reasons:

Your browser doesn't trust the server, and doesn't have a client certificate compatible for the server's trust store
The server requires a client certificate (that's the needClientAuth=true portion of the maven-jetty-plugin configuration)

To enable us to successfully connect to the server, you'll now need to configure your browser. Without going into details for each and every browser out there you need to complete two things:

Import the CA root certificate (ca.crt) generated in part 1 into your browsers list of trusted Authorities
Import the client certificate (client.p12) generated in part 1 into your browsers list of user certificates

With this complete, point your browser at https://localhost:8443/whitey-webapp. You'll most probably get a warning message - because the server certificate's common name (Test Server) doesn't match the name of the server (localhost). You should be able to convince your browser to ignore this message and continue.

With any luck, you should see a Hello World web page, which has been delivered over SSL

Part 3 - x509 Authentication with Spring, Eclipse, Jetty and Maven

Introduction

Part 1 : Generation of a client & server self-signed certificates (with common self-signed CA root certificate)
Part 2 : Maven web application archetype generation and maven-jetty-plugin configuration
Part 3 : Using a simple in-memory authorization provider
Part 4 : Web application debugging using Eclipse

Part 3 - Using a simple authorization provider

In this installment I'm going to configure spring security for x509 Pre authentication, extract the users name from their client certificate and look up their credentials in an in-memory authorization provider

Maven Dependencies

We need to add a Spring filter to our web application, this filter will be responsible for providing the security layer for the web app. Firstly we need to add some Spring dependencies to our Maven pom.xml:

Web Application Deployment Descriptor

Next we need to amend the deployment descriptor (web.xml) to add in a Spring DelegatingProxyFilter filter and a listener:

Base Spring Configuration

Now we need to create the base Spring configuration XML files. For this we need to create the file referenced in the web.xml: /WEB-INF/applicationContext.xml: This configuration file simply calls out to another configuration file, this time via the classpath prefix. This allows us to amend the classpath order during testing, allowing for different settings for test vs production environments.

Security Configuration

Now for the security configuration file: src/main/webapp/WEB-INF/classes/config/security.xml:

Displaying User Information

To display the user information, we can amend the index.jsp to dump the contents of the request's user Principal and the user name:

Testing It All Works

To test everything we just configured, restart your jetty test server Point your browser to https://localhost:8443/whitey-webapp/ and again, with some luck, you should get the following:

Spring Security

User principal: org.springframework.security.web.authentication.preauth.PreAuthenticatedAuthenticationToken@4032854a: Principal: org.springframework.security.core.userdetails.User@bb6965d9: Username: jsmith; Password: [PROTECTED]; Enabled: true; AccountNonExpired: true; credentialsNonExpired: true; AccountNonLocked: true; Granted Authorities: ROLE_ADMIN,ROLE_USER; Credentials: [PROTECTED]; Authenticated: true; Details: org.springframework.security.web.authentication.WebAuthenticationDetails@957e: RemoteIpAddress: 127.0.0.1; SessionId: null; Granted Authorities: ROLE_ADMIN, ROLE_USER
User name: jsmith

Wednesday, April 4, 2012

Part 1 - x509 Authentication with Spring, Eclipse, Jetty and Maven

Introduction

Part 1 : Generation of a client & server self-signed certificates (with common self-signed CA root certificate)
Part 2 : Maven web application archetype generation and maven-jetty-plugin confguration
Part 3 : Using a simple in-memory authorization provider
Part 4 : Web application debugging using Eclipse

Part 1 - Certificate generation

I am going to present the generation of the client and server certificates signed by a common CA root certificate. Using a common root certificate is more like what you would have in a production environment and makes the creation of a simple trust stores easy.

Automated Script

So time for a bit of an explanation. We need to create 4 components:

CA root certificate - This is our Certificate Authority (CA). Using this certificate, we can create certificates for the client and server which are issued by the CA.
- CA.key - Private key for the CA
- CA.crt - Public key for the CA
Server certificate - This is our Server certificate - used by the web server to provide identity information about itself to clients.
- server.key - Private key for the server
- server.csr - Certificate request for the server (used by the CA when creating the server public key)
- server.crt - Public key for the server
Client certificate - This is our Client certificate - used by the browser or another client (java) to provide identity to the server.
- client.key - Private key for the client
- client.csr - Certificate request for the client (used by the CA when creating the client public key)
- client.crt - Public key for the client
Java trust store - This is our store of certificates that should be trusted. It's used by the web server or java client to validate whether the certificate for the other side of the link can be used to communicate. By using a common CA root certificate, we can create and use a single truststore on both the web server and java client.