Friday, August 31, 2012

Publishing Your Dissertation With LaTeX

I published my doctoral dissertation. As I used LaTeX to write it in the first place the publisher agreed to let me do the typesetting myself. This way I believed I would avoid a load of extra work converting it all to MS Word and all the annoyances that come along with it. While I am happy with the result in terms of typesetting quality, doing it all in LaTeX turned out to be quite a messy process that took me almost a year.

1. Fundamentals: jurabib hacking

Most of the complexity was hidden in subtle problems I had not foreseen. The main obstacle that consumed most of my time was caused by a decision I made years ago, when I decided to use the jurabib package to manage my references. By the time I was ready to publish the dissertation the package was not actively supported anymore. Instead there was biblatex, which looked very promising but did lack one feature I could not live without (support for archival records is essential for historians). I ended up adapting the official jurabib package to my needs (which were not fully covered by the original package), which took me at least 3 months —working in my spare time after my full-time job. Before that I had only been a user of LaTeX. While I felt comfortable using the higher level API, I had to learn La/TeX from scratch in order to do some more serious stuff. I got there mostly by reading parts of Knuth's TexBook and getting hints from the Internets. What I found out about jurabib was not very encouraging. Some implementation details seemed like a hack to me once I started understanding what was going on. My modifications did not help in any way making it look nicer or more robust. I just tried to coerce it somehow to follow my will. The result is probably not worth sharing but it did what I needed it to do.

2. XeLaTeX or LaTex?

I first jumped on to the XeLaTeX train, which seemed to make my life easier because I was able to use the OpenType fonts installed on my system. At same time I had to convert everything into UTF-8 because for some unknown reason I could not get the

\XeTeXinputencoding 
switch to work. But the showstopper was that the quality of the typesetting simply did not match the printed books I used as references. The main reason for that is that XeLaTex does not yet support the microtypography features that are available in LaTaX through the microtype package. I reverted back to LaTeX and used fontools' autoinst, which is basically a wrapper around LCDF TypeTools. It turned out to be not hard at all to use my system's OpenType fonts that way.

3. Adaptation to my publisher's requirements

My only guidance as to what the result should look like was an example of another book published in the same series and some general information from my publisher. I used that and Mac OS X's Preview program to get a feel for sizes and proportions of the intended result and used What The Font to identify the fonts. It felt weird figuring out the layout that way. I think this awkwardness was due to the publisher's workflow clearly being geared towards a MS Word centric approach, where all the typesetting happens at the publishing house. I had used KOMAscript to typeset the initial version of the text, which I submitted to my university. This turned out to be the next roadblock. There was a stark dissonance between what my publisher demanded and what the author of the KOMA templates deemed acceptable in an effort to educate his users towards his understanding of typesetting. Again I was reluctant to abandon the KOMA packages completely as they also offered much valued functionality in other areas.

My main difficulty was font sizes where I had to do nasty things like this:


\KOMAoption{fontsize}{10.5pt}
\makeatletter
%hacking to 11pt
\def\normalsize{%
\@setfontsize\normalsize\@xipt{12.75}%
\abovedisplayskip 11\p@ \@plus3\p@ \@minus6\p@
\abovedisplayshortskip \z@ \@plus3\p@
\belowdisplayshortskip 6.5\p@ \@plus3.5\p@ \@minus3\p@
\belowdisplayskip \abovedisplayskip
\let\@listi\@listI
}
\makeatother

I just did not find another solution. If I had had the time, the clean solution might have been to write my own template based on one of the default LaTeX templates instead.

4. Indices

Just one thing: Make sure they are ordered correctly even if you have got words/titles with umlauts in them ...

Conclusion

What does this all mean? Don't do LaTeX? Not at all, quite the opposite! Do it! Apart from the jurabib issue which was really painful, the other points mentioned here were lessons to be learned rather than unsurmountable obstacles. So by all means do it, maybe choose your packages wisely. Using TeX saved me a lot of tedious manual work while preparing indices and managing references and it also gave me professional grade typesetting on top of that.

Thursday, August 30, 2012

The Cost of Apache Commons HashCode

I had a discussion at work today about the cost of object creation in Java. It was about the way you are supposed to use Apache Commons HashCodeBuilder and EqualsBuilder creating a new object on every single call to hashcode() or equals():

@Override
public int hashCode(){
    return new HashCodeBuilder()
        .append(name)
        .append(length)
        .append(children)
        .toHashCode();
}

While I was pretty sure that the cost of creating these short lived objects on a modern JVM was pretty negligible in most use cases, I had no numbers to prove it.

Time for a biased benchmark of the simplest kind like the following:

package org.blogspot.pbrc.benchmarks;

import java.util.HashSet;
import java.util.Random;
import java.util.Set;

public class Runner {

 private static final int NUM_ELEMS = 100000;
 private static final int NUM_RUNS = 100;

 public static void main(String[] args) {
  String[] keys = createKeys();
  Integer[] numerics = createNumericValues();

  // priming the jvm
  for (int i = 0; i < 5; i++) {
   manualEqualsTest(i, keys, numerics);
   autoEqualsTest(i, keys, numerics);
  }

  long time = 0l;
  for (int i = 0; i < NUM_RUNS; i++) {
   time += manualEqualsTest(i, keys, numerics);
  }
  System.out.format("%s %.4fsecs\n", "Manual avg:",
    (time / Double.valueOf(NUM_RUNS)) / 1000000000.0);

  long autotime = 0l;
  for (int i = 0; i < NUM_RUNS; i++) {
   autotime += autoEqualsTest(i, keys, numerics);
  }
  System.out.format("%s %.4fsecs\n", "Apache commons avg:",
    (autotime / Double.valueOf(NUM_RUNS)) / 1000000000.0);

 }

 private static long autoEqualsTest(int runNumber, String[] keys,
   Integer[] numerics) {
  Set<CommonsEqualsAndHashcode> autoObjs = new HashSet<CommonsEqualsAndHashcode>(
    NUM_ELEMS);

  long start = System.nanoTime();
  for (int i = 0; i < NUM_ELEMS; i++) {
   autoObjs.add(new CommonsEqualsAndHashcode(keys[i], numerics[i]));
  }

  return System.nanoTime() - start;
 }

 private static long manualEqualsTest(int runNumber, String[] keys,
   Integer[] numerics) {
  Set<ManualEqualsAndHashCode> valueObjs = new HashSet<ManualEqualsAndHashCode>(
    NUM_ELEMS);
  long start = System.nanoTime();

  for (int i = 0; i < NUM_ELEMS; i++) {
   valueObjs.add(new ManualEqualsAndHashCode(keys[i], numerics[i]));
  }

  return System.nanoTime() - start;
 }

 private static Integer[] createNumericValues() {
  Integer[] result = new Integer[NUM_ELEMS];
  Random rand = new Random();
  for (int i = 0; i < NUM_ELEMS; i++) {
   result[i] = rand.nextInt();
  }
  return result;
 }

 private static String[] createKeys() {
  String[] result = new String[NUM_ELEMS];
  RandomString rand = new RandomString(32);
  for (int i = 0; i < NUM_ELEMS; i++) {
   result[i] = rand.nextString();
  }
  return result;
 }

}

I used two—otherwise identical—immutable value types: one with hand-crafted equals() and hashcode(), the other using Apache Commons HashCodeBuilder and EqualsBuilder.

package org.blogspot.pbrc.benchmarks;
public class ManualEqualsAndHashCode {

 public final String val;
 public final Integer numeric;

 public ManualEqualsAndHashCode(String val, Integer numeric) {
  this.val = val;
  this.numeric = numeric;

 }

 @Override
 public boolean equals(Object obj) {
  if (obj == this) {
   return true;
  }
  if (!(obj instanceof ManualEqualsAndHashCode)) {
   return false;
  }
  ManualEqualsAndHashCode other = (ManualEqualsAndHashCode) obj;
  return other.val.equals(this.val) && other.numeric.equals(this.numeric);
 }

 @Override
 public int hashCode() {
  int result = 17;
  result = 31 * result + val.hashCode();
  result = 31 * result + numeric;
  return result;
 }

}

I got the following results on a 2 GHz Intel Core 2 Duo with 4 GB 1067 MHz DDR3 running OS X 10.8.1 (12B19) with Java 1.7.06


Manual avg: 0.4052secs
Apache commons avg: 0.4508secs

In this particular scenario the overhead—when using the Apache Commons classes—amounts to about ten percent. Depending on where you're coming from, this might be a price worth paying. But, of course, this test scenario of just putting items into a collection is highly synthetic so your mileage may vary.

Now go and find more flaws in the benchmark!