Saturday, October 20, 2012

Mythbuster












I'm asked to use an automated code review tool at work. It's not FindBugs; nevermind the name. Given a repository URL, it checks for architectural integrity, programming practices, and documentation. It reports scores on the academic scale, with zero being the worst and 4.0 being the best possible score. It flags offensive components and presents citations to buttress its arguments.

I'm particularly bothered by one offense against architectural integrity. It says that instantiation of objects inside a loop is inefficient and gives the lowest possible architectural score. The goal is to avoid instantiation of too many objects.

How on earth is a developer supposed to populate a list of objects?

Besides, this might have been a serious issue on the earliest versions of the JVM. But object creation now rivals C in speed, and newer generational garbage collectors are also big improvements.

Rather than rail against the injustice, I decided to create a small benchmark and see for myself.

I started with a simple Person class:

package cast;

import org.apache.commons.lang3.StringUtils;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;

/**
 * Person
 * @author mduffy
 * @since 9/19/12 7:38 AM
 */
public class Person {
    private static final DateFormat BIRTH_DATE_FORMATTER;
    
    private final String firstName;
    private final String lastName;
    private final Date birthDate;

    static {
        BIRTH_DATE_FORMATTER = new SimpleDateFormat("yyyy-MMM-dd");
        BIRTH_DATE_FORMATTER.setLenient(false);
    }
    
    public Person(String firstName, String lastName, Date birthDate) {
        this.firstName = (StringUtils.isBlank(firstName) ? "" : firstName);
        this.lastName = (StringUtils.isBlank(lastName) ? "" : lastName);
        this.birthDate = ((birthDate == null) ? new Date() : new Date(birthDate.getTime()));
    }

    public String getFirstName() {
        return firstName;
    }

    public String getLastName() {
        return lastName;
    }

    public Date getBirthDate() {
        return new Date(birthDate.getTime());
    }

    @Override
    public String toString() {
        final StringBuilder sb = new StringBuilder();
        sb.append("Person");
        sb.append("{firstName='").append(firstName).append('\'');
        sb.append(", lastName='").append(lastName).append('\'');
        sb.append(", birthDate=").append(BIRTH_DATE_FORMATTER.format(this.birthDate));
        sb.append('}');
        return sb.toString();
    }
}

I needed to keep statistics on wall time and heap used, so I wrote a Statistics class:

package cast;

import org.apache.commons.lang3.StringUtils;

import java.util.Collection;

/**
 * Statistics accumulates simple statistics for a given quantity "on the fly" - no array needed.
 * Resets back to zero when adding a value will overflow the sum of squares.
 * @author mduffy
 * @since 9/19/12 8:16 AM
 */
public class Statistics {
    private String quantityName;
    private int numValues;
    private double x;
    private double xsq;
    private double xmin;
    private double xmax;

    /**
     * Constructor
     */
    public Statistics() {
        this(null);
    }

    /**
     * Constructor
     * @param quantityName to describe the quantity (e.g. "heap size")
     */
    public Statistics(String quantityName) {
        this.quantityName = (StringUtils.isBlank(quantityName) ? "x" : quantityName);
        this.reset();
    }

    /**
     * Reset the object in the event of overflow by the sum of squares
     */
    public synchronized void reset() {
        this.numValues = 0;
        this.x = 0.0;
        this.xsq = 0.0;
        this.xmin = Double.MAX_VALUE;
        this.xmax = -Double.MAX_VALUE;
    }

    /**
     * Add a List of values
     * @param values to add to the statistics
     */
    public synchronized void addAll(Collection values) {
        for (Double value : values) {
            add(value);
        }
    }

    /**
     * Add an array of values
     * @param values to add to the statistics
     */
    public synchronized void allAll(double [] values) {
        for (double value : values) {
            add(value);
        }
    }
    
    /**
     * Add a value to current statistics
     * @param value to add for this quantity
     */
    public synchronized void add(double value) {
        double vsq = value*value;
        ++this.numValues;
        this.x += value;
        this.xsq += vsq; // TODO: how to detect overflow in Java?
        if (value < this.xmin) {
            this.xmin = value;
        }
        if (value > this.xmax) {
            this.xmax = value;
        }
    }

    /**
     * Get the current value of the mean or average
     * @return mean or average if one or more values have been added or zero for no values added
     */
    public synchronized double getMean() {
        double mean = 0.0;
        if (this.numValues > 0) {
            mean = this.x/this.numValues;
        }
        return mean;
    }

    /**
     * Get the current min value
     * @return current min value or Double.MAX_VALUE if no values added
     */
    public synchronized double getMin() {
        return this.xmin;
    }

    /**
     * Get the current max value
     * @return current max value or Double.MIN_VALUE if no values added
     */
    public synchronized double getMax() {
        return this.xmax;
    }

    /**
     * Get the current standard deviation
     * @return standard deviation for (N-1) dof or zero if one or fewer values added
     */
    public synchronized double getStdDev() {
        double stdDev = 0.0;
        if (this.numValues > 1) {
            stdDev = Math.sqrt((this.xsq-this.x*this.x/this.numValues)/(this.numValues-1));
        }
        return stdDev;
    }

    /**
     * Get the current number of values added
     * @return current number of values added or zero if overflow condition is encountered
     */
    public synchronized int getNumValues() {
        return this.numValues;
    }

    @Override
    public String toString() {
        final StringBuilder sb = new StringBuilder();
        sb.append("Statistics");
        sb.append("{quantityName='").append(quantityName).append('\'');
        sb.append(", numValues=").append(numValues);
        sb.append(", xmin=").append(xmin);
        sb.append(", mean=").append(this.getMean());
        sb.append(", std dev=").append(this.getStdDev());
        sb.append(", xmax=").append(xmax);
        sb.append('}');
        return sb.toString();
    }
}

Then I wrote a test method to exercise creating a million instances and compared the two idioms:

package cast;

import java.util.Date;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;

/**
 * InstantiationInsideLoopTest attempts to answer the question: Is it bad to instantiate an object inside a loop
 * with a modern JVM?
 * @author mduffy
 * @since 9/19/12 7:35 AM
 */
public class InstantiationInsideLoopTest {
    private static final int DEFAULT_INSTANCES = 1000000;
    private static final int DEFAULT_REPETITIONS = 5;
    private static final String[] FIRST_NAMES = {
            "Abel", "Bob", "Charlie", "David", "Edward", "Fred",
    };
    private static final String[] LAST_NAMES = {
            "Smith", "Jones", "Carver", "Brown",
    };

    private Random random;
    private Statistics begHeapStats     = new Statistics("beg heap (MB)");
    private Statistics endHeapStats     = new Statistics("end heap (MB)");
    private Statistics wallTimeStats    = new Statistics("wall time (sec)");


    public static void main(String[] args) {
        int numRepetitions = ((args.length > 0) ? Integer.valueOf(args[0]) : DEFAULT_REPETITIONS);
        int numInstances = ((args.length > 1) ? Integer.valueOf(args[1]) : DEFAULT_INSTANCES);
        boolean isInside = (args.length > 2) && "inside".equalsIgnoreCase(args[2]);
        System.out.println(String.format("# reps: %d # instances: %d inside? %b", numRepetitions, numInstances, isInside));
        InstantiationInsideLoopTest tester = new InstantiationInsideLoopTest();
        if (isInside) {
            tester.instantiateInside(numInstances, numRepetitions);
        } else {
            tester.instantiateOutside(numInstances, numRepetitions);
        }
    }

    public InstantiationInsideLoopTest() {
        this(System.currentTimeMillis());
    }

    public InstantiationInsideLoopTest(long seed) {
        this.random = new Random(seed);
    }

    private void instantiateOutside(int numInstances, int numRepetitions) {
        this.begHeapStats = new Statistics("beg heap (MB)");
        this.endHeapStats = new Statistics("end heap (MB)");
        this.wallTimeStats = new Statistics("wall time (sec)");
        for (int i = 0; i < numRepetitions; ++i) {
            long begTime = System.currentTimeMillis();
            long begHeap = Runtime.getRuntime().freeMemory();
            List persons = new LinkedList();
            Person p;
            for (int j = 0; j < numInstances; ++j) {
                p = new Person(getFirstName(), getLastName(), getBirthDate());
                persons.add(p);
            }
            long endHeap = Runtime.getRuntime().freeMemory();
            long endTime = System.currentTimeMillis();
            this.begHeapStats.add(begHeap/1024L/1024L);
            this.endHeapStats.add(endHeap/1024L/1024L);
            this.wallTimeStats.add((endTime - begTime)/1000.0);
//          summarize("outside", i, begTime, endTime, maxHeap, begHeap, endHeap, persons);
        }
        System.out.println(this.begHeapStats);
        System.out.println(this.endHeapStats);
        System.out.println(this.wallTimeStats);
    }

    private void instantiateInside(int numInstances, int numRepetitions) {
        this.begHeapStats = new Statistics("beg heap (MB)");
        this.endHeapStats = new Statistics("end heap (MB)");
        this.wallTimeStats = new Statistics("wall time (sec)");
        for (int i = 0; i < numRepetitions; ++i) {
            long begTime = System.currentTimeMillis();
            long begHeap = Runtime.getRuntime().freeMemory();
            List persons = new LinkedList();
            for (int j = 0; j < numInstances; ++j) {
                Person p = new Person(getFirstName(), getLastName(), getBirthDate());
                persons.add(p);
            }
            long endHeap = Runtime.getRuntime().freeMemory();
            long endTime = System.currentTimeMillis();
            this.begHeapStats.add(begHeap/1024L/1024L);
            this.endHeapStats.add(endHeap/1024L/1024L);
            this.wallTimeStats.add((endTime-begTime)/1000.0);
//            summarize("inside", i, begTime, endTime, maxHeap, begHeap, endHeap, persons);
        }
        System.out.println(this.begHeapStats);
        System.out.println(this.endHeapStats);
        System.out.println(this.wallTimeStats);
    }

    private String getFirstName() {
        return FIRST_NAMES[this.random.nextInt(FIRST_NAMES.length)];
    }

    private String getLastName() {
        return LAST_NAMES[this.random.nextInt(LAST_NAMES.length)];
    }

    private Date getBirthDate() {
        return new Date(this.random.nextLong());
    }

    private void summarize(String testName, int repetition, long begTime, long endTime, long maxHeap, long begHeap, long endHeap, List objects) {
        System.out.println(String.format("test name    : %s", testName));
        System.out.println(String.format("repetition # : %d", repetition));
        System.out.println(String.format("list size    : %d", objects.size()));
        System.out.println(String.format("beg heap (MB): %d", begHeap / 1024L / 1024L));
        System.out.println(String.format("end heap (MB): %d", endHeap / 1024L / 1024L));
        System.out.println(String.format("max heap (MB): %d", maxHeap / 1024L / 1024L));
        System.out.println(String.format("wall time (s): %10.3f", (endTime - begTime) / 1000.0));
        for (int i = 0; i < 5; ++i) {
            System.out.println(objects.get(i));
        }
    }
}
I ran this test case on two different machines - one running Windows XP, another using Windows 7 - and both running Sun JVMs for Java 6. The results consistently said that the idiom recommended by the automated code review suite was incorrect. I understand the dangers of benchmarks like these, but it's comforting to attempt to be scientific rather than merely complain.
profile for duffymo at Stack Overflow, Q&A for professional and enthusiast programmers

3 comments:

Pete Kirkham said...

Looking at the disassembly, the only difference between the loops is that the variable slots for `i` and `p` the are reversed, which I would have thought should not make any difference once the code is compiled and they are mapped into native registers (I'm using javac 1.6). The scope of `p` doesn't effect the reachability of the object it refers to, as it isn't used outside the loop. I don't see why the tool would favour one form over the other, unless it has some very simple heuristic that looks at the scope of `p` and incorrectly decides it's used outside the loop, and so doesn't trigger the rule.

Pete Kirkham said...

Looking at the disassembly, the only difference between the loops is that the variable slots for `i` and `p` the are reversed, which I would have thought should not make any difference once the code is compiled and they are mapped into native registers (I'm using javac 1.6). The scope of `p` doesn't effect the reachability of the object it refers to, as it isn't used outside the loop. I don't see why the tool would favour one form over the other, unless it has some very simple heuristic that looks at the scope of `p` and incorrectly decides it's used outside the loop, and so doesn't trigger the rule.

Pete Kirkham said...

Looking at the disassembly, the only difference between the loops is that the variable slots for `i` and `p` the are reversed, which I would have thought should not make any difference once the code is compiled and they are mapped into native registers (I'm using javac 1.6). The scope of `p` doesn't effect the reachability of the object it refers to, as it isn't used outside the loop. I don't see why the tool would favour one form over the other, unless it has some very simple heuristic that looks at the scope of `p` and incorrectly decides it's used outside the loop, and so doesn't trigger the rule.