Oh, the joys of goosing the performance while reducing the memory hit!
One of our customers needs to create million-row spreadsheets. Their use case makes a lot of sense. The people who get the spreadsheets use pivot tables and filtering to get just the parts they need, and this way anyone can pull any of the data.
But when they first tried this, it wasn’t pretty.
When you do something a million times, the time and memory hits add up.
In addition, when you have a million of something, code that runs just fine for 40-page spreadsheets suddenly slows to a crawl and/or gobbles up a ton of memory.
Not good when your goal always remains to deliver reporting “faster.” So we worked through how to speed up the process.
Once we got their basic sample cooking (just one minute to complete), I then had a sales engineer here create some more sophisticated samples—and that exposed an additional set of problems.
So we worked on it some more and got it to the point where we can routinely generate a million-row spreadsheet (XLSX file) in 1 to 4 minutes.
The Big Lessons We Learned
- Turn debug mode off when measuring. Twice I freaked out at how slow it was after some changes only to realize all the debug checking was on. (We run our code here with debug enabled and the engine is about 4–10 times slower because of all the checks. But with that we find subtle issues.)
- We had two places in the code where we stored objects in an array and when we needed a specific one, walked the array for them. We normally had 5–20 objects in the list so no big deal. In the case of one of these samples we had 450K objects. That was a killer. So we moved both to a tree and that time hit disappeared.
- Optimizing for speed and for reduced memory consumption are very different things. They’re not antagonistic to each other, it’s just different problems to solve. Converting a string to lowercase once to be used 3 times in a method is such a small performance win that it’s unmeasurable. But it’s a significant garbage collection win when that code is run 2 million times – we create one string instead of 3.
- Switch from linked lists to stacks. We had a couple of small lists that we kept as a linked list due to the way we used them. They both have objects added and removed, generally not a big deal. But over the course of 1 million rows, it was a lot of list nodes being created and then later collected by the GC. We switched these to a stack and got almost the same performance and eliminated several hundred thousand allocations/collections. (The win here is every element added to a linked list required the creation of a node object, and every removal then a garbage collection. Adding to a stack, when the total size is not changing appreciably is just an assignment.)
- You can skip the final Engine step. The final step of the Engine, before writing out the final report, is to do page layout. However, when generating an XLSX, PPTX, or HTML report, the only useful information from this is the number of pages generated. (For DOCX/RTF it is also needed to populate fields like NUMPAGES, TOC, etc.) We added a mode to eliminate this step. Skipping this step saved about 1/3 of the total time.
- Common numeric formatting takes noticeable time. Initially 20% of the time was turning numbers into strings (in the XLSX file all numbers are stored as strings), both for cell values and for formatted cells with numbers in them. We were able to get this down to 8% by putting in special code to handle the common numeric formatting. I would not have expected this to be a noticeable component, yet after we’ve done everything we can think of, it’s still 8% of the total.
- Add a cache for the CultureInfo object. In the number -> string formatting, in the .NET engine, we get a CultureInfo object for the locale to format the string. Inside the CultureInfo object it was creating a string on each use and again, that was 2 million creations of the same exact string. So we added here (and in 4 other places for other object types) a cache where we saved the CultureInfo object for a given locale (it generally never changes) and on the next call, if it is the same locale, just reuse that object. Big savings for the garbage collection. BTW, if you need a cache in Java, use LinkedHashMap. It has a member called on each add to throw away the oldest member when it gets over a defined size.
- Pool objects when possible. We also had a couple of objects we could pool. We need an object to iterate over the content of every paragraph in every cell. Rather than recreate that object for each paragraph, we keep a small pool of them and just reinitialize the pooled object.
- Some things are just annoyances that we can’t change. One thing that was annoying – 1.2% of the time is calling BigDecimal.doubleValue() when all we need is to know if a Number object ins <, ==, or > 0. I wish they had added to Number the ability to get this property. And because we’re processing a Number, doing an instanceof against each possible type just changes where the time hit is.
There was a lot more that went into this, but the above gives you most of the big wins that might be of use to you.
The end result? We can generate a 1 million row XLSX file in 1–4 minutes, depending on what’s in the template. Add in lots of out tags returning numbers with esoteric formatting, and you’ll be up at the 4-minute end of the scale. And you can do this on a 16G system (although 24G would be better). I think that’s pretty cool.
Have you generated million-row spreadsheets? Talk about it in The Windward Studio LinkedIn Group!
Author: David Thielen
Dave, Windward's founder and CEO, is passionate about building superb software teams from scratch and dramatically improving the productivity of existing software teams. He's really proud that he once created a game so compelling (Enemy Nations) that a now-professional World of Warcraft player lost his job for playing it incessantly on company time. You can read more from Dave on his personal blog, and at Huffington Post.
Other posts by David Thielen