When writeBatch to HBaseTable generates OOM (Out Of Memory exception)

Whenever I start to work with a new technology stack I spend the first few years knee-deep in doubt.

Not “why isn’t this thing working?” doubt. That kind of doubt comes after a few years of experience, when I’m sure that what I am trying to do should work.
The first few years are full of “I wonder what this thing is doing?” doubt, or “I wonder if I’m using this thing in the right way?” doubt.

 

Today, a spark job (that batched data into a Hbase table) threw an “Out of memory” exception after inserting about 30% of the desired payload.
The knee-jerk response was “throw some more hardware at it”, but after a bit of tinkering a few discoveries were made: the most important of these was that the HTable class does some interesting things internally with threads – it appears to create a thread for each HTable.put() — and that said threads are not GC’d until you call the HTable.close() method.

 

Today’s solution was to get a BufferedMutator for the Hbase table and to explicitly call .flush() and .close() after every write.

	public void writeBatch(List<Put> puts) throws IOException {
		final BufferedMutator bufferedMutator = tableConnection.getBufferedMutator(tableName);
		bufferedMutator.mutate(puts);
		bufferedMutator.flush();
		bufferedMutator.close();
	}

This probably is not the correct way to do this, but it delivered the expected result (comment if you can see something glaringly wrong).