For code/output blocks: Use ``` (aka backtick or grave accent) in a single line before and after the block. See: http://commonmark.org/help/

High memory consumption while optimizing using InfluxDB data feed



  • Problem:

    I'm using the InfluxDB to store the history data. Naturally the InfluxDB data feed is used for backtesting and optimizing the strategy.

    Trying to optimize the strategy on ~10years 5min data set with few parameters ranges (resulting in 90 iterations) got me out of memory on my dev machine ( 12 cores + 12GB mem )

    Here the cerebro flags I was using:

        cerebro = bt.Cerebro(maxcpus=args.maxcpus,
    		                 live=False,
    		                 runonce=True,
    		                 exactbars=False,
    		                 optdatas=True,
    		                 optreturn=True,
    		                 stdstats=False,
    		                 quicknotify=True)
    

    Analysis:

    After a little bit of debugging the problem appears to be with InfluxDB data feed implementation which lacks a proper support for preload function.

    In the current InfluxDB implementation the data from the influx database is loaded during the InfluxDB.start method and the result-set is kept in memory for the live time of the InfluDB instance. Even if cerebro preloads all the data, the result-set (which is no longer needed in such case) will still be in memory.

    This is problematic when running optimization, where multiprocessing.Pool and Pool.imap is used for running the strategy with all its parameter permutations concurrently.

    The way the multiprocessing.Pool works (the default method on Linux at least) is that the main process is simply forked for each worker process, where the latter inherits the main process memory (which will include the memory allocated for the aforementioned result-set the InfluxDB data feed). In addition, for each run of the strategy, the cerebro instance will be serialized (pickle-ized) and passed to the worker process - once again this will include the memory for the InfluxDB data feed since it is directly referenced by cerebro instance . This will unnecessarily increase the memory pressure during the optimization process. Please correct me if I'm wrong.

    Here the results of memory profiler running (until the swap was exhausted and the program practically hanged at that point):

    before_change.png

    Here only 36 iterations were completed (out of 90) until the program hangs.

    The increases in memory load at times 200 and 350 are most probably caused when worker processes start new iteration (however I can't fully explain it yet - but probably because the gc is not triggered while a new cerebro instance is serialized)

    Attempt to solve:

    Adding support for 'preload' function in InfluxDB seems to substantially decrease the memory pressure during the optimization process,

    Here the resulted memory consumption (the optimization was completed iterating over all the 90 permutations):

    after_change_with_separate_mem_subprocess.png

    The memory usage is much lower now (4,5GB now vs 12GB+ before the fix)
    Still it would be interesting to attempt to reduce those jumps in memory load after each iteration.

    The patch could be found in my fork: https://github.com/vladisld/backtrader/commit/ee71ecabf66f7d37afd38b1bf7d06f24f58de596



  • It seems the gradual memory consumption increase in each worker process could indeed be explained by not timely release of the serialized Cerebro object passed to the worker thread.

    Using Pool.imap chunksize=1 illustrate this ( specifying chunksize causes the Pool to use each worker process only for chunksize work items)

    after_change_chunk_1.png

    This reduces the memory consumption even further, and in my case does not result in any significant performance degradation (since each work item takes much more time compared against the time it takes to create a new worker process), as could be clearly visible in the above graph.


Log in to reply
 

});