For code/output blocks: Use ``` (aka backtick or grave accent) in a single line before and after the block. See: http://commonmark.org/help/

How to speed up backtest


  • administrators

    After much consideration there isn't a good in-code solution for replaydata because the timestamps have to be checked to re-construct the bars.

    Have a data feed which reads input which is formatted to indicate replaying. X ticks will be read and the data feed will not increase its own length, until a marker is seen and the length is increased. With that in mind, there is no actual replaying being calculated each and every time, because the actual replay sequence is already in the data feed (frozen in disk)


  • administrators

    Some tests have been made with buffers other than the standard distribution array.array. Summary of results

    • Using pandas.Series as a direct replacement is impossible, due to how actually appending data works, the times increase exponentially

    • Using bcolz.carray and the built-in append functionality (touted as much lighter than anything you can do with numpy) is not such a disaster as pandas.Series, but the actual execution times increase several orders of magnitude

    • Reading CSV data with array.array, moving it to a bcolz.carray or pandas.Series buffer and then pre-allocating all indicators/observers buffers with either bcolz.carray or pandas.Series and then vectorizing the indicator operations (for example with pandas.Series.rolling(window=x, center=False).mean()) brings the performance close to that of purely using array.array

      For 2 years of daily data (approximately 2 x 256 = 5632 bars) and 1 single SimpleMovingAverage

      • Standard array.array: 0.7s
      • Approach pandas.Series: 1.4s

    It actually seems that array.array (at least for a small amount of bars) fares better than using pandas.Series, even with a vectorized operation.

    There is one last step which may be still worth trying:

    • Modify the CSV data feed to directly use pandas and avoid the copying of data from array.array to pandas.Series

    A real potential route of optimization (but with a real rework of the entire architecture) could be: dask to have calculations bypass the GIL and use all available cores (the backtrader machinery would create the needed diagram task flow taking calculation dependencies into account) A long route in any case



  • @backtrader Very interesting. Out of curiosity, were your tests run with Python3?


  • administrators

    2 and 3 ... and 3 had the worst overall results with a couple of combinations being really incredibly slow.


  • administrators

    For the sake of completeness and to make sure that 5000 and something bars is not really too small ... a test with a complete 1 year set made up of 1-minute bars.

    Summarizing what the test does:

    • Preload the data
    • Create a single SMA of period=20
    • Do nothing in the strategy's next
    • 214911 minutes of data

    Test using pd.Series as the holding buffer and with the vectorized version of the SMA

    Overall Start : 2017-02-18 08:22:18.043000
    Cerebro Create: 2017-02-18 08:22:18.045000
    Cerebro Start : 2017-02-18 08:22:18.046000
    Strategy Start: 2017-02-18 08:22:25.986000
    Strategy End  : 2017-02-18 08:23:25.616000
    Strategy Time : 0:00:59.630000
    Cerebro End   : 2017-02-18 08:23:25.641000
    Cerebro Time  : 0:01:07.595000
    

    Test using the standard array.array in the python distribution

    Overall Start : 2017-02-18 08:24:16.984000
    Cerebro Create: 2017-02-18 08:24:16.986000
    Cerebro Start : 2017-02-18 08:24:16.987000
    Strategy Start: 2017-02-18 08:24:24.609000
    Strategy End  : 2017-02-18 08:24:46.384000
    Strategy Time : 0:00:21.775000
    Cerebro End   : 2017-02-18 08:24:46.385000
    Cerebro Time  : 0:00:29.398000
    

    Quick Summary:

    • Data loading time is roughly below 8 seconds
    • The pd.Series vectorized approach takes over 100% more time than the standard array.array version

  • administrators

    As mentioned above, the data was still loaded using array.array and then moved into pandas.Series. The final test comprises:

    • Use of pandas.read_csv
    • Translation of pandas datetime to the timestamp used in backtrader which is the same as in matplotlib

    The result when loading the 214911 bars:

    • Loading time has been reduced to 3.5 seconds from the previous roughly under 8seconds
    • Backtesting time remains the same with the pd.Series version still well over 1 minute

    The branch has been uploaded to the repository ... (branch - numpylines) (directly using numpy arrays was the 1st approach, which then moved to dynamic buffers with bcolz, pandas ...), should anyone have any interest at all.

    The data loader which directly uses pd.DataFrame and pd.Series is called BacktraderCSVData2

    From this there is anyhow a positive point in that it may be possible to avoid copying some data when binding lines by just replacing object references in some cases. It will be attempted in the development branch.


  • administrators

    @Ed-Bartosh said in How to speed up backtest:

    @backtrader ok, create a new thread for this. I'll be happy to explain. Here it would be off-topic I guess.

    Thread created here: https://community.backtrader.com/post/923



  • @backtrader This is very interesting indeed! I'll definitely give this loader a try. Thank you very much!


  • administrators

    The only indicator on that branch which has been changed is the SimpleMovingAverage



  • @backtrader said in How to speed up backtest:

    The only indicator on that branch which has been changed is the SimpleMovingAverage

    Thanks for pointing this out. Fortunately this is not a problem for me at all as I'm not using backtrader indicators.



  • @backtrader played with numpylines branch. Didn't see noticeable improvements. As you've mentioned empty backtest still runs more than one minute :( Any other ideas how to speed it up?


  • administrators

    There are for sure some areas where small optimizations may be possible. For example:

    • datetime conversions

      It would sensible to think that if each data feed makes an initial conversion of the current timestamp to a python datetime.datetime instance, the rest of the platform may use this, rather than performing local conversions.

    But there is for sure an area which was designed with ease of use in mind, to create a clear boundary between the past the present and the future and to offer a fixed reference point in time:

    • 0 indexing

    References to 0 are translated back the current position in the array and the access to [0] are ubiquitous in the core and this is for sure something that has an impact in performance.



  • @backtrader thank you for the suggestions! I'll try to look at this.

    My algo that operates 3 securities with 2 time frames(minute and daily) runs two times faster after optimisations! Most of performance gains came from switching to PandasDirectData feed, truncating the data before loading into backtrader, from getting rid of replaydata and disabling datetime conversion when calling pd.read_csv. Thank you very much for your suggestions again!

    I found it quite generic and useful to truncate the data to the fromdate-todate range. My csv files contain data for 10 years. When I run backtest for only one year of data I simply truncate the rest of data before loading into backtrader. This speeds up loading data quite a lot.

    I'm going to try numpylines branch again as I suspect I did something wrong that it didn't show any performance improvements. Can you suggest how to modify PandasDirectData loader to utilize your changes?
    Here is an example of my data:

                                   2      3      4      5      6
    2014-11-17 09:31:00-05:00  96.06  96.20  96.05  96.05    955
    2014-11-17 09:32:00-05:00  96.13  96.32  96.10  96.32   1341
    2014-11-17 09:33:00-05:00  96.32  96.45  96.29  96.37    522
    2014-11-17 09:34:00-05:00  96.30  96.33  96.20  96.20    208
    2014-11-17 09:35:00-05:00  96.21  96.27  96.12  96.12    941
    2014-11-17 09:36:00-05:00  96.11  96.36  96.05  96.23   2350
    ...
    

    It's indexed by datetime as you can see. Is it mandatory to index the data, btw. Does it make backtrader faster to load it?


  • administrators

    Would be something like this

    class PandasDirectData_NumPyLines(feed.DataBase):
        params = (
            ('datetime', 0),
            ('open', 1),
            ('high', 2),
            ('low', 3),
            ('close', 4),
            ('volume', 5),
            ('openinterest', 6),
        )
    
        datafields = [
            'datetime', 'open', 'high', 'low', 'close', 'volume', 'openinterest'
        ]
    
        def start(self):
            super(PandasDirectData_NumPyLines, self).start()
            self._df = self.p.dataname
    
        def preload(self):
            # Set the standard datafields - except for datetime
            for datafield in self.datafields[1:]:
                # get the column index
                colidx = getattr(self.params, datafield)
    
                if colidx < 0:
                    # column not present -- skip
                    continue
    
                l = getattr(self.lines, datafield)
                l.array = self._df.iloc[:, colidx]
    
            field0 = self.datafields[0]
            dts = pd.to_datetime(self.index)
            getattr(self.l, field0).array = dts.apply(date2num)
    
            self._last()
            self.home()
    
    

    Where datetime is directly taken from the index. The default column offset for the other fields is probably 1-off because of it, but it can luckily be configured