How to speed up backtest



  • Backtest of empty strategy for half a year long period of minute data loaded from csv takes more than half a minute.

    Is there any way to speed it up? Is it possible to cache data and load it from cache? Would using pandas data feed help?

    Here is cProfile output:

             46334362 function calls (45321121 primitive calls) in 38.307 seconds
    
       Ordered by: cumulative time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.001    0.001   38.309   38.309 test.py.profile:5(<module>)
            1    0.000    0.000   37.961   37.961 test.py.profile:44(main)
            1    0.000    0.000   37.959   37.959 cerebro.py:707(run)
            1    0.000    0.000   37.959   37.959 cerebro.py:839(runstrategies)
            1    0.029    0.029   29.416   29.416 feed.py:615(preload)
        53237    1.488    0.000   29.386    0.001 feed.py:402(load)
       195560    0.481    0.000   13.270    0.000 feed.py:626(_load)
       195560    3.825    0.000   12.344    0.000 csvgeneric.py:73(_loadline)
            1    0.338    0.338    8.541    8.541 cerebro.py:1240(_runonce)
       195560    1.834    0.000    6.143    0.000 tzinfo.py:244(localize)
       195560    0.547    0.000    4.865    0.000 {built-in method strptime}
       408508    0.223    0.000    4.714    0.000 lineseries.py:516(forward)
       408508    0.676    0.000    4.490    0.000 lineseries.py:228(forward)
       195560    2.409    0.000    4.317    0.000 _strptime.py:295(_strptime)
        53236    0.309    0.000    4.127    0.000 strategy.py:249(_oncepost)
      1741579    2.519    0.000    3.815    0.000 linebuffer.py:247(forward)
        53236    0.350    0.000    2.876    0.000 strategy.py:304(_next_observers)
        53236    0.190    0.000    2.789    0.000 feed.py:307(advance)
       391120    0.479    0.000    2.522    0.000 tzinfo.py:189(normalize)
       142324    0.083    0.000    2.182    0.000 lineseries.py:519(backwards)
       142324    0.365    0.000    2.099    0.000 lineseries.py:235(backwards)
      1955600    1.888    0.000    1.888    0.000 {method 'replace' of 'datetime.datetime' objects}
       996268    1.098    0.000    1.734    0.000 linebuffer.py:260(backwards)
       391120    0.669    0.000    1.673    0.000 tzinfo.py:179(fromutc)
      1724188    1.339    0.000    1.619    0.000 linebuffer.py:214(__setitem__)
       391123    1.055    0.000    1.501    0.000 dateintern.py:168(date2num)
      3245685    1.175    0.000    1.497    0.000 {getattr}
       195560    1.089    0.000    1.386    0.000 dateintern.py:115(num2date)
        53235    0.550    0.000    1.358    0.000 feed.py:290(_tick_fill)
    ...
    

    And the code:

    import sys
    import os
    import datetime
    
    from pytz import timezone
    
    import backtrader as bt
    
    class Strategy(bt.Strategy):
        """Trading strategy class."""
    
        params = (
            ('start_date', datetime.datetime(2015, 6, 17)),
            ('end_date', datetime.datetime(2015, 12, 31)),
            ('timezone', timezone('US/Eastern')),
        )
    
        def __init__(self):
            pass
    
        def next(self):
            return
    
    class MyCSVData(bt.feeds.GenericCSVData):
        params = (
            ('dtformat', '%Y%m%d'),
            ('tmformat', '%H%M'),
    
            ('datetime', 0),
            ('time', 1),
            ('open', 2),
            ('high', 3),
            ('low', 4),
            ('close', 5),
            ('volume', 6),
            ('openinterest', -1))
    
    def main(argv):
        cerebro = bt.Cerebro()
        cerebro.addstrategy(Strategy)
    
        # Datas are in a subfolder of the samples. Need to find where the script is
        # because it could have been called from anywhere
        modpath = os.path.dirname(os.path.abspath(argv[0]))
    
        # load all required datafeeds
        stock = 'AAPL'
        data = MyCSVData(
                dataname=os.path.join(modpath, 'datas', '%s.txt' % stock),
                # Do not pass values before this date
                fromdate=Strategy.params.start_date,
                # Do not pass values before this date
                todate=Strategy.params.end_date,
                # Do not pass values after this date
                reverse=False,
                tzinput=Strategy.params.timezone)
    
        # Add the minute Data Feed to Cerebro
        cerebro.adddata(data, name="%s_min" % stock)
    
        # Add replayed Data Feed (daily)
        #cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % stock)
    
        # Set our desired cash start
        cerebro.broker.setcash(100000)
    
        # Run over everything
        cerebro.run()
    
    if __name__ == '__main__':
        sys.exit(main(sys.argv))
    


  • I used .csv file with 2 years of price data. If I use full .csv file with 6 years of data the same code takes 96 seconds to run.

    If I uncomment this line:

    cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % sto
    

    it takes 120 seconds to run.

    As I'm going to use at least 3 tickers every backtest run will take even longer, so caching/preloading data would help a lot I guess.



  • @Ed-Bartosh zipline takes 3 sec to run empty algo and the time seems to not depend on the amount of symbols in the strategy. They preload data into so called data bundles and keep it in sqlite or colz, so caching helps.


  • administrators

    What zipline actually does it is unknown. Due to an early question in the tracker, some concise execution samples were looked into to do a very rough comparison, but the actual source code has never been consulted.

    Some comment somewhere mentioned that in the new release (it may have already been in the previous one) some things have been written in Cython to speed things.

    But not knowing what zipline actually supports or doesn't, any comparison is really meaningless. backtrader synchronizes data feeds with different trading days. It would only be a random guess to make statement about the same in zipline. Now, and thanks to the persistence of @randyt, the support for the proper synchronization of multi-timezone intraday and/or daily timeframes is touching its final stages.

    One of the major differences is probably the pure python approach of backtrader vs the Pandas/NumPy approach in zipline. This was intentional, to be able to run backtrader there were Pandas may not be available (yes, those places do exist) but a regular python distribution is.

    The built-in array.array module is probably some orders of magnitude slower than Pandas, although this is only an educated guess.

    One of the reasons is due to the fulfillment (in the author's opinion) of the main goal of backtrader. There are actually 2 goals:

    1. Ease of use
    2. Go back to #1

    Fulfilling that means that several lazily evaluated objects are created in the background when operations are requested and/or lines are created. And being all of them array.array based, the impact is heavy on execution time.

    The only thing you may do to decrease some time is run cerebro with stdstats=False, which deactivates the standard observers and really saves some time. But it won't make it amazingly fast.

    With regards to caching ... it won't play a real role. You may experiment with this feed which pre-caches itself during __init__ and serves the values according to the data feed API over _load. (the actual only value served is the timestamp, the others are left as NaN)

    class MyData(bt.DataBase):
        params = (
            ('startdate', 1500 * 365),  # Some days before the start of year 1500
            ('mysize', 252 * 9 * 60),  # 120960 - with default - not beyond 2017
        )
    
        def __init__(self):
            self.dts = [self.p.startdate + x for x in range(self.p.mysize)]
    
        def _load(self):
            try:
                self.lines.datetime[0] = self.dts[len(self) - 1]
            except IndexError:
                return False
    
            return True
    


  • Thank you for the detailed explanations!

    What zipline actually does it is unknown. Due to an early question in the tracker, some concise execution samples were looked into to do a very rough comparison, but the actual source code has never been consulted.

    Year, read that. It's a bit confusing as it misses killer backtester features that zipline doesn't have and points out to something not very important from my point of view. I can give you more details if you want, but probably not in this topic.

    One of the major differences is probably the pure python approach of backtrader vs the Pandas/NumPy approach in zipline.

    Yep, that makes sense. This is what I was afraid of :(

    This was intentional, to be able to run backtrader there were Pandas may not be available (yes, those places do exist) but a regular python distribution is.

    May I ask what kind of systems are targeted by this design decision? According to what you're saying you expect people to be surprised that "those places do exist". I am surprised :)

    Ease of use

    This is the whole point of this conversation. The thing is that It's not easy to use when you have to wait 2 extra minutes for every run. This slows me down quite a lot. Fortunately I don't use a lot of symbols and time frames, so I can bear with this, considering a lot of great features that I enjoy very much.

    BTW, backtester feels much faster than zipline in executing the algo. Unfortunately this is "compensated" by slow loading speed.

    The only thing you may do to decrease some time is run cerebro with stdstats=False, which deactivates the standard observers and really saves some time. But it won't make it amazingly fast.

    Thanks, I'll try this.

    I'm also going to make the load faster by loading csv into pandas and remove data that is out of test data range. Considering one extra minute of difference between processing 2 and 6 years long data it should help.


  • administrators

    @Ed-Bartosh said in How to speed up backtest:

    May I ask what kind of systems are targeted by this design decision? According to what you're saying you expect people to be surprised that "those places do exist". I am surprised

    It was for example mentioned in this blog post: Blog - Saving Memory and directly quoting it:

    • Be pure Python to allow to run in embedded systems if needed be

      A scenario in the future could have backtrader connected to a 2nd machine which provides the live feed, whilst backtrader itself runs inside a Raspberry Pi or something even more limited like an ADSL Router (AVM Frit!Box 7490 with a Freetz image)

    The latter being where you cannot get Pandas (at least not yet)

    This is the whole point of this conversation. The thing is that It's not easy to use when you have to wait 2 extra minutes

    Ease of use to quickly prototype and implement ideas.


  • administrators

    @Ed-Bartosh said in How to speed up backtest:

    It's a bit confusing as it misses killer backtester features that zipline doesn't have and points out to something not very important from my point of view. I can give you more details if you want, but probably not in this topic.

    It would really be interesting to learn about those "killer" features.



  • @backtrader ok, create a new thread for this. I'll be happy to explain. Here it would be off-topic I guess.



  • @Ed-Bartosh btw, "stdstats=False" approach didn't help. It saved 3 sec out of more than 2 minutes of backtest run.

    Can you point out which parts of backtester have to be changes to support pandas? How complex this task would be from your point of view?



  • @Ed Bartosh could you describe in a bit more detail what you are doing in terms of datasets, etc? I ask not to be argumentative, but to understand where I might also start running into these delays you mention.

    FWIW, I'm predominantly working with daily data and am running backtests with about 20 years of daily data for at least 2 instruments and have never seen a run take more than a minute.

    That said, I can imagine that a lot could be done here to focus on a few key areas of the system with pandas. If it were pluggable, that could still allow the system to satisfy @backtrader requirements .

    I also think that by creating ways to interact with data sources stored in memory could make a big difference in some of these run times. A little profiling would probably help identify some of the low hanging fruit there.

    I too would like to hear about the "killer" features you are missing.



  • @RandyT said in How to speed up backtest:

    @Ed Bartosh could you describe in a bit more detail what you are doing in terms of datasets, etc? I ask not to be argumentative, but to understand where I might also start running into these delays you mention.

    Sure. Here is an "empty" strategy that raises ZeroDivisionError on the first call of the "next" method:

    #!/usr/bin/env python
    
    import sys
    import os
    import datetime
    
    from pytz import timezone
    
    import backtrader as bt
    
    class Strategy(bt.Strategy):
        """Trading strategy class."""
    
        params = (
            ('stocks', ('AAPL', 'QQQ', 'HSY')),
            ('start_date', datetime.datetime(2015, 6, 17)),
            ('end_date', datetime.datetime(2015, 12, 31)),
            ('timezone', timezone('US/Eastern')),
        )
    
        def __init__(self):
            pass
    
        def next(self):
            1/0
    
    class MyCSVData(bt.feeds.GenericCSVData):
        params = (
            ('dtformat', '%Y%m%d'),
            ('tmformat', '%H%M'),
    
            ('datetime', 0),
            ('time', 1),
            ('open', 2),
            ('high', 3),
            ('low', 4),
            ('close', 5),
            ('volume', 6),
            ('openinterest', -1))
    
    def main(argv):
        cerebro = bt.Cerebro()
        cerebro.addstrategy(Strategy)
    
        # Datas are in a subfolder of the samples. Need to find where the script is
        # because it could have been called from anywhere
        modpath = os.path.dirname(os.path.abspath(argv[0]))
    
        # load all required datafeeds
        for stock in Strategy.params.stocks:
            data = MyCSVData(dataname=os.path.join(modpath, 'datas', '%s.txt' % stock),
                             fromdate=Strategy.params.start_date,
                             todate=Strategy.params.end_date,
                             reverse=False,
                             tzinput=Strategy.params.timezone)
            # Add the minute Data Feed to Cerebro
            cerebro.adddata(data, name="%s_min" % stock)
    
            # Add replayed Data Feed (daily)
            cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % stock)
    
        # Set our desired cash start
        cerebro.broker.setcash(100000)
    
        # Run over everything
        cerebro.run()
    
    if __name__ == '__main__':
        sys.exit(main(sys.argv))
    
    

    Running the above code with 6 years of data (from 20100104,931 to 20161121,1533 to be exact) in minute time frame takes 1min 6sec on my system (MacBook Pro, 3.1 Gz Intel Core I7, 16Gb 1867MHz DDR3):

    $ time ./test-load.py 
    Traceback (most recent call last):
    ...
      File "./test-load.py", line 25, in next
        1/0
    ZeroDivisionError: integer division or modulo by zero
    
    real	1m5.829s
    user	1m3.669s
    sys	0m1.034s
    

    With 2 securities (added 'QQQ' to the Strategy.params.stocks) it takes 2 min:

    $ time ./test-load.py 
    ...
      File "./test-load.py", line 25, in next
        1/0
    ZeroDivisionError: integer division or modulo by zero
    
    real	2m1.363s
    user	1m56.855s
    sys	0m1.888s
    

    with 3 securities (added 'HSY' to the Strategy.params.stocks) it takes 3m 37 sec:

    $ time ./test-load.py 
    Traceback (most recent call last):
    ...
      File "./test-load.py", line 25, in next
        1/0
    ZeroDivisionError: integer division or modulo by zero
    
    real	3m37.998s
    user	3m15.327s
    sys	0m4.490s
    

    You can imagine how many days users would wait if they want to run a strategy that operates for example all S&P 500 securities on the Raspberry Pi mentioned above :)



  • I generated a quick callgraph which might give us a bit more insight as to where we are spending time. I'll see about generating the same with your example.

    https://dl.dropboxusercontent.com/u/4390007/bt.pdf



  • @RandyT you can look at the profiler output in the first post of this thread for more info.


  • administrators

    @Ed-Bartosh said in How to speed up backtest:

    btw, "stdstats=False" approach didn't help. It saved 3 sec out of more than 2 minutes of backtest run.

    It wasn't meant to give you a 50% savings.

    Can you point out which parts of backtester have to be changes to support pandas? How complex this task would be from your point of view?

    It's not just supporting pandas, as in making pd.Series the underlying buffer. There is plenty of code which is pure python doing the overriding of operators and managing the indirection of the logic to make 0 the pivoting point pointing to the current data and -1 the last data that was delivered. And also having single lines objects and multiple lines objects being compatible.

    In the case of exactbars=1, the entire buffering system is switched to collections.deque and there is no real speed improvement.

    As an experiment, it may be worth (or fun?) to produce a version which can switch the underlying buffer for lines to pd.Series.


  • administrators

    @Ed-Bartosh said in How to speed up backtest:

    You can imagine how many days users would wait if they want to run a strategy that operates for example all S&P 500 securities on the Raspberry Pi mentioned above

    The Raspberry Pi would be doing live trading with a 2 line logic:

    1. Buy when it's cheap
    2. Sell when it's expensive. Go back to 1


  • @backtrader said in How to speed up backtest:

    As an experiment, it may be worth (or fun?) to produce a version which can switch the underlying buffer for lines to pd.Series

    I can help at least to test it. Not sure you need help with coding, but I can do that too :)



  • @backtrader said in How to speed up backtest:

    The Raspberry Pi would be doing live trading with a 2 line logic:

    Buy when it's cheap
    Sell when it's expensive. Go back to 1

    I'm pretty sure this use case doesn't cover more than 0.5% of backtrader user base. However, much more people would be interested in backtesting strategies that involve many symbols.

    PS: Don't get me wrong, please. I'm not questioning your design decision. I just want my algo to run faster :)

    PSS: quick search showed that pandas can be installed on Raspberry Pi: http://hilpisch.com/rpi/02_data_analytics.html


  • administrators

    The Raspberry Pi was never in doubt, because there are several complete Linux distributions. The other target (a freetzed Fritz!Box 7490) has no pandas, numpy or anything else.


  • administrators

    The example from above uses replaydata. If this were not used, something that could be considered to speed up things would be:

    • Replace the load cycle in a data feed with a plain one which simply loops over the data.

    This precludes the usage of filters (including resampling and replaying) With that in mind the usage of pd.Series would make sense when one is for example using a PandasData feed.

    In that case instead of copying the data from the Dataframe, each column from the DataFrame would take the place of a given line and a pointer could be used to indicate the current location to access the data.



  • @backtrader Unfortunately my algo uses replaydata. Can you suggest any optimisations for this?

    so far I managed to make the algo run faster by loading csv into pandas data frame and truncating data that's not used in the algo. I also played with pickling/unpickling the data frame, but that didn't help that much.


Log in to reply
 

Looks like your connection to Backtrader Community was lost, please wait while we try to reconnect.