How to speed up backtest
-
Backtest of empty strategy for half a year long period of minute data loaded from csv takes more than half a minute.
Is there any way to speed it up? Is it possible to cache data and load it from cache? Would using pandas data feed help?
Here is cProfile output:
46334362 function calls (45321121 primitive calls) in 38.307 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 38.309 38.309 test.py.profile:5(<module>) 1 0.000 0.000 37.961 37.961 test.py.profile:44(main) 1 0.000 0.000 37.959 37.959 cerebro.py:707(run) 1 0.000 0.000 37.959 37.959 cerebro.py:839(runstrategies) 1 0.029 0.029 29.416 29.416 feed.py:615(preload) 53237 1.488 0.000 29.386 0.001 feed.py:402(load) 195560 0.481 0.000 13.270 0.000 feed.py:626(_load) 195560 3.825 0.000 12.344 0.000 csvgeneric.py:73(_loadline) 1 0.338 0.338 8.541 8.541 cerebro.py:1240(_runonce) 195560 1.834 0.000 6.143 0.000 tzinfo.py:244(localize) 195560 0.547 0.000 4.865 0.000 {built-in method strptime} 408508 0.223 0.000 4.714 0.000 lineseries.py:516(forward) 408508 0.676 0.000 4.490 0.000 lineseries.py:228(forward) 195560 2.409 0.000 4.317 0.000 _strptime.py:295(_strptime) 53236 0.309 0.000 4.127 0.000 strategy.py:249(_oncepost) 1741579 2.519 0.000 3.815 0.000 linebuffer.py:247(forward) 53236 0.350 0.000 2.876 0.000 strategy.py:304(_next_observers) 53236 0.190 0.000 2.789 0.000 feed.py:307(advance) 391120 0.479 0.000 2.522 0.000 tzinfo.py:189(normalize) 142324 0.083 0.000 2.182 0.000 lineseries.py:519(backwards) 142324 0.365 0.000 2.099 0.000 lineseries.py:235(backwards) 1955600 1.888 0.000 1.888 0.000 {method 'replace' of 'datetime.datetime' objects} 996268 1.098 0.000 1.734 0.000 linebuffer.py:260(backwards) 391120 0.669 0.000 1.673 0.000 tzinfo.py:179(fromutc) 1724188 1.339 0.000 1.619 0.000 linebuffer.py:214(__setitem__) 391123 1.055 0.000 1.501 0.000 dateintern.py:168(date2num) 3245685 1.175 0.000 1.497 0.000 {getattr} 195560 1.089 0.000 1.386 0.000 dateintern.py:115(num2date) 53235 0.550 0.000 1.358 0.000 feed.py:290(_tick_fill) ...
And the code:
import sys import os import datetime from pytz import timezone import backtrader as bt class Strategy(bt.Strategy): """Trading strategy class.""" params = ( ('start_date', datetime.datetime(2015, 6, 17)), ('end_date', datetime.datetime(2015, 12, 31)), ('timezone', timezone('US/Eastern')), ) def __init__(self): pass def next(self): return class MyCSVData(bt.feeds.GenericCSVData): params = ( ('dtformat', '%Y%m%d'), ('tmformat', '%H%M'), ('datetime', 0), ('time', 1), ('open', 2), ('high', 3), ('low', 4), ('close', 5), ('volume', 6), ('openinterest', -1)) def main(argv): cerebro = bt.Cerebro() cerebro.addstrategy(Strategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(argv[0])) # load all required datafeeds stock = 'AAPL' data = MyCSVData( dataname=os.path.join(modpath, 'datas', '%s.txt' % stock), # Do not pass values before this date fromdate=Strategy.params.start_date, # Do not pass values before this date todate=Strategy.params.end_date, # Do not pass values after this date reverse=False, tzinput=Strategy.params.timezone) # Add the minute Data Feed to Cerebro cerebro.adddata(data, name="%s_min" % stock) # Add replayed Data Feed (daily) #cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % stock) # Set our desired cash start cerebro.broker.setcash(100000) # Run over everything cerebro.run() if __name__ == '__main__': sys.exit(main(sys.argv))
-
I used .csv file with 2 years of price data. If I use full .csv file with 6 years of data the same code takes 96 seconds to run.
If I uncomment this line:
cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % sto
it takes 120 seconds to run.
As I'm going to use at least 3 tickers every backtest run will take even longer, so caching/preloading data would help a lot I guess.
-
@Ed-Bartosh zipline takes 3 sec to run empty algo and the time seems to not depend on the amount of symbols in the strategy. They preload data into so called data bundles and keep it in sqlite or colz, so caching helps.
-
What
zipline
actually does it is unknown. Due to an early question in the tracker, some concise execution samples were looked into to do a very rough comparison, but the actual source code has never been consulted.Some comment somewhere mentioned that in the new release (it may have already been in the previous one) some things have been written in Cython to speed things.
But not knowing what
zipline
actually supports or doesn't, any comparison is really meaningless. backtrader synchronizes data feeds with different trading days. It would only be a random guess to make statement about the same inzipline
. Now, and thanks to the persistence of @randyt, the support for the proper synchronization of multi-timezone intraday and/or daily timeframes is touching its final stages.One of the major differences is probably the pure python approach of backtrader vs the Pandas/NumPy approach in
zipline
. This was intentional, to be able to run backtrader there werePandas
may not be available (yes, those places do exist) but a regular python distribution is.The built-in
array.array
module is probably some orders of magnitude slower thanPandas
, although this is only an educated guess.One of the reasons is due to the fulfillment (in the author's opinion) of the main goal of backtrader. There are actually 2 goals:
- Ease of use
- Go back to #1
Fulfilling that means that several lazily evaluated objects are created in the background when operations are requested and/or lines are created. And being all of them
array.array
based, the impact is heavy on execution time.The only thing you may do to decrease some time is run
cerebro
withstdstats=False
, which deactivates the standard observers and really saves some time. But it won't make it amazingly fast.With regards to caching ... it won't play a real role. You may experiment with this feed which pre-caches itself during
__init__
and serves the values according to the data feed API over_load
. (the actual only value served is the timestamp, the others are left asNaN
)class MyData(bt.DataBase): params = ( ('startdate', 1500 * 365), # Some days before the start of year 1500 ('mysize', 252 * 9 * 60), # 120960 - with default - not beyond 2017 ) def __init__(self): self.dts = [self.p.startdate + x for x in range(self.p.mysize)] def _load(self): try: self.lines.datetime[0] = self.dts[len(self) - 1] except IndexError: return False return True
-
Thank you for the detailed explanations!
What zipline actually does it is unknown. Due to an early question in the tracker, some concise execution samples were looked into to do a very rough comparison, but the actual source code has never been consulted.
Year, read that. It's a bit confusing as it misses killer backtester features that zipline doesn't have and points out to something not very important from my point of view. I can give you more details if you want, but probably not in this topic.
One of the major differences is probably the pure python approach of backtrader vs the Pandas/NumPy approach in zipline.
Yep, that makes sense. This is what I was afraid of :(
This was intentional, to be able to run backtrader there were Pandas may not be available (yes, those places do exist) but a regular python distribution is.
May I ask what kind of systems are targeted by this design decision? According to what you're saying you expect people to be surprised that "those places do exist". I am surprised :)
Ease of use
This is the whole point of this conversation. The thing is that It's not easy to use when you have to wait 2 extra minutes for every run. This slows me down quite a lot. Fortunately I don't use a lot of symbols and time frames, so I can bear with this, considering a lot of great features that I enjoy very much.
BTW, backtester feels much faster than zipline in executing the algo. Unfortunately this is "compensated" by slow loading speed.
The only thing you may do to decrease some time is run cerebro with stdstats=False, which deactivates the standard observers and really saves some time. But it won't make it amazingly fast.
Thanks, I'll try this.
I'm also going to make the load faster by loading csv into pandas and remove data that is out of test data range. Considering one extra minute of difference between processing 2 and 6 years long data it should help.
-
@Ed-Bartosh said in How to speed up backtest:
May I ask what kind of systems are targeted by this design decision? According to what you're saying you expect people to be surprised that "those places do exist". I am surprised
It was for example mentioned in this blog post: Blog - Saving Memory and directly quoting it:
-
Be pure Python to allow to run in embedded systems if needed be
A scenario in the future could have backtrader connected to a 2nd machine which provides the live feed, whilst backtrader itself runs inside a Raspberry Pi or something even more limited like an ADSL Router (AVM Frit!Box 7490 with a Freetz image)
The latter being where you cannot get
Pandas
(at least not yet)This is the whole point of this conversation. The thing is that It's not easy to use when you have to wait 2 extra minutes
Ease of use to quickly prototype and implement ideas.
-
-
@Ed-Bartosh said in How to speed up backtest:
It's a bit confusing as it misses killer backtester features that zipline doesn't have and points out to something not very important from my point of view. I can give you more details if you want, but probably not in this topic.
It would really be interesting to learn about those "killer" features.
-
@backtrader ok, create a new thread for this. I'll be happy to explain. Here it would be off-topic I guess.
-
@Ed-Bartosh btw, "stdstats=False" approach didn't help. It saved 3 sec out of more than 2 minutes of backtest run.
Can you point out which parts of backtester have to be changes to support pandas? How complex this task would be from your point of view?
-
@Ed Bartosh could you describe in a bit more detail what you are doing in terms of datasets, etc? I ask not to be argumentative, but to understand where I might also start running into these delays you mention.
FWIW, I'm predominantly working with daily data and am running backtests with about 20 years of daily data for at least 2 instruments and have never seen a run take more than a minute.
That said, I can imagine that a lot could be done here to focus on a few key areas of the system with pandas. If it were pluggable, that could still allow the system to satisfy @backtrader requirements .
I also think that by creating ways to interact with data sources stored in memory could make a big difference in some of these run times. A little profiling would probably help identify some of the low hanging fruit there.
I too would like to hear about the "killer" features you are missing.
-
@RandyT said in How to speed up backtest:
@Ed Bartosh could you describe in a bit more detail what you are doing in terms of datasets, etc? I ask not to be argumentative, but to understand where I might also start running into these delays you mention.
Sure. Here is an "empty" strategy that raises ZeroDivisionError on the first call of the "next" method:
#!/usr/bin/env python import sys import os import datetime from pytz import timezone import backtrader as bt class Strategy(bt.Strategy): """Trading strategy class.""" params = ( ('stocks', ('AAPL', 'QQQ', 'HSY')), ('start_date', datetime.datetime(2015, 6, 17)), ('end_date', datetime.datetime(2015, 12, 31)), ('timezone', timezone('US/Eastern')), ) def __init__(self): pass def next(self): 1/0 class MyCSVData(bt.feeds.GenericCSVData): params = ( ('dtformat', '%Y%m%d'), ('tmformat', '%H%M'), ('datetime', 0), ('time', 1), ('open', 2), ('high', 3), ('low', 4), ('close', 5), ('volume', 6), ('openinterest', -1)) def main(argv): cerebro = bt.Cerebro() cerebro.addstrategy(Strategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(argv[0])) # load all required datafeeds for stock in Strategy.params.stocks: data = MyCSVData(dataname=os.path.join(modpath, 'datas', '%s.txt' % stock), fromdate=Strategy.params.start_date, todate=Strategy.params.end_date, reverse=False, tzinput=Strategy.params.timezone) # Add the minute Data Feed to Cerebro cerebro.adddata(data, name="%s_min" % stock) # Add replayed Data Feed (daily) cerebro.replaydata(data, timeframe=bt.TimeFrame.Days, name="%s_day" % stock) # Set our desired cash start cerebro.broker.setcash(100000) # Run over everything cerebro.run() if __name__ == '__main__': sys.exit(main(sys.argv))
Running the above code with 6 years of data (from 20100104,931 to 20161121,1533 to be exact) in minute time frame takes 1min 6sec on my system (MacBook Pro, 3.1 Gz Intel Core I7, 16Gb 1867MHz DDR3):
$ time ./test-load.py Traceback (most recent call last): ... File "./test-load.py", line 25, in next 1/0 ZeroDivisionError: integer division or modulo by zero real 1m5.829s user 1m3.669s sys 0m1.034s
With 2 securities (added 'QQQ' to the Strategy.params.stocks) it takes 2 min:
$ time ./test-load.py ... File "./test-load.py", line 25, in next 1/0 ZeroDivisionError: integer division or modulo by zero real 2m1.363s user 1m56.855s sys 0m1.888s
with 3 securities (added 'HSY' to the Strategy.params.stocks) it takes 3m 37 sec:
$ time ./test-load.py Traceback (most recent call last): ... File "./test-load.py", line 25, in next 1/0 ZeroDivisionError: integer division or modulo by zero real 3m37.998s user 3m15.327s sys 0m4.490s
You can imagine how many days users would wait if they want to run a strategy that operates for example all S&P 500 securities on the Raspberry Pi mentioned above :)
-
I generated a quick callgraph which might give us a bit more insight as to where we are spending time. I'll see about generating the same with your example.
-
@RandyT you can look at the profiler output in the first post of this thread for more info.
-
@Ed-Bartosh said in How to speed up backtest:
btw, "stdstats=False" approach didn't help. It saved 3 sec out of more than 2 minutes of backtest run.
It wasn't meant to give you a 50% savings.
Can you point out which parts of backtester have to be changes to support pandas? How complex this task would be from your point of view?
It's not just supporting
pandas
, as in makingpd.Series
the underlying buffer. There is plenty of code which is pure python doing the overriding of operators and managing the indirection of the logic to make0
the pivoting point pointing to the current data and-1
the last data that was delivered. And also having single lines objects and multiple lines objects being compatible.In the case of
exactbars=1
, the entire buffering system is switched tocollections.deque
and there is no real speed improvement.As an experiment, it may be worth (or fun?) to produce a version which can switch the underlying buffer for lines to
pd.Series
. -
@Ed-Bartosh said in How to speed up backtest:
You can imagine how many days users would wait if they want to run a strategy that operates for example all S&P 500 securities on the Raspberry Pi mentioned above
The Raspberry Pi would be doing live trading with a 2 line logic:
- Buy when it's cheap
- Sell when it's expensive. Go back to 1
-
@backtrader said in How to speed up backtest:
As an experiment, it may be worth (or fun?) to produce a version which can switch the underlying buffer for lines to pd.Series
I can help at least to test it. Not sure you need help with coding, but I can do that too :)
-
@backtrader said in How to speed up backtest:
The Raspberry Pi would be doing live trading with a 2 line logic:
Buy when it's cheap
Sell when it's expensive. Go back to 1I'm pretty sure this use case doesn't cover more than 0.5% of backtrader user base. However, much more people would be interested in backtesting strategies that involve many symbols.
PS: Don't get me wrong, please. I'm not questioning your design decision. I just want my algo to run faster :)
PSS: quick search showed that pandas can be installed on Raspberry Pi: http://hilpisch.com/rpi/02_data_analytics.html
-
The Raspberry Pi was never in doubt, because there are several complete Linux distributions. The other target (a freetzed Fritz!Box 7490) has no
pandas
,numpy
or anything else. -
The example from above uses
replaydata
. If this were not used, something that could be considered to speed up things would be:- Replace the
load
cycle in a data feed with a plain one which simply loops over the data.
This precludes the usage of filters (including resampling and replaying) With that in mind the usage of
pd.Series
would make sense when one is for example using aPandasData
feed.In that case instead of copying the data from the
Dataframe
, each column from theDataFrame
would take the place of a given line and a pointer could be used to indicate the current location to access the data. - Replace the
-
@backtrader Unfortunately my algo uses replaydata. Can you suggest any optimisations for this?
so far I managed to make the algo run faster by loading csv into pandas data frame and truncating data that's not used in the algo. I also played with pickling/unpickling the data frame, but that didn't help that much.