How to speed up backtest
After much consideration there isn't a good in-code solution for
replaydatabecause the timestamps have to be checked to re-construct the bars.
Have a data feed which reads input which is formatted to indicate replaying. X ticks will be read and the data feed will not increase its own length, until a marker is seen and the length is increased. With that in mind, there is no actual replaying being calculated each and every time, because the actual replay sequence is already in the data feed (frozen in disk)
Some tests have been made with buffers other than the standard distribution
array.array. Summary of results
pandas.Seriesas a direct replacement is impossible, due to how actually appending data works, the times increase exponentially
bcolz.carrayand the built-in
appendfunctionality (touted as much lighter than anything you can do with
numpy) is not such a disaster as
pandas.Series, but the actual execution times increase several orders of magnitude
Reading CSV data with
array.array, moving it to a
pandas.Seriesbuffer and then pre-allocating all indicators/observers buffers with either
pandas.Seriesand then vectorizing the indicator operations (for example with
pandas.Series.rolling(window=x, center=False).mean()) brings the performance close to that of purely using
For 2 years of daily data (approximately
2 x 256 = 5632 bars) and 1 single
It actually seems that
array.array(at least for a small amount of bars) fares better than using
pandas.Series, even with a vectorized operation.
There is one last step which may be still worth trying:
- Modify the CSV data feed to directly use
pandasand avoid the copying of data from
A real potential route of optimization (but with a real rework of the entire architecture) could be:
daskto have calculations bypass the
GILand use all available cores (the backtrader machinery would create the needed diagram task flow taking calculation dependencies into account) A long route in any case
RandyT last edited by
@backtrader Very interesting. Out of curiosity, were your tests run with Python3?
2 and 3 ... and 3 had the worst overall results with a couple of combinations being really incredibly slow.
For the sake of completeness and to make sure that 5000 and something bars is not really too small ... a test with a complete 1 year set made up of
Summarizing what the test does:
- Preload the data
- Create a single
- Do nothing in the strategy's
- 214911 minutes of data
pd.Seriesas the holding buffer and with the vectorized version of the
Overall Start : 2017-02-18 08:22:18.043000 Cerebro Create: 2017-02-18 08:22:18.045000 Cerebro Start : 2017-02-18 08:22:18.046000 Strategy Start: 2017-02-18 08:22:25.986000 Strategy End : 2017-02-18 08:23:25.616000 Strategy Time : 0:00:59.630000 Cerebro End : 2017-02-18 08:23:25.641000 Cerebro Time : 0:01:07.595000
Test using the standard
array.arrayin the python distribution
Overall Start : 2017-02-18 08:24:16.984000 Cerebro Create: 2017-02-18 08:24:16.986000 Cerebro Start : 2017-02-18 08:24:16.987000 Strategy Start: 2017-02-18 08:24:24.609000 Strategy End : 2017-02-18 08:24:46.384000 Strategy Time : 0:00:21.775000 Cerebro End : 2017-02-18 08:24:46.385000 Cerebro Time : 0:00:29.398000
- Data loading time is roughly below 8 seconds
pd.Seriesvectorized approach takes over 100% more time than the standard
As mentioned above, the data was still loaded using
array.arrayand then moved into
pandas.Series. The final test comprises:
- Use of
- Translation of
pandasdatetime to the timestamp used in backtrader which is the same as in
The result when loading the 214911 bars:
- Loading time has been reduced to
3.5seconds from the previous roughly under
- Backtesting time remains the same with the
pd.Seriesversion still well over
The branch has been uploaded to the repository ... (branch -
numpylines) (directly using
numpyarrays was the 1st approach, which then moved to dynamic buffers with
pandas...), should anyone have any interest at all.
The data loader which directly uses
From this there is anyhow a positive point in that it may be possible to avoid copying some data when binding lines by just replacing object references in some cases. It will be attempted in the
- Use of
@backtrader ok, create a new thread for this. I'll be happy to explain. Here it would be off-topic I guess.
Thread created here: https://community.backtrader.com/post/923
@backtrader This is very interesting indeed! I'll definitely give this loader a try. Thank you very much!
The only indicator on that branch which has been changed is the
The only indicator on that branch which has been changed is the SimpleMovingAverage
Thanks for pointing this out. Fortunately this is not a problem for me at all as I'm not using backtrader indicators.
@backtrader played with numpylines branch. Didn't see noticeable improvements. As you've mentioned empty backtest still runs more than one minute :( Any other ideas how to speed it up?
There are for sure some areas where small optimizations may be possible. For example:
It would sensible to think that if each data feed makes an initial conversion of the current timestamp to a python
datetime.datetimeinstance, the rest of the platform may use this, rather than performing local conversions.
But there is for sure an area which was designed with ease of use in mind, to create a clear boundary between the past the present and the future and to offer a fixed reference point in time:
0are translated back the current position in the array and the access to
are ubiquitous in the core and this is for sure something that has an impact in performance.
@backtrader thank you for the suggestions! I'll try to look at this.
My algo that operates 3 securities with 2 time frames(minute and daily) runs two times faster after optimisations! Most of performance gains came from switching to PandasDirectData feed, truncating the data before loading into backtrader, from getting rid of replaydata and disabling datetime conversion when calling pd.read_csv. Thank you very much for your suggestions again!
I found it quite generic and useful to truncate the data to the fromdate-todate range. My csv files contain data for 10 years. When I run backtest for only one year of data I simply truncate the rest of data before loading into backtrader. This speeds up loading data quite a lot.
I'm going to try numpylines branch again as I suspect I did something wrong that it didn't show any performance improvements. Can you suggest how to modify PandasDirectData loader to utilize your changes?
Here is an example of my data:
2 3 4 5 6 2014-11-17 09:31:00-05:00 96.06 96.20 96.05 96.05 955 2014-11-17 09:32:00-05:00 96.13 96.32 96.10 96.32 1341 2014-11-17 09:33:00-05:00 96.32 96.45 96.29 96.37 522 2014-11-17 09:34:00-05:00 96.30 96.33 96.20 96.20 208 2014-11-17 09:35:00-05:00 96.21 96.27 96.12 96.12 941 2014-11-17 09:36:00-05:00 96.11 96.36 96.05 96.23 2350 ...
It's indexed by datetime as you can see. Is it mandatory to index the data, btw. Does it make backtrader faster to load it?
Would be something like this
class PandasDirectData_NumPyLines(feed.DataBase): params = ( ('datetime', 0), ('open', 1), ('high', 2), ('low', 3), ('close', 4), ('volume', 5), ('openinterest', 6), ) datafields = [ 'datetime', 'open', 'high', 'low', 'close', 'volume', 'openinterest' ] def start(self): super(PandasDirectData_NumPyLines, self).start() self._df = self.p.dataname def preload(self): # Set the standard datafields - except for datetime for datafield in self.datafields[1:]: # get the column index colidx = getattr(self.params, datafield) if colidx < 0: # column not present -- skip continue l = getattr(self.lines, datafield) l.array = self._df.iloc[:, colidx] field0 = self.datafields dts = pd.to_datetime(self.index) getattr(self.l, field0).array = dts.apply(date2num) self._last() self.home()
datetimeis directly taken from the index. The default column offset for the other fields is probably 1-off because of it, but it can luckily be configured