How to speed up backtest
-
After much consideration there isn't a good in-code solution for
replaydata
because the timestamps have to be checked to re-construct the bars.Have a data feed which reads input which is formatted to indicate replaying. X ticks will be read and the data feed will not increase its own length, until a marker is seen and the length is increased. With that in mind, there is no actual replaying being calculated each and every time, because the actual replay sequence is already in the data feed (frozen in disk)
-
Some tests have been made with buffers other than the standard distribution
array.array
. Summary of results-
Using
pandas.Series
as a direct replacement is impossible, due to how actually appending data works, the times increase exponentially -
Using
bcolz.carray
and the built-inappend
functionality (touted as much lighter than anything you can do withnumpy
) is not such a disaster aspandas.Series
, but the actual execution times increase several orders of magnitude -
Reading CSV data with
array.array
, moving it to abcolz.carray
orpandas.Series
buffer and then pre-allocating all indicators/observers buffers with eitherbcolz.carray
orpandas.Series
and then vectorizing the indicator operations (for example withpandas.Series.rolling(window=x, center=False).mean()
) brings the performance close to that of purely usingarray.array
For 2 years of daily data (approximately
2 x 256 = 5632 bars
) and 1 singleSimpleMovingAverage
- Standard
array.array
: 0.7s - Approach
pandas.Series
: 1.4s
- Standard
It actually seems that
array.array
(at least for a small amount of bars) fares better than usingpandas.Series
, even with a vectorized operation.There is one last step which may be still worth trying:
- Modify the CSV data feed to directly use
pandas
and avoid the copying of data fromarray.array
topandas.Series
A real potential route of optimization (but with a real rework of the entire architecture) could be:
dask
to have calculations bypass theGIL
and use all available cores (the backtrader machinery would create the needed diagram task flow taking calculation dependencies into account) A long route in any case -
-
@backtrader Very interesting. Out of curiosity, were your tests run with Python3?
-
2 and 3 ... and 3 had the worst overall results with a couple of combinations being really incredibly slow.
-
For the sake of completeness and to make sure that 5000 and something bars is not really too small ... a test with a complete 1 year set made up of
1-minute
bars.Summarizing what the test does:
- Preload the data
- Create a single
SMA
ofperiod=20
- Do nothing in the strategy's
next
- 214911 minutes of data
Test using
pd.Series
as the holding buffer and with the vectorized version of theSMA
Overall Start : 2017-02-18 08:22:18.043000 Cerebro Create: 2017-02-18 08:22:18.045000 Cerebro Start : 2017-02-18 08:22:18.046000 Strategy Start: 2017-02-18 08:22:25.986000 Strategy End : 2017-02-18 08:23:25.616000 Strategy Time : 0:00:59.630000 Cerebro End : 2017-02-18 08:23:25.641000 Cerebro Time : 0:01:07.595000
Test using the standard
array.array
in the python distributionOverall Start : 2017-02-18 08:24:16.984000 Cerebro Create: 2017-02-18 08:24:16.986000 Cerebro Start : 2017-02-18 08:24:16.987000 Strategy Start: 2017-02-18 08:24:24.609000 Strategy End : 2017-02-18 08:24:46.384000 Strategy Time : 0:00:21.775000 Cerebro End : 2017-02-18 08:24:46.385000 Cerebro Time : 0:00:29.398000
Quick Summary:
- Data loading time is roughly below 8 seconds
- The
pd.Series
vectorized approach takes over 100% more time than the standardarray.array
version
-
As mentioned above, the data was still loaded using
array.array
and then moved intopandas.Series
. The final test comprises:- Use of
pandas.read_csv
- Translation of
pandas
datetime to the timestamp used in backtrader which is the same as inmatplotlib
The result when loading the 214911 bars:
- Loading time has been reduced to
3.5
seconds from the previous roughly under8
seconds - Backtesting time remains the same with the
pd.Series
version still well over1 minute
The branch has been uploaded to the repository ... (branch -
numpylines
) (directly usingnumpy
arrays was the 1st approach, which then moved to dynamic buffers withbcolz
,pandas
...), should anyone have any interest at all.The data loader which directly uses
pd.DataFrame
andpd.Series
is calledBacktraderCSVData2
From this there is anyhow a positive point in that it may be possible to avoid copying some data when binding lines by just replacing object references in some cases. It will be attempted in the
development
branch. - Use of
-
@Ed-Bartosh said in How to speed up backtest:
@backtrader ok, create a new thread for this. I'll be happy to explain. Here it would be off-topic I guess.
Thread created here: https://community.backtrader.com/post/923
-
@backtrader This is very interesting indeed! I'll definitely give this loader a try. Thank you very much!
-
The only indicator on that branch which has been changed is the
SimpleMovingAverage
-
@backtrader said in How to speed up backtest:
The only indicator on that branch which has been changed is the SimpleMovingAverage
Thanks for pointing this out. Fortunately this is not a problem for me at all as I'm not using backtrader indicators.
-
@backtrader played with numpylines branch. Didn't see noticeable improvements. As you've mentioned empty backtest still runs more than one minute :( Any other ideas how to speed it up?
-
There are for sure some areas where small optimizations may be possible. For example:
-
datetime
conversionsIt would sensible to think that if each data feed makes an initial conversion of the current timestamp to a python
datetime.datetime
instance, the rest of the platform may use this, rather than performing local conversions.
But there is for sure an area which was designed with ease of use in mind, to create a clear boundary between the past the present and the future and to offer a fixed reference point in time:
0
indexing
References to
0
are translated back the current position in the array and the access to[0]
are ubiquitous in the core and this is for sure something that has an impact in performance. -
-
@backtrader thank you for the suggestions! I'll try to look at this.
My algo that operates 3 securities with 2 time frames(minute and daily) runs two times faster after optimisations! Most of performance gains came from switching to PandasDirectData feed, truncating the data before loading into backtrader, from getting rid of replaydata and disabling datetime conversion when calling pd.read_csv. Thank you very much for your suggestions again!
I found it quite generic and useful to truncate the data to the fromdate-todate range. My csv files contain data for 10 years. When I run backtest for only one year of data I simply truncate the rest of data before loading into backtrader. This speeds up loading data quite a lot.
I'm going to try numpylines branch again as I suspect I did something wrong that it didn't show any performance improvements. Can you suggest how to modify PandasDirectData loader to utilize your changes?
Here is an example of my data:2 3 4 5 6 2014-11-17 09:31:00-05:00 96.06 96.20 96.05 96.05 955 2014-11-17 09:32:00-05:00 96.13 96.32 96.10 96.32 1341 2014-11-17 09:33:00-05:00 96.32 96.45 96.29 96.37 522 2014-11-17 09:34:00-05:00 96.30 96.33 96.20 96.20 208 2014-11-17 09:35:00-05:00 96.21 96.27 96.12 96.12 941 2014-11-17 09:36:00-05:00 96.11 96.36 96.05 96.23 2350 ...
It's indexed by datetime as you can see. Is it mandatory to index the data, btw. Does it make backtrader faster to load it?
-
Would be something like this
class PandasDirectData_NumPyLines(feed.DataBase): params = ( ('datetime', 0), ('open', 1), ('high', 2), ('low', 3), ('close', 4), ('volume', 5), ('openinterest', 6), ) datafields = [ 'datetime', 'open', 'high', 'low', 'close', 'volume', 'openinterest' ] def start(self): super(PandasDirectData_NumPyLines, self).start() self._df = self.p.dataname def preload(self): # Set the standard datafields - except for datetime for datafield in self.datafields[1:]: # get the column index colidx = getattr(self.params, datafield) if colidx < 0: # column not present -- skip continue l = getattr(self.lines, datafield) l.array = self._df.iloc[:, colidx] field0 = self.datafields[0] dts = pd.to_datetime(self.index) getattr(self.l, field0).array = dts.apply(date2num) self._last() self.home()
Where
datetime
is directly taken from the index. The default column offset for the other fields is probably 1-off because of it, but it can luckily be configured