Adding InfluxDB Datafeeds
-
Hi All,
Recently started using InlfuxDB database as opposed to CSV files due to memory limitations.
While those limitations do still exist, they are not as severe.
I've been adding data feeds as shown below:
symbols = get_sp500_tickers().head(150) for symbol in symbols: data = bt.feeds.InfluxDB(host='localhost', port='8086', username='root', password='root', database='test_datat', dataname=symbol, timeframe=bt.TimeFrame.Minutes, compression=1, startdate=datetime.datetime(2020, 1, 2), todate=datetime.datetime(2020, 5, 22), high='high', low='low', open='open', close='close', volume='volume', ointerest='openinterest') cerebro.adddata(data) print(symbol) cerebro.run(runonce=True,stdstats=False)
Just wondering:
- Is this the optimal way of adding these kinds of feeds?
- Is there anything I can do to reduce the memory consumption?
Thanks
Alfred Sisley
-
You may try to play with
exactbars
of cause - but this will only possibly save memory of the data stored in theLineBuffer
objects. See here for more info.As for the data stored in the result set inside the InfluxDB feed objects - here it may became more involved. See my previous post about it here
-
@vladisld Great post!
-
Very much enjoyed seeing graphic representation of memory usage. How would one do something similar? Was this done with a particular utility program?
-
In regards to your fix on github, how would I go about incorporating this into the code I'm currently using for InfluxDB? Is it as simple as adding the missing functions?
def _preload(self):
def preload(self):
-
You mentioned "Using
Pool.imap chunksize=1
illustrate this ( specifying chunksize causes the Pool to use each worker process only for chunksize work items)". Would this look likecerebro.run(chunksize=1)
?
As always, immensely appreciate your insights.
Alfred Sisley
-
-
@AlfredSisley said in Adding InfluxDB Datafeeds:
How would one do something similar? Was this done with a particular utility program?
@AlfredSisley said in Adding InfluxDB Datafeeds:
Is it as simple as adding the missing functions?
You may just replace the influxfeed.py.
@AlfredSisley said in Adding InfluxDB Datafeeds:
Would this look like cerebro.run(chunksize=1)?
No, it is not a
run
method parameter. It is themultiprocessing.Pool.imap
parameter.This is only relevant if your using optstrategy. It has no impact on backtesting.
For this to work you need to modify the Cerebro engine. There you'll find a pretty intense code that require good understanding of pickling mechanism and multiprocessing package.
Beside the change to the chunksize, the other problem is the way the Cerebro engine instance is pickle-ized to the working process. The fix there is much more involved. I have some dirty change in my fork - but is it definitely not at production quality level. see here - you may take it if you brave enough :-)
-
Thanks @vladisld!
-
I've made the changes to influxfeed.py, believe there was an improvement on run time.
-
Memory Usage @ 5 tickers:
-
Memory Usage of 10 tickers:
Is there anything noticeable in the charts above that would suggest issues with the code?
Is there any way of unloading the historical data from memory at any point in the process and just maintaining the buy/sell decisions?
At this point, is my only alternative to upgrade my machine?
Much Appreciated!
Alfred Sisley
-
-
Have you experimented with
exactbars
, as I suggested earlier ? -
@vladisld I added
cerebro.run(exactbars=1)
and the program managed to run through 100 tickers, which is good news!However, as you will see below, by capping my memory, it also increased the time required.
It took around 16 seconds per ticker (100 tickers).
Below is a before/after example of memory consumption using 5 tickers.
For 500 tickers, I would be looking at 8,000 seconds (or roughly 2 hours) to run a backtest.
At this point, what else can I do to materially speed things up?
5 Tickers Before:
5 Tickers After:
Thanks
Alfred Sisley
-
Did you try backtesting each ticker in parallel ( if it makes sense of cause ) ?
-
@vladisld said in Adding InfluxDB Datafeeds:
s sense of cause
@vladisld If you mean running 1 ticker at a time, I did not, I was hoping to get results from a portfolio level.
Given the current limitations, I will try this.
At this point, short of changing the guts of cerebro, is there anything I can do on the software side to improve things? If not, would a better machine solve the problem?
Current laptop:
Thanks
Alfred Sisley
-
Talking about HW specs makes it a completely different discussion of cause, however with all that said:
i7-6500U ( 4 cores hyperthreaded) + 12GB buys you a decent machine for development and basic backtesting probably. However you'll hit its limits pretty fast (if not already) if you start optimizing.
In my case, a similar machine is utilized for development but for optimizing a more capable server is used ( old dual Xeon E5-2690 v2 - 48 core + 256 GB + small NVMe drive ). Usually I'm running optimizations with up to 4096 configs - so something that will take months to run on my dev machine may take day or less on this server.
So the answer is yes - more capable, dedicated machine may improve things for heavy loads. And it is actually not that expensive - got my for less that 1500$ on eBay.
-
@vladisld said in Adding InfluxDB Datafeeds:
pecs makes it a completely different discussion of cause, however with all that said:
I'll look into upgrading my machine.
Are you surprised by how long it takes me to run my simple skeleton backtests given my hardware?
Does it make you think there's something very wrong with my code?
Sounds like you have no problem backtesting with similar hardare specs, but I can't seem to get anywhere.
Appreciate your thoughts.
Alfred Sisley
-
@AlfredSisley said in Adding InfluxDB Datafeeds:
Are you surprised by how long it takes me to run my simple skeleton backtests given my hardware?
Hard to tell, can't directly compare with my local results. Usually I'm backtesting a single ticker at a time, so my mileage may vary.
What's your timing for backtesting a single ticker for, say 5 - 10 years of 1 or 5 minutes bars ?
-
Hi @vladisld,
I have to say “thank you” because of your posts I started using InfluxDB to store my data. I spent some time to figure out how to load filtered data from s3 aws buckets to influx but finally solved that. What still is not clear for me and what relates to this topic is how to feed correctly data into backtrader. I tried with build-in influxfeedpy but there are few mistakes i.e. open price calculated from “open price mean”.
Would you be so kind and share good practices in working with influxfeed and influxfeed code you are using? That would help me a lot.
Thank you in advance -
@luzzalgos said in Adding InfluxDB Datafeeds:
open price calculated from “open price mean”
I'm not sure it is a problem once the requested timeframe/compression matches the timeframe and compression stored in the database.
I do have a slightly modified version of influxfeed.py in my fork that doesn't use 'mean' for calculating the open price (only from performance perspective, although still using grouping) and includes more fixes to support proper 'preload' functionality and reduced memory print:
https://github.com/vladisld/backtrader/blob/lazytrader/backtrader/feeds/influxfeed.pyAs for good practices - don't have much - it works pretty well as it is. As with all databases there are many ways of performance tuning in the InfluxDB database itself - starting with engine parameters, indexing, proper tags selection, hardware upgrade (in case you are serious about your data) and so on - however this is a far bigger topic and is not suitable for a simple post (just google 'influxdb performance tuning' - there are a lot of resources and tutorials)