SQLite Database - IndexError: array index out of range
-
Hi All,
Been using BT for a couple of weeks now, looking to use SQLite database as datafeed since CSV's lead to a state of "insufficient memory" and don't run all the way through (6 month period/1-minute data / 500 stocks).
I've set up the database on my drive and uploaded it with my data (source: IQFeed).
The script I use to transfer CSV data to the database is this:
import sqlite3 import time import datetime import random import pandas as pd from sp500tickers import get_sp500_tickers file_path = r"C:\Users\julia\Desktop\Python\Project - Automated_Trading\Tickers" conn = sqlite3.connect('test_database.db') c = conn.cursor() def create_table(): c.execute('CREATE TABLE IF NOT EXISTS stuffToPlot (ticker TEXT, datetime TEXT, open REAL, low REAL, high REAL, close REAL, volume INTEGER, openinterest INTEGER)') def data_entry(): tickers = get_sp500_tickers().head(1) for ticker in tickers: data = file_path + '/' + ticker +'.csv' df = pd.read_csv(data,header=None) df.columns=['datetime','open','low','high','close','volume','openinterest'] df['ticker'] = ticker #Add ticker column df.set_index('datetime') print(df.index) df.to_sql('stuffToPlot', conn, if_exists='append', index= False) conn.commit() data_entry()
Data looks like:
SQLite Database looks like:I've included all the tickers in a single database table. Also note the ticker column.
I'm getting the following error:
Traceback (most recent call last): File "c:/Users/julia/Desktop/Python/Project - Automated_Trading/Algo_sql.py", line 254, in <module> cerebro.run() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1127, in run runstrat = self.runstrategies(iterstrat) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1293, in runstrategies self._runonce(runstrats) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1652, in _runonce strat._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineiterator.py", line 297, in _once indicator._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineiterator.py", line 297, in _once indicator._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\linebuffer.py", line 630, in _once self.oncestart(self._minperiod - 1, self._minperiod) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineroot.py", line 165, in oncestart self.once(start, end) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\linebuffer.py", line 672, in once dst[i] = src[i + ago] IndexError: array index out of range
For simplicity, I've kept next() blank.
The datafeed class looks like (inspired by online examples):
class MySQLData(DataBase): params = ( ('ticker', ''), ('fromdate', datetime.date(2020,1,2)), ('todate', datetime.date(2020, 5,22)), ('name', ''), ('compression', 1), ('timeframe', bt.TimeFrame.Minutes), ) def start(self): DB_LOCATION = r"C:\Users\julia/test_database.db" self.connection = sqlite3.connect(DB_LOCATION) self.cur = self.connection.cursor() self.result = self.connection.execute("SELECT `datetime`,`open`,`low`,`high`,`close`,`volume`,'openinterest', 'ticker' FROM `stuffToPlot` WHERE 'ticker' =" + '?' + " AND `datetime` between '"+ self.p.fromdate.strftime('%Y-%m-%d')+"' and '"+self.p.todate.strftime('%Y-%m-%d')+"' ORDER BY `datetime` ASC",(str(self.p.ticker),)) def stop(self): self.connection.close() self.cur.close() def _load(self): one_row = self.result.fetchone() if one_row is None: return False self.lines.datetime[0] = date2num(one_row[0]) self.lines.open[0] = float(one_row[1]) self.lines.high[0] = float(one_row[3]) self.lines.low[0] = float(one_row[2]) self.lines.close[0] = float(one_row[4]) self.lines.volume[0] = int(one_row[5]) self.lines.openinterest[0] = int(one_row[6]) return True
In the main, I run a loop to add all the feeds:
if __name__ == '__main__': symbols = ['MMM'] for s in symbols: data = MySQLData(ticker=s,name=s,timeframe=bt.TimeFrame.Minutes,compression=1,fromdate=datetime.date(2020,1,2),todate=datetime.date(2020, 5,22),datetime=0,open=1,low=2,high=3,close=4,volume=5,openinterest=6,nullvalue=0.0,dtformat=('%Y-%m-%d %H:%M:%S'),tmformat=('%H:%M:%S')) cerebro.adddata(data) cerebro.run()
It's entirely possible that this is more of a SQLite question but any help on this is greatly appreciated!
Alfred Sisley
-
I find something like this a bit easier to work with. Obviously add in datetime varaibles where appropriate. If you need to you can exend the pandas data class to modify feeds and line locations. Or you can just adjust the columns in your dataframe using .loc.
conn = sqlite3.connect("data/TestDatabase.db") tickers = ["ES"] for t in tickers: sql = "SELECT datetime, open, high, low, close, vol FROM stuffToPlot WHERE datetime > '2020-01-01' and ticker='{}'".format( t ) dataframe = pd.read_sql(sql, con=conn) dataframe["datetime"] = pd.to_datetime(dataframe["datetime"]) dataframe = dataframe.set_index("datetime") dataframe = dataframe.rename(columns={'vol':'volume'}) data = bt.feeds.PandasData(dataname=dataframe) print(dataframe.head()) cerebro.adddata(data)
Printing out...
dt 2020-01-02 09:31:00, o 3235.25 h 3237.00 l 3234.75 c 3237.00 v 16808 dt 2020-01-02 09:32:00, o 3237.00 h 3237.00 l 3234.00 c 3234.75 v 10439 dt 2020-01-02 09:33:00, o 3234.50 h 3235.50 l 3233.75 c 3234.75 v 8203 dt 2020-01-02 09:34:00, o 3234.75 h 3238.00 l 3234.75 c 3237.50 v 8664 dt 2020-01-02 09:35:00, o 3237.25 h 3238.25 l 3236.25 c 3236.25 v 7889 dt 2020-01-02 09:36:00, o 3236.50 h 3239.50 l 3236.00 c 3239.00 v 11918 dt 2020-01-02 09:37:00, o 3238.75 h 3239.25 l 3237.75 c 3238.75 v 7558 dt 2020-01-02 09:38:00, o 3238.75 h 3239.00 l 3237.25 c 3237.75 v 6557 dt 2020-01-02 09:39:00, o 3237.75 h 3238.25 l 3235.75 c 3237.00 v 7043 dt 2020-01-02 09:40:00, o 3236.75 h 3238.25 l 3236.75 c 3238.00 v 6474 dt 2020-01-02 09:41:00, o 3238.00 h 3239.50 l 3237.75 c 3238.25 v 10372 dt 2020-01-02 09:42:00, o 3238.25 h 3238.50 l 3237.50 c 3237.75 v 5401 dt 2020-01-02 09:43:00, o 3238.00 h 3238.25 l 3237.25 c 3237.50 v 4868 dt 2020-01-02 09:44:00, o 3237.50 h 3238.50 l 3237.25 c 3238.00 v 4275
-
@run-out said in SQLite Database - IndexError: array index out of range:
Obviously add in datetim
Thanks @run-out. I previously toyed around with going from SQLite to pandas df to Cerebro but it still resulted in insufficient memory.
Looking to pull data straight from the database, which will hopefully result in being able to run my strategy on 500 tickers.
Any suggestions? Do you use a great number of tickers in your strategies, if so, how?
Alfred Sisley
-
I ran a loop of 500 times to pull the ES minute data if have (only in rth) and without the open interest column and it loaded into memory, although slowly.
Is there not some way you can scale back your test? Perhaps run it on multiple sets of 50 tickers? Or restrict trading hours and reduce the data that way? Drop open interest and maybe volume if not using them?
One thing I did once I was able to run one of my testing variables that reduced the universe of tickers before the back test was necessary. This allowed me to reduce 5000 stocks to 600 stocks for the entire back test (day frequency).
All in all, you need to find a way to have less data, or get a better machine! Hope that helps. :)
-
@run-out said in SQLite Database - IndexError: array index out of range:
his allowed me to reduce 5000 stocks to 600 stocks for the entire back test (day frequency).
Dropping columns (volume & open interest) absolutely helped but still not enough.
My strategy consists of scanning 500 stocks every minute for a signal (all during rth) then buying on that signal. I don't think I could exclude any data given this?
Wondering if you could recommend specs for a new machine that would be able to handle this amount of data? Or perhaps a particular VPS?
Your insight is greatly appreciated @run-out !
Alfred Sisley
-
Have you considered (or have an experience with) using time series databases ( instead of SQLite ) - like InfluxDB for example ? Shouldn't it be more optimized for fetching time ranges than SQLite ?
-
@vladisld Have not considered nor do I have any experience with InfluxDB. But having poked around the forum, it appears to be the database of choice.
Looking through the documentation, I put the following script together to write Data frames to the database, however, it appears empty.
Code:
from influxdb import DataFrameClient from influxdb import client as influxdb import pandas as pd dbname = 'demo' client = DataFrameClient(host='127.0.0.1', port=8086, username='root', password='root', database=dbname) client.create_database(dbname) df = pd.read_csv(r"C:\Users\julia\Desktop\Python\Project - Automated_Trading\Tickers\A.csv",header=None) df.columns=['datetime','open','low','high','close','volume','openinterest'] df['datetime'] = pd.to_datetime(df['datetime'],format = '%Y-%m-%d %H:%M:%S') df = df.set_index('datetime') client.write_points(df, measurement='enroll_pd', protocol='json',tags=df.columns) print(client.query("select * from demo"))
Would it be possible to suggest what I'm doing wrong, or perhaps, provide python code that would be better suited.
Currently getting data from IQFeed in the following format:
Greatly appreciate your help!
Alfred Sisley
-
There is a good sample in the code: iqfeed-to-influxdb.py that is doing exactly that.
-
@vladisld Thanks for the link.
Getting the following error:
usage: iqfeed-to-influxdb.py [-h] (--ticker TICKER | --ticker-list TICKER_LIST) [--dbhost DBHOST] [--dbport DBPORT] [--iqhost IQHOST] [--iqport IQPORT] [--username USERNAME] [--password PASSWORD] [--database DATABASE] [--fromdate FROMDATE] [--todate TODATE] [--debug] [--info] iqfeed-to-influxdb.py: error: one of the arguments --ticker --ticker-list is required
I'm not exactly sure where in the code to add my tickers.
I believe it's here:
def get_tickers_from_file(self, filename): #def get_tickers_from_file(self, filename): """Load ticker list from txt file""" if not os.path.exists(filename): log.error("Ticker List file does not exist: %s", filename) tickers = [] with io.open(filename, 'r') as fd: for ticker in fd: tickers.append(ticker.rstrip()) return tickers
I've added a global variable called 'filename' with the name of of the file as a string but that doesn't seem to work.
Any insight on how to get started with this. Appreciate your patience...
Alfred Sisley
-
Not using it myself(have my own utility). It seems you just need to supply a ticket name (or filensme containing tickets names on each line) on command line. It will connect to the local iqfeed client by itself, download the history data and store it to the influxdb database.
I suggest you just learn the source code and see if you can use some ideas from it in you own system.
-
Thanks @vladisld. I managed to download from iqfeed + create influxdb database.
Now trying to use the code at the link below to access the data:
https://github.com/mementum/backtrader/blob/master/backtrader/feeds/influxfeed.pyBut I'm getting the following error (same error as original question):
File "c:/Users/julia/Desktop/Python/Project - Automated_Trading/Algo_influxdb.py", line 304, in <module> cerebro.run() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1127, in run runstrat = self.runstrategies(iterstrat) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1293, in runstrategies self._runonce(runstrats) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\cerebro.py", line 1652, in _runonce strat._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineiterator.py", line 297, in _once indicator._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineiterator.py", line 297, in _once indicator._once() File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\linebuffer.py", line 630, in _once self.oncestart(self._minperiod - 1, self._minperiod) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\lineroot.py", line 165, in oncestart self.once(start, end) File "C:\Users\julia\AppData\Local\Programs\Python\Python38-32\lib\site-packages\backtrader\linebuffer.py", line 672, in once dst[i] = src[i + ago] IndexError: array index out of range
Would you know what this error is exactly (date related?) and know how to address it?
Code:
TIMEFRAMES = dict( ( (bt.TimeFrame.Seconds, 's'), (bt.TimeFrame.Minutes, 'ms'), (bt.TimeFrame.Days, 'd'), (bt.TimeFrame.Weeks, 'w'), (bt.TimeFrame.Months, 'm'), (bt.TimeFrame.Years, 'y'), ) ) class InfluxDB(feed.DataBase): frompackages = ( ('influxdb', [('InfluxDBClient', 'idbclient')]), ('influxdb.exceptions', 'InfluxDBClientError') ) params = ( ('host', 'localhost'), ('port', '8086'), ('username', 'root'), ('password', 'root'), ('database', 'test_data'), ('dataname', 'test_data'), ('timeframe', bt.TimeFrame.Minutes), ('startdate', '2020-01-02 9:30:00'), ('high', 'high'), ('low', 'low'), ('open', 'open'), ('close', 'close'), ('volume', 'volume'), ('ointerest', 'openinterest'), ) def start(self): super(InfluxDB, self).start() try: self.ndb = idbclient(self.p.host, self.p.port, self.p.username, self.p.password, self.p.database) except InfluxDBClientError as err: print('Failed to establish connection to InfluxDB: %s' % err) tf = '{multiple}{timeframe}'.format( multiple=(self.p.compression if self.p.compression else 1), timeframe=TIMEFRAMES.get(self.p.timeframe, 'ms')) if not self.p.startdate: st = '<= now()' else: st = '>= \'%s\'' % self.p.startdate # The query could already consider parameters like fromdate and todate # to have the database skip them and not the internal code qstr = ('SELECT mean("{open_f}") AS "open", mean("{high_f}") AS "high", ' 'mean("{low_f}") AS "low", mean("{close_f}") AS "close", ' 'mean("{vol_f}") AS "volume", mean("{oi_f}") AS "openinterest" ' 'FROM "{dataname}" ' 'WHERE datetime {begin} ' 'GROUP BY time({timeframe}) fill(none)').format( open_f=self.p.open, high_f=self.p.high, low_f=self.p.low, close_f=self.p.close, vol_f=self.p.volume, oi_f=self.p.ointerest, timeframe=tf, begin=st, dataname=self.p.database) try: dbars = list(self.ndb.query(qstr).get_points()) except InfluxDBClientError as err: print('InfluxDB query failed: %s' % err) self.biter = iter(dbars) def _load(self): try: bar = next(self.biter) except StopIteration: return False self.l.datetime[0] = date2num(dt.datetime.strptime(bar['datetime'], '%Y-%m-%d %H:%M:%S')) self.l.open[0] = bar['open'] self.l.high[0] = bar['high'] self.l.low[0] = bar['low'] self.l.close[0] = bar['close'] self.l.volume[0] = bar['volume'] return True
and...
data= bt.feeds.InfluxDB(database='test_data') cerebro.adddata(data)
Alfred Sisley
-
It seems you modified the existing InfluxDB feed (which is working ok BTW - not optimal but ok ).
'WHERE datetime {begin} '
The default index column in InfluxDB is called
time
if I'm not mistaken.Try to use the original feed, and only if it is not working - modify it. From my experience it is working ok without any changes needed.
-
@vladisld You were right, I was making my life harder by modifying the code.
However, while the script now runs all the way through, it doesn't appear to make it to
next
.I've retained only skeleton code (removed indicators & some other class/functions) to see if that would have an impact, nothing...
It seems as though the data is not getting loaded.
Since you say that the InfluxDB feed is indeed working, it makes me believe the issue is coming from perhaps how I'm writing data to the influx database or the naming convention in my database?
Would there be anything in my script that would explain this?
import pandas as pd from influxdb import InfluxDBClient client = InfluxDBClient(host='localhost', port=8086, username='root', password='root', database='test_data') client.create_database('test_data') filename = r"C:\Users\julia\Desktop\Python\Project - Automated_Trading\Tickers\A.csv" df = pd.read_csv(filename) df.columns=['datetime','open','low','high','close','volume','openinterest'] for row_index, row in df.iterrows(): tags = row[0] open_ = row[1] low_ = row[2] high_ = row[3] close_ = row[4] volume_ = row[5] openinterest_ = row[6] json_body = [ { "measurement": "quotes", "tags": { "datetime":tags }, "fields": { "open": open_, "low": low_, "high": high_, "close": close_, "volume": volume_, "openinterest": openinterest_ } } ] print(json_body) client.write_points(json_body,protocol='json') query = 'select open from quotes;' result = client.query(query) print("Result: {0}".format(result))
Also, I've defined data as follows:
data = bt.feeds.InfluxDB(host='localhost', port='8086', username='root', password='root', database='test_data', dataname='test_data', timeframe=bt.TimeFrame.Minutes, compression=1, startdate=datetime.datetime(2020, 1, 2), todate=datetime.datetime(2020, 5, 22), high='high', low='low', open='open', close='close', volume='volume', ointerest='openinterest')
A snippet of how my IQFeed data looks:
I've got a feeling this has something to do with datetime formatting, but can't seem to pin point exactly how/where.
Again, appreciate the multiple responses.
Alfred Sisley
-
@AlfredSisley , it worth to take a look at the InfluxDB docs - especially Concepts
It's pretty simple after all, there are several concepts which are very similar to the SQL world:
InfluxDB Database - corresponds to the SQL Database
InfluxDB Measurement - corresponds to the SQL Table
InfluxDB Tags and Field - corresponds to the SQL Table Columns
and so on...Please pay more attention to the difference between the tags and fields
Unfortunately InfluxDB feed uses slightly different terminology (
dataname
instead ofmeasurement
)Here in your case you're initializing the InfluxDB feed with database 'test_data' and measurement
test_data
. However you write to the database using themeasurement
= "quotes". Could be a source of error.Additional points:
IMHO there is no need to define the "datetime" tag at all. Every InfluxDB measurement has a "time" column by default. Instead, just define the value of the "time" column in your json structure:
There the sample of what I'm using in my system:
class Record(dllistnode): def get_fields(self): return {} def get_tags(self): return {} class BarRecord(Record): __slots__ = ('timestamp', 'high_p', 'low_p', 'open_p', 'close_p', 'volume', 'oi') def __init__(self, timestamp, high_p, low_p, open_p, close_p, volume, oi): self.timestamp = timestamp self.high_p = high_p self.low_p = low_p self.open_p = open_p self.close_p = close_p self.volume = volume self.oi = oi def get_fields(self): return { 'high_p':self.high_p, 'low_p':self.low_p, 'open_p':self.open_p, 'close_p':self.close_p, 'volume':self.volume, 'oi':self.oi, } def get_tags(self): return {} class RecordsBuffer: def __init__(self): self.records = dllist() def append(self, record): self.records.append(record) def empty(self): return self.records.size == 0 def to_json(self, measurement_name, filter_f = lambda x: True): json = [] for bar in self.records: if filter_f(bar): json_point = { "measurement": measurement_name, "fields": bar.get_fields(), "tags": bar.get_tags(), "time": bar.timestamp } json.append(json_point) return json
-
@vladisld You nailed it. We have data connection!
It's taking me 20 minutes to write around 40,000 lines of data (roughly 5 months of 1 minute data for 1 ticker)
Out of curiosity, since you are also writing to db with json, are you experiencing similar "limitations"? Or is it my implementation?
Wondering if I need to re-write a script using "line protocol" to speed things up.
Many thanks.
Alfred Sisley
-
Check your memory consumption (swap disk usage if you are on Linux). It should work much faster than that ( IIRC was able to write 10 years of 1 min data for 1 ticker much under 1 minute - without a lot of optimizations ).
There are a lot of info in InfluxDB forum as well as youtube channel ( example ) about optimizing the configuration and general requirements for the machines.
What was your experience with SQLite ?
-
@AlfredSisley said in SQLite Database - IndexError: array index out of range:
It's taking me 20 minutes to write around 40,000 lines of data
Are you sure only the storage time is measured and not also the iqfeed fetch time ?
-
@vladisld Enjoyed the video. Believe the issue in the code was that it was iterating over rows and writing 1 row at a time to the database...
Can now write 13 years of 1 minute data in ~50 seconds. Needed to
batch_size=10000
for it to work on a large data-set.Problem solved. Now the real work begins.
Thank you!