When I try to use 1 month of data on 100 stocks on 5 minute intervals the whole thing is done in about 60s. Bump that up to 2 months and it was hanging for ~30 mins before I killed it. When I killed it I got this error.
Traceback (most recent call last):
File "C:\Users\Dane\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\arrays\datetimes.py", line 2081, in objects_to_datetime64ns
require_iso8601=require_iso8601,
File "pandas\_libs\tslib.pyx", line 364, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 447, in pandas._libs.tslib.array_to_datetime
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pandas\_libs\tslibs\timezones.pyx", line 176, in pandas._libs.tslibs.timezones.get_utcoffset
AttributeError: 'tzoffset' object has no attribute '_utcoffset'
After doing some more investigation, when I run for two months on a different timeframe it's fine. I was running this on 10-02-2020 to 12-02-2020. I used only 1 stock and then started reducing the time frames. 10-02 to 10-15 is fast (0.6s) 10-20 to 11-02 takes 6.8s.
I'm using polygon.io for my data. On the first run any time frame is fast.
url = self.base_url + f"/v2/aggs/ticker/{ticker}/range/{mult}/{timespan}/{from_}/{to}?unadjusted={unadjusted}&sort=asc&limit={limit}&apiKey={self.key}"
...
key_changes = []
for r in res_json["results"]:
dt = self._transform_ts_to_dt(r["t"])
if dt.time() < time(16, 0, 0) and dt.time() >= time(9, 30, 0):
key_changes.append(
{"Open": r["o"], "High": r["h"], "Low": r["l"], "Close": r["c"], "Volume": r["v"], "Datetime": dt}
)
new_df = pd.DataFrame(key_changes).set_index("Datetime")
return new_df
...
# then write to csv
ticker_data.to_csv(open(file_name, "w+"))
If I already have the daterange for a particular stock I do this
file_data = pd.read_csv(file_name, index_col="Datetime", parse_dates=True)
Finally I add my data like so
data = PandasExtra(dataname=ticker_data, tz=pytz.timezone('US/Eastern'), timeframe=bt.TimeFrame.Minutes, compression=5, sessionstart=datetime.time(9, 30, 0), sessionend=datetime.time(16, 00, 0))
data.addfilter(bt.filters.SessionFilter(data))
data.addfilter(bt.filters.SessionFiller(data, fill_vol=0))
added_data = True
cerebro.adddata(data, name=ticker)
I've looked at both spreadsheets. I can't see any difference. Since I'm using session filler I checked how many intervals volume was 0 and it was about the same.
Attached are both the 10-02 to 10-15 spreadsheet which runs well and the 10-20 to 11-02 spreadsheet which runs slow. Nvm apparently I don't have the privileges to attach documents. If you think you need that could I get those privileges? I'll paste a few lines here.
Good data
Datetime,Open,High,Low,Close,Volume
2020-10-02 09:30:00-04:00,17.83,18.0622,17.7659,18.0622,7782
2020-10-02 09:35:00-04:00,18.0101,18.12,18.0101,18.1,3254
2020-10-02 09:40:00-04:00,18.12,18.18,18.12,18.15,1529
2020-10-02 09:45:00-04:00,18.21,18.25,18.21,18.25,800
2020-10-02 09:50:00-04:00,18.1,18.19,18.08,18.19,2242
Apparently bad data
Datetime,Open,High,Low,Close,Volume
2020-10-20 09:30:00-04:00,15.155,15.28,15.151,15.215,5200
2020-10-20 09:35:00-04:00,15.28,15.335,15.26,15.32,1902
2020-10-20 09:40:00-04:00,15.3475,15.39,15.345,15.35,3221
2020-10-20 09:45:00-04:00,15.37,15.54,15.31,15.54,6953
2020-10-20 09:50:00-04:00,15.41,15.5,15.34,15.35,2845
2020-10-20 09:55:00-04:00,15.34,15.46,15.34,15.46,6611
2020-10-20 10:00:00-04:00,15.46,15.49,15.44,15.49,3368
2020-10-20 10:10:00-04:00,15.49,15.49,15.48,15.48,720
Thanks