Incorrect/corrupted data retrieved when using YahooFinanceData API
-
Hello again BT community!
I have noticed an interesting problem which occurs on certain Yahoo symbols (I think - I have to admit, I haven't tested very extensively) when using the YahooFinanceData API. I notice that for the affected symbols, the OHLCV (open, high, low, close, volume) values are incorrect up to a certain point in time, after which they are correct.
To demonstrate this odd behavior, I have written a small script. Please execute the code below:
from datetime import datetime, timedelta, date import backtrader as bt # get Yahoo finance data lookback_timedelta_days = timedelta(days=365) now_datetime = datetime.now() to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day) from_date = to_date - lookback_timedelta_days yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False) # start cerebro cerebro = bt.Cerebro() cerebro.adddata(data=yahoo_finance_data) # run cerebro cerebro_result = cerebro.run() # capture results date_series = cerebro_result[0].data.datetime.array open_series = list(cerebro_result[0].data.open.array) high_series = list(cerebro_result[0].data.high.array) low_series = list(cerebro_result[0].data.low.array) close_series = list(cerebro_result[0].data.close.array) volume_series = list(cerebro_result[0].data.volume.array) with open('results.txt', 'w') as file_obj: file_obj.write('Date,Open,High,Low,Close,Volume') for day_iter in range(len(date_series)): this_date = date.fromordinal(int(date_series[day_iter])) this_open = open_series[day_iter] this_high = high_series[day_iter] this_low = low_series[day_iter] this_close = close_series[day_iter] this_volume = volume_series[day_iter] file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
Sorry for the messy coding. It's just intended as a demo. As you see, we retrieve the historical OHLCV data for the symbol "BMW.F" going a year back and then write it to a file (results.txt) with the aim that we can compare this file against the one retrieved from Yahoo's web-interface (go here, then click "download data"). Please save the file and remove the "adjusted close" column because we don't need it (I would have attached the file, but uploads are not allowed).
If you compare the two files, you will see that there are big differences in the OHLC values up to May 11 2017. There are never any errors for the volume information, by the way. After this point, there are still differences, but they are very small (maybe precision errors?). Please see the image below to see what I mean. Up to May 11 2017, we see consistent differences on the order of ~3 Euros. After this point, the difference is on the order of ~<=0.01 Euros.
I have also created a candlestick chart using the data retrieved from the YahooFinanceData API. It confirms what we see above. Before May 11, the data looks extremely weird where not only are the values wrong, but (almost) every day, the close is lower than the open. After May 11, we see that reality is reflected more accurately.
Finally, as I said, this doesn't seem to happen for all symbols. Here are a few others where I notice this behavior with the date where bad data becomes good (ARL.DE: June 1 2017; ADJ.DE; May 4 2017; AIRA.DE; April 11 2017, etc.).
Thanks for reading this long post. I hope you can figure out what the issue is! Please let me know if I can provide any additional help.
pip show backtrader Name: backtrader Version: 1.9.51.121 Summary: BackTesting Engine Home-page: https://github.com/mementum/backtrader Author: Daniel Rodriguez Author-email: danjrod@gmail.com License: GPLv3+ Location: /usr/local/lib/python2.7/dist-packages Requires:
-
Hi
i went through your problem.
by changing the data source to
yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False, swapcloses=True, round=False)
you will get the correct result it seems.
in the yahoo feed, there is an assumption
# In v7, the final seq is "adj close", close, volume
but the data downloaded has the seq close, adj close, volume
this can be reverted by swapcloses=TrueAt least, thats how i understood the code in the yahoo feed :)
-
Unless you have been living under a rock, YahooFinanceData cannot be used reliably any longer and has nothing to do with backtrader. You have to manually filter the data and understand where the oddities introduced by Yahoo are present.
- Docs -YahooFinanceData gets displayed invalid due to changed adjusted close
- Quant.StackExchange - Is Yahoo! Finance data good or bad now?
Plus several other hundred links about the topic.
-
Thanks, @dasch and @backtrader for your quick answers!
@dasch: I tried your suggestion, but unfortunately, the results are still inconsistent. One good tip, however, from your answer was to use
round=False
. Thanks for that! I guess we will never get the correct results due to the seemingly inconsistent behavior of the data returned by Yahoo. From the BT data feed reference:If True the allegedly adjusted close and non-adjusted close will be swapped. The downloads with the new v7 API show at random times the closing prices swapped. There is no known pattern
If I use your suggestion, i.e.,
adjclose=False, swapcloses=True, round=False
, we get the following:From Yahoo: Date,Open,High,Low,Close,Adj Close,Volume 2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079 Ours: Date,Open,High,Low,Close,Volume 2016-07-27,79.937379,81.878627,79.937379,78.142998,12079
We see that the close (and volume) values match, but nothing else does.
Now, if I use
adjclose=True, swapcloses=True, round=False
, we get the following:From Yahoo: Date,Open,High,Low,Close,Adj Close,Volume 2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079 Ours: Date,Open,High,Low,Close,Volume 2016-07-27,76.838997,78.705002,76.838997,75.114166,11610
We see that the everything matches except for the close and volume values (exactly the opposite of the previous case). After a certain date, they do start matching, but I can't rely on something like this.
So it seems that if I want to get all the OHLCV values correct, I have to use a combination of the two approaches above.
@backtrader: I suppose I have been living under a rock. So far, I have only required the close values and since I always used
adjclose=False
, I got the proper values. Now, I need the other values too and therefore have only started venturing out from under my rock just now. For the record, I never meant to imply that there is some bug with BT. Please don't take it personally. I have always maintained in all my other posts that BT is an absolutely amazing product. Unfortunately, I do not see any alternative to using YahooFinanceData because I require mostly German and Indian equities and so far, I haven't been able to find an alternative service that provides these free of charge (I'm only doing this as a hobby in my free time and would therefore be unwilling to pay for a service). If you know a better source, I am all ears. Thank you again, for your continued support! -
there is a adjustment in the volume in the data parsing when adjclose=False, maybe altering the feed would help you.
By commenting line 159-160 in yahoo.py out, i get:
2016-07-27,73.860708,75.654387,73.860708,75.114166,12079
It is this code part:
# in v7 "adjusted prices" seem to be given, scale back for non adj if not self.params.adjclose: o *= adjfactor h *= adjfactor l *= adjfactor c = adjustedclose # v *= adjfactor # except volume which is the same as in v1 else: v /= adjfactor # rescale vol down
-
@backtrader hope you don't mind, i added a param to the yahoo feed. this would allow to skip the adjustment of the volume
diff backtrader_orig/feeds/yahoo.py backtrader/feeds/yahoo.py 54a55,58 > > - ``adjvolume`` (default: ``True``) > > Whether to adjust the volume if the values are not adjusted. 82a87 > ('adjvolume', True), 159c164 < else: --- > elif self.params.adjvolume:
-
Like @backtrader said, this is really a mess with yahoo and it is a unreliable source, today i had to set:
yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=True, swapcloses=False, round=False)
to get this lines:
2016-07-27,76.838997,78.705002,76.838997,75.114166,12079 2016-10-13,76.349998,76.500000,75.307999,73.177269,6697 2017-03-20,83.000000,83.000000,82.000000,79.483971,8858 2017-06-09,84.377998,84.610001,84.166000,84.539001,1182 2017-07-14,84.192001,84.317001,83.599998,83.889000,1188
You should try to find a more reliable source. I would like to help, but i have no idea, where you could find it. Anyway, i hope, this helps you a bit. :)
-
Quandl
is supported too. TheWIKI
dataset is a community effort to manage and curate EOD data. It has over 3000+ entries (quickly recalling) and is free.It may, or may not, have what you are looking for.
-
Hi again @dasch and @backtrader,
As usual, thanks for your quick replies! Instead of retrieving the data from every symbol twice (once adjusted, once not) and then taking the correct values from each, I decided to see if I could switch from Yahoo to Quandl instead. I found the BMW ticker (from the
FSE
dataset, which seems to be free to access) and decided to take it for a test run. Sadly, I didn't get very far. Here's the code (pretty much the same as before except the data acquisition part):from datetime import datetime, timedelta, date import backtrader as bt # get Yahoo finance data lookback_timedelta_days = timedelta(days=365) now_datetime = datetime.now() to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day) from_date = to_date - lookback_timedelta_days # OLD FEED FROM YAHOO # yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False, # swapcloses=False, round=False) # NEW FEED FROM QUANDL quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, dataset='FSE', round=False, adjclose=False) # start cerebro cerebro = bt.Cerebro() cerebro.adddata(data=quandl_finance_data) # run cerebro cerebro_result = cerebro.run() # capture results date_series = cerebro_result[0].data.datetime.array open_series = list(cerebro_result[0].data.open.array) high_series = list(cerebro_result[0].data.high.array) low_series = list(cerebro_result[0].data.low.array) close_series = list(cerebro_result[0].data.close.array) volume_series = list(cerebro_result[0].data.volume.array) with open('results.txt', 'w') as file_obj: file_obj.write('Date,Open,High,Low,Close,Volume\n') for day_iter in range(len(date_series)): this_date = date.fromordinal(int(date_series[day_iter])) this_open = open_series[day_iter] this_high = high_series[day_iter] this_low = low_series[day_iter] this_close = close_series[day_iter] this_volume = volume_series[day_iter] file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
I get this error:
/usr/bin/python2.7 <bla>/btYahooIssue.py Traceback (most recent call last): File "<bla>/btYahooIssue.py", line 22, in <module> cerebro_result = cerebro.run() File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1073, in run runstrat = self.runstrategies(iterstrat) File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1149, in runstrategies data.preload() File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 682, in preload while self.load(): File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 476, in load _loadret = self._load() File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 704, in _load return self._loadline(linetokens) File "/usr/local/lib/python2.7/dist-packages/backtrader/feeds/quandl.py", line 111, in _loadline v = float(linetokens[next(i)]) ValueError: could not convert string to float: Process finished with exit code 1
What am I doing wrong?
-
you did not configure the data feed correctly. you may find some details here:
-
@dasch, it still doesn't work even if I provide the API key. I get the same error with this line:
quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, round=False, adjclose=False, apikey="xxxx", dataset="FSE")
However, if I switch to the
WIKI
dataset (instead ofFSE
) and use a ticker likeYHOO
, everything works fine. So I'm guessing the problem is with the usage of theFSE
dataset. -
oh, i see.
(110)_loadline() -> v = float(linetokens[next(i)]) (Pdb) linetokens ['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
The feed does not have a volume but the feed expect it to have it.
-
i mean, the data has no volume, so it breaks here, can't really help you with that.
-
a quick fix would be something like this in the feed src:
o = float(linetokens[next(i)] or 0) h = float(linetokens[next(i)] or 0) l = float(linetokens[next(i)] or 0) c = float(linetokens[next(i)] or 0) v = float(linetokens[next(i)] or 0)
-
@dasch, yeah I thought so too.
@backtrader, can this be handled internally such that a warning is thrown and the empty value is interpreted as 0/NaN/infinity/etc. so that the process may continue to execute?
-
One more thing: I think the column with the empty value does not represent the volume. From the Quandl web UI for BMW (click on
Table
), we see the following columns:Date,Open,High,Low,Close,Change,Traded Volume,Turnover,Last Price of the day,Daily Traded Units,Daily Turnover
And, correspondingly, we see that
linetokens
is:['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
So we are expecting to see the volume where
Change
actually is.I checked 3 different databases: WIKI, FSE and NSE. The columns returned are different in each case.
WIKI: Date,Open,High,Low,Close,Volume,Ex-Dividend,SplitRatio,Adj.Open,Adj.High,Adj.Low,Adj.Close,Adj.Volume NSE: Date,Open,High,Low,Last,Close,TotalTradeQuantity,Turnover(Lacs) FSE: Date,Open,High,Low,Close,Change,TradedVolume,Turnover,LastPriceoftheDay,DailyTradedUnits,DailyTurnover
So, @backtrader, I guess you will have to treat each one differently. One way to retrieve the columns programmatically is to query the database like this:
https://www.quandl.com/api/v3/datasets.json?database_code=<WIKI/NSE/FSE/ETC>&per_page=100&sort_by=id&page=1&api_key=<XXXX>
. Every "dataset" node contains a child called "column_names" which is a list of the columns returned.Hope this helps.
-
There is a potential quick solution: make the
Quandl
data feed a subclass of GenericCSV and let the users manipulate the notion of where the expected columns are.A long term solution would also take into account the column names to dynamically create extra lines and make all the columns available (obviously things like a space between words would have to be converted to something like an underscore to let them be proper Python identifiers)
-
@backtrader, I'm not entirely sure what you mean with the first option, but if it is what I think it is, the way the
Quandl
API would be used would be similar to howGenericCSV
works. I'm OK with this. That said, I'm also OK with option 2.I appreciate your looking into this and look forward to a resolution of the matter. Of course, I trust your judgment regarding which option you plan to implement. Personally, I feel that option 2 would be more user-friendly. Please let us know when a solution is available.
Thanks, as usual, for your effective and quick replies!
-
Hi @backtrader,
Any update/ETA on this?
-
Not at the moment