Incorrect/corrupted data retrieved when using YahooFinanceData API



  • Hello again BT community!

    I have noticed an interesting problem which occurs on certain Yahoo symbols (I think - I have to admit, I haven't tested very extensively) when using the YahooFinanceData API. I notice that for the affected symbols, the OHLCV (open, high, low, close, volume) values are incorrect up to a certain point in time, after which they are correct.

    To demonstrate this odd behavior, I have written a small script. Please execute the code below:

    from datetime import datetime, timedelta, date
    import backtrader as bt
    
    # get Yahoo finance data
    lookback_timedelta_days = timedelta(days=365)
    now_datetime = datetime.now()
    to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day)
    from_date = to_date - lookback_timedelta_days
    yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False)
    
    # start cerebro
    cerebro = bt.Cerebro()
    cerebro.adddata(data=yahoo_finance_data)
    
    # run cerebro
    cerebro_result = cerebro.run()
    
    # capture results
    date_series = cerebro_result[0].data.datetime.array
    open_series = list(cerebro_result[0].data.open.array)
    high_series = list(cerebro_result[0].data.high.array)
    low_series = list(cerebro_result[0].data.low.array)
    close_series = list(cerebro_result[0].data.close.array)
    volume_series = list(cerebro_result[0].data.volume.array)
    with open('results.txt', 'w') as file_obj:
        file_obj.write('Date,Open,High,Low,Close,Volume')
        for day_iter in range(len(date_series)):
            this_date = date.fromordinal(int(date_series[day_iter]))
            this_open = open_series[day_iter]
            this_high = high_series[day_iter]
            this_low = low_series[day_iter]
            this_close = close_series[day_iter]
            this_volume = volume_series[day_iter]
            file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
    

    Sorry for the messy coding. It's just intended as a demo. As you see, we retrieve the historical OHLCV data for the symbol "BMW.F" going a year back and then write it to a file (results.txt) with the aim that we can compare this file against the one retrieved from Yahoo's web-interface (go here, then click "download data"). Please save the file and remove the "adjusted close" column because we don't need it (I would have attached the file, but uploads are not allowed).

    If you compare the two files, you will see that there are big differences in the OHLC values up to May 11 2017. There are never any errors for the volume information, by the way. After this point, there are still differences, but they are very small (maybe precision errors?). Please see the image below to see what I mean. Up to May 11 2017, we see consistent differences on the order of ~3 Euros. After this point, the difference is on the order of ~<=0.01 Euros.
    alt text

    I have also created a candlestick chart using the data retrieved from the YahooFinanceData API. It confirms what we see above. Before May 11, the data looks extremely weird where not only are the values wrong, but (almost) every day, the close is lower than the open. After May 11, we see that reality is reflected more accurately.
    alt text

    Finally, as I said, this doesn't seem to happen for all symbols. Here are a few others where I notice this behavior with the date where bad data becomes good (ARL.DE: June 1 2017; ADJ.DE; May 4 2017; AIRA.DE; April 11 2017, etc.).

    Thanks for reading this long post. I hope you can figure out what the issue is! Please let me know if I can provide any additional help.

    pip show backtrader
    
    Name: backtrader
    Version: 1.9.51.121
    Summary: BackTesting Engine
    Home-page: https://github.com/mementum/backtrader
    Author: Daniel Rodriguez
    Author-email: danjrod@gmail.com
    License: GPLv3+
    Location: /usr/local/lib/python2.7/dist-packages
    Requires: 
    


  • Hi

    i went through your problem.

    by changing the data source to

    yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False, swapcloses=True, round=False)

    you will get the correct result it seems.

    in the yahoo feed, there is an assumption

    # In v7, the final seq is "adj close", close, volume
    

    but the data downloaded has the seq close, adj close, volume
    this can be reverted by swapcloses=True

    At least, thats how i understood the code in the yahoo feed :)


  • administrators

    Unless you have been living under a rock, YahooFinanceData cannot be used reliably any longer and has nothing to do with backtrader. You have to manually filter the data and understand where the oddities introduced by Yahoo are present.

    Plus several other hundred links about the topic.



  • Thanks, @dasch and @backtrader for your quick answers!

    @dasch: I tried your suggestion, but unfortunately, the results are still inconsistent. One good tip, however, from your answer was to use round=False. Thanks for that! I guess we will never get the correct results due to the seemingly inconsistent behavior of the data returned by Yahoo. From the BT data feed reference:

    If True the allegedly adjusted close and non-adjusted close will be swapped. The downloads with the new v7 API show at random times the closing prices swapped. There is no known pattern

    If I use your suggestion, i.e., adjclose=False, swapcloses=True, round=False, we get the following:

    From Yahoo: 
    Date,Open,High,Low,Close,Adj Close,Volume
    2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079
    
    Ours: 
    Date,Open,High,Low,Close,Volume
    2016-07-27,79.937379,81.878627,79.937379,78.142998,12079
    

    We see that the close (and volume) values match, but nothing else does.

    Now, if I use adjclose=True, swapcloses=True, round=False, we get the following:

    From Yahoo: 
    Date,Open,High,Low,Close,Adj Close,Volume
    2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079
    
    Ours: 
    Date,Open,High,Low,Close,Volume
    2016-07-27,76.838997,78.705002,76.838997,75.114166,11610
    

    We see that the everything matches except for the close and volume values (exactly the opposite of the previous case). After a certain date, they do start matching, but I can't rely on something like this.

    So it seems that if I want to get all the OHLCV values correct, I have to use a combination of the two approaches above.

    @backtrader: I suppose I have been living under a rock. So far, I have only required the close values and since I always used adjclose=False, I got the proper values. Now, I need the other values too and therefore have only started venturing out from under my rock just now. For the record, I never meant to imply that there is some bug with BT. Please don't take it personally. I have always maintained in all my other posts that BT is an absolutely amazing product. Unfortunately, I do not see any alternative to using YahooFinanceData because I require mostly German and Indian equities and so far, I haven't been able to find an alternative service that provides these free of charge (I'm only doing this as a hobby in my free time and would therefore be unwilling to pay for a service). If you know a better source, I am all ears. Thank you again, for your continued support!



  • there is a adjustment in the volume in the data parsing when adjclose=False, maybe altering the feed would help you.

    By commenting line 159-160 in yahoo.py out, i get:

    2016-07-27,73.860708,75.654387,73.860708,75.114166,12079
    

    It is this code part:

                # in v7 "adjusted prices" seem to be given, scale back for non adj
                if not self.params.adjclose:
                    o *= adjfactor
                    h *= adjfactor
                    l *= adjfactor
                    c = adjustedclose
                    # v *= adjfactor  # except volume which is the same as in v1
                else:
                    v /= adjfactor  # rescale vol down
    


  • @backtrader hope you don't mind, i added a param to the yahoo feed. this would allow to skip the adjustment of the volume

    diff backtrader_orig/feeds/yahoo.py backtrader/feeds/yahoo.py
    54a55,58
    >         
    >       - ``adjvolume`` (default: ``True``)
    > 
    >         Whether to adjust the volume if the values are not adjusted.
    82a87
    >         ('adjvolume', True),
    159c164
    <             else:
    ---
    >             elif self.params.adjvolume:
    


  • Like @backtrader said, this is really a mess with yahoo and it is a unreliable source, today i had to set:

    yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=True,  swapcloses=False, round=False)
    

    to get this lines:

    2016-07-27,76.838997,78.705002,76.838997,75.114166,12079
    2016-10-13,76.349998,76.500000,75.307999,73.177269,6697
    2017-03-20,83.000000,83.000000,82.000000,79.483971,8858
    2017-06-09,84.377998,84.610001,84.166000,84.539001,1182
    2017-07-14,84.192001,84.317001,83.599998,83.889000,1188
    

    You should try to find a more reliable source. I would like to help, but i have no idea, where you could find it. Anyway, i hope, this helps you a bit. :)


  • administrators

    Quandl is supported too. The WIKI dataset is a community effort to manage and curate EOD data. It has over 3000+ entries (quickly recalling) and is free.

    It may, or may not, have what you are looking for.



  • Hi again @dasch and @backtrader,

    As usual, thanks for your quick replies! Instead of retrieving the data from every symbol twice (once adjusted, once not) and then taking the correct values from each, I decided to see if I could switch from Yahoo to Quandl instead. I found the BMW ticker (from the FSE dataset, which seems to be free to access) and decided to take it for a test run. Sadly, I didn't get very far. Here's the code (pretty much the same as before except the data acquisition part):

    from datetime import datetime, timedelta, date
    import backtrader as bt
    
    # get Yahoo finance data
    lookback_timedelta_days = timedelta(days=365)
    now_datetime = datetime.now()
    to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day)
    from_date = to_date - lookback_timedelta_days
    
    # OLD FEED FROM YAHOO
    # yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False,
    #                                                swapcloses=False, round=False)
    
    # NEW FEED FROM QUANDL
    quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, dataset='FSE', round=False,
                                          adjclose=False)
    
    # start cerebro
    cerebro = bt.Cerebro()
    cerebro.adddata(data=quandl_finance_data)
    
    # run cerebro
    cerebro_result = cerebro.run()
    
    # capture results
    date_series = cerebro_result[0].data.datetime.array
    open_series = list(cerebro_result[0].data.open.array)
    high_series = list(cerebro_result[0].data.high.array)
    low_series = list(cerebro_result[0].data.low.array)
    close_series = list(cerebro_result[0].data.close.array)
    volume_series = list(cerebro_result[0].data.volume.array)
    with open('results.txt', 'w') as file_obj:
        file_obj.write('Date,Open,High,Low,Close,Volume\n')
        for day_iter in range(len(date_series)):
            this_date = date.fromordinal(int(date_series[day_iter]))
            this_open = open_series[day_iter]
            this_high = high_series[day_iter]
            this_low = low_series[day_iter]
            this_close = close_series[day_iter]
            this_volume = volume_series[day_iter]
            file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
    

    I get this error:

    /usr/bin/python2.7 <bla>/btYahooIssue.py
    Traceback (most recent call last):
      File "<bla>/btYahooIssue.py", line 22, in <module>
        cerebro_result = cerebro.run()
      File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1073, in run
        runstrat = self.runstrategies(iterstrat)
      File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1149, in runstrategies
        data.preload()
      File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 682, in preload
        while self.load():
      File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 476, in load
        _loadret = self._load()
      File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 704, in _load
        return self._loadline(linetokens)
      File "/usr/local/lib/python2.7/dist-packages/backtrader/feeds/quandl.py", line 111, in _loadline
        v = float(linetokens[next(i)])
    ValueError: could not convert string to float: 
    
    Process finished with exit code 1
    

    What am I doing wrong?



  • you did not configure the data feed correctly. you may find some details here:

    https://community.backtrader.com/topic/491/quandl-example/



  • @dasch, it still doesn't work even if I provide the API key. I get the same error with this line:

    quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, round=False, adjclose=False, apikey="xxxx", dataset="FSE")
    

    However, if I switch to the WIKI dataset (instead of FSE) and use a ticker like YHOO, everything works fine. So I'm guessing the problem is with the usage of the FSE dataset.



  • oh, i see.

    (110)_loadline()
    -> v = float(linetokens[next(i)])
    (Pdb) linetokens
    ['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
    

    The feed does not have a volume but the feed expect it to have it.



  • i mean, the data has no volume, so it breaks here, can't really help you with that.



  • a quick fix would be something like this in the feed src:

           o = float(linetokens[next(i)] or 0)
           h = float(linetokens[next(i)] or 0)
           l = float(linetokens[next(i)] or 0)
           c = float(linetokens[next(i)] or 0)
           v = float(linetokens[next(i)] or 0)
    


  • @dasch, yeah I thought so too.

    @backtrader, can this be handled internally such that a warning is thrown and the empty value is interpreted as 0/NaN/infinity/etc. so that the process may continue to execute?



  • One more thing: I think the column with the empty value does not represent the volume. From the Quandl web UI for BMW (click on Table), we see the following columns:

    Date,Open,High,Low,Close,Change,Traded Volume,Turnover,Last Price of the day,Daily Traded Units,Daily Turnover
    

    And, correspondingly, we see that linetokens is:

    ['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
    

    So we are expecting to see the volume where Change actually is.

    I checked 3 different databases: WIKI, FSE and NSE. The columns returned are different in each case.

    WIKI: Date,Open,High,Low,Close,Volume,Ex-Dividend,SplitRatio,Adj.Open,Adj.High,Adj.Low,Adj.Close,Adj.Volume
    
    NSE: Date,Open,High,Low,Last,Close,TotalTradeQuantity,Turnover(Lacs)
    
    FSE: Date,Open,High,Low,Close,Change,TradedVolume,Turnover,LastPriceoftheDay,DailyTradedUnits,DailyTurnover
    

    So, @backtrader, I guess you will have to treat each one differently. One way to retrieve the columns programmatically is to query the database like this: https://www.quandl.com/api/v3/datasets.json?database_code=<WIKI/NSE/FSE/ETC>&per_page=100&sort_by=id&page=1&api_key=<XXXX>. Every "dataset" node contains a child called "column_names" which is a list of the columns returned.

    Hope this helps.


  • administrators

    There is a potential quick solution: make the Quandl data feed a subclass of GenericCSV and let the users manipulate the notion of where the expected columns are.

    A long term solution would also take into account the column names to dynamically create extra lines and make all the columns available (obviously things like a space between words would have to be converted to something like an underscore to let them be proper Python identifiers)



  • @backtrader, I'm not entirely sure what you mean with the first option, but if it is what I think it is, the way the Quandl API would be used would be similar to how GenericCSV works. I'm OK with this. That said, I'm also OK with option 2.

    I appreciate your looking into this and look forward to a resolution of the matter. Of course, I trust your judgment regarding which option you plan to implement. Personally, I feel that option 2 would be more user-friendly. Please let us know when a solution is available.

    Thanks, as usual, for your effective and quick replies!



  • Hi @backtrader,

    Any update/ETA on this?


  • administrators

    Not at the moment


Log in to reply
 

Looks like your connection to Backtrader Community was lost, please wait while we try to reconnect.