Navigation

    Backtrader Community

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    For code/output blocks: Use ``` (aka backtick or grave accent) in a single line before and after the block. See: http://commonmark.org/help/

    Incorrect/corrupted data retrieved when using YahooFinanceData API

    General Code/Help
    yahoofinancedat baddata corrupteddata
    3
    20
    5447
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Z
      Zubin Bharucha last edited by Zubin Bharucha

      Hello again BT community!

      I have noticed an interesting problem which occurs on certain Yahoo symbols (I think - I have to admit, I haven't tested very extensively) when using the YahooFinanceData API. I notice that for the affected symbols, the OHLCV (open, high, low, close, volume) values are incorrect up to a certain point in time, after which they are correct.

      To demonstrate this odd behavior, I have written a small script. Please execute the code below:

      from datetime import datetime, timedelta, date
      import backtrader as bt
      
      # get Yahoo finance data
      lookback_timedelta_days = timedelta(days=365)
      now_datetime = datetime.now()
      to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day)
      from_date = to_date - lookback_timedelta_days
      yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False)
      
      # start cerebro
      cerebro = bt.Cerebro()
      cerebro.adddata(data=yahoo_finance_data)
      
      # run cerebro
      cerebro_result = cerebro.run()
      
      # capture results
      date_series = cerebro_result[0].data.datetime.array
      open_series = list(cerebro_result[0].data.open.array)
      high_series = list(cerebro_result[0].data.high.array)
      low_series = list(cerebro_result[0].data.low.array)
      close_series = list(cerebro_result[0].data.close.array)
      volume_series = list(cerebro_result[0].data.volume.array)
      with open('results.txt', 'w') as file_obj:
          file_obj.write('Date,Open,High,Low,Close,Volume')
          for day_iter in range(len(date_series)):
              this_date = date.fromordinal(int(date_series[day_iter]))
              this_open = open_series[day_iter]
              this_high = high_series[day_iter]
              this_low = low_series[day_iter]
              this_close = close_series[day_iter]
              this_volume = volume_series[day_iter]
              file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
      

      Sorry for the messy coding. It's just intended as a demo. As you see, we retrieve the historical OHLCV data for the symbol "BMW.F" going a year back and then write it to a file (results.txt) with the aim that we can compare this file against the one retrieved from Yahoo's web-interface (go here, then click "download data"). Please save the file and remove the "adjusted close" column because we don't need it (I would have attached the file, but uploads are not allowed).

      If you compare the two files, you will see that there are big differences in the OHLC values up to May 11 2017. There are never any errors for the volume information, by the way. After this point, there are still differences, but they are very small (maybe precision errors?). Please see the image below to see what I mean. Up to May 11 2017, we see consistent differences on the order of ~3 Euros. After this point, the difference is on the order of ~<=0.01 Euros.
      alt text

      I have also created a candlestick chart using the data retrieved from the YahooFinanceData API. It confirms what we see above. Before May 11, the data looks extremely weird where not only are the values wrong, but (almost) every day, the close is lower than the open. After May 11, we see that reality is reflected more accurately.
      alt text

      Finally, as I said, this doesn't seem to happen for all symbols. Here are a few others where I notice this behavior with the date where bad data becomes good (ARL.DE: June 1 2017; ADJ.DE; May 4 2017; AIRA.DE; April 11 2017, etc.).

      Thanks for reading this long post. I hope you can figure out what the issue is! Please let me know if I can provide any additional help.

      pip show backtrader
      
      Name: backtrader
      Version: 1.9.51.121
      Summary: BackTesting Engine
      Home-page: https://github.com/mementum/backtrader
      Author: Daniel Rodriguez
      Author-email: danjrod@gmail.com
      License: GPLv3+
      Location: /usr/local/lib/python2.7/dist-packages
      Requires: 
      
      1 Reply Last reply Reply Quote 0
      • D
        dasch last edited by

        Hi

        i went through your problem.

        by changing the data source to

        yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False, swapcloses=True, round=False)

        you will get the correct result it seems.

        in the yahoo feed, there is an assumption

        # In v7, the final seq is "adj close", close, volume
        

        but the data downloaded has the seq close, adj close, volume
        this can be reverted by swapcloses=True

        At least, thats how i understood the code in the yahoo feed :)

        1 Reply Last reply Reply Quote 0
        • B
          backtrader administrators last edited by backtrader

          Unless you have been living under a rock, YahooFinanceData cannot be used reliably any longer and has nothing to do with backtrader. You have to manually filter the data and understand where the oddities introduced by Yahoo are present.

          • Docs -YahooFinanceData gets displayed invalid due to changed adjusted close
          • Quant.StackExchange - Is Yahoo! Finance data good or bad now?

          Plus several other hundred links about the topic.

          1 Reply Last reply Reply Quote 0
          • Z
            Zubin Bharucha last edited by Zubin Bharucha

            Thanks, @dasch and @backtrader for your quick answers!

            @dasch: I tried your suggestion, but unfortunately, the results are still inconsistent. One good tip, however, from your answer was to use round=False. Thanks for that! I guess we will never get the correct results due to the seemingly inconsistent behavior of the data returned by Yahoo. From the BT data feed reference:

            If True the allegedly adjusted close and non-adjusted close will be swapped. The downloads with the new v7 API show at random times the closing prices swapped. There is no known pattern

            If I use your suggestion, i.e., adjclose=False, swapcloses=True, round=False, we get the following:

            From Yahoo: 
            Date,Open,High,Low,Close,Adj Close,Volume
            2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079
            
            Ours: 
            Date,Open,High,Low,Close,Volume
            2016-07-27,79.937379,81.878627,79.937379,78.142998,12079
            

            We see that the close (and volume) values match, but nothing else does.

            Now, if I use adjclose=True, swapcloses=True, round=False, we get the following:

            From Yahoo: 
            Date,Open,High,Low,Close,Adj Close,Volume
            2016-07-27,76.838997,78.705002,76.838997,78.142998,75.114166,12079
            
            Ours: 
            Date,Open,High,Low,Close,Volume
            2016-07-27,76.838997,78.705002,76.838997,75.114166,11610
            

            We see that the everything matches except for the close and volume values (exactly the opposite of the previous case). After a certain date, they do start matching, but I can't rely on something like this.

            So it seems that if I want to get all the OHLCV values correct, I have to use a combination of the two approaches above.

            @backtrader: I suppose I have been living under a rock. So far, I have only required the close values and since I always used adjclose=False, I got the proper values. Now, I need the other values too and therefore have only started venturing out from under my rock just now. For the record, I never meant to imply that there is some bug with BT. Please don't take it personally. I have always maintained in all my other posts that BT is an absolutely amazing product. Unfortunately, I do not see any alternative to using YahooFinanceData because I require mostly German and Indian equities and so far, I haven't been able to find an alternative service that provides these free of charge (I'm only doing this as a hobby in my free time and would therefore be unwilling to pay for a service). If you know a better source, I am all ears. Thank you again, for your continued support!

            1 Reply Last reply Reply Quote 0
            • D
              dasch last edited by

              there is a adjustment in the volume in the data parsing when adjclose=False, maybe altering the feed would help you.

              By commenting line 159-160 in yahoo.py out, i get:

              2016-07-27,73.860708,75.654387,73.860708,75.114166,12079
              

              It is this code part:

                          # in v7 "adjusted prices" seem to be given, scale back for non adj
                          if not self.params.adjclose:
                              o *= adjfactor
                              h *= adjfactor
                              l *= adjfactor
                              c = adjustedclose
                              # v *= adjfactor  # except volume which is the same as in v1
                          else:
                              v /= adjfactor  # rescale vol down
              
              1 Reply Last reply Reply Quote 0
              • D
                dasch last edited by

                @backtrader hope you don't mind, i added a param to the yahoo feed. this would allow to skip the adjustment of the volume

                diff backtrader_orig/feeds/yahoo.py backtrader/feeds/yahoo.py
                54a55,58
                >         
                >       - ``adjvolume`` (default: ``True``)
                > 
                >         Whether to adjust the volume if the values are not adjusted.
                82a87
                >         ('adjvolume', True),
                159c164
                <             else:
                ---
                >             elif self.params.adjvolume:
                
                1 Reply Last reply Reply Quote 0
                • D
                  dasch last edited by

                  Like @backtrader said, this is really a mess with yahoo and it is a unreliable source, today i had to set:

                  yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=True,  swapcloses=False, round=False)
                  

                  to get this lines:

                  2016-07-27,76.838997,78.705002,76.838997,75.114166,12079
                  2016-10-13,76.349998,76.500000,75.307999,73.177269,6697
                  2017-03-20,83.000000,83.000000,82.000000,79.483971,8858
                  2017-06-09,84.377998,84.610001,84.166000,84.539001,1182
                  2017-07-14,84.192001,84.317001,83.599998,83.889000,1188
                  

                  You should try to find a more reliable source. I would like to help, but i have no idea, where you could find it. Anyway, i hope, this helps you a bit. :)

                  1 Reply Last reply Reply Quote 0
                  • B
                    backtrader administrators last edited by

                    Quandl is supported too. The WIKI dataset is a community effort to manage and curate EOD data. It has over 3000+ entries (quickly recalling) and is free.

                    It may, or may not, have what you are looking for.

                    1 Reply Last reply Reply Quote 0
                    • Z
                      Zubin Bharucha last edited by

                      Hi again @dasch and @backtrader,

                      As usual, thanks for your quick replies! Instead of retrieving the data from every symbol twice (once adjusted, once not) and then taking the correct values from each, I decided to see if I could switch from Yahoo to Quandl instead. I found the BMW ticker (from the FSE dataset, which seems to be free to access) and decided to take it for a test run. Sadly, I didn't get very far. Here's the code (pretty much the same as before except the data acquisition part):

                      from datetime import datetime, timedelta, date
                      import backtrader as bt
                      
                      # get Yahoo finance data
                      lookback_timedelta_days = timedelta(days=365)
                      now_datetime = datetime.now()
                      to_date = datetime(now_datetime.year, now_datetime.month, now_datetime.day)
                      from_date = to_date - lookback_timedelta_days
                      
                      # OLD FEED FROM YAHOO
                      # yahoo_finance_data = bt.feeds.YahooFinanceData(dataname="BMW.F", fromdate=from_date, todate=to_date, adjclose=False,
                      #                                                swapcloses=False, round=False)
                      
                      # NEW FEED FROM QUANDL
                      quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, dataset='FSE', round=False,
                                                            adjclose=False)
                      
                      # start cerebro
                      cerebro = bt.Cerebro()
                      cerebro.adddata(data=quandl_finance_data)
                      
                      # run cerebro
                      cerebro_result = cerebro.run()
                      
                      # capture results
                      date_series = cerebro_result[0].data.datetime.array
                      open_series = list(cerebro_result[0].data.open.array)
                      high_series = list(cerebro_result[0].data.high.array)
                      low_series = list(cerebro_result[0].data.low.array)
                      close_series = list(cerebro_result[0].data.close.array)
                      volume_series = list(cerebro_result[0].data.volume.array)
                      with open('results.txt', 'w') as file_obj:
                          file_obj.write('Date,Open,High,Low,Close,Volume\n')
                          for day_iter in range(len(date_series)):
                              this_date = date.fromordinal(int(date_series[day_iter]))
                              this_open = open_series[day_iter]
                              this_high = high_series[day_iter]
                              this_low = low_series[day_iter]
                              this_close = close_series[day_iter]
                              this_volume = volume_series[day_iter]
                              file_obj.write('%s,%f,%f,%f,%f,%d\n' % (this_date, this_open, this_high, this_low, this_close, this_volume))
                      

                      I get this error:

                      /usr/bin/python2.7 <bla>/btYahooIssue.py
                      Traceback (most recent call last):
                        File "<bla>/btYahooIssue.py", line 22, in <module>
                          cerebro_result = cerebro.run()
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1073, in run
                          runstrat = self.runstrategies(iterstrat)
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/cerebro.py", line 1149, in runstrategies
                          data.preload()
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 682, in preload
                          while self.load():
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 476, in load
                          _loadret = self._load()
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/feed.py", line 704, in _load
                          return self._loadline(linetokens)
                        File "/usr/local/lib/python2.7/dist-packages/backtrader/feeds/quandl.py", line 111, in _loadline
                          v = float(linetokens[next(i)])
                      ValueError: could not convert string to float: 
                      
                      Process finished with exit code 1
                      

                      What am I doing wrong?

                      1 Reply Last reply Reply Quote 0
                      • D
                        dasch last edited by

                        you did not configure the data feed correctly. you may find some details here:

                        https://community.backtrader.com/topic/491/quandl-example/

                        Z 1 Reply Last reply Reply Quote 0
                        • Z
                          Zubin Bharucha @dasch last edited by Zubin Bharucha

                          @dasch, it still doesn't work even if I provide the API key. I get the same error with this line:

                          quandl_finance_data = bt.feeds.Quandl(dataname="BMW_X", fromdate=from_date, todate=to_date, round=False, adjclose=False, apikey="xxxx", dataset="FSE")
                          

                          However, if I switch to the WIKI dataset (instead of FSE) and use a ticker like YHOO, everything works fine. So I'm guessing the problem is with the usage of the FSE dataset.

                          1 Reply Last reply Reply Quote 0
                          • D
                            dasch last edited by

                            oh, i see.

                            (110)_loadline()
                            -> v = float(linetokens[next(i)])
                            (Pdb) linetokens
                            ['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
                            

                            The feed does not have a volume but the feed expect it to have it.

                            1 Reply Last reply Reply Quote 0
                            • D
                              dasch last edited by

                              i mean, the data has no volume, so it breaks here, can't really help you with that.

                              Z 1 Reply Last reply Reply Quote 0
                              • D
                                dasch last edited by dasch

                                a quick fix would be something like this in the feed src:

                                       o = float(linetokens[next(i)] or 0)
                                       h = float(linetokens[next(i)] or 0)
                                       l = float(linetokens[next(i)] or 0)
                                       c = float(linetokens[next(i)] or 0)
                                       v = float(linetokens[next(i)] or 0)
                                
                                1 Reply Last reply Reply Quote 0
                                • Z
                                  Zubin Bharucha @dasch last edited by

                                  @dasch, yeah I thought so too.

                                  @backtrader, can this be handled internally such that a warning is thrown and the empty value is interpreted as 0/NaN/infinity/etc. so that the process may continue to execute?

                                  Z 1 Reply Last reply Reply Quote 0
                                  • Z
                                    Zubin Bharucha @Zubin Bharucha last edited by Zubin Bharucha

                                    One more thing: I think the column with the empty value does not represent the volume. From the Quandl web UI for BMW (click on Table), we see the following columns:

                                    Date,Open,High,Low,Close,Change,Traded Volume,Turnover,Last Price of the day,Daily Traded Units,Daily Turnover
                                    

                                    And, correspondingly, we see that linetokens is:

                                    ['2016-07-27', '77.64', '78.8', '77.51', '78.36', '', '1971401.0', '154530852.0', '', '', '']
                                    

                                    So we are expecting to see the volume where Change actually is.

                                    I checked 3 different databases: WIKI, FSE and NSE. The columns returned are different in each case.

                                    WIKI: Date,Open,High,Low,Close,Volume,Ex-Dividend,SplitRatio,Adj.Open,Adj.High,Adj.Low,Adj.Close,Adj.Volume
                                    
                                    NSE: Date,Open,High,Low,Last,Close,TotalTradeQuantity,Turnover(Lacs)
                                    
                                    FSE: Date,Open,High,Low,Close,Change,TradedVolume,Turnover,LastPriceoftheDay,DailyTradedUnits,DailyTurnover
                                    

                                    So, @backtrader, I guess you will have to treat each one differently. One way to retrieve the columns programmatically is to query the database like this: https://www.quandl.com/api/v3/datasets.json?database_code=<WIKI/NSE/FSE/ETC>&per_page=100&sort_by=id&page=1&api_key=<XXXX>. Every "dataset" node contains a child called "column_names" which is a list of the columns returned.

                                    Hope this helps.

                                    1 Reply Last reply Reply Quote 0
                                    • B
                                      backtrader administrators last edited by

                                      There is a potential quick solution: make the Quandl data feed a subclass of GenericCSV and let the users manipulate the notion of where the expected columns are.

                                      A long term solution would also take into account the column names to dynamically create extra lines and make all the columns available (obviously things like a space between words would have to be converted to something like an underscore to let them be proper Python identifiers)

                                      Z 1 Reply Last reply Reply Quote 0
                                      • Z
                                        Zubin Bharucha @backtrader last edited by

                                        @backtrader, I'm not entirely sure what you mean with the first option, but if it is what I think it is, the way the Quandl API would be used would be similar to how GenericCSV works. I'm OK with this. That said, I'm also OK with option 2.

                                        I appreciate your looking into this and look forward to a resolution of the matter. Of course, I trust your judgment regarding which option you plan to implement. Personally, I feel that option 2 would be more user-friendly. Please let us know when a solution is available.

                                        Thanks, as usual, for your effective and quick replies!

                                        Z 1 Reply Last reply Reply Quote 0
                                        • Z
                                          Zubin Bharucha @Zubin Bharucha last edited by

                                          Hi @backtrader,

                                          Any update/ETA on this?

                                          1 Reply Last reply Reply Quote 0
                                          • B
                                            backtrader administrators last edited by

                                            Not at the moment

                                            1 Reply Last reply Reply Quote 0
                                            • 1 / 1
                                            • First post
                                              Last post
                                            Copyright © 2016, 2017, 2018, 2019, 2020, 2021 NodeBB Forums | Contributors