For code/output blocks: Use ``` (aka backtick or grave accent) in a single line before and after the block. See: http://commonmark.org/help/

Data loading time ISSUE



  • Hi,
    I’ve tried to import CSV data for 1 year, 5 years and 10 years of minute data across different environments. I have had some mixed (and not so good) performance.

    I came up with a data loading time table as below:

    0_1521183743431_2018-03-16 17_43_03-Untitled spreadsheet - Google Sheets.png

    Specs:
    Server: Google Cloud instance, Intel Ivy Bridge, 8 vCPUs, 13 GB memory, 50GB SSD for boot and local disk
    2016 MacBook Pro: Processor 2.9 GHz Intel Core i5, Memory 8GB, SSD
    Desktop: Processor 2.7 GHz Intel Xeon ®, Memory 8GB, SSD

    The times recorded above are from when cerebro.run() is called → when the init is called in the strategy class.

    For data feeds, I tried MT4CSVData (uses GenericCSVData and simply modifies the params) and PandasData as per this thread: #How to speed up backtest

    I’ve read the pandas.read_csv is faster for importing CSV data. However, from the testing table, MT4CSVData actually has some better results…?

    In addition, one of the posts also mentioned that for for 214,911 lines(bars), loading time has been reduced to 3.5 seconds from the previous roughly under 8 seconds after optimization.

    Am curious how can I achieve the better performance like that and is there anything I am doing wrong?

    Code below:

    for MT4CSVData

    from __future__ import (absolute_import, division, print_function,
                            unicode_literals)
    import sys  
    import argparse
    import backtrader as bt
    import backtrader.feeds as btfeed
    import pandas
    import time
    
    class TestStrategy(bt.Strategy):
    
        def __init__(self):
    
            print("data import done %s" % time.time())
    
        def next(self):
            pass
            
    
    def parse_args():
        parser = argparse.ArgumentParser(description='')
    
        parser.add_argument('--data', '-d', required=True,
                            help='data into the system')
    
        return parser.parse_args()
    
    
    if __name__ == '__main__':
        
        args = parse_args()
        cerebro = bt.Cerebro()
    
        cerebro.addstrategy(TestStrategy)
        
        data = btfeed.MT4CSVData(
    	    dataname=args.data
        )
        cerebro.adddata(data)
    
        print("data import start %s" % time.time())
        cerebro.run()
    

    for PandasData:

    from __future__ import (absolute_import, division, print_function,
                            unicode_literals)
    
    import datetime  
    import sys 
    import argparse
    import backtrader as bt
    import backtrader.feeds as btfeed
    import pandas
    import time
    
    class TestStrategy(bt.Strategy):
    
        def __init__(self):
            print("data import done %s " % time.time())
        def next(self):
            pass
    
    def parse_args():
        parser = argparse.ArgumentParser(description='')
    
        parser.add_argument('--data', '-d', required=True,
                            help='data into the system')
    
        return parser.parse_args()
    
    if __name__ == '__main__':
        args = parse_args()
        cerebro = bt.Cerebro()
        cerebro.addstrategy(TestStrategy)
    
        datapath=args.data
        dataframe = pandas.read_csv(datapath,
                                    header=0,
                                    parse_dates=True,
                                    index_col=0)
    
        data = bt.feeds.PandasData(dataname=dataframe,timeframe=bt.TimeFrame.Minutes)
    
        cerebro.adddata(data)
        print("data import start %s " % time.time())
        cerebro.run()
    

  • administrators

    One very simple question: where is the ISSUE?



  • thanks for the reply.

    I was just wondering if:

    1. There was something inefficient/incorrect I was doing, or a way to improve the data process..?
    2. Is it normal for pandas to be slower than MT4CSV?

    When working with larger data (i.e 20yrs 1min CSV), it can take quite a long time to import..

    Should say however- the experience so far has been MUCH better than Zipline.

    Great library and documentation from what I've seen so far!


  • administrators

    @btalgo said in Data loading time ISSUE:

    1. There was something inefficient/incorrect I was doing, or a way to improve the data process..?

    Python is slow as a byproduct of the high-level of introspection and dynamism it allows

    @btalgo said in Data loading time ISSUE:

    1. Is it normal for pandas to be slower than MT4CSV?

    Without knowing how the pandas csv-loading code is implemented, the call is meant to support a wider range of use cases, which probably means that a lot more logic is involved in parsing and being resilient to errors when working with the csv. There is also probably a sweet spot, after which it is worth using it. But I haver tried.

    If you really want to load things faster, you should keep your data in databases which quickly give you just the right amount of data, rather than having to go through entire portions of a file. See for example: arctic