A Python package for manipulating 2-dimensional tabular data structures

datatable

Gitter chat PyPi version License Build Status Documentation Status Codacy Badge

This is a Python package for manipulating 2-dimensional tabular data structures (aka data frames). It is close in spirit to pandas or SFrame; however we put specific emphasis on speed and big data support. As the name suggests, the package is closely related to R's data.table and attempts to mimic its core algorithms and API.

Currently datatable is in the Beta stage and undergoing active development. Some of the features may still be missing. Python 3.6+ is required.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB) operations on a single-node machine, at the maximum speed possible. Such requirements are dictated by modern machine-learning applications, which need to process large volumes of data and generate many features in order to achieve the best model accuracy. The first user of datatable was Driverless.ai.

The set of features that we want to implement with datatable is at least the following:

  • Column-oriented data storage.

  • Native-C implementation for all datatypes, including strings. Packages such as pandas and numpy already do that for numeric columns, but not for strings.

  • Support for date-time and categorical types. Object type is also supported, but promotion into object discouraged.

  • All types should support null values, with as little overhead as possible.

  • Data should be stored on disk in the same format as in memory. This will allow us to memory-map data on disk and work on out-of-memory datasets transparently.

  • Work with memory-mapped datasets to avoid loading into memory more data than necessary for each particular operation.

  • Fast data reading from CSV and other formats.

  • Multi-threaded data processing: time-consuming operations should attempt to utilize all cores for maximum efficiency.

  • Efficient algorithms for sorting/grouping/joining.

  • Expressive query syntax (similar to data.table).

  • LLVM-based lazy computation for complex queries (code generated, compiled and executed on-the-fly).

  • LLVM-based user-defined functions.

  • Minimal amount of data copying, copy-on-write semantics for shared data.

  • Use "rowindex" views in filtering/sorting/grouping/joining operators to avoid unnecessary data copying.

  • Interoperability with pandas / numpy / pure python: the users should have the ability to convert to another data-processing framework with ease.

  • Restrictions: Python 3.6+, 64-bit systems only.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For more information see Build instructions.

See also

Owner
H2O.ai
Fast Scalable Machine Learning For Smarter Applications
H2O.ai
Comments
  • [ENH] `nth` function

    [ENH] `nth` function

    Implement dt.nth(cols, n=0) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

    Closes #3128

  • Implement cumulative functions

    Implement cumulative functions

    The list of functions to be implemented and the corresponding PR's

    • [x] cumsum() https://github.com/h2oai/datatable/pull/3257
    • [x] cumprod() https://github.com/h2oai/datatable/pull/3304
    • [x] cummax() https://github.com/h2oai/datatable/pull/3288
    • [x] cummin() https://github.com/h2oai/datatable/pull/3288
    • [x] cumcount() https://github.com/h2oai/datatable/pull/3310
    • [x] ngroup() - not strictly cumulative https://github.com/h2oai/datatable/pull/3310
    • [x] fillna() for forward/backward fill https://github.com/h2oai/datatable/pull/3311
    • [x] fillna() for filling with a value https://github.com/h2oai/datatable/pull/3344 ~~- [ ] rank~~ continued on #3148 ~~- [ ] rolling aggregations~~ continued on #1500
  • Mac M1 import error

    Mac M1 import error

    Mac M1 on BigSur 11.4 Python 3.8.8 on Miniforge Conda environment DataTable: 1.0.0 Installed via pip install git+https://github.com/h2oai/datatable Import error:

    Traceback (most recent call last):
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-4-98efda56b751>", line 1, in <module>
        import datatable as dt
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/__init__.py", line 23, in <module>
        from .frame import Frame
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/frame.py", line 23, in <module>
        from datatable.lib._datatable import Frame
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
      File "/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/__init__.py", line 31, in <module>
        from . import _datatable as core
      File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
        module = self._system_import(name, *args, **kwargs)
    ImportError: dlopen(/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so, 2): no suitable image found.  Did find:
    	/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so: mach-o, but wrong architecture
    	/Users/zwang/miniforge3/envs/tf24/lib/python3.8/site-packages/datatable/lib/_datatable.cpython-38-darwin.so: mach-o, but wrong architecture
    
  • [ENH] Column aliasing

    [ENH] Column aliasing

    This PR implements column's aliasing as proposed in #2684. We couldn't name the method .as() though, because as is a built-in python keyword — hence, we use .alias() instead. Column aliasing is now also available in the group-by clause.

    Closes #2504

  • memory leak and speed concerns

    memory leak and speed concerns

    import numpy as np
    import lightgbm_gpu as lgb
    import scipy
    import pandas as pd
    from sklearn.utils import shuffle
    from h2oaicore.metrics import def_rmse
    import datatable as dt
    
    def set_dt_col(train_dt, name, value):
        if isinstance(name, int):
            name = train_dt.names[name]
        train_dt[:, name] = dt.Frame(value)
        return train_dt
    
    nrow = 4000
    ncol = 5000
    X = np.random.randn(nrow, ncol)
    y = np.random.randn(nrow)
    model = lgb.LGBMRegressor(objective='regression', n_jobs=20)  # 40 very slow
    model.fit(X, y)
    
    X_dt = dt.Frame(X)
    cols_actual = list(X_dt.names)
    
    do_numpy = False
    score_f = def_rmse
    preds = model.predict(X)
    main_metric = score_f(actual=y, predicted=preds)
    seed = 1234
    def go():
        feature_importances = {}
        for n in range(ncol):
            print(n, flush=True)
            if do_numpy:
                shuf = shuffle(X[:,n].ravel())
                X_tmp = X # .copy()
                X_tmp[:,n] = shuf
                new_preds = model.predict(X_tmp)
                metric = score_f(actual=y, predicted=new_preds)
                col = "C" + str(n)
                feature_importances[col] = main_metric - metric
            else:
                col = cols_actual[n]
                shuf = shuffle(X_dt[:, col].to_numpy().ravel(), random_state=seed)
                X_tmp = set_dt_col(dt.Frame(X_dt), col, shuf)
                new_preds = model.predict(X_tmp)
    
                metric = score_f(actual=y, predicted=new_preds)
                feature_importances[col] = main_metric - metric
        return feature_importances
    
    print(go())
    

    Related to permutation variable importance.

    If do_numpy = False, so it uses dt, then I see the resident memory slowly creep up from about 0.8GB to 1.6GB at n=1800 etc. By n=4000 it's using 2.7GB.

    If I use do_numpy = True, so it uses no dt, then I see resident memory never change over all n.

    I thought at one point I only saw with LightGBM and not xgboost, but I'm not sure.

    Unit tests like this numpy version by Microsoft show LightGBM not itself leaking: https://github.com/Microsoft/LightGBM/issues/1968

    These 2 cases aren't doing exactly the same thing in that the numpy version keeps shuffling the same original X, while the dt version I think has essentially 2 copies, but the other original X_dt columns are not modified. But @st-pasha you can confirm.

    One can add the X_tmp = X.copy(), but it's not quite fair. It makes a full copy, while dt should get away with only overwriting a single column.

    Perhaps the flaw is how we are using dt and the frames?

  • segfault on Ubuntu 20.04 when in combination with LightGBM

    segfault on Ubuntu 20.04 when in combination with LightGBM

    # on host
    cd /tmp/
    wget https://files.slack.com/files-pri/T0329MHH6-F013VU6RW94/download/dt_lgb.gz?pub_secret=fb7b5f3988
    mv 'dt_lgb.gz?pub_secret=fb7b5f3988' dt_lgb.gz
    tar xfz dt_lgb.gz
    docker pull ubuntu:20.04
    docker run -t -v `pwd`:/tmp --security-opt seccomp=unconfined -i ubuntu:20.04 /bin/bash
    
    # on Ubuntu 20.04
    chmod 1777 /tmp
    apt-get update
    DEBIAN_FRONTEND=noninteractive apt-get install -y software-properties-common
    add-apt-repository -y ppa:deadsnakes/ppa
    apt-get update
    apt-get install -y python3.6 python3.6-dev virtualenv libgomp1 gdb vim valgrind
    
    # repro failure
    virtualenv -p python3.6 blah
    source blah/bin/activate
    pip install datatable
    pip install lightgbm
    pip install pandas
    cd /tmp/
    python lgb_prefit_df669346-4e47-4ecf-b131-0838ae8f9474.py
    

    fails with:

    /blah/lib/python3.6/site-packages/lightgbm/basic.py:1295: UserWarning: categorical_feature in Dataset is overridden.
    New categorical_feature is []
      'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
    /blah/lib/python3.6/site-packages/lightgbm/basic.py:842: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
    Please use categorical_feature argument of the Dataset constructor to pass this parameter.
      .format(key))
    Segmentation fault (core dumped)
    
  • Support for apache arrow.

    Support for apache arrow.

    Is there any reason why you did not go with apache arrow format from the beginning?

    It would be at least nice, if you allowed to_arrow_table and from_arrow_table conversions.

  • Aggregator in datatable

    Aggregator in datatable

    • Is there something datatable can't do just yet, but you think it'd be nice if it did? Aggregate

    • Is it related to some problem you're trying to solve? Solve slow reading of NFF format files.

    • What do you think the API for your feature should be? See API in the Java code. Methods required are in base class DataSource

    See Java code in https://github.com/h2oai/vis-data-server/blob/master/library/src/main/java/com/h2o/data/Aggregator.java

    Plus other classes in that package for support. All of this should be done in C++

  • Steps towards Python 3.11 support

    Steps towards Python 3.11 support

    • Replace "Py_TYPE(obj) = type" with: "Py_SET_TYPE(obj, type)"
    • Replace "Py_REFCNT(Py_None) += n" with: "Py_SET_REFCNT(Py_None, Py_REFCNT(Py_None) + n)"
    • Add pythoncapi_compat.h to get Py_SET_TYPE() and Py_SET_REFCNT() on Python 3.9 and older. File copied from: https://github.com/pythoncapi/pythoncapi_compat

    On Python 3.10, Py_REFCNT() can no longer be used to set a reference count:

    • https://docs.python.org/dev/c-api/structures.html#c.Py_REFCNT
    • https://docs.python.org/dev/whatsnew/3.10.html#id2

    On Python 3.11, Py_TYPE() can no longer be used to set an object type:

    • https://docs.python.org/dev/c-api/structures.html#c.Py_TYPE
    • https://docs.python.org/dev/whatsnew/3.11.html#id2
  • Switch back to the Apache-v2 license

    Switch back to the Apache-v2 license

    The absolute majority of Python packages are using Apache, MIT, BSD, or similar open licenses. It would be courteous to the broader Python community, and invite broader collaboration/contribution, if we did as well.

    Historically, this project has been Apache from the very first commit. However, sometime before the public release, we switched to MPL-2 license. The idea was to have the same license as R data.table project (which at that time switched from GPL to MPL too). Unfortunately, we failed to grasp the primary difference between R and Python communities at that point: the majority of R packages are licensed as GPL, and within such environment, an MPL-licensed project can be integrated freely and will be seen as more open compared to others. On the contrary, within Python community, an MPL license is more restrictive and will be eyed with suspicion. In fact, MPL license creates a perfectly tangible barrier: ASF includes this license into the Category B list of software that can only be integrated in binary, but not in source code form.

    Please, share your thoughts/comments.

  • FTRL algo does not work properly on views

    FTRL algo does not work properly on views

    Hi,

    I'm trying to use datatable FTRL proximal algo on a dataset and it behaves strangely. LogLoss increases with the number of epochs.

    Here is the code I use :

    train_dt = dt.fread('dt_ftrl_test_set.csv.gz')
    features = [f for f in train_dt.names if f not in ['HasDetections']]
    for n in range(10):
        ftrl = Ftrl(nepochs=n+1)
        ftrl.fit(train_dt[:, features], train_dt[:, 'HasDetections'])
        print(log_loss(np.array(train_dt[trn_, 'HasDetections'])[:, 0], np.array(ftrl.predict(train_dt[trn_, features]))))
    

    The output is

    0.6975873940617929
    0.7004277294410224
    0.7030339011892597
    0.705290424565774
    0.7072685897773024
    0.7091474008277487
    0.7108282513596036
    0.7123130263929156
    0.713890830846544
    0.7151695514165213
    

    my own version of FTRL trains correctly with the following output:

    time_used:0:00:01.026606	epoch: 0   rows:10001	t_logloss:0.59638
    time_used:0:00:01.715622	epoch: 1   rows:10001	t_logloss:0.52452
    time_used:0:00:02.436984	epoch: 2   rows:10001	t_logloss:0.48113
    time_used:0:00:03.158367	epoch: 3   rows:10001	t_logloss:0.44260
    time_used:0:00:03.851369	epoch: 4   rows:10001	t_logloss:0.39633
    time_used:0:00:04.553488	epoch: 5   rows:10001	t_logloss:0.38197
    time_used:0:00:05.264179	epoch: 6   rows:10001	t_logloss:0.35380
    time_used:0:00:05.973398	epoch: 7   rows:10001	t_logloss:0.32839
    time_used:0:00:06.688121	epoch: 8   rows:10001	t_logloss:0.32057
    time_used:0:00:07.394217	epoch: 9   rows:10001	t_logloss:0.29917
    
    • Your environment? I'm on ubuntu 16.04, clang+llvm-7.0.0-x86_64-linux-gnu-ubuntu-16.04, python 3.6, datatable is compiled from source.

    let me know if you need more.

    I guess I'm missing something but could not find anything in the unit tests.

    Thanks for your help.

    P.S. : make test results and the dataset I use are attached. datatable_make_test_results.txt dt_ftrl_test_set.csv.gz

  • [ENH] nth function

    [ENH] nth function

    Implement dt.nth(cols, n) function to return the nth row (also per group) for the specified columns. If n goes out of bounds, NA-row is returned.

    Closes #3128

  • `fread()` doesn't support unicode in file names on Windows

    `fread()` doesn't support unicode in file names on Windows

    我刚刚开始尝试使用datatable,发现如果文件中含有中文路径,将会出现IOError。 然而同一个文件,在全英文路径下则不会出现这样的问题。 报错信息附在最后。 我不知道,是否已存在了解决方案,我尝试搜过,但没有找到解决方案。

    My English is not good. I use machine translation:

    I just tried to use datatable, and found that if the file contains a Chinese path, an IOError will appear. However, for the same file, this problem will not occur in the full English path. The error information is attached at the end. I don't know whether there is a solution. I tried to search, but I didn't find a solution.

    IOError                                   Traceback (most recent call last)
    <timed exec> in <module>
    
    IOError: Unable to obtain size of D:/测试.csv: [errno 2] No such file or directory
    
  • DT[f.A ==

    DT[f.A == "", :] is bugged for columns with all empty strings

    from datatable import dt, f
    
    DT = dt.Frame({"A": ["", ""]})
    DT[f.A == "", dt.count()][0, 0]
    # 0
    

    If any value in the column is not an empty string, it works as expected.

    Workaround:

    DT[dt.str.len(f.A) == 0, dt.count()][0, 0]
    # 2
    
  • Is posible to read data from gcs://?

    Is posible to read data from gcs://?

    Hi guys, Is it possible to read data using fread() from gcs:// I don't see it in the docs and in the code I don't see any reference either

    Thank you! V

Universal 1d/2d data containers with Transformers functionality for data analysis.
Universal 1d/2d data containers with Transformers functionality for data analysis.

XPandas (extended Pandas) implements 1D and 2D data containers for storing type-heterogeneous tabular data of any type, and encapsulates feature extra

Mar 14, 2022
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jan 4, 2023
NumPy and Pandas interface to Big Data
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Jan 1, 2023
High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

Dec 23, 2022
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Dec 6, 2022
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

Jun 20, 2022
Data Structures and Algorithms Python - Practice data structures and algorithms in python with few small projects

Data Structures and Algorithms All the essential resources and template code nee

Dec 1, 2022
Security-TXT is a python package for retrieving, parsing and manipulating security.txt files.

Security-TXT is a python package for retrieving, parsing and manipulating security.txt files.

Feb 7, 2022
Data Structures and algorithms package implementation

Documentation Simple and Easy Package --This is package for enabling basic linear and non-linear data structures and algos-- Data Structures Array Sta

Oct 30, 2021
Research on Tabular Deep Learning (Python package & papers)

Research on Tabular Deep Learning For paper implementations, see the section "Papers and projects". rtdl is a PyTorch-based package providing a user-f

Dec 30, 2022
Glyph-graph - A simple, yet versatile, package for graphing equations on a 2-dimensional text canvas
Glyph-graph - A simple, yet versatile, package for graphing equations on a 2-dimensional text canvas

Glyth Graph Revision for 0.01 A simple, yet versatile, package for graphing equations on a 2-dimensional text canvas List of contents: Brief Introduct

Oct 21, 2022
A python application for manipulating pandas data frames from the comfort of your web browser
A python application for manipulating pandas data frames from the comfort of your web browser

A python application for manipulating pandas data frames from the comfort of your web browser. Data flows are represented as a Directed Acyclic Graph, and nodes can be ran individually as the user sees fit.

Jan 4, 2023
Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

Jan 6, 2023
Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files

pyexcel - Let you focus on data, instead of file formats Support the project If your company has embedded pyexcel and its components into a revenue ge

Dec 29, 2022
pytest plugin for manipulating test data directories and files

pytest-datadir pytest plugin for manipulating test data directories and files. Usage pytest-datadir will look up for a directory with the name of your

Dec 21, 2022
A Login/Registration GUI Application with SQLite database for manipulating data.

Login-Register_Tk A Login/Registration GUI Application with SQLite database for manipulating data. What is this program? This program is a GUI applica

Feb 1, 2022
Tidy data structures, summaries, and visualisations for missing data
Tidy data structures, summaries, and visualisations for missing data

naniar naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot

Dec 22, 2022
Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀
 Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

Jan 1, 2023
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Dec 19, 2022
Python library to extract tabular data from images and scanned PDFs
Python library to extract tabular data from images and scanned PDFs

Overview ExtractTable - API to extract tabular data from images and scanned PDFs The motivation is to make it easy for developers to extract tabular d

Dec 31, 2022