xarray: N-D labeled arrays and datasets

xarray: N-D labeled arrays and datasets

https://github.com/pydata/xarray/workflows/CI/badge.svg?branch=master https://readthedocs.org/projects/xray/badge/?version=latest https://img.shields.io/badge/benchmarked%20by-asv-green.svg?style=flat

xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

Xarray was inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray's data model, and integrates tightly with dask for parallel computing.

Why xarray?

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called "tensors") are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Xarray doesn't just keep track of labels on arrays -- it uses them to provide a powerful and concise interface. For example:

  • Apply operations over dimensions by name: x.sum('time').
  • Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
  • Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
  • Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

Documentation

Learn more about xarray in its official documentation at https://xarray.pydata.org/

Contributing

You can find information about contributing to xarray at our Contributing page.

Get in touch

  • Ask usage questions ("How do I?") on StackOverflow.
  • Report bugs, suggest features or view the source code on GitHub.
  • For less well defined questions or ideas, or to announce other projects of interest to xarray users, use the mailing list.

NumFOCUS

https://numfocus.org/wp-content/uploads/2017/07/NumFocus_LRG.png

Xarray is a fiscally sponsored project of NumFOCUS, a nonprofit dedicated to supporting the open source scientific computing community. If you like Xarray and want to support our mission, please consider making a donation to support our efforts.

History

xarray is an evolution of an internal tool developed at The Climate Corporation. It was originally written by Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo and was released as open source in May 2014. The project was renamed from "xray" in January 2016. Xarray became a fiscally sponsored project of NumFOCUS in August 2018.

License

Copyright 2014-2019, xarray Developers

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

xarray bundles portions of pandas, NumPy and Seaborn, all of which are available under a "3-clause BSD" license: - pandas: setup.py, xarray/util/print_versions.py - NumPy: xarray/core/npcompat.py - Seaborn: _determine_cmap_params in xarray/core/plot/utils.py

xarray also bundles portions of CPython, which is available under the "Python Software Foundation License" in xarray/core/pycompat.py.

xarray uses icons from the icomoon package (free version), which is available under the "CC BY 4.0" license.

The full text of these licenses are included in the licenses directory.

Comments
  • WIP: Zarr backend

    WIP: Zarr backend

    • [x] Closes #1223
    • [x] Tests added / passed
    • [x] Passes git diff upstream/master | flake8 --diff
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

    I think that a zarr backend could be the ideal storage format for xarray datasets, overcoming many of the frustrations associated with netcdf and enabling optimal performance on cloud platforms.

    This is a very basic start to implementing a zarr backend (as proposed in #1223); however, I am taking a somewhat different approach. I store the whole dataset in a single zarr group. I encode the extra metadata needed by xarray (so far just dimension information) as attributes within the zarr group and child arrays. I hide these special attributes from the user by wrapping the attribute dictionaries in a "HiddenKeyDict", so that they can't be viewed or modified.

    I have no tests yet (:flushed:), but the following code works.

    from xarray.backends.zarr import ZarrStore
    import xarray as xr
    import numpy as np
    
    ds = xr.Dataset(
        {'foo': (('y', 'x'), np.ones((100, 200)), {'myattr1': 1, 'myattr2': 2}),
         'bar': (('x',), np.zeros(200))},
        {'y': (('y',), np.arange(100)),
         'x': (('x',), np.arange(200))},
        {'some_attr': 'copana'}
    ).chunk({'y': 50, 'x': 40})
    
    zs = ZarrStore(store='zarr_test')
    ds.dump_to_store(zs)
    ds2 = xr.Dataset.load_store(zs)
    assert ds2.equals(ds)
    

    There is a very long way to go here, but I thought I would just get a PR started. Some questions that would help me move forward.

    1. What is "encoding" at the variable level? (I have never understood this part of xarray.) How should encoding be handled with zarr?
    2. Should we encode / decode CF for zarr stores?
    3. Do we want to always automatically align dask chunks with the underlying zarr chunks?
    4. What sort of public API should the zarr backend have? Should you be able to load zarr stores via open_dataset? Or do we need a new method? I think .to_zarr() would be quite useful.
    5. zarr arrays are extensible along all axes. What does this imply for unlimited dimensions?
    6. Is any autoclose logic needed? As far as I can tell, zarr objects don't need to be closed.
  • CFTimeIndex

    CFTimeIndex

    • [x] closes #1084
    • [x] passes git diff upstream/master | flake8 --diff
    • [x] tests added / passed
    • [x] whatsnew entry

    This work in progress PR is a start on implementing a NetCDFTimeIndex, a subclass of pandas.Index, which closely mimics pandas.DatetimeIndex, but uses netcdftime._netcdftime.datetime objects. Currently implemented in the new index are:

    • Partial datetime-string indexing (using strictly ISO8601-format strings, using a date parser implemented by @shoyer in https://github.com/pydata/xarray/issues/1084#issuecomment-274372547)
    • Field-accessors for year, month, day, hour, minute, second, and microsecond, to enable groupby operations on attributes of date objects

    This index is meant as a step towards improving the handling of non-standard calendars and dates outside the range Timestamp('1677-09-21 00:12:43.145225') to Timestamp('2262-04-11 23:47:16.854775807').


    For now I have pushed only the code and some tests for the new index; I want to make sure the index is solid and well-tested before we consider integrating it into any of xarray's existing logic or writing any documentation.

    Regarding the index, there are a couple remaining outstanding issues (that at least I'm aware of):

    1. Currently one can create non-sensical datetimes using netcdftime._netcdftime.datetime objects. This means one can attempt to index with an out-of-bounds string or datetime without raising an error. Could this possibly be addressed upstream? For example:
    In [1]: from netcdftime import DatetimeNoLeap
    
    In [2]: DatetimeNoLeap(2000, 45, 45)
    Out[2]: netcdftime._netcdftime.DatetimeNoLeap(2000, 45, 45, 0, 0, 0, 0, -1, 1)
    
    1. I am looking to enable this index to be used in pandas.Series and pandas.DataFrame objects as well; this requires implementing a get_value method. I have taken @shoyer's suggested simplified approach from https://github.com/pydata/xarray/issues/1084#issuecomment-275963433, and tweaked it to also allow for slice indexing, so I think this is most of the way there. A remaining to-do for me, however, is to implement something to allow for integer-indexing outside of iloc, e.g. if you have a pandas.Series series, indexing with the syntax series[1] or series[1:3].

    Hopefully this is a decent start; in particular I'm not an expert in writing tests so please let me know if there are improvements I can make to the structure and / or style I've used so far. I'm happy to make changes. I appreciate your help.

  • ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc`

    ENH: use `dask.array.apply_gufunc` in `xr.apply_ufunc`

    use dask.array.apply_gufunc in xr.apply_ufunc for multiple outputs when dask='parallelized', add/fix tests

    • [x] Closes #1815, closes #4015
    • [x] Tests added
    • [x] Passes isort -rc . && black . && mypy . && flake8
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

    Remaining Issues:

    • [ ] fitting name for current dask_gufunc_kwargs
    • [ ] rephrase dask docs to fit new behaviour
    • [ ] combine output_core_dims and output_sizes, eg. xr.apply_ufunc(..., output_core_dims=[{"abc": 2]])
  • Multidimensional groupby

    Multidimensional groupby

    Many datasets have a two dimensional coordinate variable (e.g. longitude) which is different from the logical grid coordinates (e.g. nx, ny). (See #605.) For plotting purposes, this is solved by #608. However, we still might want to split / apply / combine over such coordinates. That has not been possible, because groupby only supports creating groups on one-dimensional arrays.

    This PR overcomes that issue by using stack to collapse multiple dimensions in the group variable. A minimal example of the new functionality is

    >>> da = xr.DataArray([[0,1],[2,3]], 
                    coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
                            'lat': (['ny','nx'], [[10,10],[20,20]] )},
                    dims=['ny','nx'])
    >>> da.groupby('lon').sum()
    <xarray.DataArray (lon: 3)>
    array([0, 3, 3])
    Coordinates:
      * lon      (lon) int64 30 40 50
    

    This feature could have broad applicability for many realistic datasets (particularly model output on irregular grids): for example, averaging non-rectangular grids zonally (i.e. in latitude), binning in temperature, etc.

    If you think this is worth pursuing, I would love some feedback.

    The PR is not complete. Some items to address are

    • [x] Create a specialized grouper to allow coarser bins. By default, if no grouper is specified, the GroupBy object uses all unique values to define the groups. With a high resolution dataset, this could balloon to a huge number of groups. With the latitude example, we would like to be able to specify e.g. 1-degree bins. Usage would be da.groupby('lon', bins=range(-90,90)).
    • [ ] Allow specification of which dims to stack. For example, stack in space but keep time dimension intact. (Currently it just stacks all the dimensions of the group variable.)
    • [x] A nice example for the docs.
  • release v0.18.0

    release v0.18.0

    As discussed in the meeting, we should issue a release soon with the new backend refactor and the new docs theme.

    Here's a list of blockers:

    • [x] #5231
    • [x] #5073
    • [x] #5235

    Would be nice and look done:

    • [x] #5244
    • [x] #5258
    • [x] #5101
    • [x] ~#4866~ (we should let this sit on master for a while to find bugs)
    • [x] #4902
    • [x] ~#4972~ (this should probably also sit on master for a while)
    • [x] #5227
    • [x] #4740
    • [x] #5149

    Somewhat important, but no PR yet:

    • [x] ~#5175~ (as pointed out by @shoyer, this is really a new feature, not a regression, it can wait)

    @TomNicholas and @alexamici volunteered to handle this. I can be online at release time to help with things if needed.

    Release instructions are here: https://github.com/pydata/xarray/blob/master/HOW_TO_RELEASE.md

    IIRC they'll need to be added to the PyPI list and RTD list.

  • WIP: indexing with broadcasting

    WIP: indexing with broadcasting

    • [x] Closes #1444, closes #1436
    • [x] Tests added / passed
    • [x] Passes git diff master | flake8 --diff
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

    xref https://github.com/pydata/xarray/issues/974#issuecomment-313977794

  • Appending to zarr store

    Appending to zarr store

    This pull request allows to append an xarray to an existing datastore.

    • [x] Closes #2022
    • [x] Tests will be added. Wanted to get an opinion if this is what is imagined by the community
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API To filter the data written to the array, the dimension over which the data will be appended has to be explicitly stated. If someone has an idea how to overcome this, I would be more than happy to incorporate the necessary changes into the PR. Cheers, Jendrik
  • Integration  with dask/distributed (xarray backend design)

    Integration with dask/distributed (xarray backend design)

    Dask (https://github.com/dask/dask) currently provides on-node parallelism for medium-size data problems. However, large climate data sets will require multiple-node parallelism to analyze large climate data sets because this constitutes a big data problem. A likely solution to this issue is integration of distributed (https://github.com/dask/distributed) with dask. Distributed is now integrated with dask and its benefits are already starting to be realized, e.g., see http://matthewrocklin.com/blog/work/2016/02/26/dask-distributed-part-3.

    Thus, this issue is designed to identify the steps needed to perform this integration, at a high-level. As stated by @shoyer, it will

    definitely require some refactoring of the xarray backend system to make this work cleanly, but that's OK -- the xarray backend system is indicated as experimental/internal API precisely because we hadn't figured out all the use cases yet."

    To be honest, I've never been entirely happy with the design we took there (we use inheritance rather than composition for backend classes), but we did get it to work for our use cases. Some refactoring with an eye towards compatibility with dask distributed seems like a very worthwhile endeavor. We do have the benefit of a pretty large test suite covering existing use cases.

    Thus, we have the chance to make xarray big-data capable as well as provide improvements to the backend.

    To this end, I'm starting this issue to help begin the design process following the xarray mailing list discussion some of us have been having (@shoyer, @mrocklin, @rabernat).

    Task To Do List:

    • [x] Verify asynchronous access error for to_netcdf output is resolved (e.g., https://github.com/pydata/xarray/issues/793)
    • [x] LRU-cached file IO supporting serialization to robustly support HDF/NetCDF reads
  • Html repr

    Html repr

    This PR supersedes #1820 - see that PR for original discussion. See this gist to try out the new MultiIndex and options functionality.

    • [x] Closes #1627, closes #1820
    • [x] Tests added
    • [x] Passes black . && mypy . && flake8
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API

    TODO:

    • [x] Add support for Multi-indexes
    • [x] Probably good to have some opt-in or fail back system in case where we (or users) know that the rendering will not work
    • [x] Add some tests
  • Fixes OS error arising from too many files open

    Fixes OS error arising from too many files open

    Previously, DataStore did not judiciously close files, resulting in opening a large number of files that could result in an OSError related to too many files being open. This merge provides a solution for the netCDF, scipy, and h5netcdf backends.

  • cov() and corr() - finalization

    cov() and corr() - finalization

    Dear maintainers,

    this is my first PR and contribution to xarray. I have read the contributing guidelines and did my best to conform to your code of conduct. However, I would be happy if you have a closer look whether everything is correct and give feedback for improvement!

    As I don't have push access to @hrishikeshac's fork, I created this PR to add the final changes for PR #2652.

    • [x] Closes #2652
    • [x] Tests had been added in #2652
    • [ ] Passes black . && mypy . && flake8
      • [x] black
      • [x] flake8
      • [ ] mypy
    • [x] Fully documented, including whats-new.rst for all changes and api.rst for new API
      • [x] whats-new.rst
      • [x] api.rst
    • [x] example notebook for usage of corr() and cov()
      • [x] where and how to write the examples?
      • [x] covariance
      • [x] correlation
    • [x] moved implementation of cov() and corr() to computation
    • [x] add test_cov()
    • [x] remove da.cov and da.corr
  • cannot set figsize when plotting map with cartopy

    cannot set figsize when plotting map with cartopy

    What happened?

    I wish to set the figsize when plotting the map using the DataArray.plot, but it would throw ValueError: cannot use subplot_kws with existing ax. It seems that figsize would generate a tradictional Cartesian axes and refrain subplot_kws from setting GeoAxes with cartopy projections.

    What did you expect to happen?

    A map plot with its figsize adjustable.

    Nevertheless, I could bypass this problem around by separately setting a new figure at first with matplotlib.pyplot.figure and then plotting with DataArray.

    import xarray as xr
    import matplotlib.pyplot as plt
    import cartopy.crs as ccrs
    
    air = xr.tutorial.open_dataset("air_temperature").air
    fig = plt.figure(figsize=(8, 3))
    p = air.isel(time=0).plot.pcolormesh(
        subplot_kws=dict(projection=ccrs.PlateCarree(), facecolor="gray"),
        transform=ccrs.PlateCarree(),
    )
    

    image

    Minimal Complete Verifiable Example

    import xarray as xr
    import cartopy.crs as ccrs
    
    air = xr.tutorial.open_dataset("air_temperature").air
    p = air.isel(time=0).plot.pcolormesh(
        figsize=(7, 3.5),
        subplot_kws=dict(projection=ccrs.PlateCarree(), facecolor="gray"),
        transform=ccrs.PlateCarree(),
    )
    

    MVCE confirmation

    • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
    • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
    • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
    • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

    Relevant log output

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    Input In [1], in <cell line: 5>()
          2 import cartopy.crs as ccrs
          4 air = xr.tutorial.open_dataset("air_temperature").air
    ----> 5 p = air.isel(time=0).plot.pcolormesh(
          6     figsize=(7, 3.5),
          7     subplot_kws=dict(projection=ccrs.PlateCarree(), facecolor="gray"),
          8     transform=ccrs.PlateCarree(),
          9 )
    
    File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/plot/plot.py:1310, in _plot2d.<locals>.plotmethod(_PlotMethods_obj, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, colors, center, robust, extend, levels, infer_intervals, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
       1308 for arg in ["_PlotMethods_obj", "newplotfunc", "kwargs"]:
       1309     del allargs[arg]
    -> 1310 return newplotfunc(**allargs)
    
    File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/plot/plot.py:1210, in _plot2d.<locals>.newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
       1206 if "imshow" == plotfunc.__name__ and isinstance(aspect, str):
       1207     # forbid usage of mpl strings
       1208     raise ValueError("plt.imshow's `aspect` kwarg is not available in xarray")
    -> 1210 ax = get_axis(figsize, size, aspect, ax, **subplot_kws)
       1212 primitive = plotfunc(
       1213     xplt,
       1214     yplt,
       (...)
       1221     **kwargs,
       1222 )
       1224 # Label the plot with metadata
    
    File /srv/conda/envs/notebook/lib/python3.9/site-packages/xarray/plot/utils.py:443, in get_axis(figsize, size, aspect, ax, **kwargs)
        440     raise ValueError("cannot provide `aspect` argument without `size`")
        442 if kwargs and ax is not None:
    --> 443     raise ValueError("cannot use subplot_kws with existing ax")
        445 if ax is None:
        446     ax = _maybe_gca(**kwargs)
    
    ValueError: cannot use subplot_kws with existing ax
    

    Anything else we need to know?

    No response

    Environment

    INSTALLED VERSIONS

    commit: None python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 5.14.0-1051-oem machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.8.1

    xarray: 2022.6.0 pandas: 1.5.0 numpy: 1.22.4 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.9.1 distributed: 2022.9.1 matplotlib: 3.6.0 cartopy: 0.21.0 seaborn: 0.12.0 numbagg: None fsspec: 2022.8.2 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.3.0 pip: 22.2.2 conda: None pytest: None IPython: 8.5.0 sphinx: None

  • Can't unstack concatenated DataArrays

    Can't unstack concatenated DataArrays

    What happened?

    I had a collection of DataArrays with a stacked dimension (dimension whose corresponding index is a MultiIndex). I concatenated them into a single DataArray, then tried to unstack the stacked dimension, which failed. Performing the operations in the other order works (unstacking each DataArray, then concatenating the unstacked arrays).

    What did you expect to happen?

    I expected that concatenating the arrays then unstacking them would produce the same array as unstacking them then concatenating them, but with the possibility of saving the intermediate concatenated-but-still-stacked DataArray for later use as a template.

    Minimal Complete Verifiable Example

    import pandas as pd
    import xarray
    index = pd.MultiIndex.from_product([range(3), range(5)])
    arr = xarray.DataArray.from_series(pd.Series(range(15), index=index)).stack(index0=["level_0", "level_1"])
    arr.unstack("index0")
    
    arr2 = xarray.concat([arr, arr], dim="index2")
    arr2.unstack("index0")
    

    MVCE confirmation

    • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
    • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
    • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
    • [X] New issue — a search of GitHub Issues suggests this is not a duplicate.

    Relevant log output

    <xarray.DataArray (level_0: 3, level_1: 5)>
    array([[ 0,  1,  2,  3,  4],
           [ 5,  6,  7,  8,  9],
           [10, 11, 12, 13, 14]])
    Coordinates:
      * level_0  (level_0) int64 0 1 2
      * level_1  (level_1) int64 0 1 2 3 4
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "~/.conda/envs/plotting/lib/python3.10/site-packages/xarray/core/dataarray.py", line 2402, in unstack
        ds = self._to_temp_dataset().unstack(dim, fill_value, sparse)
      File "~/.conda/envs/plotting/lib/python3.10/site-packages/xarray/core/dataset.py", line 4618, in unstack
        raise ValueError(
    ValueError: cannot unstack dimensions that do not have exactly one multi-index: ('index0',)
    

    Anything else we need to know?

    The eventual problem to which I wish to apply the solution has two stacked dimensions rather than one, but that's likely irrelevant.

    Environment

    INSTALLED VERSIONS

    commit: None python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] python-bits: 64 OS: Linux OS-release: 3.10.0-1160.76.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.1 libnetcdf: 4.8.1

    xarray: 2022.6.0 pandas: 1.4.2 numpy: 1.22.3 scipy: 1.8.0 netCDF4: 1.6.0 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.5.1.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: 3.2.1.post0 bottleneck: 1.3.5 dask: 2022.7.1 distributed: 2022.7.1 matplotlib: 3.5.1 cartopy: 0.20.3 seaborn: 0.12.0 numbagg: None fsspec: 2022.5.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 61.3.1 pip: 22.0.4 conda: 4.14.0 pytest: 7.1.3 IPython: None sphinx: None

  • Convert xarray dataset to pandas dataframe is much slower in newest xarray version

    Convert xarray dataset to pandas dataframe is much slower in newest xarray version

    What is your issue?

    Converting an xarray dataset to pandas dataframe has become much slower in the newest xarray version.

    I want to read in very large netcdf files, extract a slice, and convert the slice to a pandas dataframe. For an input size of 2GB, the xarray version 0.21.0 takes 3 seconds versus the xarray version 2022.6.0 takes 44 seconds. See table below for more tests with increasing size of xarray dataset.

    Number of NetCDF Input Files in Xarray Dataset (~1GB per file): | 2 | 5 | 10 | 15 | 20 | 30 | 40 -- | -- | -- | -- | -- | -- | -- | -- Older Xarray Version 0.21.0 | 0:03 | 0:02 | 0:04 | 0:06 | 0:09 | 0:13 | 0:17 Newer Xarray Version 2022.6.0 | 0:44 | 1:30 | 2:46 | 4:01 | 5:23 | 7:56 | 10:29

    Here is my code:

    # Read in a list of netcdf files and combine into a single dataset.
    with xr.open_mfdataset(infile_list, combine='by_coords') as ds:
    
            # Extract the data for a single location (the nearest grid point) using the provided coordinates (lat/lon).
            ds_slice = ds.sel(lon=-84.725, lat=42.3583, method='nearest')
    
            # Convert xarray dataset to a pandas dataframe.
            # This is now the slow part since the xarray library was updated.
            df = ds_slice.to_dataframe()
    

    The netcdf files I am reading in are about 1 GB each, containing daily weather data for the entire CONUS. There is 1 file per year, so if I read in 2 files, the dimensions are (lon: 1386, lat: 585, day: 731, crs: 1) with coordinates of lon, lat, day, and crs. They include 8 float data variables.

  • File lock not being released using rasterio engine

    File lock not being released using rasterio engine

    What happened?

    Open an ASCII raster file with xarray and rioxarray rasterio extension and then try to delete the ASCII file, get PermissionError: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

    Doing the same with the direct rioxarray.open_rasterio() method works fine - the file can be deleted.

    What did you expect to happen?

    Expect that the file lock is released upon the context manager closing so that I can delete the input file.

    Minimal Complete Verifiable Example

    # This doesn't work:
    with xarray.open_dataset(ascii_file, engine="rasterio") as ds:
        ds = ds.load()
    ascii_file.unlink()
    
    > PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
    
    # This works as expected:
    with rioxarray.open_rasterio(ascii_file) as ds:
        ds = ds.load()
    ascii_file.unlink()
    

    MVCE confirmation

    • [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
    • [X] Complete example — the example is self-contained, including all data and the text of any traceback.
    • [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
    • [x] New issue — a search of GitHub Issues suggests this is not a duplicate.

    Relevant log output

    No response

    Anything else we need to know?

    No response

    Environment

    xarray: 2022.6.0 pandas: 1.5.0 numpy: 1.23.3 scipy: 1.9.1 netCDF4: 1.6.1 pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: 1.6.2 nc_time_axis: None PseudoNetCDF: None rasterio: 1.3.2 cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: None cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.3.0 pip: 22.2.2 conda: None pytest: 7.1.3 IPython: None sphinx: None rioxarray: 0.12.2

  • add dictionary-based integer assignment example (GH7043)

    add dictionary-based integer assignment example (GH7043)

    I have attempted to add an example of dictionary-based assignment to the documentation following the discussion with @mathause and @dcherian. Any further clarifications are appreciated.

  • Should Xarray have a read_csv method?

    Should Xarray have a read_csv method?

    Is your feature request related to a problem?

    Most users of Xarray/Pandas start with an IO call of some sort. In Xarray, our open_dataset(..., engine=engine) interface provides an extensible interface to more complex backends (NetCDF, Zarr, GRIB, etc.). For tabular data types, we have traditionally pointed users to Pandas. While this works for users that are comfortable with Pandas, it is an added hurdle to users getting started with Xarray.

    Describe the solution you'd like

    It should be easy and obvious how a user can get a CSV (or other tabular data) into Xarray. Ideally, we don't force the user to use a third part library.

    Describe alternatives you've considered

    I can think of three possible solutions:

    1. We expose a new function read_csv, it may do something like this:
    def read_csv(filepath_or_buffer, **kwargs):
        df = pd.read_csv(filepath_or_buffer, **kwargs)
        ds = xr.Dataset.from_dataframe(df)
        return ds
    
    1. We develop a storage backend to support reading CSV-like data:
    ds = open_dataset(filepath, engine='csv')
    
    1. We copy (1) as an example and put it in Xarray's documentation. Explicitly showing how you would use Pandas to produce a Dataset from a CSV.
Django application and library for importing and exporting data with admin integration.
Django application and library for importing and exporting data with admin integration.

django-import-export django-import-export is a Django application and library for importing and exporting data with included admin integration. Featur

Sep 22, 2022
Collection of admin fields and decorators to help to create computed or custom fields more friendly and easy way
Collection of admin fields and decorators to help to create computed or custom fields more friendly and easy way

django-admin-easy Collection of admin fields, decorators and mixin to help to create computed or custom fields more friendly and easy way Installation

Sep 11, 2022
BitcartCC is a platform for merchants, users and developers which offers easy setup and use.
BitcartCC is a platform for merchants, users and developers which offers easy setup and use.

BitcartCC is a platform for merchants, users and developers which offers easy setup and use.

Sep 12, 2022
Ajenti Core and stock plugins
Ajenti Core and stock plugins

Ajenti is a Linux & BSD modular server admin panel. Ajenti 2 provides a new interface and a better architecture, developed with Python3 and AngularJS.

Sep 18, 2022
Simple and extensible administrative interface framework for Flask

Flask-Admin The project was recently moved into its own organization. Please update your references to [email protected]:flask-admin/flask-admin.git. Int

Sep 15, 2022
Real-time monitor and web admin for Celery distributed task queue

Flower Flower is a web based tool for monitoring and administrating Celery clusters. Features Real-time monitoring using Celery Events Task progress a

Sep 22, 2022
Freqtrade is a free and open source crypto trading bot written in Python
Freqtrade is a free and open source crypto trading bot written in Python

Freqtrade is a free and open source crypto trading bot written in Python. It is designed to support all major exchanges and be controlled via Telegram. It contains backtesting, plotting and money management tools as well as strategy optimization by machine learning.

Sep 26, 2022
Simple and extensible administrative interface framework for Flask

Flask-Admin The project was recently moved into its own organization. Please update your references to [email protected]:flask-admin/flask-admin.git. Int

Feb 7, 2021
FastAPI Admin Dashboard based on FastAPI and Tortoise ORM.
FastAPI Admin Dashboard based on FastAPI and Tortoise ORM.

FastAPI ADMIN 中文文档 Introduction FastAPI-Admin is a admin dashboard based on fastapi and tortoise-orm. FastAPI-Admin provide crud feature out-of-the-bo

Sep 26, 2022
Nginx UI allows you to access and modify the nginx configurations files without cli.
Nginx UI allows you to access and modify the nginx configurations files without cli.

nginx ui Table of Contents nginx ui Introduction Setup Example Docker UI Authentication Configure the auth file Configure nginx Introduction We use ng

Sep 16, 2022
A high-level app and dashboarding solution for Python
A high-level app and dashboarding solution for Python

Panel provides tools for easily composing widgets, plots, tables, and other viewable objects and controls into custom analysis tools, apps, and dashboards.

Sep 23, 2022
With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials.
With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials.

Django Hijack With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials. Docs See http://django

Sep 17, 2022
PyMMO is a Python-based MMO game framework using sockets and PyGame.
PyMMO is a Python-based MMO game framework using sockets and PyGame.

PyMMO is a Python framework/template of a MMO game built using PyGame on top of Python's built-in socket module.

Sep 6, 2022
Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use
Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use

Lazymux is a tool installer that is specially made for termux user which provides a lot of tool mainly used tools in termux and its easy to use, Lazymux install any of the given tools provided by it from itself with just one click, and its often get updated.

Sep 25, 2022
WebVirtCloud is virtualization web interface for admins and users
WebVirtCloud is virtualization web interface for admins and users

WebVirtCloud is a virtualization web interface for admins and users. It can delegate Virtual Machine's to users. A noVNC viewer presents a full graphical console to the guest domain. KVM is currently the only hypervisor supported.

Sep 16, 2022
Extends the Django Admin to include a extensible dashboard and navigation menu
Extends the Django Admin to include a extensible dashboard and navigation menu

django-admin-tools django-admin-tools is a collection of extensions/tools for the default django administration interface, it includes: a full feature

Sep 20, 2022
With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials.
With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials.

Django Hijack With Django Hijack, admins can log in and work on behalf of other users without having to know their credentials. Docs 3.x docs are avai

Sep 13, 2022
WordPress look and feel for Django administration panel
WordPress look and feel for Django administration panel

Django WP Admin WordPress look and feel for Django administration panel. Features WordPress look and feel New styles for selector, calendar and timepi

Jul 25, 2022
A cool, modern and responsive django admin application based on bootstrap 5
A cool, modern and responsive django admin application based on bootstrap 5

django-baton A cool, modern and responsive django admin application based on bootstrap 5 Documentation: readthedocs Live Demo Now you can try django-b

Sep 17, 2022