Combines Bayesian analyses from many datasets.

PosteriorStacker

Combines Bayesian analyses from many datasets.

Introduction

Fitting a model to a data set gives posterior probability distributions for a parameter of interest. But how do you combine such probability distributions if you have many datasets?

This question arises frequently in astronomy when analysing samples, and trying to infer sample distributions of some quantity.

PosteriorStacker allows deriving sample distributions from posterior distributions from a number of objects.

Method

The method is described in Appendix A of Baronchelli, Nandra & Buchner (2020).

hbm.png

The inputs are posterior samples of a single parameter, for a number of objects. These need to come from pre-existing analyses, under a flat parameter prior.

The hierarchical Bayesian model (illustrated above) models the sample distribution as a Gaussian with unknown mean and standard deviation. The per-object parameters are also unknown, but integrated out numerically using the posterior samples.

Additional to the Gaussian model (as in the paper), a histogram model (using a flat Dirichlet prior distribution) is computed, which is non-parametric and more flexible. Both models are inferred using UltraNest.

The output is visualised in a publication-ready plot.

Synopsis of the program:

$ python3 posteriorstacker.py --help
usage: posteriorstacker.py [-h] [--verbose VERBOSE] [--name NAME]
                           filename low high nbins

Posterior stacking tool.

Johannes Buchner (C) 2020-2021

Given posterior distributions of some parameter from many objects,
computes the sample distribution, using a simple hierarchical model.

The method is described in Baronchelli, Nandra & Buchner (2020)
https://ui.adsabs.harvard.edu/abs/2020MNRAS.498.5284B/abstract
Two computations are performed with this tool:

- Gaussian model (as in the paper)
- Histogram model (using a Dirichlet prior distribution)

The histogram model is non-parametric and more flexible.
Both models are computed using UltraNest.
The output is plotted.

positional arguments:
  filename           Filename containing posterior samples, one object per line
  low                Lower end of the distribution
  high               Upper end of the distribution
  nbins              Number of histogram bins

optional arguments:
  -h, --help         show this help message and exit
  --verbose VERBOSE  Show progress
  --name NAME        Parameter name (for plot)

Johannes Buchner (C) 2020-2021 

Licence

AGPLv3 (see COPYING file). Contact me if you need a different licence.

Install

Clone or download this repository. You need to install the ultranest python package (e.g., with pip).

Tutorial

In this tutorial you will learn:

  • How to find a intrinsic distribution from data with asymmetric error bars and upper limits
  • How to use PosteriorStacker

Lets say we want to find the intrinsic velocity dispersion given some noisy data points.

Our data are velocity measurements of a few globular cluster velocities in a dwarf galaxy, fitted with some model.

Preparing the inputs

For generating the demo input files and plots, run:

$ python3 tutorial/gendata.py

Visualise the data

Lets plot the data first to see what is going on:

example.png

Caveat on language: These are not actually "the data" (which are counts on a CCD). Instead, this is a intermediate representation of a posterior/likelihood, assuming flat priors on velocity.

Data properties

This scatter plot shows:

  • large, sometimes asymmetric error bars
  • intrinsic scatter

Resampling the data

We could also represent each data point by a cloud of samples. Each point represents a possible true solution of that galaxy.

example-samples.png

Running PosteriorStacker

We run the script with a range limit of +-100 km/s:

$ python3 posteriorstacker.py posteriorsamples.txt -80 +80 11 --name="Velocity [km/s]"
fitting histogram model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-1e+01
[ultranest] Likelihood function evaluations: 114176
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -20.68 +- 0.06865
[ultranest] Effective samples strategy satisfied (ESS = 684.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.08 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.14, need <0.5)
[ultranest]   logZ error budget: single: 0.07 bs:0.07 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -20.677 +- 0.424
  single instance: logZ = -20.677 +- 0.074
  bootstrapped   : logZ = -20.676 +- 0.123
  tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 377.0 iterations

    bin1                0.051 +- 0.046
    bin2                0.052 +- 0.051
    bin3                0.065 +- 0.058
    bin4                0.062 +- 0.057
    bin5                0.108 +- 0.085
    bin6                0.31 +- 0.14
    bin7                0.16 +- 0.10
    bin8                0.051 +- 0.050
    bin9                0.047 +- 0.044
    bin10               0.048 +- 0.047
    bin11               0.047 +- 0.045
fitting gaussian model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 4544
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -47.33 +- 0.07996
[ultranest] Effective samples strategy satisfied (ESS = 1011.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.17, need <0.5)
[ultranest]   logZ error budget: single: 0.13 bs:0.08 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -47.341 +- 0.440
  single instance: logZ = -47.341 +- 0.126
  bootstrapped   : logZ = -47.331 +- 0.173
  tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 13.0 iterations

    mean                -0.3 +- 4.7
    std                 11.6 +- 5.2

Vary the number of samples to check numerical stability!
plotting results ...

Notice the parameters of the fitted gaussian distribution above. The standard deviation is quite small (which was the point of the original paper). A corner plot is at posteriorsamples.txt_out_gauss/plots/corner.pdf

Visualising the results

Here is the output plot, converted to png for this tutorial with:

$ convert -density 100 posteriorsamples.txt_out.pdf out.png

out.png

In black, we see the non-parametric fit. The red curve shows the gaussian model.

The histogram model indicates that a more heavy-tailed distribution may be better.

The error bars in gray is the result of naively averaging the posteriors. This is not a statistically meaningful procedure, but it can give you ideas what models you may want to try for the sample distribution.

Output files

  • posteriorsamples.txt_out.pdf contains a plot,
  • posteriorsamples.txt_out_gauss contain the ultranest analyses output assuming a Gaussian distribution.
  • posteriorsamples.txt_out_flexN contain the ultranest analyses output assuming a histogram model.
  • The directories include diagnostic plots, corner plots and posterior samples of the distribution parameters.

With these output files, you can:

  • plot the sample parameter distribution
  • report the mean and spread, and their uncertainties
  • split the sample by some parameter, and plot the sample mean as a function of that parameter.

If you want to adjust the plot, just edit the script.

If you want to try a different distribution, adapt the script. It uses UltraNest for the inference.

Take-aways

  • PosteriorStacker computed a intrinsic distribution from a set of uncertain measurements
  • This tool can combine arbitrarily pre-existing analyses.
  • No assumptions about the posterior shapes were necessary -- multi-modal and asymmetric works fine.
Similar Resources

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

Aug 5, 2022

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Jul 30, 2022

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Jul 26, 2022

This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

Nov 3, 2021

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

 Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Nov 18, 2021

PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

May 24, 2022

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

May 22, 2022

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible, to be the most reliable with the least complexity possible

Jan 5, 2022

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

aka

Bayesian Methods for Hackers Using Python and PyMC The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chap

Aug 3, 2022

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayesian-Torch is designed to be flexible and seamless in extending a deterministic deep neural network architecture to corresponding Bayesian form by simply replacing the deterministic layers with Bayesian layers.

Jul 29, 2022

Distributed Grid Descent: an algorithm for hyperparameter tuning guided by Bayesian inference, designed to run on multiple processes and potentially many machines with no central point of control

Distributed Grid Descent: an algorithm for hyperparameter tuning guided by Bayesian inference, designed to run on multiple processes and potentially many machines with no central point of control

Distributed Grid Descent: an algorithm for hyperparameter tuning guided by Bayesian inference, designed to run on multiple processes and potentially many machines with no central point of control.

Jan 1, 2022

This is a cryptocurrency trading bot that analyses Reddit sentiment and places trades on Binance based on reddit post and comment sentiment. If you like this project please consider donating via brave. Thanks.

This is a cryptocurrency trading bot that analyses Reddit sentiment and places trades on Binance based on reddit post and comment sentiment. The bot f

Jul 31, 2022

A collection of resources/tools and analyses for the angr binary analysis framework.

Awesome angr A collection of resources/tools and analyses for the angr binary analysis framework. This page does not only collect links and external r

Jul 17, 2022

Open Data Cube analyses continental scale Earth Observation data through time

Open Data Cube Core Overview The Open Data Cube Core provides an integrated gridded data analysis environment for decades of analysis ready earth obse

Aug 2, 2022

PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

Jun 29, 2022
A python library for Bayesian time series modeling
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Jul 29, 2022
ArviZ is a Python package for exploratory analysis of Bayesian models
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

Aug 5, 2022
Bayesian optimization in JAX

Bayesian optimization in JAX

May 11, 2022
Bonsai: Gradient Boosted Trees + Bayesian Optimization
 Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

May 2, 2022
Case studies with Bayesian methods
Case studies with Bayesian methods

Case studies with Bayesian methods

Jan 4, 2022
Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

Jun 20, 2022
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

Jun 27, 2022
Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations.

BO-GP Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations. The BO-GP codes are developed using GPy and GPyOpt. The optimizer

Mar 31, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Aug 8, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Dec 20, 2021