# Combines Bayesian analyses from many datasets.

## PosteriorStacker

Combines Bayesian analyses from many datasets.

### Introduction

Fitting a model to a data set gives posterior probability distributions for a parameter of interest. But how do you combine such probability distributions if you have many datasets?

This question arises frequently in astronomy when analysing samples, and trying to infer sample distributions of some quantity.

PosteriorStacker allows deriving sample distributions from posterior distributions from a number of objects.

### Method

The method is described in Appendix A of Baronchelli, Nandra & Buchner (2020). The inputs are posterior samples of a single parameter, for a number of objects. These need to come from pre-existing analyses, under a flat parameter prior.

The hierarchical Bayesian model (illustrated above) models the sample distribution as a Gaussian with unknown mean and standard deviation. The per-object parameters are also unknown, but integrated out numerically using the posterior samples.

Additional to the Gaussian model (as in the paper), a histogram model (using a flat Dirichlet prior distribution) is computed, which is non-parametric and more flexible. Both models are inferred using UltraNest.

The output is visualised in a publication-ready plot.

Synopsis of the program:

```\$ python3 posteriorstacker.py --help
usage: posteriorstacker.py [-h] [--verbose VERBOSE] [--name NAME]
filename low high nbins

Posterior stacking tool.

Johannes Buchner (C) 2020-2021

Given posterior distributions of some parameter from many objects,
computes the sample distribution, using a simple hierarchical model.

The method is described in Baronchelli, Nandra & Buchner (2020)
Two computations are performed with this tool:

- Gaussian model (as in the paper)
- Histogram model (using a Dirichlet prior distribution)

The histogram model is non-parametric and more flexible.
Both models are computed using UltraNest.
The output is plotted.

positional arguments:
filename           Filename containing posterior samples, one object per line
low                Lower end of the distribution
high               Upper end of the distribution
nbins              Number of histogram bins

optional arguments:
-h, --help         show this help message and exit
--verbose VERBOSE  Show progress
--name NAME        Parameter name (for plot)

Johannes Buchner (C) 2020-2021
```

### Licence

AGPLv3 (see COPYING file). Contact me if you need a different licence.

### Install

Clone or download this repository. You need to install the ultranest python package (e.g., with pip).

## Tutorial

In this tutorial you will learn:

• How to find a intrinsic distribution from data with asymmetric error bars and upper limits
• How to use PosteriorStacker

Lets say we want to find the intrinsic velocity dispersion given some noisy data points.

Our data are velocity measurements of a few globular cluster velocities in a dwarf galaxy, fitted with some model.

### Preparing the inputs

For generating the demo input files and plots, run:

```\$ python3 tutorial/gendata.py
```

### Visualise the data

Lets plot the data first to see what is going on: Caveat on language: These are not actually "the data" (which are counts on a CCD). Instead, this is a intermediate representation of a posterior/likelihood, assuming flat priors on velocity.

### Data properties

This scatter plot shows:

• large, sometimes asymmetric error bars
• intrinsic scatter

### Resampling the data

We could also represent each data point by a cloud of samples. Each point represents a possible true solution of that galaxy. ## Running PosteriorStacker

We run the script with a range limit of +-100 km/s:

```\$ python3 posteriorstacker.py posteriorsamples.txt -80 +80 11 --name="Velocity [km/s]"
fitting histogram model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-1e+01
[ultranest] Likelihood function evaluations: 114176
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -20.68 +- 0.06865
[ultranest] Effective samples strategy satisfied (ESS = 684.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.08 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.14, need <0.5)
[ultranest]   logZ error budget: single: 0.07 bs:0.07 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -20.677 +- 0.424
single instance: logZ = -20.677 +- 0.074
bootstrapped   : logZ = -20.676 +- 0.123
tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 377.0 iterations

bin1                0.051 +- 0.046
bin2                0.052 +- 0.051
bin3                0.065 +- 0.058
bin4                0.062 +- 0.057
bin5                0.108 +- 0.085
bin6                0.31 +- 0.14
bin7                0.16 +- 0.10
bin8                0.051 +- 0.050
bin9                0.047 +- 0.044
bin10               0.048 +- 0.047
bin11               0.047 +- 0.045
fitting gaussian model...
[ultranest] Sampling 400 live points from prior ...
[ultranest] Explored until L=-4e+01
[ultranest] Likelihood function evaluations: 4544
[ultranest] Writing samples and results to disk ...
[ultranest] Writing samples and results to disk ... done
[ultranest]   logZ = -47.33 +- 0.07996
[ultranest] Effective samples strategy satisfied (ESS = 1011.4, need >400)
[ultranest] Posterior uncertainty strategy is satisfied (KL: 0.46+-0.07 nat, need <0.50 nat)
[ultranest] Evidency uncertainty strategy is satisfied (dlogz=0.17, need <0.5)
[ultranest]   logZ error budget: single: 0.13 bs:0.08 tail:0.41 total:0.41 required:<0.50
[ultranest] done iterating.

logZ = -47.341 +- 0.440
single instance: logZ = -47.341 +- 0.126
bootstrapped   : logZ = -47.331 +- 0.173
tail           : logZ = +- 0.405
insert order U test : converged: False correlation: 13.0 iterations

mean                -0.3 +- 4.7
std                 11.6 +- 5.2

Vary the number of samples to check numerical stability!
plotting results ...
```

Notice the parameters of the fitted gaussian distribution above. The standard deviation is quite small (which was the point of the original paper). A corner plot is at posteriorsamples.txt_out_gauss/plots/corner.pdf

### Visualising the results

Here is the output plot, converted to png for this tutorial with:

```\$ convert -density 100 posteriorsamples.txt_out.pdf out.png
``` In black, we see the non-parametric fit. The red curve shows the gaussian model.

The histogram model indicates that a more heavy-tailed distribution may be better.

The error bars in gray is the result of naively averaging the posteriors. This is not a statistically meaningful procedure, but it can give you ideas what models you may want to try for the sample distribution.

### Output files

• posteriorsamples.txt_out.pdf contains a plot,
• posteriorsamples.txt_out_gauss contain the ultranest analyses output assuming a Gaussian distribution.
• posteriorsamples.txt_out_flexN contain the ultranest analyses output assuming a histogram model.
• The directories include diagnostic plots, corner plots and posterior samples of the distribution parameters.

With these output files, you can:

• plot the sample parameter distribution
• report the mean and spread, and their uncertainties
• split the sample by some parameter, and plot the sample mean as a function of that parameter.

If you want to adjust the plot, just edit the script.

If you want to try a different distribution, adapt the script. It uses UltraNest for the inference.

### Take-aways

• PosteriorStacker computed a intrinsic distribution from a set of uncertain measurements
• This tool can combine arbitrarily pre-existing analyses.
• No assumptions about the posterior shapes were necessary -- multi-modal and asymmetric works fine.
###### Johannes Buchner ### A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

Aug 5, 2022

### A Python library for detecting patterns and anomalies in massive datasets using the Matrix Profile

matrixprofile-ts matrixprofile-ts is a Python 2 and 3 library for evaluating time series data using the Matrix Profile algorithms developed by the Keo

Jul 30, 2022

### Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

Aug 2, 2022

### Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Jul 26, 2022

### This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

Nov 3, 2021

### Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Nov 18, 2021

### PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

May 24, 2022

### A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

May 22, 2022

### Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible

Python script that analyses the given datasets and comes up with the best polynomial regression representation with the smallest polynomial degree possible, to be the most reliable with the least complexity possible

Jan 5, 2022

### aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Bayesian Methods for Hackers Using Python and PyMC The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chap

Aug 3, 2022

### pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Jul 21, 2022

### Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks

Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks. Bayesian-Torch is designed to be flexible and seamless in extending a deterministic deep neural network architecture to corresponding Bayesian form by simply replacing the deterministic layers with Bayesian layers.

Jul 29, 2022

### Hierarchical-Bayesian-Defense - Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical Variational Inference (Openreview)

Towards Adversarial Robustness of Bayesian Neural Network through Hierarchical V

Jul 26, 2022

### Distributed Grid Descent: an algorithm for hyperparameter tuning guided by Bayesian inference, designed to run on multiple processes and potentially many machines with no central point of control

Distributed Grid Descent: an algorithm for hyperparameter tuning guided by Bayesian inference, designed to run on multiple processes and potentially many machines with no central point of control.

Jan 1, 2022

### This is a cryptocurrency trading bot that analyses Reddit sentiment and places trades on Binance based on reddit post and comment sentiment. If you like this project please consider donating via brave. Thanks.

This is a cryptocurrency trading bot that analyses Reddit sentiment and places trades on Binance based on reddit post and comment sentiment. The bot f

Jul 31, 2022

### Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

Jul 30, 2022

### A collection of resources/tools and analyses for the angr binary analysis framework.

Awesome angr A collection of resources/tools and analyses for the angr binary analysis framework. This page does not only collect links and external r

Jul 17, 2022

### Open Data Cube analyses continental scale Earth Observation data through time

Open Data Cube Core Overview The Open Data Cube Core provides an integrated gridded data analysis environment for decades of analysis ready earth obse

Aug 2, 2022

### PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

Jun 29, 2022
###### A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Jul 29, 2022
###### ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

Aug 5, 2022
May 11, 2022
###### Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

May 2, 2022
Jan 4, 2022
###### Fourier-Bayesian estimation of stochastic volatility models

fourier-bayesian-sv-estimation Fourier-Bayesian estimation of stochastic volatility models Code used to run the numerical examples of "Bayesian Approa

Jun 20, 2022
###### BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

Jun 27, 2022
###### Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations.

BO-GP Bayesian optimization based on Gaussian processes (BO-GP) for CFD simulations. The BO-GP codes are developed using GPy and GPyOpt. The optimizer

Mar 31, 2022
###### A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Aug 8, 2022
###### A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Dec 20, 2021