# Multicore t-SNE

This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with python and Torch CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core.

# What to expect

Barnes-Hut t-SNE is done in two steps.

• First step: an efficient data structure for nearest neighbours search is built and used to compute probabilities. This can be done in parallel for each point in the dataset, this is why we can expect a good speed-up by using more cores.

• Second step: the embedding is optimized using gradient descent. This part is essentially consecutive so we can only optimize within iteration. In fact some parts can be parallelized effectively, but not all of them a parallelized for now. That is why second step speed-up will not be that significant as first step sepeed-up but there is still room for improvement.

So when can you benefit from parallelization? It is almost true, that the second step computation time is constant of D and depends mostly on N. The first part's time depends on D a lot, so for small D time(Step 1) << time(Step 2), for large D time(Step 1) >> time(Step 2). As we are only good at parallelizing step 1 we will benefit most when D is large enough (MNIST's D = 784 is large, D = 10 even for N=1000000 is not so much). I wrote multicore modification originally for Springleaf competition, where my data table was about 300000 x 3000 and only several days left till the end of the competition so any speed-up was handy.

# Benchmark

### 1 core

Interestingly, that this code beats other implementations. We compare to sklearn (Barnes-Hut of course), L. Van der Maaten's bhtsne, py_bh_tsne repo (cython wrapper for bhtsne with QuadTree). perplexity = 30, theta=0.5 for every run. In fact py_bh_tsne repo works at the same speed as this code when using more optimization flags for compiler.

This is a benchmark for 70000x784 MNIST data:

Method Step 1 (sec) Step 2 (sec)
MulticoreTSNE(n_jobs=1) 912 350
bhtsne 4257 1233
py_bh_tsne 1232 367
sklearn(0.18) ~5400 ~20920

I did my best to find what is wrong with sklearn numbers, but it is the best benchmark I could do (you can find test script in python/tests folder).

### Multicore

This table shows a relative to 1 core speed-up when using n cores.

n_jobs Step 1 Step 2
1 1x 1x
2 1.54x 1.05x
4 2.6x 1.2x
8 5.6x 1.65x

# How to use

Python and torch wrappers are available.

## Python

### Install

#### Directly from pypi

pip install MulticoreTSNE

#### From source

Make sure cmake is installed on your system, and you will also need a sensible C++ compiler, such as gcc or llvm-clang. On macOS, you can get both via homebrew.

To install the package, please do:

git clone https://github.com/DmitryUlyanov/Multicore-TSNE.git
cd Multicore-TSNE/
pip install .


Tested with both Python 2.7 and 3.6 (conda) and Ubuntu 14.04.

### Run

You can use it as a near drop-in replacement for sklearn.manifold.TSNE.

from MulticoreTSNE import MulticoreTSNE as TSNE

tsne = TSNE(n_jobs=4)
Y = tsne.fit_transform(X)


Please refer to sklearn TSNE manual for parameters explanation.

This implementation n_components=2, which is the most common case (use Barnes-Hut t-SNE or sklearn otherwise). Also note that some parameters are there just for the sake of compatibility with sklearn and are otherwise ignored. See MulticoreTSNE class docstring for more info.

#### MNIST example

from sklearn.datasets import load_digits
from MulticoreTSNE import MulticoreTSNE as TSNE
from matplotlib import pyplot as plt

embeddings = TSNE(n_jobs=4).fit_transform(digits.data)
vis_x = embeddings[:, 0]
vis_y = embeddings[:, 1]
plt.scatter(vis_x, vis_y, c=digits.target, cmap=plt.cm.get_cmap("jet", 10), marker='.')
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
plt.show()


### Test

You can test it on MNIST dataset with the following command:

python MulticoreTSNE/examples/test.py <n_jobs>


#### Note on jupyter use

To make the computation log visible in jupyter please install wurlitzer (pip install wurlitzer) and execute this line in any cell beforehand:

%load_ext wurlitzer


Memory leakages are possible if you interrupt the process. Should be OK if you let it run until the end.

## Torch

To install execute the following command from repository folder:

luarocks make torch/tsne-1.0-0.rockspec


or

luarocks install https://raw.githubusercontent.com/DmitryUlyanov/Multicore-TSNE/master/torch/tsne-1.0-0.rockspec


You can run t-SNE like that:

tsne = require 'tsne'

Y = tsne(X, n_components, perplexity, n_iter, angle, n_jobs)


torch.DoubleTensor type only supported for now.

# Future work

• Allow other types than double
• Improve step 2 performance (possible)

# Citation

@misc{Ulyanov2016,
author = {Ulyanov, Dmitry},
title = {Multicore-TSNE},
year = {2016},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/DmitryUlyanov/Multicore-TSNE}},
}


Of course, do not forget to cite L. Van der Maaten's paper

###### Dmitry Ulyanov
Co-Founder at in3D, Phd @ Skoltech
• #### installation fails on macOS

I followed the instructions on this link to get a version of gcc that supports openmp.

but it looks like the install script isn't using gcc:

-- The C compiler identification is AppleClang 8.0.0.8000038
-- The CXX compiler identification is AppleClang 8.0.0.8000038
-- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc


help?

• #### Will there be a version for Windows?

Whilst the compilation and installation worked fine on Windows 8.1, running the code in Python results in

OSError: cannot load library \lib\site-packages\MulticoreTSNE/libtsne_multicore.so: error 0x7e

I guess that's since windows rather expects a DLL than a .so library. Unfortunately my CMAKE skills are not sufficient to adjust the current build instructions to also produce a .dll on Windows - so here's me hoping that someone might fix that.

• #### A range of refreshments ...

fixes https://github.com/DmitryUlyanov/Multicore-TSNE/issues/27 fixes https://github.com/DmitryUlyanov/Multicore-TSNE/issues/24 fixes https://github.com/DmitryUlyanov/Multicore-TSNE/issues/14 fixes https://github.com/DmitryUlyanov/Multicore-TSNE/issues/13 closes https://github.com/DmitryUlyanov/Multicore-TSNE/pull/16

• #### Improve Windos compatibility and test on AppVeyor

This PR ~~improves~~enables building on Windos, proved by example on AppVeyor. If you like it, consider enabling AppVeyor for this project. Two tests needed to be adapted a bit, I guess due to their imprecise, arbitrary-set constraints. I know not better.

Updated the readme to what seem to be the current build instructions. I don't think anything is needed besides cmake and a compiler. OpenMP is optional, pending compiler availability. Please feel free to edit as you see fit.

OpenMP requirement was loosened because recent MSVC compilers only support 2.0 version of that standard from 1998.

Would you be interested in maintaining this package also on PyPI? I'd be happy to follow up with a small addendum to Travis and AppVeyor recipes, which would result in you just needing to issue the following few steps to publish a new release, making it accessible to install on all (three) platforms:

git commit setup.py -m "bump version to 1.1 or whatever"
git tag 1.1
git push --tags
# The tagged commit build results would be automatically pushed to PyPI

• #### package is now pip-installable

Hi Dmtry!

This PR fixes problem #6

I did this by using setuptools install function instead of the one contained in distutils. In addition to that I slightly rewrote the custom install function in setup.py. This was necessary, since the original part didn't abort the installation process in the case cmake wasn't installed. For completeness I also added a gitignore file.

• #### Use option random_state / Set seed via srand (random_state);

Hi Dmitry,

currently, the option random_state is avoided and thereby every tSNE plot looks different. Would you consider setting a seed for the initialization as described above, in random_state != None? If you want, I can make a pull request for that.

Cheers, Alex

• #### remove verbose from example

Running the example as-is, I get:

Mar 6, 2021
###### A high performance implementation of HDBSCAN clustering. http://hdbscan.readthedocs.io/en/latest/

HDBSCAN Now a part of scikit-learn-contrib HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over va

Dec 29, 2022
###### The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information".

The HIST framework for stock trend forecasting The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining C

Jan 3, 2023
###### Implementation of SOMs (Self-Organizing Maps) with neighborhood-based map topologies.

py-self-organizing-maps Simple implementation of self-organizing maps (SOMs) A SOM is an unsupervised method for learning a mapping from a discrete ne

Nov 22, 2022
###### A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe which runs both payloads.

Update ! ANONFILE MIGHT NOT WORK ! About A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe w

Oct 12, 2022
###### Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Dec 27, 2022
###### Python module for drawing and rendering beautiful atoms and molecules using Blender.

Batoms is a Python package for editing and rendering atoms and molecules objects using blender. A Python interface that allows for automating workflows.

Jul 6, 2022
Aug 11, 2022
###### Python scripts for plotting audiograms and related data from Interacoustics Equinox audiometer and Otoaccess software.

audiometry Python scripts for plotting audiograms and related data from Interacoustics Equinox 2.0 audiometer and Otoaccess software. Maybe similar sc

Jun 15, 2022
###### Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advanced practical bioinformatics and its applications globally.

-Nyokong. Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advance

Aug 4, 2022
###### The Timescale NFT Starter Kit is a step-by-step guide to get up and running with collecting, storing, analyzing and visualizing NFT data from OpenSea, using PostgreSQL and TimescaleDB.

Timescale NFT Starter Kit The Timescale NFT Starter Kit is a step-by-step guide to get up and running with collecting, storing, analyzing and visualiz

Dec 24, 2022
###### Plot and save the ground truth and predicted results of human 3.6 M and CMU mocap dataset.

Visualization-of-Human3.6M-Dataset Plot and save the ground truth and predicted results of human 3.6 M and CMU mocap dataset. human-motion-prediction

Nov 18, 2022
###### Visual Python is a GUI-based Python code generator, developed on the Jupyter Notebook environment as an extension.

Visual Python is a GUI-based Python code generator, developed on the Jupyter Notebook environment as an extension.

Jan 3, 2023