nitowa 681b09e5e0 more merging algorithms, add bench script		vor 3 Jahren
config/db	more merging algorithms, add bench script	vor 3 Jahren
spark-packages	working graph implementation and improved shell scripts	vor 3 Jahren
src/spark	more merging algorithms, add bench script	vor 3 Jahren
.gitignore	more merging algorithms, add bench script	vor 3 Jahren
README.md	add clarification to README	vor 3 Jahren
bench.py	more merging algorithms, add bench script	vor 3 Jahren
clean.py	progress on mapping data, finding clusters, probably inefficient	vor 3 Jahren
settings.json	checkpoint dir to settings, rename main_back to main_with_collect	vor 3 Jahren
setup.py	more merging algorithms, add bench script	vor 3 Jahren
small_test_data.csv	progress on mapping data, finding clusters, probably inefficient	vor 3 Jahren
start_services.sh	working graph implementation and improved shell scripts	vor 3 Jahren
submit.sh	working graph implementation and improved shell scripts	vor 3 Jahren
submit_graph.sh	working graph implementation and improved shell scripts	vor 3 Jahren
submit_partition.sh	union find with partition clustering	vor 3 Jahren

Project Description

TODO

Installation

Prerequisites:

Python3
Apache spark 3.2 (https://spark.apache.org/downloads.html)
Cassandra DB (https://cassandra.apache.org/_/index.html, locally the docker build is recommended: https://hub.docker.com/_/cassandra)

For the graph implementation specifically you need to install graphframes manually from a third party since the official release is incompatible with spark 3.x (pull request pending). A prebuilt copy is supplied in the spark-packages directory.

graphframes (https://github.com/eejbyfeldt/graphframes/tree/spark-3.3)

Setting up

Modify settings.json to reflect your setup. If you are running everything locally you can use start_services.sh to turn everything on in one swoop. It might take a few minutes for Cassandra to become available.
Load the development database by running python3 setup.py from the project root. Per default this will move small_test_data.csv into the transactions table.

Deploying:

Start the spark workload by either running submit.sh (slow) or submit_graph.sh (faster)
If you need to clean out the Database you can run python3 clean.py. Be wary that this wipes all table definitions and data.

README.md

Project Description

Installation

Prerequisites:

Setting up

Deploying: