nitowa 681b09e5e0 more merging algorithms, add bench script		2年前
config/db	more merging algorithms, add bench script	2年前
spark-packages	working graph implementation and improved shell scripts	2年前
src/spark	more merging algorithms, add bench script	2年前
.gitignore	more merging algorithms, add bench script	2年前
README.md	add clarification to README	2年前
bench.py	more merging algorithms, add bench script	2年前
clean.py	progress on mapping data, finding clusters, probably inefficient	2年前
settings.json	checkpoint dir to settings, rename main_back to main_with_collect	2年前
setup.py	more merging algorithms, add bench script	2年前
small_test_data.csv	progress on mapping data, finding clusters, probably inefficient	2年前
start_services.sh	working graph implementation and improved shell scripts	2年前
submit.sh	working graph implementation and improved shell scripts	2年前
submit_graph.sh	working graph implementation and improved shell scripts	2年前
submit_partition.sh	union find with partition clustering	2年前

Project Description

TODO

Installation

Prerequisites:

Python3
Apache spark 3.2 (https://spark.apache.org/downloads.html)
Cassandra DB (https://cassandra.apache.org/_/index.html, locally the docker build is recommended: https://hub.docker.com/_/cassandra)

For the graph implementation specifically you need to install graphframes manually from a third party since the official release is incompatible with spark 3.x (pull request pending). A prebuilt copy is supplied in the spark-packages directory.

graphframes (https://github.com/eejbyfeldt/graphframes/tree/spark-3.3)

Setting up

Modify settings.json to reflect your setup. If you are running everything locally you can use start_services.sh to turn everything on in one swoop. It might take a few minutes for Cassandra to become available.
Load the development database by running python3 setup.py from the project root. Per default this will move small_test_data.csv into the transactions table.

Deploying:

Start the spark workload by either running submit.sh (slow) or submit_graph.sh (faster)
If you need to clean out the Database you can run python3 clean.py. Be wary that this wipes all table definitions and data.

README.md

Project Description

Installation

Prerequisites:

Setting up

Deploying: