Job Pipeline

Jobs pipeline

What is the job?

A Job is a script that can execute a Data App with full parameters to create and update data for a single table. All Chainslake jobs are organized in the jobs folder of the data-builder repository.

Example: binance/cex/trade_minute

$CHAINSLAKE_HOME_DIR/spark/script/chainslake-run.sh --class chainslake.cex.Main \
    --name BinanceTradeMinute \
    --master local[4] \
    --driver-memory 4g \
    --conf "spark.app_properties.app_name=binance_cex.trade_minute" \
    --conf "spark.app_properties.binance_cex_url=$BINANCE_CEX_URL" \
    --conf "spark.app_properties.number_re_partitions=4" \
    --conf "spark.app_properties.quote_asset=USDT" \
    --conf "spark.app_properties.wait_milliseconds=1" \
    --conf "spark.app_properties.config_file=binance/application.properties"

Job configuration

  • –class: config Main of App’s package
  • –name: Name of job
  • –master: Config number of core used to run job
  • –driver-memory: Config the amount of RAM allocated to the job
  • –conf “spark.app_properties.app_name=”: Configure the name of the app that the job will execute
  • –conf “spark.app_properties.config_file=”: Config the job configuration file path.

What is the pipeline?

Pipeline is a directed graph that connects jobs in the order they are executed. Chainslake uses Airflow to automate and execute jobs in a DAG, ensuring that jobs are always executed in the correct order, with a controlled number of concurrent jobs.

You can find the entire Chainslake pipeline in the airflow/dags directory in the data-builder project

How to build and submit job and pipeline?

Like Data App, you need to use data-builder to build and submit your work.

  1. You need to folk data-builder project to your account on github
  2. Code and test your app on your local
  3. Commit and push your work to your repository
  4. Create Pull request to merge your work to main branch of data-builder