apache spark data pipeline osdq online with Winfy

We have hosted the application apache spark data pipeline osdq in order to run this application in our online workstations with Wine or directly.


Quick description about apache spark data pipeline osdq:

This is an offshoot project of open source data quality (osDQ) project https: //sourceforge.net/projects/dataquality/

This sub project will create apache spark based data pipeline where JSON based metadata (file) will be used to run data processing , data pipeline , data quality and data preparation and data modeling features for big data. This uses java API of apache spark. It can run in local mode also.

Get json example at https: //github.com/arrahtech/osdq-spark

How to run

Unzip the zip file

Windows : java -cp .\lib\*;osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c .\example\samplerun.json

Mac UNIX
java -cp ./lib/*: ./osdq-spark-0.0.1.jar org.arrah.framework.spark.run.TransformRunner -c ./example/samplerun.json

For those on windows, you need to have hadoop distribtion unzipped on local drive and HADOOP_HOME set. Also copy winutils.exe from here into HADOOP_HOME\bin.

Features:
  • Create data pipeline like using Join, Filter, Aggregate, Case statement
  • Use Data Quality - replace, drop, join,
  • Data Profiling, Column base Profiling
  • Fuzzy Join - cosine distance and others
  • classification and sampling - random forest, Multi class neural network
  • data normalization - zscore, std deviation, ratio score,
  • Sampling Random, Stratified , Key based


Audience: Information Technology, Other Audience, Architects.
User interface: Console/Terminal.
Programming Language: Java, Scala.
Categories:
Data Warehousing, Business Intelligence, ETL

©2024. Winfy. All Rights Reserved.

By OD Group OU – Registry code: 1609791 -VAT number: EE102345621.