Home

Welcome to the data-wrangler wiki!

mockup

External docs

Here's a mockup containing the current (July 2014) thoughts.
Here is a link to some high level thoughts regarding the app: high level thoughts

Operations

Application level

open data file for wrangling
open data file for wrangling using pre-existing input schema
convert data file to alternate format using pre-existing conversion schema
merge two input files (user specifies key/foreign key)
load file from cloud-storage (columns will be auto-assigned)

File level

just view Head rows (10 rows?) - default for huge files
just view Tail rows (10 rows?)
re-sample at alternate frequency (only columns that have been annotated with type/units)
export using pre-existing output schema (essential cols for that schema must have been specified)
delete incomplete/invalid rows
interpolate incomplete/invalid cells (for annotated columns)
specify delimiter(s)
specify fixed widths
specify message type column (after doing this, there's a set of column definitions for each message type)
filter for message type
provided persistent document annotations for file type (Source, link to format reference)
register input schema for re-use in application
export input schema for exchange
push file to cloud storage
provide version tracking for all file modifications (with comments - see Version History below)

Column level

specify data type & units
add calculated column based on selected one(s) [see below for calculation examples]
apply time offset (for time data type)
view statistical overview (5 number summary for quantitative, other for qualitative - think Descriptive Statistics & See Apache Commons)
view xy-plot (if quantitative, plot against time if time column annotated)
view value listing (if qualitative. Only first few values if there are lots of them)
remove outliers (if quantitative, configurable)
mark as identity column (used for grouping data for export, e.g. TgtId)
mark as category value (used to allow group rows by category)
delete column
hide (ignore) column
provided persistent column annotations

Row level

delete row
insert new row (duplicating previous one)

Cell level

replace value with interpolated one

Calculations/Operations

A series of operations will be available for column-related activities. These operations will typically produce one new column based on one existing column, but scripting (or custom UI) will allow for 1-* input columns to generate 1-* output columns

inverse
parse date/time text
parse latitude/longitude text
add time offset
units conversions (time, distance, area, velocity, acceleration)
multiply by value
add value
smooth (range of smoothing types)
apply user-provided script (see below)

Applicable technologies

Eclipse's Nebula NatTable widget will provide valuable UI foundation
Eclipse's Native Git support will support backbone data storage and versioning.
Eclipse's EASE - Eclipse Advanced Scripting can provide great versatility to scripting-aware users, providing additional data conversions & operations.
Apache Commons Math Descriptive Statistics is able to quickly summarise a dataset

Version History

Eclipse includes a mature implementation of Version Control via the Git distributed version control system.

This can be unwieldy to non-developers, but interactions with it will be hidden within the Data Wrangler. So, all the user will encounter is:

the ability to provide a comment explaining each file/data change
the ability to view the history of either the current folder, the current file, or the current column in the current file

Screenshot of version history: Version Snapshot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly