Skip to content
Ian Mayo edited this page Feb 17, 2015 · 25 revisions

Welcome to the data-wrangler wiki!

mockup

External docs

  • Here's a mockup containing the current (July 2014) thoughts.
  • Here is a link to some high level thoughts regarding the app: high level thoughts

Operations

Application level

  • open data file for wrangling
  • open data file for wrangling using pre-existing input schema
  • convert data file to alternate format using pre-existing conversion schema
  • merge two input files (user specifies key/foreign key)
  • load file from cloud-storage (columns will be auto-assigned)

File level

  • just view Head rows (10 rows?) - default for huge files
  • just view Tail rows (10 rows?)
  • re-sample at alternate frequency (only columns that have been annotated with type/units)
  • export using pre-existing output schema (essential cols for that schema must have been specified)
  • delete incomplete/invalid rows
  • interpolate incomplete/invalid cells (for annotated columns)
  • specify delimiter(s)
  • specify fixed widths
  • specify message type column (after doing this, there's a set of column definitions for each message type)
  • filter for message type
  • provided persistent document annotations for file type (Source, link to format reference)
  • register input schema for re-use in application
  • export input schema for exchange
  • push file to cloud storage
  • provide version tracking for all file modifications (with comments - see Version History below)

Column level

  • specify data type & units
  • add calculated column based on selected one(s) [see below for calculation examples]
  • apply time offset (for time data type)
  • view statistical overview (5 number summary for quantitative, other for qualitative - think Descriptive Statistics & See Apache Commons)
  • view xy-plot (if quantitative, plot against time if time column annotated)
  • view value listing (if qualitative. Only first few values if there are lots of them)
  • remove outliers (if quantitative, configurable)
  • mark as identity column (used for grouping data for export, e.g. TgtId)
  • mark as category value (used to allow group rows by category)
  • delete column
  • hide (ignore) column
  • provided persistent column annotations

Row level

  • delete row
  • insert new row (duplicating previous one)

Cell level

  • replace value with interpolated one

Calculations/Operations

A series of operations will be available for column-related activities. These operations will typically produce one new column based on one existing column, but scripting (or custom UI) will allow for 1-* input columns to generate 1-* output columns

  • inverse
  • parse date/time text
  • parse latitude/longitude text
  • add time offset
  • units conversions (time, distance, area, velocity, acceleration)
  • multiply by value
  • add value
  • smooth (range of smoothing types)
  • apply user-provided script (see below)

Applicable technologies

Version History

Eclipse includes a mature implementation of Version Control via the Git distributed version control system.

This can be unwieldy to non-developers, but interactions with it will be hidden within the Data Wrangler. So, all the user will encounter is:

  • the ability to provide a comment explaining each file/data change
  • the ability to view the history of either the current folder, the current file, or the current column in the current file

Screenshot of version history: Version Snapshot