-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Ian Mayo edited this page Feb 17, 2015
·
25 revisions
Welcome to the data-wrangler wiki!
- Here's a mockup containing the current (July 2014) thoughts.
- Here is a link to some high level thoughts regarding the app: high level thoughts
- open data file for wrangling
- open data file for wrangling using pre-existing input schema
- convert data file to alternate format using pre-existing conversion schema
- merge two input files (user specifies key/foreign key)
- load file from cloud-storage (columns will be auto-assigned)
- just view Head rows (10 rows?) - default for huge files
- just view Tail rows (10 rows?)
- re-sample at alternate frequency (only columns that have been annotated with type/units)
- export using pre-existing output schema (essential cols for that schema must have been specified)
- delete incomplete/invalid rows
- interpolate incomplete/invalid cells (for annotated columns)
- specify delimiter(s)
- specify fixed widths
- specify message type column (after doing this, there's a set of column definitions for each message type)
- filter for message type
- provided persistent document annotations for file type (Source, link to format reference)
- register input schema for re-use in application
- export input schema for exchange
- push file to cloud storage
- provide version tracking for all file modifications (with comments - see Version History below)
- specify data type & units
- add calculated column based on selected one(s) [see below for calculation examples]
- apply time offset (for time data type)
- view statistical overview (5 number summary for quantitative, other for qualitative - think Descriptive Statistics & See Apache Commons)
- view xy-plot (if quantitative, plot against time if time column annotated)
- view value listing (if qualitative. Only first few values if there are lots of them)
- remove outliers (if quantitative, configurable)
- mark as identity column (used for grouping data for export, e.g. TgtId)
- mark as category value (used to allow group rows by category)
- delete column
- hide (ignore) column
- provided persistent column annotations
- delete row
- insert new row (duplicating previous one)
- replace value with interpolated one
A series of operations will be available for column-related activities. These operations will typically produce one new column based on one existing column, but scripting (or custom UI) will allow for 1-* input columns to generate 1-* output columns
- inverse
- parse date/time text
- parse latitude/longitude text
- add time offset
- units conversions (time, distance, area, velocity, acceleration)
- multiply by value
- add value
- smooth (range of smoothing types)
- apply user-provided script (see below)
- Eclipse's Nebula NatTable widget will provide valuable UI foundation
- Eclipse's Native Git support will support backbone data storage and versioning.
- Eclipse's EASE - Eclipse Advanced Scripting can provide great versatility to scripting-aware users, providing additional data conversions & operations.
- Apache Commons Math Descriptive Statistics is able to quickly summarise a dataset
Eclipse includes a mature implementation of Version Control via the Git distributed version control system.
This can be unwieldy to non-developers, but interactions with it will be hidden within the Data Wrangler. So, all the user will encounter is:
- the ability to provide a comment explaining each file/data change
- the ability to view the history of either the current folder, the current file, or the current column in the current file
Screenshot of version history: