Initial commit

2019-02-06 12:24:39 -08:00 · 2019-02-06 12:24:39 -08:00 · 26c5433d16
commit 26c5433d16
21 changed files with 1968 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,268 @@
+# csv-sanity
+
+Preserve your sanity is a world full of malformed, poorly validated CSV files.
+Sanitize and transform large CSVs with millions of records quickly and
+efficiently.
+
+**NOTE:** csv-sanity is in an alpha state and is subject to breaking changes.
+The ruleset file syntax in particular is likely to change in the near future.
+I've personally used csv-sanity on a number of projects and it has been
+incredibly helpful, but as with most alpha software csv-sanity is provided
+as-is and provides no warranty or guarantee. Use at your own risk and double
+check your transformed files!
+
+## Purpose
+
+The CSV format is not well-standardized and has many shortfalls when it comes to
+storing large numbers of records with complex data formats, but CSVs are
+ubiquitous in many realms as a neutral interchange format that most CRMs and
+database software can parse and understand.
+
+But what happens when your CRM can only parse ISO 8601 formatted dates and the
+CSV you inherited has dates in another format such as the following:
+
+```csv
+id,name,signup_date
+2,John Doe,11/22/2017
+3,Jane Doe,11/28/2017
+```
+
+Or you received a CSV of people who you need to contact via a personalized
+email, but your contacts' names in the CSV are in ALL CAPS:
+
+```csv
+id,first_name,last_name
+2,JOHN,DOE
+3,JANE,DOE
+```
+
+Or you have a CSV that has valid values for the vast majority of records, but 1
+out of every 20k records has nonsense values that cause your entire import to
+abort:
+
+```csv
+id,fist_name,last_name,party_registration
+2,Jane,Doe,REP
+3,John,Doe,DEM
+345,Josh,Smith,HAHAHAHA
+```
+
+Or even a CSV that has a few malformed records due to unescaped commas:
+
+```csv
+id,first_name,last_name,email
+2,Jane,Doe,jane@example.com
+3,John,Doe,"i,don't,follow,the,rules"@example.com
+```
+
+These are all real problems I've encountered with CSVs over the years. If the
+CSV is small enough they can be corrected by hand, but for CSVs with 10k, 100k
+or even millions of records correcting by hand simply isn't a viable option.
+
+`csv-sanity` aims to solve the issue of sanitizing large, poorly-validated CSVs.
+
+## Usage
+
+`csv-sanity` is an executable that takes an input CSV to process and a JSON
+ruleset file defining the transformation rules to apply:
+
+```bash
+csv-sanity [-r RULESET_FILE] <INPUT_FILE>
+```
+
+If a path to a ruleset file is not provided via the `-r` option, `csv-sanity`
+will look for a file named "ruleset.json" in the current directory.
+
+By default, `csv-sanity` outputs two files to the current directory:
+output.csv, which contains the processed CSV with validated and transformed
+records, and errors.csv, which contains a list of records and fields that
+couldn't be processed and reasons they were rejected. The paths where the output
+and error files are output can be overridden via the `-o FILE_PATH` and
+`-e FILE_PATH` options, respectively.
+
+## ruleset.json Syntax
+
+Ruleset files are JSON files that define a collection of transformation rules
+and the fields to which they should be applied.
+
+The following is an example ruleset JSON file:
+
+```json
+{
+    "rules": [
+        {
+            "applicability": {
+                "Global": [],
+            },
+            "transformer": {
+                "None": {
+                    "regex": "\\A(?:[:cntrl:]|\\s)*\\z"
+                }
+            },
+            "priority": -10
+        },
+        {
+            "applicability": {
+                "Global": [],
+            },
+            "transformer": {
+                "Trim": {}
+            },
+            "priority": -10
+        },
+        {
+            "applicability": {
+                "Fields": {
+                    "field_names": [
+                        "first_name",
+                        "last_name"
+                    ]
+                }
+            },
+            "transformer": {
+                "Capitalize": {}
+            }
+        }
+    ]
+}
+```
+
+Every ruleset.json file is a JSON object with a single "rules" field with an
+array of rule objects.
+
+Rules are objects with two fields:
+
+- **"applicability"**: specifies whether a rule applies globally or only to a
+    predefined set of fields (specified as the column headers in the CSV being
+    processed)
+- **"transformer"**: a transformer object, which specifies how the applicable
+    fields should be transformed.
+
+### Transformers
+
+
+
+#### Capitalize
+
+Transforms string fields into Capital Case.
+
+#### Choice
+
+Only accepts a pre-defined list of acceptable values and rejects the rest.
+
+#### Date
+
+```json
+{
+    "Date": {
+        "input_formats": [
+            "%m/%d/%Y"
+        ],
+        "output_formats": "%F"
+    }
+}
+```
+
+Attempt to parse fields with a list of datetime formats via
+[time::strptime](https://docs.rs/time/0.1.37/time/fn.strptime.html). See the
+docs for the [time](https://docs.rs/time/0.1.37/time/index.html) crate for
+details on datetime formating syntax.
+
+#### Email
+
+```json
+{
+    "Email": {}
+}
+```
+
+Attempt to parse fields as email addresses, rejecting any fields that appear to
+be invalid email addresses.
+
+#### None
+
+```json
+{
+    "None": {
+        "regex": "\\A(?:[:cntrl:]|\\s)*\\z"
+    }
+}
+```
+
+Replace matched fields with a blank value. Useful as a global rule for
+normalizing blank fields in a CSV file.
+
+#### Number
+
+```json
+{
+    "Number": {}
+}
+```
+
+Attempt to parse fields as whole integers, rejecting any fields that cannot be
+parsed.
+
+#### PhoneNumber
+
+```json
+{
+    "PhoneNumber": {}
+}
+```
+
+Attempt to parse files as US, NANP-formatted phone numbers, transforming them
+into a standard international format of `+1 <area_code> <exchange_code> <subscriber_number>`.
+
+#### Regex
+
+```json
+{
+    "Regex": {
+        "regex": "\\A([A-Z])[A-Z]+\\z",
+        "template": "$1"
+    }
+}
+```
+
+Match fields against the provided regex pattern and transform them according to
+the template string, replacing capture groups placeholders. See the
+[Regex::replace](https://docs.rs/regex/0.2.1/regex/struct.Regex.html#method.replace)
+in the regex crate docs for details.
+
+#### RegexMatch
+
+```json
+{
+    "RegexMatch": {
+        "regex": "\\A[A-Z]{2,3}\\z",
+        "negate": false
+    }
+}
+```
+
+Reject any fields that fail to match against the provided regex pattern. If
+`negate` is `true`, the reject any fields that match the provided regex pattern
+instead.
+
+#### Trim
+
+```json
+{
+    "Trim": {}
+}
+```
+
+Trim leading and trailing whitespace from fields. Useful as a global rule to
+normalize fields and remove useless whitespace.
+
+#### Zipcode
+
+```json
+{
+    "Zipcode": {}
+}
+```
+
+Attempt to parse fields as US zip codes in the formats "xxxxx" and "xxxxx-xxxx",
+rejecting any fields that fail to match that format.