284 lines
6.6 KiB
Markdown
284 lines
6.6 KiB
Markdown
# csv-sanity
|
|
|
|
Preserve your sanity is a world full of malformed, poorly validated CSV files.
|
|
Sanitize and transform large CSVs with millions of records quickly and
|
|
efficiently.
|
|
|
|
**NOTE:** csv-sanity is in an alpha state and is subject to breaking changes.
|
|
The ruleset file syntax in particular is likely to change in the near future.
|
|
I've personally used csv-sanity on a number of projects and it has been
|
|
incredibly helpful, but as with most alpha software csv-sanity is provided
|
|
as-is and provides no warranty or guarantee. Use at your own risk and double
|
|
check your transformed files!
|
|
|
|
## Purpose
|
|
|
|
The CSV format is not well-standardized and has many shortfalls when it comes to
|
|
storing large numbers of records with complex data formats, but CSVs are
|
|
ubiquitous in many realms as a neutral interchange format that most CRMs and
|
|
database software can parse and understand.
|
|
|
|
But what happens when your CRM can only parse ISO 8601 formatted dates and the
|
|
CSV you inherited has dates in another format such as the following:
|
|
|
|
```csv
|
|
id,name,signup_date
|
|
2,John Doe,11/22/2017
|
|
3,Jane Doe,11/28/2017
|
|
```
|
|
|
|
Or you received a CSV of people who you need to contact via a personalized
|
|
email, but your contacts' names in the CSV are in ALL CAPS:
|
|
|
|
```csv
|
|
id,first_name,last_name
|
|
2,JOHN,DOE
|
|
3,JANE,DOE
|
|
```
|
|
|
|
Or you have a CSV that has valid values for the vast majority of records, but 1
|
|
out of every 20k records has nonsense values that cause your entire import to
|
|
abort:
|
|
|
|
```csv
|
|
id,fist_name,last_name,party_registration
|
|
2,Jane,Doe,REP
|
|
3,John,Doe,DEM
|
|
345,Josh,Smith,HAHAHAHA
|
|
```
|
|
|
|
Or even a CSV that has a few malformed records due to unescaped commas:
|
|
|
|
```csv
|
|
id,first_name,last_name,email
|
|
2,Jane,Doe,jane@example.com
|
|
3,John,Doe,"i,don't,follow,the,rules"@example.com
|
|
```
|
|
|
|
These are all real problems I've encountered with CSVs over the years. If the
|
|
CSV is small enough they can be corrected by hand, but for CSVs with 10k, 100k
|
|
or even millions of records correcting by hand simply isn't a viable option.
|
|
|
|
`csv-sanity` aims to solve the issue of sanitizing large, poorly-validated CSVs.
|
|
|
|
## Usage
|
|
|
|
`csv-sanity` is an executable that takes an input CSV to process and a JSON
|
|
ruleset file defining the transformation rules to apply:
|
|
|
|
```bash
|
|
csv-sanity [-r RULESET_FILE] <INPUT_FILE>
|
|
```
|
|
|
|
If a path to a ruleset file is not provided via the `-r` option, `csv-sanity`
|
|
will look for a file named "ruleset.json" in the current directory.
|
|
|
|
By default, `csv-sanity` outputs two files to the current directory:
|
|
output.csv, which contains the processed CSV with validated and transformed
|
|
records, and errors.csv, which contains a list of records and fields that
|
|
couldn't be processed and reasons they were rejected. The paths where the output
|
|
and error files are output can be overridden via the `-o FILE_PATH` and
|
|
`-e FILE_PATH` options, respectively.
|
|
|
|
## ruleset.json Syntax
|
|
|
|
Ruleset files are JSON files that define a collection of transformation rules
|
|
and the fields to which they should be applied.
|
|
|
|
The following is an example ruleset JSON file:
|
|
|
|
```json
|
|
{
|
|
"rules": [
|
|
{
|
|
"applicability": {
|
|
"Global": [],
|
|
},
|
|
"transformer": {
|
|
"None": {
|
|
"regex": "\\A(?:[:cntrl:]|\\s)*\\z"
|
|
}
|
|
},
|
|
"priority": -10
|
|
},
|
|
{
|
|
"applicability": {
|
|
"Global": [],
|
|
},
|
|
"transformer": {
|
|
"Trim": {}
|
|
},
|
|
"priority": -10
|
|
},
|
|
{
|
|
"applicability": {
|
|
"Fields": {
|
|
"field_names": [
|
|
"first_name",
|
|
"last_name"
|
|
]
|
|
}
|
|
},
|
|
"transformer": {
|
|
"Capitalize": {}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Every ruleset.json file is a JSON object with a single "rules" field with an
|
|
array of rule objects.
|
|
|
|
Rules are objects with two fields:
|
|
|
|
- **"applicability"**: specifies whether a rule applies globally or only to a
|
|
predefined set of fields (specified as the column headers in the CSV being
|
|
processed)
|
|
- **"transformer"**: a transformer object, which specifies how the applicable
|
|
fields should be transformed.
|
|
|
|
### Transformers
|
|
|
|
#### Capitalize
|
|
|
|
```json
|
|
{
|
|
"Capitalize": {}
|
|
}
|
|
```
|
|
|
|
Transforms string fields into Capital Case.
|
|
|
|
#### Choice
|
|
|
|
```json
|
|
{
|
|
"Choice": {
|
|
"choices": [
|
|
"A",
|
|
"B",
|
|
"C"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
Only accepts a pre-defined list of acceptable values and rejects the rest.
|
|
|
|
#### Date
|
|
|
|
```json
|
|
{
|
|
"Date": {
|
|
"input_formats": [
|
|
"%m/%d/%Y"
|
|
],
|
|
"output_formats": "%F"
|
|
}
|
|
}
|
|
```
|
|
|
|
Attempt to parse fields with a list of datetime formats via
|
|
[time::strptime](https://docs.rs/time/0.1.37/time/fn.strptime.html). See the
|
|
docs for the [time](https://docs.rs/time/0.1.37/time/index.html) crate for
|
|
details on datetime formating syntax.
|
|
|
|
#### Email
|
|
|
|
```json
|
|
{
|
|
"Email": {}
|
|
}
|
|
```
|
|
|
|
Attempt to parse fields as email addresses, rejecting any fields that appear to
|
|
be invalid email addresses.
|
|
|
|
#### None
|
|
|
|
```json
|
|
{
|
|
"None": {
|
|
"regex": "\\A(?:[:cntrl:]|\\s)*\\z"
|
|
}
|
|
}
|
|
```
|
|
|
|
Replace matched fields with a blank value. Useful as a global rule for
|
|
normalizing blank fields in a CSV file.
|
|
|
|
#### Number
|
|
|
|
```json
|
|
{
|
|
"Number": {}
|
|
}
|
|
```
|
|
|
|
Attempt to parse fields as whole integers, rejecting any fields that cannot be
|
|
parsed.
|
|
|
|
#### PhoneNumber
|
|
|
|
```json
|
|
{
|
|
"PhoneNumber": {}
|
|
}
|
|
```
|
|
|
|
Attempt to parse files as US, NANP-formatted phone numbers, transforming them
|
|
into a standard international format of `+1 <area_code> <exchange_code> <subscriber_number>`.
|
|
|
|
#### Regex
|
|
|
|
```json
|
|
{
|
|
"Regex": {
|
|
"regex": "\\A([A-Z])[A-Z]+\\z",
|
|
"template": "$1"
|
|
}
|
|
}
|
|
```
|
|
|
|
Match fields against the provided regex pattern and transform them according to
|
|
the template string, replacing capture groups placeholders. See the
|
|
[Regex::replace](https://docs.rs/regex/0.2.1/regex/struct.Regex.html#method.replace)
|
|
in the regex crate docs for details.
|
|
|
|
#### RegexMatch
|
|
|
|
```json
|
|
{
|
|
"RegexMatch": {
|
|
"regex": "\\A[A-Z]{2,3}\\z",
|
|
"negate": false
|
|
}
|
|
}
|
|
```
|
|
|
|
Reject any fields that fail to match against the provided regex pattern. If
|
|
`negate` is `true`, the reject any fields that match the provided regex pattern
|
|
instead.
|
|
|
|
#### Trim
|
|
|
|
```json
|
|
{
|
|
"Trim": {}
|
|
}
|
|
```
|
|
|
|
Trim leading and trailing whitespace from fields. Useful as a global rule to
|
|
normalize fields and remove useless whitespace.
|
|
|
|
#### Zipcode
|
|
|
|
```json
|
|
{
|
|
"Zipcode": {}
|
|
}
|
|
```
|
|
|
|
Attempt to parse fields as US zip codes in the formats "xxxxx" and "xxxxx-xxxx",
|
|
rejecting any fields that fail to match that format.
|