
Comma-Separated Values (CSV) is a old data format that came into use several decades ago and is often used in low-tech solutions and/or legacy systems.
I’ve worked on too many teams who have used complicated implementations with nested loops to iterate through each CSV row, don’t leverage native CSV support for header key mapping, and so forth.
During the course of this article, I’ll walk you through how to make the changes necessary to parse CSVs coupled with improved error handling using Ruby’s native CSV gem and Dry RB’s Dry Schema gem. All of this will be done with minimal effort.
Quick Start
For those who would like to get started quickly, here’s the working implementation that this article will be delving into.
#! /usr/bin/env ruby
# frozen_string_literal: true
# Save as `snippet`, then `chmod 755 snippet`, and run as `./snippet`.
require "bundler/inline"
gemfile true do
source "https://rubygems.org"
gem "amazing_print"
gem "debug"
gem "dry-schema"
gem "dry-monads"
gem "refinements"
end
require "csv"
Dry::Schema.load_extensions :monads
include Dry::Monads[:result]
using Refinements::Hashes
Schema = Dry::Schema.Params do
before(:key_coercer) { |result| result.to_h.symbolize_keys! }
required(:body).array(:hash) do
required(:book).filled(:string)
required(:author).filled(:string)
required(:price).filled(:float)
required(:created_at).filled(:date_time)
end
end
class Parser
HEADERS = {
"Book" => :book,
"Author" => :author,
"Price" => :price,
"CreatedAt" => :created_at
}.freeze
def initialize schema: Schema, headers: HEADERS, client: CSV
@schema = schema
@headers = headers
@client = client
end
def call(body) = schema.call(body: csv(body)).to_monad
private
attr_reader :schema, :headers, :client
def csv body
client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
.to_a
.map(&:to_h)
end
end
result = Parser.new.call <<~BODY
Book,Author,Price,CreatedAt
Mystics,urGoh,10.50,2022-01-01
Skeksis,skekSil,20.75,2022-02-13
BODY
case result
in Success(schema) then ap schema.to_h[:body]
in Failure(schema) then ap schema.errors.to_h
end
When running the above script, you’ll get the following output:
[
{
:book => "Mystics",
:author => "urGoh",
:price => 10.5,
:created_at => #<DateTime: 2022-01-01T00:00:00+00:00 ((2459581j,0s,0n),+0s,2299161j)>
},
{
:book => "Skeksis",
:author => "skekSil",
:price => 20.75,
:created_at => #<DateTime: 2022-02-13T00:00:00+00:00 ((2459624j,0s,0n),+0s,2299161j)>
}
]
If you tweak the CSV body so it is malformed:
Book,Author,Price,CreatedAt
Mystics,,10.50,2022-01-01
Skeksis,skekSil,20.75,
…then you’ll get the following errors when running the script:
{
0 => {
:author => [
"must be filled"
]
},
1 => {
:created_at => [
"must be filled"
]
}
}
That’s a lot of power with only a little bit of code but you might have questions about the implementation so let’s break this down next.
Breakdown
We’ll start at the top and work our way down.
Pragmas
#! /usr/bin/env ruby
# frozen_string_literal: true
Pragmas — also known as magic comments — ensures the script runs as a Ruby program and all strings are frozen for improved performance. You can learn more about pragmas via my Pragmater gem if you like.
Dependencies
Using a Bundler Inline script ensures dependencies are installed before the rest of the script executes. Definitely handy for small scripts like this but you can always use my Rubysmith gem if you need more firepower.
As for the dependencies, themselves, here are the details:
-
Amazing Print - I’m using this for pretty printing hashes at the end of the script via the
ap
message. I’ll touch upon this more later. -
Debug - This is Ruby’s new debugger and is great for adding
binding.break
breakpoints to your code for debugging purposes. -
Dry Schema - Provides a powerful DSL for analyzing and validating data structures. This is the primary power of this script and I will expand upon this further soon.
-
Dry Monads - Blends Functional Programming with our Object Oriented Design. This lends itself well for the pattern matching at the end of the script.
-
Refinements - This is my Ruby gem which refines core primitives and enhances the language without resorting to monkey patching.
Setup
Once our dependencies are installed, there is a tiny bit of setup required:
require "csv"
Dry::Schema.load_extensions :monads
include Dry::Monads[:result]
using Refinements::Hashes
First, you’ll need to require the CSV gem so you can parse CSV content. Next, teach
Dry Schema to use monads — and include the result monad — so you can pattern match.
Finally, you can use my Hash
refinement so you can symbolize/coerce the schema keys since Ruby
doesn’t have native support for key symbolization.
Schema
Now that we understand the dependencies used and the setup, we can talk about Dry Schema usage which is the heart of our solution:
Schema = Dry::Schema.Params do
before(:key_coercer) { |result| result.to_h.symbolize_keys! }
required(:body).array(:hash) do
required(:book).filled(:string)
required(:author).filled(:string)
required(:price).filled(:float)
required(:created_at).filled(:date_time)
end
end
Dry Schema provides parameter and JSON schema support by default. The difference — even
though we are dealing with a CSV — is what type coercion is used but I will let you read the Dry
Schema documentation to learn more. I do want to point out that — with both Params and JSON — all keys are strings and is why I use my Refinements gem to coerce the keys as symbols
within the key_coercer
before block. I prefer using symbols as keys when possible.
Next up is the body
of the CSV hash. Given the schema above, this equates to the following:
[
{
book: "Mystics",
author: "urGoh",
price: 10.5,
created_at: #<DateTime: 2022-01-01T00:00:00+00:00 ((2459581j,0s,0n),+0s,2299161j)>
}
]
Each element in the array is the CSV row as converted to a hash. I’m also expecting each CSV row to have certain columns which are:
-
book
: Must be filled as a string. -
author
: Must be filled as a string. -
price
: Must be filled as a float. -
created_at
: Must be filled as a date/time.
Dry Schema makes it convenient to define what keys and values are required and also what you want the values coerced into. Normally, this would be additional work but is now avoided.
Parser
With our schema defined, we can move on to the second part of this puzzle which is our CSV parser. Here’s the code for review:
class Parser
HEADERS = {
"Book" => :book,
"Author" => :author,
"Price" => :price,
"CreatedAt" => :created_at
}.freeze
def initialize schema: Schema, headers: HEADERS, client: CSV
@schema = schema
@headers = headers
@client = client
end
def call(body) = schema.call(body: csv(body)).to_monad
private
attr_reader :schema, :headers, :client
def csv body
client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
.to_a
.map(&:to_h)
end
end
The core structure of this class is based on the Command and Barewords patterns which I’ve detailed before. Where things get interesting is with the initial parsing of the CSV and, later, when the body is consumed by the schema. Let’s start with the parsing of the CSV:
client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
.to_a
.map(&:to_h)
Here you’re asking the CSV client to build a CSV instance with headers enabled. The headers are
important because you’ll need them in order to build a row of key/value hash pairs which we can hand
off to the schema. The last thing to do is assign a header_converters
closure which knows how to
translate each header key into a symbol which your schema will understand. This means that if your
header is "Book"
then it is looked up in the headers
hash and translated to the :book
symbol.
Same goes for "Author"
as :author
and so forth. Here’s a line-by-line breakdown so you can see
the evolution of the CSV object being transformed before being handed off to the schema:
client.instance(body, headers: true, header_converters: proc { |key| headers[key] })
.tap { |object| puts object.inspect }
.to_a
.tap { |object| puts object.inspect }
.map(&:to_h)
.tap { |object| puts object.inspect }
The above will yield the following output but I’ve added comments to make each step output more clear:
# Step 1 - The CSV instance is initialized.
#<CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
# Step 2 - The CSV instance is converted to an array of CSV rows.
[
#<CSV::Row book:"Mystics" author:"urGoh" price:"10.50" created_at:"2022-01-01">,
#<CSV::Row book:"Skeksis" author:"skekSil" price:"20.75" created_at:"2022-02-13">
]
# Step 3 - Each CSV row is converted into an array of hashes which Dry Schema can consume.
[
{:book=>"Mystics", :author=>"urGoh", :price=>"10.50", :created_at=>"2022-01-01"},
{:book=>"Skeksis", :author=>"skekSil", :price=>"20.75", :created_at=>"2022-02-13"}
]
With only a few lines of code, this is a lot of power the CSV gem gives you. 🎉
Now we can feed that information to our schema via this last line of code:
schema.call(body: csv(body)).to_monad
With the CSV parsed, all you have to do is message the schema with the CSV array and ask that the result be converted to a monad for pattern matching later.
Parsing
With parsing understood, now you can call it:
result = Parser.new.call <<~BODY
Book,Author,Price,CreatedAt
Mystics,urGoh,10.50,2022-01-01
Skeksis,skekSil,20.75,2022-02-13
BODY
For illustration purposes, I’m inlining the CSV body via a heredoc but you could also message the parser with contents read from a file as well.
Pattern Matching
At this point, we are at the end of the script where you can pattern match on the result monad as follows:
case result
in Success(schema) then ap schema.to_h[:body]
in Failure(schema) then ap schema.errors.to_h[:body]
end
The benefit of having the schema answer a monad is that you’ll always know the result is either a
Success
or a Failure
. That’s it.
Normally, you’d use the success or failure to process the result such as messaging an API client or
updating the UI. Instead, I’m using Amazing Print (i.e. ap
) to print out the success or
failure result for illustration purposes. I’ll leave it up to you to wire up whatever downstream
processing you’d need next. 🚀
Next Steps
During the course of this article, I’ve only been talking about Dry Schema as the primary solution but I want to highlight that if you need more error handling — or customized rules — you’ll want to reach for Dry Validation which is built on top of Dry Schema.
By the way, if it helps, both Dry Schema and Dry Validation are infinitely better than dealing with Active Model Validations or Action Controller Strong Parameters so the sooner you can move away from using them the happier and more efficient you’ll be.
Conclusion
I hope you’ve enjoyed learning how to parse CSVs — complete with error handling — with only a few lines of code which leverages the Command Pattern with a combination of native CSV and Dry Schema support.
Enjoy and may your CSV parsing implementations be fun to work with!