When was the last time you got something useful out of a 100 000-line CSV file in Excel? Probably never, since it would crash. I am sure there are Excel tools that handle this, but here is an example of how I use PowerShell to get information from large CSV outputs.
Table of Contents
Introduction
Dataset
Download the CSV to follow along:
CISA Known Exploited Vulnerabilities Catalog **
** as of June 2022 the column in the CSV we are interested in is named “vendorProject”
Problem Statement
Our end goal:
How many entries does each vendor have in the CISA Known Exploited Vulnerabilities catalog?
PowerShell Commands Used
How we get there:
LINE 1 imports the CSV file as $csv
LINE 2 gets each unique vendor as a string and stores it as $vendor_all
LINE 3 counts the number of times the string is found in the vendorProject column of $csv
for each of the entries in $vendor_all
Worked Example
LINE 1 is self-explanatory.
In LINE 2, $csv | select -expand "vendorProject"
fetches all the entries in the vendorProject column of the CSV file. However, it selects every entry — we are only interested in each unique entry i.e. only the first occurrence of each vendor, so we pipe it to sort -u
.
Note that
select
is the alias ofSelect-Object
as opposed toSelect-String
. You can check this withalias select
— you can do the same withalias sort
to determine its “long-form”. Because PowerShell makes it easy to check aliases, I fall in the camp that prefers it to make scripts less unwieldy.
Finally, we store the output as $vendor_all
so we can use it in LINE 3.
LINE 3 is where the fun is.
Because we want to count the number of entries in the catalog for each vendor, let’s begin with a ForEach-Object block — convince yourself that “%” is an alias with alias -Definition ForEach-Object
:
%{}
The vendors we are looking for are in $vendor_all
so we pipe them in. Within the %{}
script block, the $_
variable represents the current object of the loop. Convince yourself by running:
$vendor_all | %{$_}
We will store this current object as the separate variable $vendor
as we will be using a second script block within this script block, so $_
will represent a new “current object” instead… Hopefully this will make sense when we reach that bit.
$vendor_all | %{$vendor = $_}
Now that we have got $vendor_all
in our script block, we want get $csv
in so we have something to match $vendor
against. We begin on a new line within the script block:
$vendor_all | %{
$vendor = $_
$csv
}
The column we want to match against is vendorProject, so let’s fetch it. Note
- that “?” is an alias for Where-Object, and
- how
$_
now represents the current object from$csv
instead!
$vendor_all | %{
$vendor = $_
$csv | ?{$_."vendorProject"}
}
And for each new piping of $csv
, look for each occurrence of vendorProject that matches the current $vendor
string from $vendor_all
.
$vendor_all | %{
$vendor = $_
$csv | ?{$_."vendorProject" -eq $vendor}
}
Then count it:
$vendor_all | %{
$vendor = $_
$csv | ?{$_."vendorProject" -eq $vendor} | measure | select Count
}
This outputs just the numbers. So we can actually see which vendor it is in the first place, let’s place that entire expression in (brackets)
so it executes first, then print $vendor
and a tab
character in front of it:
$vendor_all | %{
$vendor = $_
$vendor + "`t" + ( $csv | ?{$_."vendorProject" -eq $vendor} | measure | select Count )
}
Finally, we can turn it into a one liner by using a “;” in place of the new line:
$vendor_all | %{ $vendor = $_ ; $vendor + "`t" + ( $csv | ?{$_."vendorProject" -eq $vendor} | measure | select Count ) }
Hopefully, you get an output similar to below:
Extra Credit
To be honest, that LINE 3 command is just the beginning — I have found real value in querying a second column if the first column meets a certain criteria. Take the following Problem Statement for example:
How many products does each vendor have in the CISA Known Exploited Vulnerabilities catalog?
The data of interest is in a different product column depending on what is found in the vendorProject column.
Take a look at the following, the difference being in the LINE 3 script which actually do not use any commands not already covered: