GNU Unix utilities are always fun to decode and understand, and this time its the good ole wc.
In this article we are going to attempt to decode and understand how does this simple word/line/byte counting utility program works and may be we try to achieve the same in our favorite language go.
If you were to decode the overall design and control-flow for most of the unix utilities, there is a common architectural style that can be inferred. For wc, I have tried to capture the control-flow using the above image, ofcourse I have simplified some stages. If you wish to jump straight to the code, here is the repo for the code written so far wc-go
For more patient folks, who would like to learn feel free to keep reading.
Show me the code
Now how do we find the source code for this utility. Remember wc is one of the core utilities of GNU operating system, so a simple google search should show you either the git mirror or take you to https://www.gnu.org/software/coreutils, where you can find github link to browse the source code, among other options.
Once we have the source code (I am following the code here), we can start to understand in more detail.
Program Structure
Setup n Initialization
As you can see there are a lot of flags and variables, that are being declared before we even get to the main() function.
Here are some that will be used to define behaviour while printing output. https://github.com/goj/coreutils/blob/rm-d/src/wc.c#L53C1-L72C29
/* Size of atomic reads. */
#define BUFFER_SIZE (16 * 1024)
/* Cumulative number of lines, words, chars and bytes in all files so far.
max_line_length is the maximum over all files processed so far. */
static uintmax_t total_lines;
static uintmax_t total_words;
static uintmax_t total_chars;
static uintmax_t total_bytes;
static uintmax_t max_line_length;
/* Which counts to print. */
static bool print_lines, print_words, print_chars, print_bytes;
static bool print_linelength;
/* The print width of each count. */
static int number_width;
/* True if we have ever read the standard input. */
static bool have_read_stdin;
Kind of self-explanatory.
For instance, these are the options that correspond to the flags that are available to end-user while using wc. https://github.com/goj/coreutils/blob/rm-d/src/wc.c#L92
static struct option const longopts[] =
{
{"bytes", no_argument, NULL, 'c'},
{"chars", no_argument, NULL, 'm'},
{"lines", no_argument, NULL, 'l'},
{"words", no_argument, NULL, 'w'},
{"files0-from", required_argument, NULL, FILES0_FROM_OPTION},
{"max-line-length", no_argument, NULL, 'L'},
{GETOPT_HELP_OPTION_DECL},
{GETOPT_VERSION_OPTION_DECL},
{NULL, 0, NULL, 0}
};
There are some more local flags we can see inside the main()
function. I am adding comments inline to explain wherever it makes sense.
bool ok; // The final return status for the program
int optc; // The next char option to process
int nfiles; // Number of files to count
char **files; // Input file names
char *files_from = NULL; // Name of the reference file holding all target file names
struct fstatus *fstatus; // A custom defined struct holding stat() info and execution status
struct Tokens tok; // This is a token struct based on something called obstacks used in GNU, for input file list.
initialize_main (&argc, &argv); // Redirection and wildcarding when done by utility itself. Generally a no-op
set_program_name (argv[0]); // Sets program name as `wc`
setlocale (LC_ALL, ""); // Sets locale to system's default
bindtextdomain (PACKAGE, LOCALEDIR); // Binds translation files to use for this package for internatiolization
textdomain (PACKAGE); // Sets domain to use for translations
atexit (close_stdout); // Wrapper around on_exit
/* Line buffer stdout to ensure lines are written atomically and immediately
so that processes running in parallel do not intersperse their output. */
setvbuf (stdout, NULL, _IOLBF, 0);
print_lines = print_words = print_chars = print_bytes = false; // Initialize flags with false value
print_linelength = false;
total_lines = total_words = total_chars = total_bytes = max_line_length = 0; //Initialize counters as 0
Parse Flags
Flag parsing is achieved using getopt.c
file from GNU coreutils library called gnulib, which implements or wraps functionality for command-line options parsing.
Covering this library code probably requires a post of its own, but lets see what more we can find out about this library.
Source code for getopt.c
can be found here
Parsing stage is essentially doing
- Utility (wc in our case) defines valid options, check the options called
static struct option const longopts[] =
we have talked of earlier. getopt.c
uses this utility provided list of options to parse, which program input are flags and which ones are positional arguments. For instance, for a sample run likegetopt can figure out thatwc -l --bytes filename1.txt
-l
is a short-option,--bytes
is a long-option andfilename1.txt
is a positional argument.- Figure out if we are counting bytes, characters, words or lines, if no flag is set then compute all ?
- Is the max line length flag is set using
-L
flag. - Is there a file being given as input which contains paths to other target files to analyze. (this is an interesting one as I was inititally unaware of such an option existing.), this is the
--files0-from
option. Condition is file-names in the given file have to be\0
orNUL
terminated. So, basically something like this is totally fine and works!!
Lets say we have a list of file names, these files have maybe special characters or even spaces in them.We can simply write these out to another file withfile1.txt file with spaces.txt file\nwith\nnewlines.txt
\0
termination.And now wc can totally process this list of file namesprintf "file1.txt\0file with spaces.txt\0file\nwith\nnewlines.txt\0" > filelist
outputswc --files0-from=filelist
10 20 100 file1.txt 5 10 50 file with spaces.txt 3 6 30 file\nwith\nnewlines.txt 18 36 180 total
- If no flags, help or version command is in the arguments provided, program prints the usage and exits out as
EXIT_FAILURE
Execute
This is the meat of the program, if we simply count the number of lines, out of the 800 (ignoring the author comments, include statements, variable declarations etc), execution is roughly 600 lines or so.
Loosely these are the steps of algorithm that this program is taking for Execute phase.
- Wait for user-input, could be a single file, multiple files, a file with multiple file paths, or stream of user-entered text, terminated by
EOF
or[^D]
in some environments. - Tokenize the input or file list (in-case of list of files)
- Compute status of each file using fstat() and define a custom fstatus struct
- Compute the optimum number of widths based on arguments and number of files.
- Loop over each file
- Verify file access
- Unable to open -
EXIT_FAILURE
- Unable to read from input source -
EXIT_FAILURE
- Unable to allocate enough RAM to read input -
EXIT_FAILURE
- Unable to open -
- Process file as per the flags( counting bytes, lines, words etc)
- At the end of this file, print relevant counts and update running totals
- Verify file access
- Once all files processed, print the totals in new line -
EXIT_SUCCESS
Go Program
It is vital to remember that it is never a good idea to simply try to translate a program written in one language to another language line-by-line. The essence and purpose can be achieved, but trying to translate is not a good idea.
There are open source libraries (like Cobra and Viper ) that are already battle-tested to create great command-line tools in go. You can check out the list of other cool cmd-line tools that use Cobra here. We are going to use Cobra for our use-case to begin with.
This post got pretty long, so here is the code I managed to put together in golang. I believe as of this writing, the stream input and processing list of files is not implemented yet, along with some other safety features.
If you would like to review the code or have a suggestion for me, please feel free to leave a comment or open a github issue. I would greatly appreciate that.