Monk Coding Challenge 1: wc tool in C • ShellMonk

Intro

Hello, dear reader. Welcome to the first installment of the Monk Coding Challenges, where we write the wc tool in plain, old C. If you want to skip text and go straight to the conclusion, jump to final verdict.

C and me

My first experience with C was in high school. Before that, I mostly coded in Visual Basic 6, and after being code-shamed several times by some random online elitists for using VB6 instead of something “more serious,” I decided to start learning C.

The experience was, to put it lightly, painful. At first, I could not understand why this language didn’t have a cute editor for drawing user interfaces as Visual Basic did. It took me a while to figure out that to create a simple window, you need to write about 70 lines of some obscure Win32 API code, which I still don’t like to this day and which was the main reason why I switched to Linux and kept it as my primary operating system to this day. Thanks to the widespread internet piracy of the early 2000s, I managed to download two books: C Programming Language and Advanced Programming in the UNIX Environment from some shady “warez” websites (I’m not proud of it, but to redeem myself, I purchased the physical copies few years later).

After months of pain, I reached a magical point where I seriously fell in love with computing. I was learning about POSIX, low-level memory management, file descriptors, signals, syscalls, and the ingenious simplicity of the UNIX realm. I felt like I was seeing through the Matrix for the first time, and C was the little boat I was floating on in the ocean of computation. Spending days debugging in gdb was soul-crushing sometimes, but life was beautiful.

Thank you, C.

Solving the challenge

Language choice for the first challenge was easy. I saw the task and wanted to let some C through my fingers. The last time I wrote some serious C code was a decade ago, so I might be a bit RUSTy (he-he, pun intended).

NOTE: Code intentionally wasn’t ported to Windows

First of all, two points in developing a solution require us to engage a bit more of our neurons:

Input: We need to read both from the file and standard input (note: file can be huge, and we don’t know the final size of the standard input stream)
Counting: words = String.split(' ') is not going to work on streams and character != byte

1. Input

Since we need to read both from the (maybe huge) file and standard input (of unknown size) equally, we cannot load everything into the memory and then run our counting algorithm; we need to read everything as a stream. The beauty of UNIX’s “everything is a file” philosophy shines here. What if I told you you could treat files on your disk, network sockets, standard input, and many other, at first glance, unrelated abstractions, equally? If you don’t trust me, let me show you something:

// ...
/**
 * counter structure containing counted bytes, 
 * characters, words and lines
 */
typedef struct  {  
  unsigned long bytes;
  unsigned long chars;
  unsigned long words;
  unsigned long lines;
} Counter;

// ...
int main(int argc, char ** argv) {
  // ...
  FILE *inputfd;

  // if file is provided, open it
  // if not, default to standard input
  if(filename) {
    inputfd = fopen(filename, "r");
    if(inputfd == NULL) {
      fprintf(stderr, "[ERROR] Cannot open file: %s\n", filename);
      exit(EXIT_FAILURE);
    }
  } else {
    inputfd = stdin;
  }

  // count values from file descriptor
  Counter counter = read_from_fd(inputfd);

  // close the stream
  fclose(inputfd);

  // ...
}

Power of file descriptors, a helpful IO abstraction from POSIX API. Although low-level, file descriptors are one of the building blocks of POSIX-compliant operating systems. If you invest time in understanding them, you can add a potent tool to your programming arsenal (if you want to play with them, try writing a reverse shell, it’s easier than you think; hint: dup2() is your friend).

2. Counting

As already mentioned, loading everything into the memory is a bad idea. That’s why we’re working with streams.

Now, the counting algorithm is our meat here. We have to do it in O(n), meaning we can run through the stream only once and cannot load everything into the memory. Surprisingly, that’s pretty easy to achieve with some creativity. Here’s my simple implementation of read_from_fd() function (feel free to optimize or change it however you like it if you’re so inclined. I’d be glad):

/**
 * @brief function that calculates chars, words and lines 
 * 
 * @param fd file descriptor we are reading from
 * @return Counter struct defined before
 */
Counter read_from_fd(FILE *fd) {

  // current wide character
  wint_t wc;

  // wopen = word open
  // meaning that word is being read
  bool wopen = false;

  // fancy way of initializing structs
  Counter cnt = { .bytes = 0, 
                  .chars = 0, 
                  .words = 0, 
                  .lines = 0 };

  // char buff[8];
  // TODO: This can and should be improved by buffering,
  //       but reading one char at a time 
  //       works for now

  // loop until we reach the end of the wide char stream
  while(WEOF != (wc = fgetwc(fd))) {
    cnt.chars++;
    // convert wide character to multibyte
    // and add the lenght to the sum of bytes
    cnt.bytes += wctomb(buff, wc);

    // if new line, add newline, obviously
    if(wc == L'\n') cnt.lines++;

    // check if wide character is 
    // whitespace - ' ', \n, \t, \r, etc.
    bool space = iswspace(wc);

    // counting the words, nice little algo
    if(wopen) {
      if(space) {
        wopen = false;
      }
    } else {
      if(!space) {
        wopen = true;
        cnt.words++;
      }
    }
  }

  return cnt;
}

If you look closely at the code snipped above, you’ll notice that we’re also handling wide characters. POSIX API helps again here with fgetwc(), wctomb() and iswspace() functions. Pretty neat, isn’t it?

NOTE: You can find full solution here.

Final verdict

Okay, it’s time to sum everything up.

C is a procedural, imperative little language with a simple and unsafe static type system. That means that for all abstractions you want, you need to write them yourself. The good news is that you can emulate pretty much any abstract constructs you can find in other languages - objects, interfaces, polymorphism, functional composition, even monads, but be prepared to do a lot of typing and debugging. At the end of the day, you can view C as a glorified assembly with nice syntactic sugar.

Pros

Coding in C feels like playing with lightning that Zeus himself crafted on Olympus Mountain. You know you can do anything you can imagine, and that feeling you cannot ignore. Of course, you can also severely hurt yourself (and others) if you’re not careful enough. With great power comes great responsibility. C can give you low-level access to the internals of the machine you’re working on that high-level languages can’t. If you’re skilled enough, you can drain out every last clock cycle of your CPU and achieve insane performance.

C is fun if you’re crazy enough. Being unsafe and simple makes it fun to experiment with and do nasty hacks like this.

The ecosystem is HUGE. Some of the world’s largest and most complex codebases are written in C. C is like Rule 34 of programming - if it exists, there’s a C library for it.

Cons

C is a simple language if we look at its grammar, type system, or just plain number of keywords (32). You can learn C syntax in one afternoon but spend decades mastering its practical usage.

But although simple, C is challenging to program in. Bad memory management costs our industry billions of USD per year. Memory leaks, buffer overflows, segfaults, dangling pointers, all those nasty phenomena that are making the lives of thousands of developers worldwide miserable are far too easy to introduce with C. And we’re not even talking about concurrently executed code. That’s a world of pain in itself.

When and where to use it

You need low-level access to the hardware (drivers, kernels, network stacks, etc)
You need the best possible performance and don’t want to deal with assembly
You are working on the system with interfaces defined in C style
You have it as part of your university curriculum
You want to flex on your friends and colleagues

C in 2023 and beyond

According to the TIOBE index, C is alive and well and will likely last for a few years (probably decades). There’s simply too much critical C code in the world that cannot be easily replaced. Those (usually massive) systems must be maintained, expanded, and debugged. Although we’re experiencing the rise of “C/C++ replacement” languages in recent years with candidates like Rust, Go, Zig, Nim, Odin, etc. I doubt anyone will replace C anytime soon. Even C++ failed to do so.

However, it hurts me to say this, but I would only recommend choosing C as a primary language for new projects, especially critical ones, if you really need to. C’s benefits are not enough to account for the risks you are introducing with it. There are safer and more modern alternatives.

Outro

I always like to say how, in IT, soft skills are essential only if you lack hard skills, and yet, here I am, writing a blog post ten times longer than the source code I was describing in it. Strange world we live in.

Anyhow, dear reader, if you came this far, thank you for reading. See you in the next challenge.