Skip to content

interview

the database discovery

This is probably my most interesting story so far at this job. No lie, I really did discover a database in production that no one else knew existed.

It starts when Kobi, AppCard's Operations Director, approached me one day and say, "Hey Viet, can you look into why one of our jbrains wasn't backed up?".

For context, jbrains are the on-prem devices AppCard deploys to our customers (the grocery stores). These brains sits in the grocery store's network and communicates with the various Point-of-Sales devices to administer coupons, loyalty system etc. The Jbrain is highly configurable, as each grocers have different needs and integrations.

I knew the jbrain's "backups" are really just daily copies of these configuration files, stored in a server on AWS (we will leave aside the question of why not S3). With these files, a jbrain replacement can be "built" with the same configurations if there are hardware failures or the likes.

After confirming that it looks like the jbrain in question has no backups, and actually there are other jbrains that are missing their backups too, the only suspects is a bug in the backup process or a bug with the backup server. Now I know this backup server, the tech support guys and I use it everyday to do our work, but it's a holy mess of scripts created by half a dozen sysadmins that I never got a knowledge transfer on, we can't start our search there. Howabout the process? Do we know how the backup process work? Of course we don't. And just as obviously, the guys that actually built it are long gone and didn't leave behind any documentation on both the process and the server. The only clue I had was someone mentioning: "I think it's scheduled to run daily at 1 or 2AM or something".

Now that could mean anything, but to me, that sounded like a crontab. At the very least, I hope the crontab exists on the backups server, and not some other server, cause oh boy do we have a lot of servers (as an aside, this monstrosity of complexity is being worked on, with no end in sight). I was able to find a way to output every possible cronjob (users, cron directories), and nestled in all those jobs was one labeled "daily jbrain backups". Aha!

But wait, that backup script is in perl. I didn't know perl, but I had the spirit of all engineers in that we know we can figure anything out. It's actually quite an intuitive language. And all you really need to know how to debug is the ability to print to stdout.

I quickly found that this backup perl script rely on a textfile with a list of stores to know what to backup. Grepping on that list, we can see it is certainly missing many many stores. So what populates this file?

At this point I could do a combination of find/grep, but thankfully I noticed that this textfile is last modified on the dot at 11PM the previous day. Lol, crontab again it is. Scanning the crontab output from the previous section, and what do you know, another perl script.

This time, I noticed something peculiar. The perl script started calling /usr/bin/mysql with some variables. Chasing down these variables leads to some env files. And at this point, I realized that it is calling a database that I didn't know about. This database wasn't in my training, it wasn't ever mentioned by the support engineers, it wasn't on the google sheet containing list of database maintained by the ex-database administrator, and it was not part of my knowledge transfer with the ex-system admin either.

I called Kobi and told him the situation, and then we simply shared a kind of chuckle reserved for situations of absurdity.

Back to work, I obviously started by logging into this MariaDB lost through time. There were only a few tables, nothing mind-blowing or anything. But combined with the perl script, I started tracing that perl script to see what it is doing with the database. And actually, once I figured out how to run the perl script, the error quickly became apparent as the result of an unhandled error by the script when it tries to insert rows into the database. For a moment, it was the developer happy debug loop of modify, run, read until eureka!

Anyway, what was the issue isn't important (it is fixed by now!)(there was missing ancillary data because new jbrains had a recent upgrade), but the discovery of the database is. This database, until we can move on from it, is a critical part of the company's infrastructure. The existence of this database, even mostly-unmanaged as it is now, changed how development for operations can move forward. We started documenting it. Though opportunities are few, future development did consider whether we can use that database. Once I figured out how to get myself superuser access, I even started adding new tables for my developmental needs.

Looking back, I think of this story as a fond discovery. The CTO was definitely pleased to hear about this find. And I think it's a lesson in how effective but forgotten scripts and software can quietly run for years until the day something breaks.

PS: We are starting to centralize the various perl and bash scripts across servers and versioning control them. Not forgetting those too!

the data recovery

Many developers will have done this, some probably do this as a daily routine, but a recent work of mine on a data recovery job felt like a latest expression of my career's progress so far.

The Problem

After being notified by some customers, AppCard discovered that a real-time SQS data queue provided by a third-party hasn’t provided real-time data in a while. Though we were able to quickly notify our third-party to bring that system back online, we still had an issue where approximately 4 days of data was missing and unprocessed.

{% include centerImage.html url="/assets/DataRecovery/not_my_problem.gif" desc="What I wanted us to say to them but they said this to us first" title="The 3rd-party didn't say this, but more like 'we don't want to deal with this'" alt="Jimmy Fallon on The Tonight Show saying 'This sounds more like a you problem'" %}

Based on business considerations, we decided that it would be best if we could recover the data without needing help from the third-party (instead of telling them that they should be doing this because after all it's their fault). When the integration lead hesitated to take on this responsibility due to allocation constraints, I volunteered to take on the challenge. There were two components I had to address before even committing (because free credits for customers are expensive but simple): 1. Is it possible to retrieve the data from the third-party's available API? 2. How long would it take to implement this?

{% include centerImage.html url="/assets/DataRecovery/give_the_money.gif" desc="How I imagine any average customer hearing about missing data" title="The greed of man is insatiable" alt="Scene from the show Friends where Phoebe grabs Ross then threateningly says 'Give me your money, Punk'" %}

The Fix

Firing up a jupyter notebook, I got to work. Quickly, I was able to confirm that with the right secrets pulled from the right place, and just reading the documentation, the third-party's API seems to be able to provide the data we need. ( We are leaving aside the question why we rely on an SQS instead of this API ;) ). Additionally, after quickly skimming through our integration subsystem, I was able to identify a location in the flow where the right data could be injected with the right dummy setup.

{% include centerImage.html url="/assets/DataRecovery/in_theory_possible.gif" desc="I was 70% sure I could do it" title="The line between confidence and arrogance is thin" alt="Some dude on a red couch saying 'In theory it's possible'" %}

Gauging my own speed of development, considering that realistically I only grasp maybe 60-70% of how to use the API or the integration subsystem, and adding some buffer, I estimated 2 days for implementation and 1 day to run the recovery process. I then presented my findings to the business and tech leads that afternoon, giving me the greenlight to go ahead.

{% include centerImage.html url="/assets/DataRecovery/you_got_this.gif" desc="I didn't include a few worrying discussions of possible side-effects" title="Bill Murray would make a great tech lead" alt="Bill Murray in a suit with left eyebrow raised while holding a wine glass on his left hand and pointing at the screen with his right hand at the viewer with caption 'You Got This'" %}

Our async infrastructure and integration is already built on the Python framework Celery, convenient grounds for this one-off development. The simple overview of the job is that it would pull data for 100 transactions at time, process it, and repeat until it hits a transaction outside the 4 day gap. I made sure to provide sufficient optional parameterization in case I needed to restart the jobs if it fail or stop unexpectedly. Since we can only deploy once a day, it would be better for a struggling but kept-running process than having to wait for the following day to fix the code and start over. This also meant an almost excessive amount of loggings, so as to have an intimate visibility on how the recovery task is going, and provide the necessary parameters if the job needed restarting.

{% include centerImage.html url="/assets/DataRecovery/laying_train_tracks.gif" desc="Conceptual visual of my architecture" title="I'm Gromit" alt="The beagle Gromit from the series Wallace and Gromit riding a toy train and laying down the train tracks for that toy train as fast as he can so he won't crash" %}

Once I felt comfortable, we had a pre-production environment that I made sure to test out my task. But admittedly our pre-production data is very different from real production. There were immediate hiccups once this was merged in production, one of our assumptions turned out to be incorrect and sometimes the async job didn’t automatically repeat even though there was more data in the gap to query. Thankfully, I could manually re-trigger the jobs with the right parameters because of the logging. This meant more human intervention but still allowed the job to finish.

{% include centerImage.html url="/assets/DataRecovery/phew.gif" desc="I didn't do this cause I was sitting next to the business, but I was this internally" title="A lot of internal self-praise" alt="Some guy wiping his brow" %}

The Conclusion

In the end, almost all our customers didn’t even notice the data gap. Shoppers got their points and we didn't need to give anyone any extra credit. My teammates could focus on other tasks while I proved to myself that I can sovle vague and unknown problems by myself. This mini-project was well-delivered, well-scheduled, and had real immediate business impact on the bottom line. Coming home that day, I felt like I earned my paycheck.

{% include centerImage.html url="/assets/DataRecovery/honest_work.jpg" desc="Professional pride feels good" title="Couldn't find the gif for this" alt="The meme with the farmer and caption 'It ain't much, but it's honest work'" %}

fintech 1 interview

I got through the online assessment and first round interview with a NY fintech company. Here is the reflection I have on the two parts. A learning experience that's for sure.

5 interview questions: C++, Area 2

Area Number Two: Object-Oriented Programming

We shouldn't hire SDEs (arguably excepting college hires) who aren't at least somewhat proficient with OOP. I'm not claiming that OOP is good or bad; I'm just saying you have to know it, just like you have to know the things you can and can't do at an airport security checkpoint.

Two reasons:

1) OO has been popular/mainstream for more than 20 years. Virtually every programming language supports OOP in some way. You can't work on a big code base without running into it.

2) OO concepts are an important building block for creating good service interfaces. They represent a shared understanding and a common vocabulary that are sometimes useful when talking about architecture.

So you have to ask candidates some OO stuff on the phone.

a) Terminology

The candidate should be able to give satisfactory definitions for a random selection of the following terms:

  • class, object (and the difference between the two)

Class, as an analogy, is the blueprint/template for an object, while an object is an instance of the class (like a built house).

In C++, classes are simple:

class cPoint
{
    public: 
        int X;
        int Y;
};

There is another holdover from time of C:

typedef struct
{
    int X;
    int Y;
} sPoint;

or just:

struct sPoint
{
    int X;
    int Y;
};

instantiation

z`

method (as opposed to, say, a C function)

virtual method, pure virtual method

class/static method

static/class initializer

constructor

destructor/finalizer

superclass or base class

subclass or derived class

inheritance

encapsulation

multiple inheritance (and give an example)

delegation/forwarding

composition/aggregation

abstract class

interface/protocol (and different from abstract class)

method overriding

method overloading (and difference from overriding)

polymorphism (without resorting to examples)

is-a versus has-a relationships (with examples)

method signatures (what's included in one)

method visibility (e.g. public/private/other)

These are just the bare basics of OO. Candidates should know this stuff cold. It's not even a complete list; it's just off the top of my head.

Again, I'm not advocating OOP, or saying anything about it, other than that it's ubiquitious so you have to know it. You can learn this stuff by reading a single book and writing a little code, so no SDE candidate (except maybe a brand-new college hire) can be excused for not knowing this stuff.

I draw a distinction between "knows it" and "is smart enough to learn it." Normally I allow people through for interviews if they've got a gap in their knowledge, as long as I think they're smart enough to make it up on the job.

But for these five areas, I expect candidates to know them. It's not just a matter of being smart enough to learn them. There's a certain amount of common sense involved; I can't imagine coming to interview at Amazon and not having brushed up on OOP, for example. But these areas are also so fundamental that they serve as real indicators of how the person will do on the job here.

b) OO Design

This is where most candidates fail with OO. They can recite the textbook definitions, and then go on to produce certifiably insane class designs for simple problems. For instance:

They may have Person multiple-inherit from Head, Body, Arm, and Leg.

They may have Car and Motorcycle inherit from Garage.

They may produce an elaborate class tree for Animals, and then declare an enum ("Lion = 1, Bear = 2", etc.) to represent the type of each animal.

They may have exactly one static instance of every class in their system.

(All these examples are from real candidates I've interviewed in the past 3 weeks.)

Candidates who've only studied the terminology without ever doing any OOP often don't really get it. When they go to produce classes or code, they don't understand the difference between a static member and an instance member, and they'll use them interchangeably.

Or they won't understand when to use a subclass versus an attribute or property, and they'll assert firmly that a car with a bumper sticker is a subclass of car. (Yep, 2 candidates have told me that in the last 2 weeks.)

Some don't understand that objects are supposed to know how to take care of themselves. They'll create a bunch of classes with nothing but data, getters, and setters (i.e., basically C structs), and some Manager classes that contain all the logic (i.e., basically C functions), and voila, they've implemented procedural programming perfectly using classes.

Or they won't understand the difference between a char*, an object, and an enum. Or they'll think polymorphism is the same as inheritance. Or they'll have any number of other fuzzy, weird conceptual errors, and their designs will be fuzzy and weird as well.

For the OO-design weeder question, have them describe:

What classes they would define.

What methods go in each class (including signatures).

What the class constructors are responsible for.

What data structures the class will have to maintain.

Whether any Design Patterns are applicable to this problem.

Here are some examples:

Design a deck of cards that can be used for different card game applications.

Likely classes: a Deck, a Card, a Hand, a Board, and possibly Rank and Suit. Drill down on who's responsible for creating new Decks, where they get shuffled, how you deal cards, etc. Do you need a different instance for every card in a casino in Vegas?

Model the Animal kingdom as a class system, for use in a Virtual Zoo program.

Possible sub-issues: do they know the animal kingdom at all? (I.e. common sense.) What properties and methods do they immediately think are the most important? Do they use abstract classes and/or interfaces to represent shared stuff? How do they handle the multiple-inheritance problem posed by, say, a tomato (fruit or veggie?), a sponge (animal or plant?), or a mule (donkey or horse?)

Create a class design to represent a filesystem.

Do they even know what a filesystem is, and what services it provides? Likely classes: Filesystem, Directory, File, Permission. What's their relationship? How do you differentiate between text and binary files, or do you need to? What about executable files? How do they model a Directory containing many files? Do they use a data structure for it? Which one, and what performance tradeoffs does it have?

Design an OO representation to model HTML.

How do they represent tags and content? What about containment relationships? Bonus points if they know that this has already been done a bunch of times, e.g. with DOM. But they still have to describe it.

The following commonly-asked OO design interview questions are probably too involved to be good phone-screen weeders:

Design a parking garage.

Design a bank of elevators in a skyscraper.

Model the monorail system at Disney World.

Design a restaurant-reservation system.

Design a hotel room-reservation system.

A good OO design question can test coding, design, domain knowledge, OO principles, and so on. A good weeder question should probably just target whether they know when to use subtypes, attributes, and containment.

5 interview questions

It's not 5 interview questions, it's 5 categories of interview questions.

Goal: do these all in C++, Java, and Rust

Steve Yegge influenced a lot of people with this post.

5 interview questions

It's not 5 interview questions, it's 5 categories of interview questions.

Goal: do these all in C++, Java, and Rust

Steve Yegge influenced a lot of people with this post.