VCDX Journey – Part 2 – Where I Messed It Up


This post will cover those bits where I messed it up. In case you missed Part 1 of this series, check it out here.

So in no particular order:

Insufficient technical depth

This might be a strange one to some, especially the people just starting out on this path. You’ll likely hear – keep your presentation at the logical level – by a lot of people. This is true for the most part and you should absolutely avoid getting tangled in a time-wasting explanation of say how multi-NIC vMotion works. Instead focus on how this decision fulfilled a business need (things like minimizing maintenance windows etc..). But.. I realized after the first two attempts the panelists were prodding me for depth of knowledge here and there. I looked at my feedback for both the attempts and that appeared to clinch it for me. You might think maybe my documentation was weak in sections where I was asked for more technical depth. Well, maybe that’s what the panelists thought. We’ll never know the rubric till we become panelists ourselves ;). Anyway, it’s a given they’re asking you questions to help you score points. It’s also commonly said in study groups if you’ve used (or replaced in the submitted design) an all flash VSAN which removes those pesky LUN numbers and sizing decisions, RAID sets etc, the panel perhaps thinks you’ve taken the easier route of avoiding having to discuss design decisions which would’ve been made with say a tiered/traditional SAN. Whatever the case may be, the questions appear to be aimed at getting the candidate to justify and score better. Very few people can explain this better than my friend Jason Grierson has done here.

So for my third attempt, I changed my presentation strategy. I mixed it up – threw in a dash of physical level decisions and associated justification here and there (especially when discussing cluster, storage and network design). This enabled me to show some of the technical depth the panel appeared to be after. One of my close mates and study crewman, Kim Bottu, told me at least twice I was too light on the technical depth. I should have heeded that advice. No point in thinking what could’ve been or should’ve been.

Rushing it

I rushed my second defense, 100%. My crew thinks I rushed the third one too, well maybe. But the second one for sure. I got rattled too when I saw Grant Orchard and Hugo Phan in the room, even though Grant’s a good friend and was acting as the proctor.

Back to the rushing it part, I went back in angst after my first go. I didn’t prepare as well as I did for my third attempt. I didn’t improve on my documentation package. I didn’t go in armed with all that performance testing and validation, failure scenario testing and production-ready state analysis data. I didn’t work on the design scenario enough (more on this in a bit). I should’ve listened in on other people’s mocks. In other words, I should’ve taken the time to prepare better not just harder. Takeaway: Don’t go in with vengeance. Be honest with yourself, put yourself in the panelists’ chairs and ask those difficult questions. Like someone said, if you can’t answer it simply, you don’t understand it well enough.

Not enough documentation

Sure enough they only want to see those 5-6 pdf’s in the submitted package. That’s all I did too for the first and mostly the second submission. My crew advised me to give them more. Thing is:

  • Production readiness testing/analysis – You’ve designed the solution, great. But how does the panel know it worked for the needs of the business? This analysis should tell them that.
  • Requirements Traceability Matrix – There’s gotta be something graphical that maps requirements back to your decisions. This matrix should tell them that.
  • I got pointed out in my second defense about missing metrics, triggers, alerts and actions taken. So for the third submission, I gave them such a pdf which was the essence of the SOPs, some parts of the design and the implementation plans to show what actions would be taken, when they’d be taken, at what trigger point and as a result of which metric.
  • Hope you get the drift..

Here’s what I submitted for the third go (I don’t care what you think about overkill and what not, and I know they definitely won’t go in and read every one of these docs but completeness doesn’t hurt) – there was a total of 83 documents that included 55 SOPs, so if you bundled the SOPs into one and the DRS rules ones into one I gave them about 25 files and they are:

Submission

Not enough quantification of performance based design decisions

I got caught out big time on this one in my second defense. Now I can’t reveal what they cornered me on, but I sure can give you an example. PVSCSI drivers. We know these drivers are great for workloads driving high IO (think million+ IOPS). So there’s this performance improvement we know of. But, how much? Performance of what? What was gained, in what case, for what VMs, why not other VMs etc etc.. You MUST be able to quantify any performance gains. Defending a design decision with:

  • oh, this is how Ops wanted it or have been doing it all along.
  • this is how I standardized it.
  • the vendor recommended it.
  • the vendor wouldn’t support without it.

is highly likely not going to fly with the panel. Now the last bullet point maybe up for debate and maybe I couldn’t convince the panel with that justification even though that’s how it actually was in the real solution I designed. I cannot recommend enough to defend performance based decisions with:

  • testing results before and after the change (of say the virtualized drivers in this case) with hard numbers.
  • sufficient and succinct justification.

Here’s an example to hopefully get the message across:

IOPSsettings

Insufficient justification

This is probably the thing between a pass or fail. After all, if decisions cannot be justified to the panelists’ satisfaction, the number isn’t achieved. I have probably thrown some tidbits in the above sections already but if a design decision cannot be adequately justified as:

  • helping solve a business problem, and/or
  • helping solve a technical issue.

you’d be better off:

  • explaining it better, or
  • taking the decision out completely or simplifying the situation.

The aim should be to use least possible turns of the knobs to achieve an objective. Looking back at 2 of my design decisions, which I believe ended up in me missing out on my number in the third attempt, I shot down my chances by persisting with them. See the thing is something may have worked really well in the real world, but this is a vendor certification in which a candidate is expected to answer/justify in a certain way.

Example 1 – you disabled Admission Control and monitored with vROps – great, maybe it worked for the real design – BUT – be prepared to explain it to the guys on the other  side of the table. Alternatively, just roll with the default setting of enabling it (you’d need to justify it regardless, but it might just be easier)

Example 2 – you used the Ingestion API for log forwarding, great – BUT, again be prepared with adequate and succinct justification!

Simply put, there were probably more than enough instances (especially in the design scenario) where the panel wasn’t satisfied with my replies resulting in me not getting that elusive number.

Insufficient understanding of failure scenarios

Don’t take this one lightly. I’ll provide two examples here of what I mean.

Say your array has 4 controllers. Say they operate in pairs. You gotta know:

  • What’d happen if one controller in one pair fails.
  • What’d happen if a pair fails.
  • What’d happen if one controller fails in each pair.

Say you have 4 chassis of converged technology in 2 racks. You gotta know:

  • What’d happen if a chassis fails.
  • What’d happen if a rack fails.
  • What’d happen if the site or ISL fails.

Now while we’re all technically minded and would love to give a technical explanation of overcommitment, memory ballooning and what not, that’s not quite what the panel’s after. They’re after things like:

  • Impact on service availability.
  • Impact on service recoverability.
  • Impact on costs.
  • Impact on operational capability and efficiency.

I didn’t demonstrate enough understanding of some of the above business impacts. Simple as that. Gotta be:

  • Talking business impacts.
  • Clinical and concise.
  • Willing and able to go level 300 with a technical explanation.

Insufficient design scenario practice and/or weak strategy

Again, in some study circles there’s this belief the design scenario is lower in weight in comparison with the defense. No one but the panelists and the program manager know the scoring rubric. Having done this thing three times, I believe it’s got just as much weight, if not more, compared to the defense. Sure some people reckon they had a great defense and poor design scenario and still passed (I was like that in my third go and still didn’t pass). Let’s not discuss how/why they passed and you/I didn’t.  This section’s about how I messed the scenario up each time as much as I hate to admit it:

  • Designing for SLA. Surely I should have learned after 2 attempts and a number of mocks. The pressure got me just about each time. The third time I was humming along nicely, on track to get that number, when the panelists threw a wrench into my wheels and caused a train wreck. I can’t write here publicly what happened but the sudden change in direction threw me off. It shouldn’t have I know. I messed up the RPOs and RTOs, simply put. The panelists smelled blood, homed in for the kill and did it well by confusing me further. In reality/essence, they were probably trying to see how I’d react to a tough situation. I lost the number in those last 10 or so mins.
  • Too quick to draw conclusions. This was applicable more so in my first 2 attempts, in the third one I was good in this regard. I assumed, too conveniently, they had the same availability SLA for all those applications they told me of. I also resorted to using hardware provided clustering in the second attempt. What was I effing thinking 🙁
  • Risks in scenario. Didn’t talk enough about risks associated with design decisions. Now while I went with a VMware technology with which you start small and scale as you go, happy days you’d think, but I didn’t call out things like – sudden growth spikes, the need for staff training, availability and performance impact and the like. Further, I didn’t throw in any mitigation either, quite obviously.

Am I going to keep going till I get my number? I leave the below picture of Stan Wawrinka’s tattoo of a Samuel Beckett quote for you to figure out.

Fail Again

Part 3 of this series will discuss the design scenario part of the process.


Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.