AWS Intern Reflection: Takeaways Beyond Tech

Table of Contents

Summer really flew by. These four months at AWS Vancouver were honestly the most unforgettable and fulfilling summer of my life. Besides Vancouver’s amazing weather, the project I worked on brought me huge challenges and satisfaction, plus the whole team had such an awesome vibe - it was basically a top-tier internship experience.

This blog isn’t your typical technical post. I won’t (~~and can’t~~) discuss specific technical solutions. Instead, I want to focus on what I learned throughout the internship and the moments that inspired me - just sharing my thoughts as an intern facing a massive and complex engineering system.

My team is called Ingestion Hub, and we mainly maintain a service called vended log . Simply put, when users subscribe to an AWS cloud service, they can set it up to send their logs to other accounts. It’s basically about transmitting and distributing information, and the main challenge is how to efficiently and reliably process and forward these log messages through an extremely large and complex infrastructure.

I was responsible for optimizing a very specific but critical problem: when a user suddenly sends a huge amount of logs - like instantly producing ten times their normal traffic - certain modules in the system would “overload,” affecting the stability of the entire system. My task was to design and implement an automated “sideline” mechanism to handle this situation.

Sounds straightforward, right? But when I actually started designing, a bunch of questions popped up:

How do you define “overload”? Which system metrics should we monitor (CPU? Memory? Queue length?) to accurately judge this? How do we avoid “false positives”?
What algorithm should we use for detection? Should we monitor data volume (byte size) or rate (TPS)? How do we ensure the algorithm works in all edge cases?
How do we restart after “sidelining”? How do we design an elegant restart mechanism to ensure sidelined data gets reprocessed without mixing with normal or error data?
How do we ensure the whole process is efficient and stable? The sidelining and restarting itself can’t become a new performance bottleneck or introduce new bugs.

On my first day, my mentor was super honest with me - this project wasn’t easy, and if I could complete it independently, it would be at least L5 level work (one level up from new grad). Later, in my first 1-on-1 with my manager, she even tried to comfort me: “Don’t worry, we’ll definitely consider the project difficulty when making return offer decisions.” This made me feel both stressed and weirdly excited.

Lucky for me, the whole team was incredibly supportive. My mentor, for the first two weeks, would have an hour-long 1-on-1 with me every single day without fail, helping me understand the system and answering my questions.

Even so, facing the objective complexity of the system, I was still in a “completely lost” state for the first few weeks. The part I needed to work on (my scope) was probably only 10% of the whole system, but to understand that 10%, I had to understand the context of the remaining 90%. At first, I could only take the information I received and the code I read and force myself to understand them as simplified “definitions,” but these “definitions” were often inaccurate or even wrong. And because of these understanding gaps, it was really hard to “align” with people who actually understood the system at this stage (sorry for using such corporate speak but it really fits here…). Communication only became effective when both sides had similar levels of understanding.

Like, I personally remember during my Design Doc review in week four, I totally didn’t understand the questions our L7 was asking, but by my final demo at the end of the internship, I could actually have decent discussions.

My mentor taught me a super effective method: read code with hypotheses. First, form an initial mental model of how the system works through docs and architecture diagrams, then take that model into the code to find evidence that validates or disproves your assumptions. This turns passive “reading” into an active “detective game” - pretty effective actually.

This process also made me deeply realize the huge difference between engineering practice and studying in school. When you learn math in school, any formula or theorem can be derived all the way down to basic axioms (like the 1+1=2 level), and there are no unexplainable “gaps” in the knowledge system. But in large-scale software engineering, you can’t possibly understand every detail. Every engineer can only stop at a certain abstraction layer and trust that the next layer of abstraction is reliable. You can’t review all the knowledge points before starting like in school - the only path is to learn while doing, and do while learning.

It’s like building an airplane - engineers need to understand aerodynamics, materials science, and engine principles, but they don’t need to derive everything from Newton’s laws and the periodic table. They work on an already-built abstraction layer and trust that the underlying physics is solid. An engineer’s job is also to build new bridges between these complex abstraction layers.

Halfway through my internship, after basically completing the core functionality, I went back to read tons of code and documentation, and only then did I feel like I truly understood the abstractions needed for my part. I could finally understand some of the team discussions about existing system problems. This experience of “quantity leading to quality” was really amazing.

Unlike many interns who just implement based on their mentor’s design, my mentor and the whole team really encouraged me to think and design independently. My mentor especially focused on training my understanding of project logic: Why did we design it this way? What’s the root cause behind it?

For intern projects, the team usually already has a preliminary solution. I got a draft design document right from the start - basically giving me partial conclusions. If I hadn’t actively thought about “why these conclusions,” and just kept implementing, I could easily go off track on the details (which I actually did at first).

This ability to ask “Why” is also something AWS values deeply when reviewing major system incidents (COE). The classic example is the “Lincoln Memorial” story (@chatgpt):

Background: The Lincoln Memorial’s exterior stone was experiencing serious aging and erosion. Management’s first reaction was: spend big money to replace with more durable stone or increase maintenance.

But by asking “Why” layer by layer, they dug out the root cause:

Why was the memorial’s stone aging so badly? → Because the building surface was frequently washed with strong cleaning agents.
Why did they need to clean so frequently? → Because there was lots of bird poop on the walls, affecting appearance.
Why were there so many birds here? → Because the memorial had tons of spiders around - a feast for birds.
Why were there so many spiders? → Because lots of moths and insects gathered here at night.
Why were there so many insects? → Because the memorial’s night lighting system turned on too early at dusk, and the bright lights attracted them.

Final conclusion: The root cause of stone aging was turning on the lights too early.

Solution: Not spending huge money on new stone, but adjusting the lighting strategy - delaying the lights by one hour. Fewer insects meant fewer spiders and birds, cleaning frequency dropped dramatically, and the memorial’s stone was protected.

This story really inspired me. When I got stuck during my internship and asked experienced engineers on the team for help, I could deeply feel this mindset of getting to the root cause. They wouldn’t just give me an answer, but would ask me questions in return, guiding me to find the core of the problem myself. This is something I need to keep working on.