Posterous theme by Cory Watilo

Adventures in Sysadmin

minas.morgul.net is the hub of much of my digital life. It also provides services for quite a few friends, ranging from backup DNS to mailing lists and IRC. It lives in a datacenter 3000 miles away from where I live, with conditioned power, climate control, etc. It's got redundant power supplies, RAID disks, remote console, and most of the other stuff you'd expect from a machine that's supposed to be up and running non-stop. There are a few things, though, that can't really be made redundant. (At least not cheaply.) CPUs are one of those things...

Having established some background, the story picks up this past thursday morning, when I awoke to find some strange log messages on minas. Some things that don't usually crash had crashed over night. Some web app stuff had failed in strange ways. My initial suspicion was that somebody was probing some web apps, possibly looking for security flaws to exploit. My fears only increased when some of the commands I was running during my investigation also started crashing. Had somebody broken in and modified the system to hide their presence, but done so in a sloppy way that left things unstable? w(1), a tool to report some system information including the people who are logged in, crashed with a "bus error". Not good.

After some time spent thinking about the best way to recover from a security compromise, which would have been difficult without physical access to the system, I started noticing additional puzzling behavior. The programs that I had seen crash didn't actually seem to *always* crash. Sometimes they'd run just fine. df(1) would sometimes report meaningful disk usage numbers, but other times report wacky numbers that made no sense. The bus errors I'd been seeing typically happen when a program tries to access memory outside of its legal address space. Math errors when calculating pointer addresses. Math errors when calculating disk usage. Hmm. What does math in a computer? The CPU. How many CPUs are in this system? 2. Could it be that one of two CPUs is failing, and that any time the scheduler places a process on that CPU it is susceptible to crashing? How might I find out? Here's where things start to get interesting.

I started looking into possible ways to manipulate the kernel's scheduler to see if I might be able to control which CPU a given process runs on. I discovered the taskset(1) program, which can adjust a process's "CPU affinity". Using this tool, it's possible to target a specific CPU when launching a process. It's also possible to manipulate the CPU affinity of an already running process. Child processes inherit their parent's CPU affinity. So, to start with, it should be pretty easy to determine whether or not a given CPU is bad:

$ taskset -c 0 uptime
Floating point exception 
$ taskset -c 1 uptime 
 14:39:25 up 4 days, 16:28, 6 users, load average: 0.03, 0.22, 0.26

This was reliable and repeatable. CPU 0 is apparently bad. Time to move long running processes off of it. To ensure that I got all children, I restarted some daemons (cron, apache) with cpuset. My shell processes all got migrated to CPU 1. I set the CPU affinity of init. The server became, in essence, a uniprocessor box when it had previously been a dual processor system. It has been running for two days like this, and seems reliable.

I have no idea what will happen if this host reboots. I actually am not sure how it is that the system hasn't crashed. One thing whose CPU affinity can't be adjusted is the kernel itself. I am not sure, but I believe that it would still be running on CPU 0. But maybe that's not always true. Does CPU affinity affect system calls as well? That might explain why it's still running *now*, but what will happen when it reboots? I'm familiar with the 'maxcpus' kernel parameter, but that doesn't actually let me specify which CPU is used. I suspect that setting maxcpus=1 will put everything on CPU 0 and I'll be hosed. There's also an isolcpus kernel parameter. This one seems a bit more promising. It essentially lets you tell the kernel never to put a process on the given CPU(s). It's normally used for realtime stuff, where you want to dedicate a specific CPU to your realtime process. But the kernel still needs to actually *boot*. Is it actually going to be able to do so? I don't really want to find out, and I don't want depend on it being able to do so. Time to think about disaster recovery plans.

In any case, this was an awfully interesting problem, and I think the solution was kind of neat as well. I've never had to deal with such a situation before.

On a related note, does anybody have an old Opteron 248 you want to part with?

Letter to my congressman regarding SOPA

I'm writing as a constituent to inquire about your position regarding
H.R. 3261, the "Stop Online Piracy Act", and to encourage you to
vigorously oppose this bill when it reaches the House floor. The Act has
numerous problems and poses a major threat to the continued health of
the Internet as a medium for open communication. The Act introduces
technical and financial burdens that effectively impact every site that
provides a mechanism for the publishing of user-generated content, while
also eliminating judicial oversight. Ultimately, it is likely that this
will have a chilling effect in that the cost of allowing users to
contribute content to the web will be too high to be justifiable in many
cases. Ultimately, under H.R. 3261, the Internet will cease to exist as
a vibrant social commons and start to resemble the cable TV system,
dominated by a small number of large corporate interests and with high
entry and operating costs.

I hope this message is just one of many that you'll receive in
opposition to this Act, and that I can count on your vocal opposition to
it. Thank you for your time and for your continued service in Congress.

transitions, again

Today my manager and I announced my upcoming departure from Mozilla.
I've only worked there a short while, so this feels really abrupt to me,
as I'm sure it does to the many Mozillians that I work with every day.
Because ultimately the details of my departure and the various decisions
leading up to it are private, and I don't want to say anything that
might be misconstrued or misinterpreted, I won't get into that sort of
thing here. The major thought that I want to convey regarding Mozilla
is that it's an awesome organization, and an incredibly important one.
I really think that its significance is under-represented, generally. I
wish the entire Mozilla ecosystem nothing but success.

My next stop will be Seattle, for a systems engineering position within
EC2 at Amazon Web Services. I really look forward to starting work
there, as well as exploring a new city!

Oh, and did I mention that I got married? Yeah, it's been a busy
week...

World IPv6 Day at Mozilla

The Internet changed yesterday. Did you notice? If not, we did it right. Mozilla was one of hundreds of participants in World IPv6 Day, both "the largest experiment in Internet history" and "the nerdiest holiday ever".

Mozilla added IPv6 connectivity to the following sites:

* www.mozilla.org
* www.mozilla.com
* wiki.mozilla.org
* addons.mozilla.org

In addition, we've been running IPv6 on our desktops, laptops, and other devices in our Mountain View, CA office for several months.

Making major architectural to something as large and widely distributed as the Internet is not an easy task. The IPv6 migration effort has been under way since the mid 1990's, and is likely to take another decade or longer. Yesterday, however, was a unique and significant milestone in that long process. For the first time ever, users with IPv6 network connectivity would use the new version of the protocol by default when accessing major Internet sites. This is significant because IPv4 and IPv6 will need to coexist for years to come, and major web sites will need to reliably serve users regardless of the protocol in use. Even users who only have IPv4 connectivity, which is still the vast majority of the Internet, participated in World IPv6 Day by helping site administrators around the world gain experience in running in "dual-stack" mode.


While World IPv6 Day only required a 24 hour commitment from website operators, we at Mozilla rather enjoy living in the future, and don't plan on going back to the old IPv4-only Internet. Barring something unforeseen, we don't plan on shutting down our ability to serve our web sites via IPv6 in the foreseeable future. To the contrary, we expect to be adding IPv6 to more services in the coming weeks. For example, two major services that we provide, irc.mozilla.org and ftp.mozilla.org, weren't ready in time for World IPv6 Day, but we expect to have them working via IPv6 this summer.

IPv6 is important to the future of the Internet, not only because it will allow continued growth into new regions of the world and new markets such as mobile, but also because it re-enforces the end-to-end principle that is fundamental to open Internet access. Although World IPv6 Day is drawing to a close, the effort behind it is ongoing. Mozilla looks forward to a continuing role at the forefront of the evolution of the Internet.

Fun with JavaScript

Somewhere along the lines recently, I took an interest in JavaScript
programming. I wrote some bad JS code way way back in 1999 while
working for a small, long gone ISP, but had spent very little time in it
since. When I last wrote JS, there was no XHR, JSON, FireBug, JQuery,
prototype.js, etc. Netscape was pushing some new "layers" thing that
was supposed to be the basis of their DHTML implementation, and
Microsoft was doing something completely different. At the time, the
standard approach was generally to implement most functionality on the
server and interact with it via CGI forms. JS was used to implement some
UI stuff, but not to do any real work. It was a miserable experience.

Writing JavaScript today is still a pretty miserable experience, but
several things have changed. The industry has a lot more experience
with the language, so it's pretty easy to learn what aspects of it are
to be avoided and which ones embraced. The tools and libraries are much
more mature and generally better. The implementations (browser-based
and otherwise) are much better. And, maybe most significantly, I'm a
much better programmer than I was in 1999.

Today I watched the first part of Douglas Crockford's "Javascript:
The Good Parts" talk on Safari via my ACM membership. It's been an
entertaining and informative review of the history of the language and
made me feel a little better about having avoided it for so long. There
really are lots of major issues with the language design, and it's sort
of amazing that it's been so successful. The environment in which it
was born, in the bad old days of the Netscape/IE browser wars, was not
condusive to success. The haste with which it was shipped, the weird
relationship between Netscape and Sun. These were all major obstactles
to its success. But the features that make the language interesting
(notably, in my opinion, first-class functions and the prototypal object
system), are really neat and keep programmers just happy enough that
they're able to overlook some of the uglier areas, or in some cases
avoid them completely.

In any case, I've had fun hacking on the simple photo gallery
application I've been working on, and look forward to whatever my next
web programming project turns out to be...

Leaving Yahoo!

I announced to my manager a couple of weeks ago that I'd be leaving
Yahoo! on March 3. Yesterday I informed the rest of my team. The final
decision to leave has been surprisingly difficult, and still has me
feeling very unsettled.

I came to Yahoo! just over a year ago, after almost 10 years at my
previous job. Leaving after such a short time is strange, especially
since there is a whole lot of stuff left for me to learn and do. People
don't always have a lot of respect for this company, and that's
unfortunate. Yahoo! has a whole lot of really cool technology and
really dedicated people working on it. The ease with which my team
could launch a new version of our software to a globally distributed
cluster of many thousands of busy Linux servers, with no service
downtime, is really awesome. The frameworks and tools for managing
hosts and services are really well designed and scale amazingly well.
It's not perfect, and I'm sure there are some places that do these
things better, but not many. Some of these tools are have been released
under an open source license. Hopefully many more will follow.
I hope that, if this happens, it'll spark some interest in some of the
other tools that Yahoo! has developed internally.

So, if Yahoo!'s systems are as cool as I make them sound, why am I
leaving? It's a tough question to answer. Some of it is personal. I
never felt a true sense of ownership of "Yahoo!" as a whole. I tried,
but it was hard to feel like I could contribute to improving the Yahoo!
user experience. The culture at Yahoo! simply doesn't seem to encourage
this. Dogfooding is encouraged, but only superficially. Looking around
a bit, you're probably more likely to see people using web search and
email options from the company's competitors than you are the company's
own offeringṡ. The only service that gets much internal use is Instant
Messenger. (The ironic thing is that many engineers use Linux desktops,
which are officially supported by corporate IT, yet there is no
officially supported YIM client for Linux.)

My desire to leave Yahoo! basically comes down to my desire to work on
things that I care about with other people who also care. I find such
an environment to be far more stimulating, and far more effective when
it comes to driving improvements in the products. In part, this is why
I find open source so compelling. Nobody works on open source just for
the paycheck. They work on it because they use the software. They
write code knowing that other people are going to read it. People who
are passionate about making improvements can simply do so. With that in
mind, I'm happy to say that, on March 7, I will be joining Mozilla
Corporation
. Stay tuned...

 

Stuart Meyerhans, 1916-2011

It's amazing to ponder the fact that, when I look at any photograph
taken over almost the past century, my grandfather existed somewhere on
Earth in that same sliver of time. In any of the famous pictures from
the Great Depression, the second World War, the Atomic Age of the 1950's
American Dream, and the Space Age of the 1960's, Stuart Meyerhans was
somewhere when that photo was taken, doing something, being someone.
From the early days of radio and TV to today's era of global digital
connectivity, he was thinking, feeling, listening, and speaking. Today,
for the first time since Woodrow Wilson was president, there is no
Stuart Meyerhans in the world. It's hard to conceive of all that he
witnessed, all that he feared, loved, wondered, saw, and heard. It's
hard to imagine that today's photographs will capture a world without
him in it.

Rest in peace, Grandpa.

updating choqok packaging

It's been a little too long since I kept the Debian Choqok packages as
up to date as I'd like. This has lead to some issues, since bug #591100
really should have been fixed in time for squeeze. Then, when upstream
stopped pushing their svn changes to gitorious and moved their actual
development to kde.org's local git hosting, all my branches got screwed
up, leading to further delays. I think this is fully resolved at this
point. (I really, really hope so! It was a painful experience!) So, I
hope to get a new choqok package uploaded real soon!
​