Discussions

News: Skype Outage: What happened on August 16

  1. Skype Outage: What happened on August 16 (16 messages)

    The Skype blog has posted an explanation of what happened to knock out the network, for all intents and purposes, over the weekend. A minor update forced a large number of users to flood the network while logging on, which exposed a bug. Skype's a good example of a "big enterprise application," obviously, with critical functionality. (Some people were calling Skype a "common carrier," for all intents and purposes, saying that an outage was not acceptable in any way.) An outage like this can drastically affect the "five nines" (where a service is up "99.999% of the time") - and some have criticized Skype for not managing the process better, asking why Skype didn't roll back the release, for example. Rolling back might not have been an option; in addition to the cost of deploying client updates, the problem was in the server software, apparently, and not the client software; a rollback probably wouldn't have avoided the allocation problem. That said: how would you have addressed a similar problem on your servers?
  2. That said: how would you have addressed a similar problem on your servers?
    Shit happens.
  3. That said: how would you have addressed a similar problem on your servers?
    Shit happens.
    Do you work for Sun?
  4. That said: how would you have addressed a similar problem on your servers?
    Shit happens.
    Do you work for Sun?
    LOL :-) I guess we all work for SUN, somehow.
  5. I'm sure your answer would be acceptable to Wall Street. eBay's share price may have been affected. :) http://mashable.com/2007/08/16/skype-outage/
  6. I'm sure your answer would be acceptable to Wall Street. eBay's share price may have been affected. :)
    http://mashable.com/2007/08/16/skype-outage/
    http://finance.yahoo.com/q/bc?s=EBAY&t=my
  7. I'm sure your answer would be acceptable to Wall Street. eBay's share price may have been affected. :)
    http://mashable.com/2007/08/16/skype-outage/

    http://finance.yahoo.com/q/bc?s=EBAY&t=my
    The stock is down about 7% since it's high last week. But that slide started before the Skype outage.
  8. The stock is down about 7% since it's high last week. But that slide started before the Skype outage.
    So what? Shit happens anyway. PS: the current 52wk Range is 25.25 - 37.44.
  9. Inaccurate Post[ Go to top ]

    "...and some have criticized Skype for not managing the process better, asking why Skype didn't roll back the release, for example." This leads the reader to believe that the update in question came from Skype - it did not. The update in question was a Windows patch sent to all windows update users that required a reboot. Apparently, the number of patches in this "patch Tuesday" was larger than normal thus requiring more reboots. The unintended DDOS occurred when, upon rebooting, all of these clients auto-logged in to Skype. At least, that's the story according to Skype. -John Mark Community Manager Hyperic, Inc. http://www.hyperic.com/
  10. What I would like to know more about is what kind of development process Skype uses. Do they use TDD? Are they Agile? Waterfall? Chaotic? How did this kind of bug slip through?
  11. What I would like to know more about is what kind of development process Skype uses. Do they use TDD? Are they Agile? Waterfall? Chaotic? How did this kind of bug slip through?
    Some of their job openings mention that they "value experience with agile methodologies". Anyway we're talking about millions of clients connecting in a short time, and a presumably fairly complex network resource allocation algorithm. Agile or not, it's not exactly the easiest scenario to simulate in a test environment. "There certainly are programming tasks that can't be driven solely by tests (or at least, not yet). Security software and concurrency, for example, are two topics where TDD is insufficient to mechanically demonstrate that the goals of the software have been met. [...] Subtle concurrency problems can't be reliably duplicated by running the code." -- Kent Beck, Test-Driven Development: By Example (2002)
  12. What I would like to know more about is what kind of development process Skype uses. Do they use TDD? Are they Agile? Waterfall? Chaotic? How did this kind of bug slip through?

    Some of their job openings mention that they "value experience with agile methodologies".

    Anyway we're talking about millions of clients connecting in a short time, and a presumably fairly complex network resource allocation algorithm. Agile or not, it's not exactly the easiest scenario to simulate in a test environment.
    I don't doubt that at all. The question is what did they or didn't they do to try to ensure this would not happen. "It's hard" is just an excuse. If they realized this was hard to test, they should have used some other strategy to try to ensure correctness. Perhaps FSV or other tools that verify models statically could be used. If only testing was used, this is evidence of the popular delusion that complex software can be tested 'completely'. Given the amount of time it took them to fix the issue, I'm guessing their engineers were blindsided by the issue.
  13. Perhaps FSV or other tools that verify models statically could be used.
    FSV = Finite-State Verification
  14. If the engineers were not blindsided, it wouldn't have been an issue.
  15. Question of test procedures[ Go to top ]

    What I would like to know more about is what kind of development process Skype uses. Do they use TDD? Are they Agile? Waterfall? Chaotic? How did this kind of bug slip through?
    As far as I understand the bug would only have been found if they had installed the upgrade in a little test network before installing the upgrade in the actual network. I can't say whether that would have been too effortful, would have cost too much money or so since I'm more a software guy and not a network guy and therefore can't really tell. However, using agile development style or not would not have prevented the problem from happening. IMHO it's a matter of the test and roll-out procedure. Regards, Oliver
  16. Re: Question of test procedures[ Go to top ]

    What I would like to know more about is what kind of development process Skype uses. Do they use TDD? Are they Agile? Waterfall? Chaotic? How did this kind of bug slip through?

    As far as I understand the bug would only have been found if they had installed the upgrade in a little test network before installing the upgrade in the actual network. I can't say whether that would have been too effortful, would have cost too much money or so since I'm more a software guy and not a network guy and therefore can't really tell. However, using agile development style or not would not have prevented the problem from happening. IMHO it's a matter of the test and roll-out procedure.

    Regards, Oliver
    Just to be clear, the update was not to Skype. The update was to Windows. And the problem wasn't caused by the windows update, it was caused by a large number of people joining the network at the same time. Testing is not the only way to verify software. It's possible to catch a bug without writing a single test. In fact, it's been known for nearly a century that testing alone cannot ensure program correctness for most non-trivial specifications. The way people talk about testing, you'd think this wasn't the case.
  17. I would agree, "shit happens". I don't think there is any product on the market that had not had a problem in production. I am sure Skype team learned from it and next release will go smoother. BTW, people could have switched to Gizmo during the outage. I actually like Gizmo conference room support and their Linux support is outstanding compared to Skype. Best, Dmitriy GridGain - Grid Computing Made Easy