Tuesday, April 17, 2012

Instagram - architecture that worth now 1B

  • Amazon shop. They use many of Amazon's services. With only 3 engineers so don’t have the time to look at self hosting.
  • 100+ EC2 instances total for various purposes.
  • Ubuntu Linux 11.04 (“Natty Narwhal”). Solid, other Ubuntu versions froze on them.
  • Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
  • SSL terminates at the ELB, which lessens the CPU load on nginx.
  • Amazon’s Route53 for the DNS.
  • 25+ Django application servers on High-CPU Extra-Large machines.
  • Traffic is CPU-bound rather than memory-bound, so High-CPU Extra-Large machines are a good balance of memory and CPU.
  • Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
  • Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
  • PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
  • Twelve PostgreSQL replicas run in a different availability zone.
  • PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups.
  • EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
  • All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
  • Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
  • XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
  • Pgbouncer is used pool connections to PostgreSQL.
  • Several terabytes of photos are stored on Amazon S3.
  • Amazon CloudFront as the CDN.
  • Redis powers their main feed, activity feed, sessions system, and other services.
  • Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
  • Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
  • Apache Solr powers the geo-search API. Like the simple JSON interface.
  • 6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
  • Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
  • 200 Python workers consume tasks off the Gearman task queue.
  • Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
  • Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
  • Pingdom for external monitoring of the service.
  • PagerDuty for handling notifications and incidents.
  • Sentry for Python error reporting.

1 comment:

davidbond said...

Pretty nice post. I simply stumbled upon your weblog and needed.To say that I even have actually treasured surfboarding around your diary posts.
After all i will be subscribing on your feed and that i am hoping you write once more soon!
restore deleted emails