“250 milliseconds, either slower or faster, is close to the magic number for competitive advantage on the Web” – Harry Shum, Executive Vice President of Technology and Research, Microsoft.
According to a report published by the site optimization website Strangeloop (now Radware), 57% of online customers will abandon a website after waiting 3 seconds for a website to load out of which 80% will not return ever again and half of this fraction will tell others about their bad experience. Add to this the fact that Internet users are reported to have faulty perceptions of time spent ‘waiting’ (15% slower than actual load time), we can imagine how crucial role Web Performance (Speed , Stability and Availability) plays in determining the success or failure of an enterprise in this exceedingly online world as it directly impacts user retention , online feedback, number of downloads (mobile apps) , conversions, and thus, revenue.
The quantitative rewards of having a fairly fast and robust online presence can be imagined by the fact that President Obama’s fundraising site raised an additional $34 million for his campaign in 2011, after increasing page speed by 60% and that Intuit saw a 14% increase in conversions after cutting down its load time to half (Ref.). On the contrary, any adverse behavior brings with it the cost of lost opportunity, potential loss of customer loyalty and a severe damage to the brand reputation.
An ideal way to mitigate such risks will be to estimate the impact of web performance and downtime on the business revenue using production load statistics juxtaposed with business revenue model and use the findings to optimize the capacity planning process and appropriate funding for performance tuning/testing activities accordingly.
Large enterprises mostly have dedicated teams to avert any ‘slowness’ in production , armed with state of art 24 X 7 real time monitoring and diagnostic tools equipped with dashboards and alerts/triggers , self-healing mechanisms, graceful fail-overs (like Oracle RAC implementation) and multiple levels of redundancy and data backups to take over the primary configuration in case things go south.
The precursor to such monitoring and arguably a much more vital step is to test this chunk of code and all the pieces of production hardware where this is going to be deployed for predefined performance goals w.r.t throughput and latency through a battery of tests (load, stress, scalability, soak) usually performed by a dedicated Performance Testing and Engineering Team (or the development team itself) over a period of time before release. Since prevention is always better than cure, almost 90% of performance issues in production can be prevented if the hardware, configurations and code are thoroughly tested together by performance experts and the results carefully analyzed (avoiding wrongful extrapolations) for any anomalies and deviation from the standard expected behavior agreed upon by all the stakeholders.
Easier said than done, there are a plethora of challenges that the Performance Engineering team usually faces in this process , the most important ones being the absence of a 100% mirror environment of production (due to cost constraints), lack of enough unique test data for applications under heavy load (in tune of millions of hits per hour), stubbing or virtualization of test data in absence of enough unique test data (which fails to simulate exact production like realistic loads) and lastly the common practice of selectively targeting only business critical and high volume business flows for load and stress testing due to time constraints while neglecting other business flows which become potential threats as they can trigger a slowdown and an eventual crash over a time period by a simple memory leak or unexpected load behavior.
How the end user perceives the application/website performance will depend on a host of factors like the end device being used for access, network speed in that geographic location, code behavior and hardware capacity at different levels (back end, middle ware and front end), the overall user load at that point of time and a multitude of configuration settings (like thread pool configuration, message queue lengths etc) at the application, database and web servers.
A Performance Engineer in true sense has the responsibility of tuning all these components and configuration settings within them so that working together they deliver peak performance and do so at the minimum cost, almost with the precision of conducting a 1000 pieces symphony orchestra. Though, this might sound overwhelming at first and admittedly takes years to master, the Performance Experts know by experience that most of these performance issues can be usually attributed to a few broad areas and improperly configured parameters which might have gone unnoticed or were overlooked by the developer and tester during normal development cycle.
The remaining part of this article is a non-exhaustive list of such probable problem areas and configurations to inspect when facing a performance degradation. In a holistic way agnostic to any specific technology, database or middle ware/web server (but mostly revolving around Java Enterprise implementations), the list is based on my personal experiences in similar roles and also talks about detecting and resolving the related performance issues using appropriate tools and methodology, assuming basic understanding of performance engineering related terms and concepts by the reader. This can also be used in parts as a high level checklist by a Performance Engineer to refer to before finally stamping the code and hardware configuration fit to go live.
- Memory Leaks: Memory leaks are one of the most common reasons causing slow response times over time in a Java Code, arising out of improper JVM configurations and malicious objects. If left unchecked, such leaks can cause a complete shutdown (OutofMemory error). There are multiple ways to check for memory leaks as per tool and time availability like interpreting the output of -XX:+PrintGCDetails on the console directly or monitoring GC behavior on Introscope (saw-tooth pattern). For digging deeper, taking heap dumps (jmap) and analyzing them on tools like IBM PMAT or Eclipse MAT plugin can give an in-depth information on the classes and objects causing the leak. Memory leaks can be resolved by handling the discovered classes/objects at code-level by the developer and tuning the MemArgs Settings for the JVM. JVM Tuning is an extensive topic in itself requiring a through understanding of memory management in Java and beyond the scope of this article but on a high level involves using the most suited GC algortihms for minor and major GC along with setting optimum values of Xmx, Xms, Survivor Ratio, NewSize and MaxTenuringThreshold parameters as per the expected allocation rate. It is a common practice to test different JVM configuration settings for the same load and select the one giving best performance at the end. One can refer to this wonderful guide for in-depth information on JVM tuning for best performance. Tools: JMX Clients (JConsole, JVisualVM), JStat (Command line tool), GCViewer, HP JMeter, Introscope LeakHunter, Profilers (hprof, AProf).
- Thread Blocks & Deadlocks: Abnormally high CPU usage, requests getting timed out and abysmally slow processing can be caused by thread blocks (one thread occupying the lock and preventing others from obtaining it) , deadlocks (thread A needs thread B’s lock to continue its task while thread B needs thread A’s lock to continue at the same time) or waiting threads at the CPU level. Best way to identify and fix a thread issue is to take thread dumps (also known as Java Dumps) at frequent intervals (jstack utility) and analyzing them using tools like TDA (Thread Dump Analyzer) to check the number of blocked threads, waiting threads or a potential deadlock pattern (such pattern finding can be done manually too but tools make life easier). getThreadInfo returns information difficult to acquire by thread dumps like the amount of time that the threads waited or were blocked along with a list of threads that have been inactive for abnormally long time-periods. Tools: Thread Dump Analyzer, Introscope (can take thread dumps directly), Samurai (open source), JConsole.
- Gridlocks: Too much of synchronization to avoid thread deadlocks might inadvertently ‘single-thread’ the application. It can lead to very slow response times coupled with low CPU utilization as each of the threads reach the synchronized code and enter a waiting state. This can be confirmed by checking thread dumps for a large number of threads in wait state (Waiting or Timed Waiting) across multiple dumps taken consecutively. Gridlocks can be avoided by trying to use immutable resources and using synchronization sparingly, at the code level.
- Thread Pool Size: In web applications thread pool size determines the number of concurrent requests that can be handled at any given time. A properly sized thread pool allows as many requests to run as the hardware and application can support without straining the hardware. A larger than optimum thread pool leads to unnecessary overhead on the CPU which can further slowdown the response while a smaller than optimum pool size will lead to lot of threads waiting for execution, thus again slowing down transactions. An ideal way for determining the pool size is to estimate the average number of users in the system using Little’s Law (average serving time multiplied by arrival rate) and size the thread pool to match this number (keeping buffer for occasional spikes) while also ensuring that this size is ably supported by the hardware (i.e. number of CPUs at disposal). If there is a sudden increase in user load (more than the expected spikes) or even a gradual growth over time (for example with increasing customer base), the thread pool size needs to be realigned accordingly for the response to be equally fast as earlier.
- Database Connection Pool Configuration: Database connections are relatively expensive to create. Hence they are created beforehand and used whenever access to database is needed. This also limits the amount of load coming from a particular application to the database as too much of load can crash the database and impact other applications (in case of a shared database). This pool size has to be optimized according to the load expected and hardware capacity of the database server. As is the case with thread pools, a small DB Connection pool will force the business transactions to wait for a connection to be available. This can be confirmed by monitoring the queue time and length , both of which will be increasing rapidly. Also, majority of the business transactions will wait on aDatasource.GetConnection () call while Database will show low resource utilization. On the contrary a very large (larger than optimum) pool size will allow too much load to flow to the database and can slowdown business transactions across all application servers accessing this database. This is characterized by high SQL query processing time and high CPU utilization on the database, observed from DB logs or AWR Reports. The application will wait on DB query executions (PreparedStatement.Execute()). The golden value for pool size is just below the saturation point in general.
- Poor/Corrupt Table Indexes: This is a very common cause for SQL queries taking a long time to process. Once we have the query causing the slowness identified (by say AWR report or db logs) we need to list out all the tables being called in that query and check for indexes on all those tables. More often than not, if the query has shown a sudden degradation in performance and the query in itself has not been modified , it can be fixed simply by rectifying/rebuilding the faulty indexes.
- Badly Designed Stored Procedures/SQL Queries: Many times the existing Stored Procedures are modified (making calls to new tables, new join statements etc) to support new functionalities. If there is a noticeable degradation in query processing time when compared to a previous codebase, these modified procedures need to be checked as first suspects. It is very possible that some performance impacting bit has crept in the query (increased number of sortings or full table scans) which is causing such behavior. There are multiple query optimization tools available in the market like DBOptimizer by Idera, SQLSentry PlanExplorer, queryProfiler by DBForgeStudio etc but the best way to get around such issues is to get hold of a DB expert (if you are not one) and let him/her tune the query !
- Non-effective bundling / Excessive http requests: While testing front-end systems, if there is observed a degradation in page load times at the browser, one of the easiest ways to do root cause analysis is to observe the Network tab in Developer view on Chrome browser while keeping the page under inspection open. This not only lists all the objects (like stylesheets, images etc) being downloaded with their size and total count but also the time taken by the browser to download each of them. This list can be sorted for maximum time consuming objects or object size and then compared with the older code base before degradation (if available) for the number of components being downloaded and their respective load times to pinpoint the cause. An increased number of components getting downloaded (without any change in functionality) will require more number of http(s) requests adding to the page load times and indicating non-effective bundling/packaging of objects which need to be fixed at the code level. For a long running test, we can use the Web Page Diagnostics tool offered by HP Loadrunner to get similar metrics on component level response times for a webpage.
- Faulty Methods in Specific Transactions: If there are specific transactions in test or production which appear to be problematic w.r.t. response times, we can trace their execution paths and component response times using Introscope Transaction Tracer. Transaction tracer not only filters poorly performing transactions based on given filters but also helps to identify the cause (components) causing such behavior within those transactions. Introscope also provides the provision of Dynamic Instrumentation on the fly (attaching byte code to return more metrics from a particular method selected for deeper investigation). The identified components need to be isolated and fixed (mostly at code level) in order to get things back to normal.
- Poor load distribution / Improper F5 Configuration: Load balancers are used for increasing the concurrency and reliability of large systems (by redundancy) but if they do not have enough resources or are not configured properly (for example incorrect weights assigned in Weighted Round Robin (WRR) or Weighted Least Connection (WLC) algorithms), they can actually decrease the performance of application by putting too much traffic on a single server in the cluster or acting as single point of failures. Load balancing issue leads to unequal distribution of load across servers and the same can be easily confirmed by monitoring CPU usage, memory utilization and logging activity on all the servers sharing the load w.r.t. each other (the instances not taking load will be in ideal state). This overload of traffic on the remaining servers if left unchecked can cause severe performance degradation and might eventually lead to a crash because of not enough hardware available to handle the excess load. Apart from the load balancer configuration, such a behavior can also be caused by all the server instances not coming up properly to take up the load after a restart or new code deployment/ configuration change.
- Too Many Context Switches: A high context-switching rate (switching of CPU from one thread to another) indicates an excess of threads competing for the processors on the system. Context switching is a computationally intensive task and an unusually high switching rate will adversely impact the performance of multiprocessor computers. Excessive switching can be caused by excessive page faults caused by insufficient memory or a processor bottleneck w.r.t load . This can be monitored using sar -w and proc/[pid]/status on the server or by enablingRSTATD on the server being tested and observing the metric ‘Context switch rate’ in LoadRunner. Context switching rate can be checked by reducing the number of active threads on the system by the use of thread pooling or disabling hyper-threading if enabled. The ultimate solution is to bring in a more powerful processor or simply adding an additional one to the existing.
- JDBC Connection Leaks: Connection leaks happen when we open a connection to the database from our application and forget to close it, or for some reasons it doesn’t get closed by the code. Connection leaks saturate the connection pool for new connections to be established causing major timeout and slowdown issues. Such leaks can be investigated by analyzing thread dumps for total number of created, active and idle threads or monitoring JDBC pool utilization using Introscope. Once confirmed, there are multiple ways of resolving this issue based on the implementation. For example using Weblogic Profile Connection Leak mechanism to pinpoint the root cause, using Spring JDBC templates or using connection pool implementations which offer the option of forcefully closing connections (or notify about connection leakage) based on predefined conditions (the TomCat connection pool has logAbandoned and suspectTimeout properties to configure pool to log about possible leak suspects).
- Saturated Message Queues (MQs): A much larger than expected load can saturate the message queues and cause timeouts at the application level. If there is a message queue implementation in place and a slowdown is observed beyond a particular load during testing, it is a good idea to check the queues for saturation and flush them (in test) if saturated. Queue lengths can be monitored using Sitescope or by logging into the MQ servers. In case of repeated saturation, the queues need to be optimally reconfigured as very large queue lengths can also cause a slowdown by hogging too much of memory. Relevant attributes to configure is Max queue depth in IBM Websphere MQ.
- Saturated Disk Space by excessive logging: As trivial as it might sound, it is quite a possibility that the disk space gets saturated earlier than expected (on account of some recurring error making the logs grow 100 times faster than usual or an inadvertent change in logging mode to error/debug from info) causing the response times to slow down before starving and the server thus refusing to process any further requests. With hundreds of metrics to monitor and inspect, this is also one area which tends to be easily overlooked at times. Disk space utilization can be checked using du/df commands or monitored through Sitescope in real time. Once detected, we need to figure out the cause (look for errors and logging mode), fix it, clean up the disk space (or take backup) and then start all over again.
In addition to the problem areas listed above, it is a good idea to keep an eye on the Stall Count metric on Introscope where consistently high values imply slow backend, periodically high values imply load related bottlenecks in the system and progressively increasing count implies resource leaks. For more advanced diagnostics, custom Probe Build Directories (PBDs) can be written and deployed and the preferred metrics thus configured can be monitored via Introscope graphically.