Tuesday, January 5, 2016

Large Data Sets in Salesforce Apex (a.k.a Why the fuck is this query timing out?)

Topics included included in this post:
This post is for anyone struggling to work with large data sets in Apex, particularly if you're aggregating that information in some way, but some of this can be useful even for returning detailed records. I've spent the last few months creating a Visualforce reporting page that aggregates just about every record created in a selected time period in my client's Salesforce instance.  When I began the project, they were having trouble pulling data for their larger offices even one office at a time, and my goal was to create a page with tabs for various metric categories (revenue, new/lost customers, payroll, survey feedback, lead and marketing spend analysis) and in each tab, pull all offices' (a little more than 200) metrics together for comparison.

Their business deals in a high volume of relatively small and often recurring transactions with customers.  To give you a rough idea of scale, a single month's worth of revenue data is spread across about 70,000 records, and they were hoping to run a report for a 6 month time span.  Some of the metrics they wanted required analysis deeper than simple sums and averages, notably customer attrition. All of this was further complicated by the fact that they need to group the data by office, and many of those offices are comprised of multiple locations whose metrics need to be rolled up to a parent office.

On to the meat of it.  Few of the tips below for optimization are new, many are best practices or tips that other Salesforce experts have already put out there, but I wanted to compile them together in one place with some additional practical advice on how and when to use them.

Javascript Remoting (if your end goal is Visualforce)

Probably the most obvious piece of advice on this topic.  I won't go into the technical details too much here since there's a wealth of information already out there, but the advantages come from the asynchronous nature of the remote calls to Salesforce (more tips on that below) that mean you can break up your calls to Salesforce into manageable chunks and the fact that you don't have to deal with the Visualforce View State (hooray!).  As awesome as that all sounds, it also means you'll need to be fairly adept at Javascript and you're on your own for generating the UI, no Visualforce tags to automatically render and format everything prettily for you.   
A few good resources:


Test Data

Ideally, you should be building this in a full sandbox. If your requirements are such that you need advice like this to achieve your goals, you have a sound argument for your employer / client as to why they should spend the money paying for one.  If that's not an option, you'll at least want a Developer Pro sandbox with as much test data as possible, but be warned that your page may work beautifully in the sandbox and then crap out when pulling real data in production.


Analyzing Performance Gains

Let me caution that some of these tips may work in some scenarios but not others.  Removing a non-selective filter from a query and applying that filter logic to the results returned instead may help with one query, but hurt the performance of another query (that was my experience).  So you'll need to get into your debug logs and look at timestamps to verify which method of getting to the results you need works best.  And if javascript is doing a lot of the legwork, you might need to sit and time things with a stop watch (I did).

And on that note, be aware of Salesforce's Query Cache.  The first time you run a query, Salesforce is putting the results into a cache that will stick around for a couple hours.  So the next time you run the same query, even if you've made no changes, you'll find that it runs lightning quick.  This means you'll have to keep adjusting your query criteria (the easiest option being to query a different date range) to ensure any performance gain you're seeing is due to optimizations you're making and not that cache.  There were times I thought I made grand successes, only to have the same query time-out the next day when the cache had been cleared- disheartening.

Optimizing Your Where Statement

I was already familiar with many of these concepts, but there were queries with time-out issues that were extraordinarily hard to resolve that taught me a few new tricks.

A great resource on this is the Dreamforce presentation "Faster SoQL? Yes, Please" by John Tan (@johntansfdc) and Chris Peterson (@ca_peterson), which includes some of these tips and more.  I wish I had happened upon this before I learned some of these the hard way! At the risk of flooding them with requests for help, those two are the true query optimization gurus.

Making Filters Selective

A selective filter is one that leverages indexed fields in Salesforce, which can significantly improve performance.  The "Make SOQL query selective" page details out what this means, and also points to some additional resources. If you're having issues with a query that is filtering by a field that's not indexed by default (like Lookups and External Id fields are), you may be able to request that Salesforce add a custom index to that field (this is restricted to certain field types).  Use the Query Plan tool in the Developer Console to help tune your queries and identify fields that may require custom indexing.

I also just discovered another tool available if you want to dive in even deeper- the Query Resource Feedback parameter available in via the REST API.  This will actually detail out how the Salesforce query optimizer is evaluating each criteria in your query.  An overview and links to additional resources can be found here on the Salesforce Developers site.

Remove Non-Selective Filters- Non-Aggregate Queries

The flip-side of the advice above- if all of the filters in a query are selective except one or two it may be more effective to take those non-selective filters out and manually filter that data out of the results. That's a little abstract, let's try an example.  I had a few queries with multiple criteria on indexed fields that also included a filter on a picklist field (where Status__c = 'Active').  I altered that query to remove that Status criteria and instead group the results by Status.  I then looped through the results and compiled the data just for the results where the Status was "Active".  The end results were the same, but time to get to that result dropped by about 30 seconds.  This same advice can be applied to any not (<>) criteria you may have, which isn't selective even if it's on an indexed field.

Be aware that what you're hoping for, but may not necessarily get, is a trade off in the reduction of time to run a query that makes the additional time manually processing the results worth it.  It might be that your data returns so many results that any performance gain on the query itself is completely lost as your code loops through and processes those results.

Remove Non-Selective Filters- Aggregate Queries

Same idea as above, but in aggregate queries you have the option of migrating those non-selective filters from your Where statement to a separate Having statement. More details here.

Add Selective Filters, Just Cause

You might be able to boost the performance of a query by adding a selective filter that logically speaking you don't really need.  The simplest example of this would be to add a filter on a RecordTypeId field, even though your other filters could only possibly return that Record Type anyway.  A more elaborate example.  I worked with a custom object used to store details of payroll that included a master-detail field to the Contact object.  The only Contacts that can possibly be in that field are employees, so I didn't need to include that as a criteria in addition to the date range.  But, running a separate query to generate a Set of employee Contact Ids and then including that as a filter in the Payroll query notably improved performance.  Go figure.

Nix Those "Or" Statements

"Or" statements that include fields can kill query performance, so you want to get rid of them wherever you can.  This is easier said than done; sometimes life demands complex boolean logic.  In the scenarios where I had no choice but to include some "or" logic, I ended up breaking the query up into separate queries and then compiling the results together.

Use Aggregate Queries

This may also be obvious depending on your database background, but whenever possible let Salesforce's query engine do the heavy lifting on grouping and mathematical operations on your data.    For some of the metrics I was generating, particularly related to customer attrition, it was impossible to get the results with a single aggregate query.  Even so, a series of aggregate queries used to gather the criteria needed for the final metric got me there.  And for metrics where I was able to shift from the old method of looping through individual detail records and compiling data for calculations, to an aggregate query that summed/averaged it all up for me, the performance gain was extraordinary. 

Aggregate SoQL Resources:

Field History Conundrum

A very specific issue, but one I suspect other people have struggled with as well.  There were a number of metrics which required determining a customer's Account "Status" (custom field) or Job "Status" at a specific point in time.  So their Status may be "Canceled" right now, but I needed to determine if they were "Active" 6 months ago.  That information was available in Salesforce's Field History tracking data, but was rendered effectively useless by two limitations of the Field History data:
  • You cannot do Aggregate SoQL on Field History data
  • You cannot filter Field History data by the New Value that's being tracked, only by Field Name.
So to get the very basic information of which customer Accounts were active on a certain date, their old logic had to loop through hundreds of thousands of Field History records to pick out the ones that were active on a certain date.  It barely worked to pull information for just one office, and was completely unfeasible for pulling all offices at once.

So I, annoyingly, recreated that field history logic with a new trigger.  It works almost identically to the standard field history, but I limited it to only track the Status field and added in a lookup back to the previous Status History record so there was a quick way to find customers who, for example, went from Canceled status to Active status.  The trigger was pretty simple, but the downfall was a ton of new data in Salesforce- the built-in field history tracking doesn't count toward data storage space but the new custom object obviously does.  For the new reporting capabilities it provided, they decided it was worth it.


Streamline Data Returned

You can lose performance on the browser side if you're returning more data from a @RemoteAction call than you really need.  If you're returning SObject data, make sure it doesn't include extraneous fields and same if you're returning a custom Apex class (which is converted to a JSON object in Javascript).  While you're testing, it's worth throwing the object returned into a console.log statement in Javascript so you can do a sanity check that it's as efficient as it can be.


Use Recursive Logic to Break Up Unwieldy Queries

This should be a last resort after you've done every optimization you can come up with, but let's say you have and you're still hitting time-out issues in Visualforce (I've had and solved similar issues in triggers and Batch Apex like this as well).  For each of the blocks of metrics I had to return (revenue, payroll, etc), I had to painstakingly figure out the maximum date range that I could return at once. And when determining that, don't forget my note above on the Query Cache. I then defined the logic to break up the date range selected by the user into manageable chunks.  

So at the end of one RemoteAction call, the logic checks to see if additional calls are required and if so, recursively jumps back and makes the RemoteAction call for a new date range- all the while merging the results into one persistent Javascript object.  A more concrete example.  The user selects a date range of 6 months to return revenue metrics, which I determined can only be returned no more than 31 days at a time.  The Javascript makes the same call 6 times, once for each month.  I actually have a quick bit of logic that always restricts each call to occur in the same month, since that's how the results are reported so it's just easier than grouping the results by month in Apex.  Once everything has been queried and compiled, the logic to actually render the results on the page is called.  

In some scenarios, I was able to break the logic up into multiple calls- one for marketing spend, one for new leads, one for new recurring customers- and then compile and analyze that data with Javascript.

This method is far from ideal and bear in mind that performance becomes very dependent on the user's machine.  So it may work fine on your slick development computer, but if any of your users aren't as technologically endowed, they're going to sit and wait longer while their browser processes the results.  But if you've really explored all of the options of optimizing the Apex and are still hitting time-outs, this could be your saving grace.  

I'd recommend setting this up so you can easily change the amount of data called at once (date range is the easiest option if you can) since you may need to adjust it as either the data expands or (hopefully) Salesforce performance increases.  I have one variable where I set the number of days to be called at once, and a quick routine that sets that variable based on the type of metric being returned.