Tuesday, January 5, 2016

Large Data Sets in Salesforce Apex (a.k.a Why the fuck is this query timing out?)

Topics included included in this post:
This post is for anyone struggling to work with large data sets in Apex, particularly if you're aggregating that information in some way, but some of this can be useful even for returning detailed records. I've spent the last few months creating a Visualforce reporting page that aggregates just about every record created in a selected time period in my client's Salesforce instance.  When I began the project, they were having trouble pulling data for their larger offices even one office at a time, and my goal was to create a page with tabs for various metric categories (revenue, new/lost customers, payroll, survey feedback, lead and marketing spend analysis) and in each tab, pull all offices' (a little more than 200) metrics together for comparison.

Their business deals in a high volume of relatively small and often recurring transactions with customers.  To give you a rough idea of scale, a single month's worth of revenue data is spread across about 70,000 records, and they were hoping to run a report for a 6 month time span.  Some of the metrics they wanted required analysis deeper than simple sums and averages, notably customer attrition. All of this was further complicated by the fact that they need to group the data by office, and many of those offices are comprised of multiple locations whose metrics need to be rolled up to a parent office.

On to the meat of it.  Few of the tips below for optimization are new, many are best practices or tips that other Salesforce experts have already put out there, but I wanted to compile them together in one place with some additional practical advice on how and when to use them.

Javascript Remoting (if your end goal is Visualforce)

Probably the most obvious piece of advice on this topic.  I won't go into the technical details too much here since there's a wealth of information already out there, but the advantages come from the asynchronous nature of the remote calls to Salesforce (more tips on that below) that mean you can break up your calls to Salesforce into manageable chunks and the fact that you don't have to deal with the Visualforce View State (hooray!).  As awesome as that all sounds, it also means you'll need to be fairly adept at Javascript and you're on your own for generating the UI, no Visualforce tags to automatically render and format everything prettily for you.   
A few good resources:


Test Data

Ideally, you should be building this in a full sandbox. If your requirements are such that you need advice like this to achieve your goals, you have a sound argument for your employer / client as to why they should spend the money paying for one.  If that's not an option, you'll at least want a Developer Pro sandbox with as much test data as possible, but be warned that your page may work beautifully in the sandbox and then crap out when pulling real data in production.


Analyzing Performance Gains

Let me caution that some of these tips may work in some scenarios but not others.  Removing a non-selective filter from a query and applying that filter logic to the results returned instead may help with one query, but hurt the performance of another query (that was my experience).  So you'll need to get into your debug logs and look at timestamps to verify which method of getting to the results you need works best.  And if javascript is doing a lot of the legwork, you might need to sit and time things with a stop watch (I did).

And on that note, be aware of Salesforce's Query Cache.  The first time you run a query, Salesforce is putting the results into a cache that will stick around for a couple hours.  So the next time you run the same query, even if you've made no changes, you'll find that it runs lightning quick.  This means you'll have to keep adjusting your query criteria (the easiest option being to query a different date range) to ensure any performance gain you're seeing is due to optimizations you're making and not that cache.  There were times I thought I made grand successes, only to have the same query time-out the next day when the cache had been cleared- disheartening.

Optimizing Your Where Statement

I was already familiar with many of these concepts, but there were queries with time-out issues that were extraordinarily hard to resolve that taught me a few new tricks.

A great resource on this is the Dreamforce presentation "Faster SoQL? Yes, Please" by John Tan (@johntansfdc) and Chris Peterson (@ca_peterson), which includes some of these tips and more.  I wish I had happened upon this before I learned some of these the hard way! At the risk of flooding them with requests for help, those two are the true query optimization gurus.

Making Filters Selective

A selective filter is one that leverages indexed fields in Salesforce, which can significantly improve performance.  The "Make SOQL query selective" page details out what this means, and also points to some additional resources. If you're having issues with a query that is filtering by a field that's not indexed by default (like Lookups and External Id fields are), you may be able to request that Salesforce add a custom index to that field (this is restricted to certain field types).  Use the Query Plan tool in the Developer Console to help tune your queries and identify fields that may require custom indexing.

I also just discovered another tool available if you want to dive in even deeper- the Query Resource Feedback parameter available in via the REST API.  This will actually detail out how the Salesforce query optimizer is evaluating each criteria in your query.  An overview and links to additional resources can be found here on the Salesforce Developers site.

Remove Non-Selective Filters- Non-Aggregate Queries

The flip-side of the advice above- if all of the filters in a query are selective except one or two it may be more effective to take those non-selective filters out and manually filter that data out of the results. That's a little abstract, let's try an example.  I had a few queries with multiple criteria on indexed fields that also included a filter on a picklist field (where Status__c = 'Active').  I altered that query to remove that Status criteria and instead group the results by Status.  I then looped through the results and compiled the data just for the results where the Status was "Active".  The end results were the same, but time to get to that result dropped by about 30 seconds.  This same advice can be applied to any not (<>) criteria you may have, which isn't selective even if it's on an indexed field.

Be aware that what you're hoping for, but may not necessarily get, is a trade off in the reduction of time to run a query that makes the additional time manually processing the results worth it.  It might be that your data returns so many results that any performance gain on the query itself is completely lost as your code loops through and processes those results.

Remove Non-Selective Filters- Aggregate Queries

Same idea as above, but in aggregate queries you have the option of migrating those non-selective filters from your Where statement to a separate Having statement. More details here.

Add Selective Filters, Just Cause

You might be able to boost the performance of a query by adding a selective filter that logically speaking you don't really need.  The simplest example of this would be to add a filter on a RecordTypeId field, even though your other filters could only possibly return that Record Type anyway.  A more elaborate example.  I worked with a custom object used to store details of payroll that included a master-detail field to the Contact object.  The only Contacts that can possibly be in that field are employees, so I didn't need to include that as a criteria in addition to the date range.  But, running a separate query to generate a Set of employee Contact Ids and then including that as a filter in the Payroll query notably improved performance.  Go figure.

Nix Those "Or" Statements

"Or" statements that include fields can kill query performance, so you want to get rid of them wherever you can.  This is easier said than done; sometimes life demands complex boolean logic.  In the scenarios where I had no choice but to include some "or" logic, I ended up breaking the query up into separate queries and then compiling the results together.

Use Aggregate Queries

This may also be obvious depending on your database background, but whenever possible let Salesforce's query engine do the heavy lifting on grouping and mathematical operations on your data.    For some of the metrics I was generating, particularly related to customer attrition, it was impossible to get the results with a single aggregate query.  Even so, a series of aggregate queries used to gather the criteria needed for the final metric got me there.  And for metrics where I was able to shift from the old method of looping through individual detail records and compiling data for calculations, to an aggregate query that summed/averaged it all up for me, the performance gain was extraordinary. 

Aggregate SoQL Resources:

Field History Conundrum

A very specific issue, but one I suspect other people have struggled with as well.  There were a number of metrics which required determining a customer's Account "Status" (custom field) or Job "Status" at a specific point in time.  So their Status may be "Canceled" right now, but I needed to determine if they were "Active" 6 months ago.  That information was available in Salesforce's Field History tracking data, but was rendered effectively useless by two limitations of the Field History data:
  • You cannot do Aggregate SoQL on Field History data
  • You cannot filter Field History data by the New Value that's being tracked, only by Field Name.
So to get the very basic information of which customer Accounts were active on a certain date, their old logic had to loop through hundreds of thousands of Field History records to pick out the ones that were active on a certain date.  It barely worked to pull information for just one office, and was completely unfeasible for pulling all offices at once.

So I, annoyingly, recreated that field history logic with a new trigger.  It works almost identically to the standard field history, but I limited it to only track the Status field and added in a lookup back to the previous Status History record so there was a quick way to find customers who, for example, went from Canceled status to Active status.  The trigger was pretty simple, but the downfall was a ton of new data in Salesforce- the built-in field history tracking doesn't count toward data storage space but the new custom object obviously does.  For the new reporting capabilities it provided, they decided it was worth it.


Streamline Data Returned

You can lose performance on the browser side if you're returning more data from a @RemoteAction call than you really need.  If you're returning SObject data, make sure it doesn't include extraneous fields and same if you're returning a custom Apex class (which is converted to a JSON object in Javascript).  While you're testing, it's worth throwing the object returned into a console.log statement in Javascript so you can do a sanity check that it's as efficient as it can be.


Use Recursive Logic to Break Up Unwieldy Queries

This should be a last resort after you've done every optimization you can come up with, but let's say you have and you're still hitting time-out issues in Visualforce (I've had and solved similar issues in triggers and Batch Apex like this as well).  For each of the blocks of metrics I had to return (revenue, payroll, etc), I had to painstakingly figure out the maximum date range that I could return at once. And when determining that, don't forget my note above on the Query Cache. I then defined the logic to break up the date range selected by the user into manageable chunks.  

So at the end of one RemoteAction call, the logic checks to see if additional calls are required and if so, recursively jumps back and makes the RemoteAction call for a new date range- all the while merging the results into one persistent Javascript object.  A more concrete example.  The user selects a date range of 6 months to return revenue metrics, which I determined can only be returned no more than 31 days at a time.  The Javascript makes the same call 6 times, once for each month.  I actually have a quick bit of logic that always restricts each call to occur in the same month, since that's how the results are reported so it's just easier than grouping the results by month in Apex.  Once everything has been queried and compiled, the logic to actually render the results on the page is called.  

In some scenarios, I was able to break the logic up into multiple calls- one for marketing spend, one for new leads, one for new recurring customers- and then compile and analyze that data with Javascript.

This method is far from ideal and bear in mind that performance becomes very dependent on the user's machine.  So it may work fine on your slick development computer, but if any of your users aren't as technologically endowed, they're going to sit and wait longer while their browser processes the results.  But if you've really explored all of the options of optimizing the Apex and are still hitting time-outs, this could be your saving grace.  

I'd recommend setting this up so you can easily change the amount of data called at once (date range is the easiest option if you can) since you may need to adjust it as either the data expands or (hopefully) Salesforce performance increases.  I have one variable where I set the number of days to be called at once, and a quick routine that sets that variable based on the type of metric being returned.  




Friday, November 30, 2012

Amazon.com knew I was pregnant before my mother

All it took was a purchase of one nursing bra and Amazon began flooding me with advertisements for pregnancy and baby-related products.  Some people find this kind of targeted advertising convenient.  I find it invasive and annoying.  I don't even mind it if I'm outright searching for something and they show me related products, but now whatever I'm doing in Amazon, I'm bombarded with suggestions on what I might want for my breasts, nether region or yet-to-be-born baby.

With the "convenience" that comes with having your every digital move catalogued and analyzed, we lose a significant amount of privacy.  Today it's my own view of Amazon, Google, Facebook, etc. that's affected by my purchases, searches and comments; what of the day when Amazon starts recommending to my friends and family viewing my Wish List that I might appreciate some nursing pads or stretch mark cream?  And if you think they won't, you're taking the concept of "corporations are people too, my friend" too seriously.  I'm sure there are plenty of Amazon employees and even executives who would oppose such tactics, but ultimately it's the bottom-line that will drive such decisions; after all it's their legal obligation to their shareholders to scrabble for every dime they can.

I quit Facebook the day I commented that I was going to Boston and started getting advertisements for Boston restaurants and events in my sidebar.  I knew that day was coming though and had always been conscientious of what I posted there.  I don't think people, particularly young people, are always that aware.  Once it's online, assume it's there forever.  Go ahead and delete or close your account thinking that's wiped it out, it's still on some server somewhere.

Think Facebook won't turn over a log of your full history (deleted or not) to a government intelligence agency if asked, even without a warrant?  Did you think the same thing about phone companies before the Bush administration made it clear that saying the word "terrorist" absolves both the government and telecommunications companies of having to adhere to any laws, right down to the very Constitution they supposedly hold so dear?  And it's not just the government you have to worry about, Facebook has already "accidentally" revealed what was supposed to be private information of users to third-party application developers when those users installed their apps.

Whatever privacy scandals we hear about related to social media and search engines (try googling "Google Germany Privacy" for some insight on how they push the boundaries) is only what they've been caught doing and assuredly just the tip of the iceburg, and advances in technology and the evolution of artificial intelligence will bring about privacy breaches we can't even yet fathom.  It's a strange, new world we live in.

There is a bit of "turnaround is fair play" going on in the world though.  The same companies that push the boundaries of the trust we put in them with our personal data have put their own trust in the government and companies of China as they use factories there for manufacturing and setup branch offices.  China doesn't have the same ideas about intellectual property as we do.  Companies have found themselves victim to corporate espionage, stealing and recreation of their designs and even disturbingly finding monitoring devices built into electronics that were not in the design.  The Chinese government offers lip services to such breaches, but rarely takes any action on the matter.  Be patriotic and come back to the USA guys; we may expect living wages and safe, non-toxic work environments, but at least you can have us arrested and sue us if we steal from or sabotage you.

Friday, October 5, 2012

Zoo Life- The End

My stint as a docent at the Prospect Park Zoo came to an end a few months back and I thought I'd take a moment to talk about why.  Partly just for the irony of it; you'll see what I mean.

Here's how the story began, but not why I ultimately quit.  I was part of the animal handling program there.  This had been going on for a number of years; docents who had gone through training and been certified were able to present animals to the public during tours and scheduled presentations on  weekends.  Not all of the zoo animals mind you, only ones the animal keepers had designated as tractable based on species and the individual, like chinchillas, chickens and some snake and lizard species.  It was a rewarding program for the docents and a nice chance for the public to get up-close-and-personal with some of the animals, including a chance to touch them.

Rather suddenly, the program was ended.  To my knowledge, there was no unfortunate incident that brought this on, and that's the kind of rumor that spreads pretty quickly.  This seemed more like a policy decision from the managerial level, probably from someone new to the zoo who was worried about liability.  That's a guess though, we were just told that the program was no more.  I was disappointed, but also recognized that I was just a volunteer there and as such didn't have any say in policy.  Some of the other docents took it a lot harder, there were meetings (including tearful pleas) and emails and petitions on the web to appeal the end of the program.  Frankly, it was all a little melodramatic for my tastes.

At this point though, I'm still a loyal volunteer and staying out of the fray.  I even received a call from the NY Post (blech) on the matter and had nothing but good things to say about the zoo.  Until the day I was informed that a new condition of being a volunteer was that I had to sign an agreement saying that I would never make any public (via media, Facebook, email or however) negative statements about the zoo.  Excuse me?  I had no problem with restrictions about what I could say while I was at the zoo acting as a representative, but trying to restrict my free speech as an individual outside of the zoo?  That was unacceptable to me.  If I was an employee, it would be illegal to require that of me; we have whistleblower protection laws in this country for that.  If they did ask their employees to sign the same agreement, I hope the illegality of it comes back to bite them in the ass later.

I don't like it when people try to inappropriately assert authority over me, or anyone.  People who do so are generally bad human beings even beyond their pathetic little power grabs.  Being the CEO of my own company has made me rather sensitive to it.  So told them I wouldn't sign and why and left the zoo on my final day of volunteering sniffling as I did so.  I really did enjoy my time there until some asshole  who didn't even have the courage to identify themselves came along and ruined it for me.  God I hate having principles sometimes.

Thursday, October 4, 2012

The Joy of Salesforce Governor Limits

Working within the governor limits in Salesforce can be a real balancing act.  I understand why they're there; it's a shared environment and you can't have developers writing code that bogs down the servers for everyone else. Especially because, let's face it, there are lot of really shitty developers out there. But that doesn't mean I don't curse their name when I see those dreaded LimitException errors.

I started writing Apex as soon as it came out; back when the governor limits were a lot more stringent and nifty tricks for avoiding them like future methods didn't exist yet.  Nowadays, I have a number of clients processing large amounts of data and/or implementing very complex business processes on their data, so I've come up with quite a few tips and tricks on dealing with the limits that I thought I'd share.

Bulk Proofing Code and Future Methods

I'm lumping these together because you don't really need me to explain them to you.  Unlike many other bloggers, I'm not interested in proving how clever I am by regurgitating information already published much more eloquently by Salesforce themselves.  So instead I'll just refer you to them-
If you want more info on those, Google is your friend.  If you can't figure out how to bulk-proof your code, you probably have no business writing code; this is shared environment coding 101 (and is not so bad for developers working within their own environments to know either).  

Two drawbacks of Future methods that are worth noting if you're considering implementing them-
  • Your future method may be generating records or updating fields that your users are expecting to see instantly, and it can be confusing for them if they don't. Generally it happens fast enough that all they have to do is refresh the page to see it, but asynchronous logic is a little abstract to explain to a non-developer (and even to a lot of developers).
  • If your future method hits an error, either your user will never know it or you'll have to write in logic that emails them if an error is hit- but again, kind of confusing to an end-user.  Be prepared to have to fix errors being hit in future methods that you receive by email and remember to go back and clean-up any data affected by the fact that your method didn't fire correctly.

Chunking Future Methods

I've had scenarios where I have DML operations on multiple records, either set off by Apex or data imports, that fire a future method and then hit governor limits within that method.  Each time you run code, you can set off up to 10 Future methods so if governor limits are an issue in your future method, consider breaking up whatever you're sending into the method into manageable chunks.  An example of some trigger logic that does exactly that here-

//Breaking this into chunks of 30 to avoid governor limits
Map<Integer, Set<Id>> mapSetStats = new Map<Integer, Set<Id>>();
for (Attendance__c a : Trigger.new){ 
    if ((a.Status__c == 'Attended' &&Trigger.isInsert)
     || (a.Status__c != Trigger.oldMap.get(a.Id).Status__c 
         &&Trigger.oldMap.get(a.Id).Status__c == 'Attended')
    ){
    //for first attended or when last Set in the Map has 30 ids, add a new Set to the map
    if (mapSetStats.size() == 0 || mapSetStats.get(mapSetStats.size()).size() == 30){
    mapSetStats.put(mapSetStats.size() + 1, new Set<Id>());
    }
    mapSetStats.get(mapSetStats.size()).add(a.Id);
    }

}
//Sync Monthly/Annual Client Statistics:
if (mapSetStats.size() > 0){
for (Integer i : mapSetStats.keySet()) 
            clsAttendance.syncClientStatistics(mapSetStats.get(i), new Set<Id>());
}

In this example, "mapSetStats" is populated by individual Sets of Ids, each of which is no larger than 30 records each.  I happen to know that users will never update more than 300 records at a time (10 Future calls * 30 record batches) so I can get away with this.  If chunking isn't feasible for any reason, then you'll need another solution, such as....

Business Logic in Scheduled Code

I've used this not only for getting around governor limits but also for limits on the other side of external web service callouts. Here's the basic idea-
  • Instead of sending records to a Future method for processing, find a way to gather records that require processing in a query.  In some scenarios, I've created a checkbox field specifically for this purpose and updated that field in a trigger to true, but sometimes the very nature of what you need to accomplish (e.g. a blank field that needs to be filled in) means that you don't have to explicitly update it to identify those records in a query.
  • Create a Batchable class which queries for the records that have been marked in the step above.  For details on this, see Salesforce's article on Using Batch Apex.  
  • Create a Schedulable class which performs your logic on records from your Batchable class.  Again, you don't need me to figure this out, see Apex Scheduler.
  • When you use Database.executeBatch to get records from the Batchable class, you can pass in an optional second parameter to set the limit on the number of records returned- you just need to figure out what's the largest number of records you can process at once without hitting the limits.
  • If you did need to explicitly mark records for processing, like my checkbox example above, make sure you un-mark them when the processing is complete.  Otherwise, the same records will just keep being update in an endless loop.
As with Future methods, users will need to understand that there will be a delay in the updates that happen as a result of scheduled code.  I've been lucky enough that in the cases where I have had to use scheduled code, users haven't had an issue with that.  If your users do have a problem with it, remind them that the fancy, magical updates they're looking for were probably absolutely impossible for them to accomplish just a few years ago so they'll have to shut up and deal.  Depending on your relationship with them, you might need to re-phrase that.

Sunday, September 2, 2012

Using Third-Party Generated Certificates in HTTPRequest Calls

I've been doing various integrations with external web services lately and have learned some things along the way. The latest integration I did required two-way authentication.  The Salesforce documentation makes some assumptions about these integrations-
  1. That the service you're integrating with provides a parseable WSDL.
  2. That the service you're integrating with allows you to upload a signed certificate that was generated from Salesforce for the two-way integration.
And if both of those hold true, the Salesforce documentation (and various related blog posts) will be sufficient for you.  However-
  1. Not all services provide a WSDL, and even those that do might include XML elements that are not supported by Salesforce.  In this case, you're on your own for writing an Apex class to interact with the web service.
  2. Some services provide a WSDL that Salesforce can parse, but you'll find that you need to edit the resulting Apex class to include properties that were excluded.  Though I didn't dig in to see if this was the fault of the WSDL file itself or the parser, I suspect the latter.  How to go about editing that generated Apex class is worth it's own post.
  3. Some services provide you with a client certificate file that you need to include in your request- you don't have the option of generating one in Salesforce and getting it signed.
None of these are insurmountable, but they'll take a lot more leg-work on your part and in the case of having to provide a third-party generated client certificate in an HTTP request, I found myself having to piece together how to do so from various forum posts.  So I'll focus on that today.  

So an external service (in my case, the credit card processor First Data) provides you with a client certificate file and a password for that file.  Now what?  In Salesforce's documentation for the HTTPRequest class you'll notice a method called "setClientCertificate" with a note that this method is deprecated and you should really use "setClientCertificateName".  But if you've been provided with the client certificate, "setClientCertificate" is what you'll need.  Here's how to get it to work-
  • Upload the client certificate as a Static Resource
  • In Apex, query for that file and base64 encode the body of the file into a string variable.
  • In your HTTPRequest variable, use the setClientCertificate method with the base64 encoded certificate as the first argument, and the certificate password as the second argument.
Oh sure, it's easy when you know how to do it.  And a note that alternatively, you can base64 encode the certificate yourself and use the resulting string in setClientCertificate, but the above seemed a little more elegant.  The actual code to accomplish this-


String cert;

for (StaticResource sr : [Select Id, Body from StaticResource where Name =: <certfilename>]){
cert = EncodingUtil.base64Encode(sr.Body);
}
HttpRequest hReq = new HttpRequest();
hReq.setEndpoint(<url>);
hReq.setMethod('POST');
hReq.setClientCertificate(cert, <certpassword>);

If you were lucky enough to have a WSDL that Salesforce was able to parse, you'll be setting the "clientCert_x" variable of the stub to the base64 encoded string and the "clientCertPasswd_x" variable to the password.

Once again, hope this saves someone some time!