Tuesday, July 29, 2014

Endeca data modeling for best performance

ECommerce is all about selling products online. For capitalizing a sell opportunity and leaving an impression on customer, you have to present your products as quickly as possible in a nice clean way.


First thought that comes into mind to achieve ‘as quickly as possible’ requirement is Cache;


  • Browser cache – e.g. local downloaded objects
  • CDN cache – e.g. can cache whole page static or dynamic (with ajax calls for loading customer specific data), cache static content, cache redirects, cache error pages/responses  
  • Web server cache – e.g. cache assets in disk/memory, persistent connection
  • Applications cache – e.g. cache to reduce db calls, static objects holding common properties
  • Database cache – e.g. query caching, pre-compiled queries


Any medium or large eCommerce application considers some sort of caching strategies in the architecture and it should be. But caching can’t resolve response time consumed in first hit when nothing was cached. Caching at any level needs some unique way to identify your second hit is exactly same to first one – cache key. What if there are infinite combinations and all possible combinations cannot be cached or what if you have too many session variables that make a customer session so unique that becomes impossible to cache and customer starts experiencing delay in page responses.


So apart from caching you have to work on basics and basic is data modeling. You should always model your data in a way that application should not end up doing iteration/consolidation or lot of manipulation. If Data returned from layer 1 to layer 2 is in the form that can be presented directly without iteration or with some iteration then you have done a good job. e.g. A nested for loop is always one of the biggest killer for eCommerce applications and application performance goes down and down as your data grow with time.  


Let me explain above principle with a data modeling assignment that I recently completed.


Suppose your eCommerce database stores products information into hierarchical format as shown in below –
One product that comes into three colors and each color has three size variants. It means this product information is distributed into 13 objects.


Above is very common structure for any eCommerce site especially for retailers those deal in apparel domain.


You are asked to design data model for search engine so that application should able to display color swatches on product listing page with price ranges. E.g. Color red, size M=$5 L=$7 XL=$10 XXL=$12, so your price range becomes $5-$12.


For example, Below screen shot displays swatches with price ranges on product listing page.


A search engine no matter it is Solr or Endeca always manages data into flat structure format called a Record or document. They are built to present data quickly instead of taking joins across tables at run time. That is one of the basic difference between search engine and relational databases.


In your case you have product information distributed at three levels.  


So you have following options


Option 1 – Index data at Base product level. This means a record have to store all color and sizes information as multi valued attributes.


Information distributed in 13 objects as in above diagram need to be mapped into one single record.
This become even worse when you have to store multi value attributes for image urls for corresponding color.


If you have a naming convention followed for image urls then you might not require index image urls otherwise your record will be very messy and web application fetching such records will end up splitting multi value data, identify and pick corresponding urls. Lots of lot iteration we are building in this design.


This design will become a bottleneck as your business grows and you enrich your products with more attributes at color or size level.


I will never recommend this solution.


Option 2 – Index data at color level and ask search engine to group them on basis of base product id. I mean use parent product id as rollup key (Endeca specific concept).


This way data available at 13 objects level need to be mapped into three records and if you index base product id then you can use this id as rollup key (group by).


So you will get an aggregated record list with a representative record that will be presenting one of the records values among three by default. In Endeca you have good control on representative record e.g. though this record is one of them but it can have attributes that represent lowest and highest price values among three. (Called derived attributes in Endeca). This lowest and highest price becomes your price range.
Since your records representing color so you have to keep size and price data as multivalue attributes. But on search/listing pages you generally don’t need size specific information. It is required on product detail page which you generally don’t render through search engine.


The best part of this approach is search engine gives you data in almost same format that you want to display on listing pages. So fast rendering is possible with less iteration.


Option 3 - Index data at size level and use base product id as rollup key (group by)


13 objects are now mapped into 9 size records with a common base product key.
Duplicity of records is high in this solution and web application has to identify color by consolidating all size records. This can be a performance hit because you have to iterate all size records and consolidate color and other attributes.


Generally you don’t need size specific information on listing pages, if that is the case then I will recommend option 2 is the best way of modeling this requirement.


I hope I could link my principal of ‘less iteration’ with the very basic concept of ‘data modelling’.


Thanks.
http://www.linkedin.com/in/sumitag

1 comment:

Unknown said...

Why should you choose us as your outsourced partner for Amazon Product Listing Services. Well, to begin there is no match to the quality and accuracy level of our work. Our team will come up with customized solutions as per the client’s needs and budget. Mail - info@123edata.com, Call- 1-8772841032, 9958399732