Tuesday 21 February 2012

Monday 20 February 2012

Reverse Replication woes - solved

Hot off the press/keyboard (i.e. not fully tested). With the help of an Adobe support engineer in Basel and an on-site Adobe consultant we discovered what the root cause of the reverse replication problem was.

Namely, that when a user voted in a poll, the new vote AND ALL previous votes were being reverse replicated. This caused a MASSIVE workload on the Author because each node in the /var/replication/outbox did not contain 1 corresponding vote; it actually contained ALL of the votes including the new additional one. This explains why the Author would take 20 minutes to process just 10 nodes in the outbox.

The root cause was in the structure we were using (has been abbreviated):-

/content
/usergenerated
/somepoll
/poll1 [cq:Page]
/jcr:content [cq:PageContent]
/question
/answers
/1
/12423434
/12312323
/2
/23463456

Each vote is added under the /answers node as type "nt:unstructured" with the various properties. But, on each submission of a vote, the custom code (a custom SlingPostServlet) was setting the 3 magic properties (cq:distribute, jcr:lastModified & jcr:lastModifiedBy) on "/poll1/jcr:content". This causes the page "poll1" to be marked for reverse replication - and with it, all it's sub-nodes (aggregated).

The solution was to change the nodes that get created to individual pages themselves as follows :-

/content
/usergenerated
/somepoll
/poll1
/question
/answers
/1
/12423434 [cq:Page]
/jcr:content [cq:PageContent]
/12312323 [cq:Page]
/jcr:content [cq:PageContent]
/2
/23463456 [cq:Page]
/jcr:content [cq:PageContent]

And, then to ensure that the 3 magic properties are created on the jcr:content node of each vote node. Note: DO NOT have a jcr:content node anywhere in the intermediate hierarchy because this interferes with the firing of the outbox manager (I think it sees a jcr:content node and assumes that there must be a page there but, because there isn't a page there, then it aborts - and nothing appears in the outbox. I suffered with this problem when I kept "poll1/jcr:content" in the path (i.e. /content/usergenerated/somepoll/poll1/jcr:content/question/answers/1/12423434/jcr:content).

NB, due to our environment, we needed to use a custom SlingPostServlet and start the reverse replication in our project. However, the above structure should work with the normal OOTB page manager activated reverse replication.

The HTML form to post these votes would be something like this :-

<form action="/content/usergenerated/somepoll/poll1/question/answers/1/123456789" method="post" enctype="multipart/form-data">
<input type="hidden" name="./jcr:primaryType" value="cq:Page" />
<input type="hidden" name="././jcr:content/jcr:primaryType" value="cq:PageContent" />
<input type="hidden" name="././jcr:content/answer" value="my_chosen_answer" />

<input type="hidden" name=":redirect" value='/content/website/thankyou.html' />
<button type="submit">Submit</button>
</form>

Thursday 16 February 2012

Reverse Replication woes

So, in my previous post I said how wonderful FP37434 is (the replication stabilisation FP). Unfortunately, it did not solve our problem and we now have a large volume of content to reverse replicate (~50k nodes in /var/replication/outbox across all our publish servers).

We are currently facing 2 problems. When the RR agent polls, the publish server with FP37434 exhibits a huge native memory leak (approx 8GB of native memory is being claimed) causing a great deal of paging on the system.

When we batch this down to only 10 items in the outbox, we noticed that the author takes 30 minutes to process 10 nodes.

Adding extra logging (com.day.cq.replication.content.durbo) at DEBUG level shows that the Author is doing valid work for 30 minutes processing just 10 nodes from the outbox.

It turns out that when a node is added to /content/usergenerated/path/to/something then CQ appears to be adding all of the pre-existing sibling nodes in the newly created node under /var/replication/outbox. You can see this by analysing nodes inside the outbox. This is why 10 nodes takes 30 minutes for the author to process - because it's actually unpacking 10000 nodes.

This probably also explains why our CQ author is performing slowly.

Hopefully, I will remember to post the solution here when we get to it ... :-)