" /> Status for Andrew DeFaria: June 2006 Archives

« May 2006 | Main | July 2006 »

June 27, 2006

Salira Vob Corruption

  • Cleaned up Multisite Packets
  • Cleaned up sons-sc-cc:/Windows/temp and sons-clearcase salira vob cleartext pools due to disk space crunch
  • Ran dbcheck on salira vob to fix corruption
  • Tested changing mastership of a test branch

Time spend: 7 hours

Cleaning up Multisite Packets

First order of business was to attempt to clean up multisite packets that reside in the shipping bays for both sons-clearcase and sons-sc-cc as much as possible. As per my prior work there seems to be huge sync packets to sync, which takes time. I wanted to attempt a chmaster on an older branch to see how that changes from sons-clearcase -> sons-sc.cc. Part of the chmaster involves informing the other replica of the change. This happens through the normal multisite syncreplica. If the bays are full of huge packets then I need to process them first. One problem I hit was running out of space on sons-sc-cc. Normally this is not a problem as there is enough space on the C drive where the vobs reside. But with these huge packets going back and forth I was running out of space. Cleaned up some space and attempt to import all packets on sons-sc-cc. I also attempted to scrub the cleartext pool on sons-clearcase, which has grown to 4 gig! The cleartext pool is a caching mechanism thus since Clearcase can reconstruct the cleartext pool at any time (cleartext is mutable) I figured I could save 4 gig.

Testing chmaster

Tested out that I cannot check out, and back in, and element on the rel_1.0 branch from a view on sons-sc-cc. I then attempted to transfer mastership of the rel_1.0 branch -> sons-sc-cc but received the following error:

[ccadmin] sons-clearcase:ct chmaster SantaClara brtype:rel_1.0@\\salira
cleartool: Error: Branch type "rel_1.0" has branches (with default mastership) that have outstanding checkouts.

Actually there are still checkout on the rel_1.0 branch in, for example, the view YXiu_view_desktop (e.g. salira/neopon/build/makefile).

Ran dbcheck on salira vob to fix corruption

10:40 Pm: Decided to give up on the testing of chmaster and get the vob fixed. Locked salira vob. Started copy of db

:10:43 Pm: Dtarted keybuild procedure. Keybuild failed with:

db_VISTA Version 3.20
Key File Build Utility
Copyright (C) 1985-1990 Raima Corporation, All Rights Reserved

initializing key file: vob_db.k01
initializing key file: vob_db.k02
initializing key file: vob_db.k03
initializing key file: vob_db.k04
processing data file: vob_db.d01, total records = 3555277
 record:       9000
 record:      19000
 record:      29000
 record:      39000
 record:      49000
 record:      59000
 record:      69000
 record:      79000
 record:      89000
 record:      99000
 record:     109000
 record:     119000
 record:     129000
 record:     139000
 record:     149000
 record:     159000

keybuild failed with an exit code of 58. Ran keybuild again... This seems to be going better... Did d01 file. Proceeded to work on the d02 file then (11:07 Pm):

record:   863000
*** db_VISTA database error -901 - system error Bad read 863475 863474processing data file vob_db.d02, total records = 1 record: 1 key file rebuild completed

Hmmm... Doesn't seem like the key file rebuild was really completed. I wonder... Should I try again? Trying again...

Third times a charm they say! keybuild ran to completion but for a while it was touch and go as sons-clearcase was not responding. Now, however, I can import the packets that were stuck... Well most of them:

Applied sync. packet sync_SantaClara_26-Jun-06.02.00.01_5308 to VOB \\sons-clearcase\VOBs\salira.vbs
Multitool.exe: Error: Database identifier (dbid) not found in database: "\salira".
Multitool.exe: Error: Could not get oplog entry with order:2886884 from replica:
China with oplog_id:376595: reference to non-existent ClearCase object.
Multitool.exe: Error: Could not check oplog entry for divergence: reference to non-existent ClearCase object.
Multitool.exe: Error: Cannot apply sync. packet sync_China_26-Jun-06.16.32.42_3292_1 to VOB replica \\sons-clearcase\VOBs\salira.vbs: reference to non-existent ClearCase object

Damn. Ran syncreplica -import again and everything got processed. I'm glad it's processed but I can't help but wonder why I hit these errors...

June 26, 2006

dbcheck

  • Ran dbcheck on salira vob

Time spent: 2 hours

Frank W O'Keefe wrote:

Hello Andrew,

For the error: 06/23/06 07:48:04 db_server(10104): Error: db_server.exe(10104): Error: Database identifier 427883 not foundin "../db__obj.c" line 731.

This could possibly mean there is an issue with the VOBs database. Unfortunately I cannot determine which VOB this is for? I would need you to run a "dbcheck" on the VOB that is reporting this error. Unfortunately I was seeing this error many times in the logs so I cannot tell for which VOB it is reporting this on.

(10104) in the error is the process id that is/was running. This may help in finding the VOB.

I'm pretty sure I know the vob in question - their main vob (\salira).

The following URL is to the instructions on running dbcheck. http://www-1.ibm.com/support/docview.wss?uid=swg21122748

I tried following that by using the method of lock vob, copy the vob database files, unlock vob, dbcheck the copy. Everytime I got a -4 error so I went back to do lock vob, dbcheck, unlock vob.

I was surprised to see some stuff come out on stderr:

[ccadmin] sons-clearcase:/apps/Rational/ClearCase/etc/utils/dbcheck -r1 -a -k -p8192 vob_db > C:\\cygwin\\tmp\\dbcheck.txt

Processing delete chain:  75 nodes on delete chain.
Processing nodes:
+++....

Eventually it finished stating:

Database consistency check completed

169 errors were encountered in 167 records/nodes

Also, I am going to send you a URL to a technote about this PC's heap size. I see messages indicating that you may need to adjust the heap settings for this host.

http://www-1.ibm.com/support/docview.wss?uid=swg21142584

Depending on the dbcheck output, we may need to get a copy of the VOB's db directory but I rather hold off on that request until we see what the dbcheck reports.

I"ve attached the dbcheck output.

June 22, 2006

Vob corruption/Email/Chmaster

  • Fixed problem with email from Multisite jobs
  • Investigated vob corruption
  • Looked into chmaster

Total time: 5 hours

Email server for Multisite messages

Multisite needs to send email if there is a problem with synchronization. The setting for which SMTP server to use is in the Clearcase Control Panel under Advanced. Somehow that got set to sons-exch02 which is no longer a valid SMTP server. Changed this to sons-exch01.

Database identifier not found when doing syncreplica import

Dear IBM/Rational Tech Support: My name is Andrew DeFaria and I perform Clearcase/Clearquest consultant services. One of my clients, Salira Optical Network Systems (a former employer of mine), has been experiencing a problem, described below, and has asked me to look into it for them. They are also in the process of migrating to newer server hardware and migrating up to the latest version of Clearcase/Clearquest. I've been performing this migration. So far we have the new server up and have Multisite replicating things between 3 "sites" - a remote one in Shanghai (sons-cc) and two in Santa Clara: the old server (sons-clearcase) and the new server (sons-sc-cc).

I had been working on this problem for a while last night. I was seeing what you guys were seeing - the db_server process will be running wildly and taking up 50% of the CPU. This seems to happen whenever the scheduled syncreplica -import runs on sons-clearcase. As a result sons-clearcase is not being synced. As far as I can tell this syncreplica -import never finishes and the db_server process consumes 50% of the CPU until killed.

I've also seen several of the following error in the db_server log:

Database identifer <x> not found in "..db__obj.c" line 731

While the line number remains the same the id's I've seen are 427883, 427919, 427922.

Thinking that this was some sort of vob database corruption I ran checkvob and it reported some minor missing references to source containers. I then ran it in fix mode which cleared up the missing source containers but the missing db identifiers remain.

Next I tried running recoverpacket hoping to set the epoch numbers back a few days and thinking maybe the syncreplica would repair itself. On sons-sc-cc (SantaClara replica) I issued the following command:

[ccadmin] sons-sc-cc:mt recoverpacket -since 20-Jun-05 SantaClara@\salira
Using epoch information from Monday, June 19, 2006 11:00:03 Pm
Epoch row for replica "US" successfully reset

Then back on sons-clearcase (US replica) I issued:

[ccadmin] sons-clearcase:mt syncreplica -export -fship SantaClara@\salira

This went on to create a huge packet (growing over 1 gig!) before the scheduled syncreplica -import starts and ties up the db_server process.

Searching IBM/Rational support the closest thing I see is multitool syncreplica -export fails with Database identifier 0 not found in "../db__ver.c" line 505. I know this speaks of syncreplica -export and references db__ver.c not db__obj.c it is the closest problem report that I could find. And it has the onerous note of:

Note: This defect may also occur in ClearCase MultiSite 2002.05 (5.0), however, the fix will not be back patched, you either need to back out of the patch that introduced this, or upgrade to a later version of ClearCase MultiSite to recover.

The clearcase version on sons-clearcase is:

[ccadmin] sons-clearcase:ct -ver
ClearCase version 2002.05.00 (Tue Oct 30 08:27:59 2001)
clearcase patch p2002.05.00 NT-8 (Mon Jun 10 14:44:04 2002)
clearcase patch p2002.05.00 NT-12 (Thu Sep 12 11:15:10 2002)
@(#) MVFS version 2002.05.00+ (May 25 2002 03:14:49)
cleartool                         2002.05.00 (Fri Oct 26 20:24:09  2001)
db_server                         2002.05.00+ (Fri Aug 30 11:48:28 2002)

As I said, we are in the process of migrating to 2003.06 and we are already halfway there - however, at this point people have not yet fully migrated their views over to the new server and the old server still serves Clearcase licenses.

Finally, as I am only a part time consultant at Salira you may wish to contact Jeff Stribling (408-845-5200) directly to gather more info and possibly try some solutions. My contact information is at http://defaria.com/contact.php but realize that during the day I'm at another client.

Chmaster

There are a few Clearcase objects that can have mastership. These are:

  1. Label types
  2. Branch types
  3. Trigger types
  4. Hyperlink types
  5. Attribute types
  6. Element types

1 & 2 above are the ones that concern me and that need to eventually get transfered. #3 has already been done by my mktriggers script with added the triggers to the vobs over on sons-sc-cc long ago.#4, 5 and 6 Salira doesn't really use anyway (there are only the predefined types for 4, 5 and 6 anyway).

It might be good for you, or perhaps Vijay, to experiment a bit on changing the mastership of a branch type, a branch that is not heavily used. I would:

  • Set up a new view on sons-sc-cc oriented to working on this test branch that is mastered by sons-clearcase
  • Verify that there is a problem and how it manifests itself attempting to use this view on sons-sc-cc. IOW, while working on this new view on sons-sc-cc verify that you cannot checkout to this test branch because it's mastered at sons-clearcase
  • Transfer mastership of this test branch over to sons-sc-cc
  • Test that the problem has gone away

Of course I realize that the overriding problem is the current vob database problem on sons-clearcase described in my earlier email. Assuming that that's fixed and multisiting is working...

You can see these types by right clicking on the vob in the Clearcase Explorer on sons-clearcase and selected Explore Types. You can double click on branch type and find your test branch. Right click on it and select properties. Go to the Mastership tab and click on Change. Select SantaClara (that's sons-sc-cc). You might need to also perform a syncreplica (run the scheduled job for sync export on sons-clearcase and the sync import on sons-sc-cc.

There are about 30 branch types that are mastered on sons-clearcase. There are far more label types that are mastered on sons-clearcase. IOW while I might change mastership by hand for the 30 or so branch types, I wouldn't want to change mastership for all those labels.

What I have not verified is what happens if one was working on sons-sc-cc, checks in a bug and the trigger attempts to move that pre existing label (this was an update to an old bug) to point to the new version about to be checked in. Since that label is mastered by sons-clearcase will that be a problem?

If your test above regarding changing the mastership of a test branch type is successful you may wish to move a more used branch type's mastership in the same manner then instructed the engineers involved in that branch to move to the new server. Then the next branch type, etc.