Killa had told Trudy that he suspected a problem with memory on the web services server. It was way under the required amount, and Trudy had arranged to get more put in. In the meantime, Killa was going to show Trudy how to split up the crawl into smaller groups.
Killa Hertz & The Case of the Missing Documents – Part 8
Trudy was sitting next to me. She had her notebook with her. This one had little flowers in the corner. She wrote down the date and looked at me expectantly.
“Ok Trudy, what we need to do here is split up the crawl. At the moment, all the documents are being crawled. As you mentioned, it takes about a week. If it fails, then you gotta start it again. And hope that it isn’t going to fail again.” Trudy nodded
Trudy’s head nodded as if her head was on a spring.
“We need to split up the documents so that we can crawl them in smaller groups. That way, if a crawl does fail. It means that you only have to crawl that small amount again.” Trudy scribbled furiously.
“So, let’s look at the docs. Normally, I find the best way to split them up is by size. Let’s have a look at the spread.” I fired up Documentum Administrator and typed in a DQL command to get the maximum size, the minimum size, and the average size, of the documents. The smallest was 0 bytes. The largest was just over a Gigabyte. The average size was just under 1 Megabyte.
I fired up Documentum Administrator and typed in a DQL command to get the maximum size, the minimum size, and the average size, of the documents. The smallest was 0 bytes. The largest was just over a Gigabyte. The average size was just under 1 Megabyte.
“Ok Trudy – it looks like you’ve got lots of small documents, and a bunch of large documents.” She looked at the results from the DQL query, and I heard her inhale quickly.
“Wow – we’ve got a document that’s over a Gigabyte – I didn’t know that.” she said. “Do you know what it is?”
I didn’t look at her, but just said “That’s up to you to find out. I just want to make sure everything can be crawled.” She wrote down the size of the largest file and put a big circle around it.
Carrying on, I said “we’ll make a new content source. But first – is there a Test system running here?” I didn’t want to do anything on their production system.
Quickly Trudy opened another browser session and opened the Administrator for the TEST environment. After a bit of clicking, I got to the screen displaying the content sources. “Trudy – I’m gonna make a new content source that only crawls documents under 250kb”.
Luckily Mike had given me a few tips, so I knew that the content source couldn’t be created through the normal SharePoint method. At the top of the screen was an extra tab titled “Wingspan Connectors”. (Wingspan was the company that sold eResults). I clicked on it and was presented with more tabs and a screen that displayed the current content source.
“Look here, Trudy, this is where the content source needs to be created.” At the top of the page was a drop-down with the words “Add New”. “What I recommend that you do is create a new content source using exactly the same details as the current one. Make sure that the name of the docbase it will be connecting to is the same, and also that you’ve selected the right Item Types – this is the tables in the docbase that defines which objects you want crawled, as well as the source for the crawled metadata. Give it a different name of course.
“Down here at the bottom:, I said, using the mouse pointer to make sure it was really clear, “this is where you need to specify the document sizes.” There was a section called “Custom Filter” “Because you have already defined the which tables that would be used, you only had to type in the criteria portion of the query. Like this:”
ANY r_version_label='CURRENT' AND r_content_size < 256000
Trudy scribbled furiously on her pad. The kid obviously wanted to make sure that she didn’t miss anything.
“Once that is done, click on the other subtabs and make sure that everything matches the existing content source. Once that’s done – click on Save.” Trudy stopped me. She wanted to grab a screen shot of what I had done. Smart idea – a picture is often worth a thousand words. I clicked on Save. The original Content Management screen appeared.
“Then,” I said as I reached for my coffee, “then, you need to make sure that the crawl properties are properly picked up. You’ll see here, next to the new content source, is a link titled “Metadata”. Click on this, and the metadata window opens up. By clicking on ‘Generate’ eResults will read the tables you’ve selected and will create a crawled property for each column heading.”
As Trudy wrote this all down I added, “Even though you can get eResults to create corresponding Managed Properties, in this case, you don’t have to.”
Trudy looked up “Umm – why’s that?” ”
“Well, eResults needs to know what column headings to use as crawled properties. However, the managed properties are the things that SharePoint looks after. When the original content source was created, the managed properties were generated. So these already exist.
Trudy wrote it all down. She was on her fifth page, and she wrote the date on the top of each page. along with a page number.
“Now – you want to make sure that this “connection” will work.” I clicked back to the Content management page. Next to the name of the listed content source was a link ‘Test’. “You click on this, and then on this Run button that is now enabled. This will cause eResults to make a connection based on the settings we have just put in.”
While I explained this to Trudy, I carried out the same actions. There was a small delay, and then the screen was filled with information. The time the crawl started was listed as well as an assortment of other data indicating that eResults was able to connect to the docbase. Under this, the crawled properties that were found for a the document were listed as well as the values of the crawled properties. “See,” I said, “it working.”
Trudy’s eyes lit up. “Cool!”
“Now, you need to create other content sources. Even though you can use whatever size limit you want, I recommend the following:
- documents equal to, or greater than 250kb, and less than 1Mb
- documents equal to, or larger than 1 Mb, and less than 100Mb
- documents equal to, or larger than 100Mb.
It looks like it’s the large documents that are causing the web service server to choke. Fortunately, there aren’t many of these.”
Trudy was busy taking screen shots, and making notes. “Thanks, Killa. I’ll get these created before the weekend. But do you think that it’s really necessary if we are putting in extra memory?”
I looked at Trudy. “Yes – the memory will help, but to be double sure, the separate content sources means that if there are any problems, you won’t have to waste another week.” She smiled stupidly “Oh yeah – you mentioned that.”
“One last thing – once you’ve set up the extra content sources, disable indexing on the original content source. This will clear the index for that content source without affecting the index from the SharePoint content sources. And it will also prevent having duplicate data in the index.”I grabbed my hat and my jacket. That hot weather was now a thunderstorm.
“Once you’ve got the content sources, and the extra memory set up, kick off a crawl. Let me know how it goes.”
to be continued …
… Part 9
Want to learn more?
Below is a selection of resources that I personally feel are relevant to this blog post, and will allow you to get more in-depth knowledge. I do earn a commission if you purchase any of these, and for that I am grateful. Thank you. (Important Disclosure)