May 14, 2008

Loading Documents

I tried loading a test document into the archive today, and there are a few things to be aware of when loading documents:

First, since each page is its own file, there needs to be a way to tell people which file represents the first page, second page, etc. There is a "description" field for each file, I suggest we use this to designate the order of the pages. It would also be good if it designated wether the file was the large archival file or the display image.

So:

"First page: Large archival quality"

Is a possible model.

Second, I'm seeing problems with the transcripts-since the file names are going to be part of the URLs, they can't have spaces in the filenames. They also don't seem to have any line breaks, which means when loaded into a web browser, they scroll off to the side. I am going to try and fix these problems, but it will probably have to be done manually.

May 13, 2008

Names

For the purposes of forming titles, the names of Barclays' mother is "Mary Elanor Paxton Barclay" and the name of the sister is "Hannah Moore Barclay."

May 7, 2008

Scanning standards

I'm relocating this from the staff wiki, since all the rest of the information about the project is here. Note that some of this information conflicts with earlier posts-where that is the case, this information takes precedence.

Scanning Standards

For the Barclay letters project:

Approximate time per image: 4 mins 30 secs

1. Open the HP solution center.

2. Choose "scan picture."

3. Make sure the initial scan screen is set to scan a color picture, that it is scanning from the glass, that it is saving to a file, and that it is saving as a .tiff file. The resolution should be set to 300. I have tried to make all these setting defaults, but check just in case.

4. Click Scan

5. On the next screen, set the filename and the base scan name. Files should be saved to c:/digital project. Then click OK.

6. on this next screen, you need to adjust the lighten/darken and sharpness settings. These are found to the right of the screen. The light/darkness settings should be: Highlights: -20, Shadows 0, Midtones 0. Sharpness should be high. The scanning software will give you attitude about some of these settings later, just remember, it's wrong, and it needs to use your settings.

7. Now adjust the scan area by grabbing and moving the dotted lines.

8. Now hit accept. The software will start giving you grief here about your settings, make sure you tell it to use yours and not the recommended scanning settings.

9. Now, open the file in photoshop. As you are opening the file, you will need to right-click on it, choose "rename" and use your delete and arrow keys to clean off the last three numbers in the scan filename. The scanner appends these automatically, I have not been able to make it stop.

10. Depending on how you scanned it, you may need to rotate the image so the text is reading left to right. Do this by going to Image->Rotate Canvas.

11. Go to image->Image size and reduce the size to APPROXIMATELY 800 pixels wide by 600 pixels high. This is not a hard and fast rule-the letters are different shapes and the exact measurements you use will vary. Use your judgement.

12. Go to File->Save As. Set the "format" drop down to "jpeg." Adjust the filename. Hit Save.
In the image quality dialogue, set the qaulity slider to the exact middle. click OK.

13. rise and repeat.

File Naming
Files are named by the following protocol:

First, the date of the letter in YYYY_MM_DD in numeric format: 1867_20_01

Then "archive" or "web" depending on whether the image is a jpeg or a tiff.

Then The sequence numbers, separated by colons: 1_5 is the first of five sides.

Example: 1856_06_02_archive_1_4.tiff

Is the first page of a five-page letter written on June second, 1856, and this particular image is an archival tiff.

StorageWe are storing the files at c:/digital project. Kyle is trying to back them up to a network drive periodically.

February 4, 2008

Workflow

Here is how documents get scanned and loaded into Dspace:

* Jean Scans the documents according to scanning guidelines we've established (they are posted to this blog). All fils-transcripts, and image files, use the same naming convention:

(authors last name)-(type)-(month)-(day)-(year)

"type" is:

tr = transcript

ar = archival tiff

ds = display jpg

So for example:

Barclay-tr-January-17-1862.txt

Is a typical filename for a transcript.

* As they are available, Jean loads the documents into dspace at the following URL: http://dspace.nitle.org/handle/10090/1065. She makes a "first pass" at filling out the metadata using the guidelines posted on this blog. Each submission will contain three files: An archival tiff, a display jpeg, and a transcript file in text format.

* Once submitted, the item goes through an accept/reject step, which Kyle is responsible for. Primarily, this is so that the filenames and types can be checked to make sure all the requisite files are present and uncorrupted.

* The submission then goes to Holt so that the metadata can be checked and expanded. Specifically, his responsibility is to expand the description and subject keyword fields.

* Once Holt has done this, the submission goes to Vaughan for a final metadata check. when he is satisfied, it goes into the public archive.

Revised (again) Metadata framework

Fields not mentioned in this document should be left blank.

Authors

Should be filled in with the name of the writer of the letter.

Title

This would follow the format: Letter, (writer) to (recipient), (date in Month, day, year format)

Publisher

Leave this blank. The metadata template will fill this in with the information:

"Washington and Lee Special Collections"


Date of Issue

This should be filled in with the date the letter was written.

Type

Set to "other"

Language

English (united states)

Hit the "next button

Subject Keywords

This will contain zero to an unlimited number of key phrases or words that describe important concepts in the letter. These phrases are not part of a formal subject classification system. They are selected by our volunteer and expanded and vetted by our historian.

Description

This field will contain a one to three sentence description of the letter created by our volunteer and vetted by our historian.

January 22, 2008

We Got Problems...

I took our prototype metadata framework and tried using it to load a sample item into the database yesterday. There were a couple of problems:

* While the "date.created" field does exist, the only way to load any information into this field is to go in as an admin AFTER the entire submission process is over and put in in manually. Worse than this, you can't search on this field. At all. The only date Dspace seems to care about is the date the item was loaded into the system.

* Likewise, the "publisher" field has to be filled out manually after the entire submission process is over, and this also cannot be searched on.

* The "type" field, which we had wanted to be able to set as "letter" has only fixed choices, and "letter" is not one of them. I was able to manually edit this field, but only after the entire submission process is done.

January 18, 2008

Revised Framework

On this one, I am labeling everything with the field into which it will go.

Publisher

"Washington and Lee Special Collections"

This information is uniform for all items.

Date.created

This should be filled in with the date the letter was written, in dd/mm/yyyy format.

Title

We are still working on how to formulate unique titles for each document.

Description

This field will contain a one to three sentence description of the letter created by our volunteer and vetted by our historian.

Keywords

This will contain zero to an unlimited number of key phrases or words that describe important concepts in the letter. These phrases are not part of a formal subject classification system. They are selected by our volunteer and expanded and vetted by our historian.

Type

We had wanted this field to be "letter," but the choices seem to be fixed, and letter is not one of them. So I am using "image."

Not metadata per se, the title of the collection within Dspace shall be:

"The letters of John Barclay (MSS number)"

We had discussed the neccessity of creating a page or document describing the collection. there are a few ways to go about doing this:

* Within Dspace, each collection has a "description" field. To see what this looks like when displayed, go here: http://dspace.nitle.org/handle/10090/1065

* I could try to find an existing field in the database to use. for a URL, as we discussed previously.

December 12, 2007

Framework

From Vaughans notes, with my own expansions:

Name of institution

This should probably be kept in the "publisher" field. We need a standardized name ("Washington and Lee?" "Washington & Lee?" "W&L?").

Name of Repository

Hmmm. I'm not seeing a field that leaps out as a logical place to put this. We could make it part of the description?

Name of collection

I'm not sure this should even go in the metadata, because we could actually use the name of the collection in the real world as the name of the collection in Dspace, where it would serve the same purpose.

MSS number of collection

Most logical place I can see would be an "identifier" field. These fields are usually used for numeric identifiers that are unique to the item, though (examples would be the ISBN number).

Are we sure we need this? If the document is available on the web, why does it matter to the user how we classify or organize the physical material? Ideally, they wouldn't even need to access the physical version. Ever.

Date of Document

I'm assuming this refers to when the letter was written. Given that, the best place would probably be the "date.created" field.

Identification of document

Is this the title? If so, it should def. go in the "title" field.

Summary Info

Definitely the "description" field.

Keywords

I think these can either be placed in the description field, or in the subject field. If we actually have some home-grown subject headings, then we ought to put those there and put keywords in the description field.

Answers to Questions

Some of the things I was asked to look into last time:

1. Character limits in the Dspace archive: There appear to be none anywhere. I tried loading an item with a page worth of text from wikipedia entered for every free-text field...and it took the information and displayed it. However, there still may be character limits in the external services that might harvest our data, so it would still be wise to keep them reasonably short.

2. Copyright: I called Sally Waint, who is our resident copyright expert. She said we are good to go.

3. Coverage/inclusive dates fields: These fields are free-text fields, just like most of the other fields. It seems we can put in dates in any format we like: "7/10/06" or "last tuesday." I don't see anything in the dublin core standard about how dates should be entered, but I will keep looking.

November 12, 2007

Metadata Frameworks

I've been doing some searching for metadata information for the past few days, and unfortunately, am not having a lot of luck. Some of the letter projects are using metadata in ways that we can't because of technical limitations. Many simply do not have any metadata information readily available, or the information they do have available is at a level of detail that's not really sufficient for our needs. And some of them are part of much larger projects that have complicated metadata schemas designed for a large collection made up of many different kinds of objects-letters, maps, photos, etc.

We may not need a template from another project to begin with. Our archive is actually pretty rigid about the types of data it expects to receive. It is using a framework called Dublin Core, which calls for specific information to be attached to every uploaded object. You can read more about the Dublin core elements at their webpage.

To summarize, there are fifteen elements that are used to describe any object uploaded into the archive. We cannot add more elements. We could choose to "ignore" certain elements by not filling them out with any information. Each element has a name and a general indication of what type of data should be entered into it. For example, the "title" element is defined as "the name by which the resource is formally known." The descriptions of the information that go into each element are fairly vague-it is up to us to decide what "title" means in the context of a specific collection.

So the question before us is not "what types of information do we want to keep about this project." It is instead, "How will we use these fifteen elements to describe the items in this collection?"

Take the "title" example above. Do we want a title for every letter? If so, how do we form that title? Do we take the letter writers name, append the date of the letter, and enter that as "title?" Or do we form a title in some other, completely different way? Or not use that element at all (leave it blank)?