Monday, January 18, 2010

Learning XML

As i have discussed earlier the most important criteria for SOA is xml and java ,i have already covered the basic knowledge of java now i will be covering the basic concept of XML.

XML-Markup langauge.

Lets take a word document or a note pad,you have defined many fields and properties about a person.Now if I will ask you to give me an idea or a method by which I can extract the details about the person like what is his name ,age and other proeprties then you might not come up with some kind of a formula.It is because it is written in a summarized pattern ,not to look to a particular pattern .To overcome this issue w3c standat have provide a format called XML.It is a markup language which help us to identify each element and its corresponding values.It is an enhanced form of html.HTML is more focused in to presentation layer and is mainly used with jsp.Again words format are not public it is not available to all.However now a days xml is a common format for data transfer.

It is like lets suupose we have at one end a java server and at other end a c server.Then XML is the common format which can be used to send data from one server to another as it contains the data type which can be identified by servers at both the end.We can use different tools for XML programming like text,pspad,textpad,editx or xml spy but many of these tools comes up with lot of predefined formats and makes our work very easy.So we will be using notepad for programming so that we can get each and every idea about programming.

So we can define XML as a markup language which provides a universal format for structured document and data on the web.It provide custom tags for definition ,transmission ,validation and interpretation of data.

XML standard

1>XML language specification-It deals with how the xml document is structured and how parser validates them.

2>DTD-Document definition type it validates the XML document.

3>XML namespace-If there are same elemetns with different definition the XML namespace helps in differentiating them.

4>Xpath –XML path language-Provides the syntax for searching the XML document for some specific entry or data.

5>XML Schema definiton XSD-It provides more statndard way of structuring XML document and validates them.

6>XSL-XML stylesheet language-It is used by XSLT XML style sheet language transformer to transform an XML document.



Once we have defined a xml we have parser which check the xml for two things

1>well formedness- it has to do with the syntax or in java terms we can also understand it as compilation.

2>validation-it has to do with validating the xml document against one of the two parameters

a>DTD-document type definition
b>XSD-XML schema defintion.

Parser

We basically use DOM (document object model) parser for parsing of xml document.IT is a w3c standard.DOM basically defines an API to access and manipulate XML.When DOM api is used the whole XML document is stored in to the memory in a tree structure.
Using the DOM api to porcess large xml document may cause delay in memory so we go for another api which is stream based it is called as SAX simple api for XML.


SAX(simple api for xml) it is an event based parser and is not a w3c standard.It records the document as a serialized stream of data.It does not it in the memory.It checks for the tags like start and end which are nothing but document object.



Lets start with how we programme in an XML

Whenever we write a programme in xml the first statement has to be

<?xml version=”1.0”?>


This is nothing but a processing instruction which tells us that the xml programme has started.It can also contain encoding attribute but that is optional.

Again we have

<! DOCTYPE ……> which is a document type declaration.

It says where to locate the DTD(internal or external)


<!--This is a comment -->

These things are called as prologue.

Again   is a entity reference

As ',",<,are some predefined built in.
But we can also have &author_name;,©_right; can be user defined built in.

Consider this exmple

<product id=”123”>
</product>

here <product id=”123”> is the starting tag and

</product> is the end tag

product is called as the element of the xml and id is the attribute of the element.


Whatever we define above our top element is called as prologue.

DOCTYPE must be followed by the root element.There can be only one root element in a XML.

When we define a DOCTYPE we either define SYSTEM or PUBLIC to identify whether the DTD is in local machine or in internet.

In our case we will be using SYSTEM for most of our example.

When we write a simple xml programme and open it is browser we find out that opens up with some coloured fileds.Though we have not defined any XSL but still the browser attaches its own stylesheet which is nothing but the java script.


There are certain rules which must be followed for well formed ness of the xml

1>All open tags must be closed,including the empty tags.

2>Empty tags can be closed in the open tags itself as

3>Nesting of tags is mandatory.

4>All attributes must be enclosed within single/double quotes.

5>For a given element one attribute can be used only once.

6>There should be only one root element.

7>The tags are case sensitive.The start and end tags should be of same case.

We can use the browser for checking the well formedness of a xml document however it does not validate the xml document with the DTD/XSD.

We can use IDE like eclipse or jdeveloper for validation of the xml document but these are very heavy applications and takes a lot of memory.So rather we will be using some small software as in our case we will use XML starlet .You can download it from the internet.Download the .zip file and unzip it.

Now set the path in your environment variable till the directory where xml.exe is present.

Now lets write a simple xml programme to undestand the different fields and properties in a XML document.For example

Take this example

<?xml version="1.0" ?>


<products>

<product type="electronic">

<id>1001</id>

<name>Apple ipod</name>

<price>25$</price>

</product>

</products>



In this example we have products as the root element.The products element can have many sub product elements.

WE have discussed about entities in XML document.

IT is used as a replacement text when referencing its name between ampersand & and semicolon :

i.e if we have to use < character then we can not use it directly as it will also state that it is a start tag.so we will be using the following command < this entities represents the sign less than < similarly there are some predefined forms for some special characters which can not be used otherwise directly in your xml document.


Important difference between XML and HTML

Xml is a language for describing data while HTML is a language for formatiing data in a web browser.XML contains user defined mark up element and HTML contain predefined markup element.XML is extensible HTML is not.

XHTML –Extensible hypertext markup language
IT is a successor to HTML.It is designed to conform with xml standard and well formed document rules.It is a way to reproduce and extend HTML document.

==================================================================
DTD-Document definition type

Document defintion type specifies specific instructions that XML parser interprets to check the document validity.A DTD may be stored in an external file or in an internal file within the xml document.When we externally define a DTD it can be pointed out by the URL (uniform resource locator) which can be a file on the disk.




The DTD allow us to define

1.>Elements.
2>Attributes
3>Entity.
4>Notation.

DTD basically validates the xml document it is required because we may need some special format of output like we need to have the name ,age and sex of an employee which are necessary then we can define it in the DTD that these things must be provided otherwise the XML document won’t be a valid document.



A DTD comment declaration is like

ELEMENTS

<!ELEMENT name of element (content model)>


ATTRIBUTES

<!ATTLIST element-name attribute-name type default>


ENTITIES

<!ENTITY entity-name “replacement text”>



NOTATIONS

<!NOTATION notation_name SYSTEM “text”>


Example of a simple DTD element

<!ELEMENT employees (employee)>
<!ELEMENT employee (name)>
<!ELEMENT name(#PCDATA)>


so a valid xml document for this DTD will be

<?xml version = “1.0” ?>
<employees>
<employee>
<name>Arpit rahi</name>
</employee>
</employees>

So now we will try to understand the DTD that we have defined.The first statement says
<!ELEMENT employees (employee)> it means there is an employees element in the xml document which has a subelement employee.

Again the next statement <!ELEMENT employee (name)> says employee is an element and it contains another element in between its starting and closing tag which is name.

The third statement says <!ELEMENT name(#PCDATA)>.
(#PCDATA) parsable character data.It represent that the data entered for the name should be a text only.


DOCTYPE,ELEMENT are the mark up they must be defined in caps.

The sequence of element declaration is not important in DTD.
We will do the testing also in order to undestand how it works.
AS we have agreed we will be using a small tool to check the functionality of XML.
We will download xml-startlet-1.0.1 from google and will start working on this.
Download and extract the file.

Now go to desktop right click on My computer select the properties.Go to advacned tab.
Go to environment variable and choose Path and set the path as

C:\xml-starlet\xmlstarlet-1.0.1;
It should be the first line in your Path.
Save and exit now open a command prompt and type xml




We will check first for the well formedness of the xml document.Lets suppose we have the folllwing employees.xml file

<?xml version = "1.0" ?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees>
<employee>
<name>Arpit rahi</name>
</employee>
</employees>


IN order to check its well formedness .We will first create a folder where in we will copy the employee.xml.Then we will naviagte till that folder and pass the following command to check its well formedness

Xml val –w employees.xml




You can make the changes in start and end tag to verify if it checks for the well formedness of the xml document.

=============================================================================================

Referencing the DTD

A dtd is declared in a xml document after the xml declaration and before the root element start.If it is defined externally it can be pointed out in the XML document by following ways.

<!DOCTYPE employees SYSTEM “employees.dtd”>

here employees.dtd has to be on the same folder structure where the xml document is lying.If you are using a dtd which I publically available then you can do it by using public keyword

<!DOCTYPE employees PUBLIC “some public file”>

However you can also define the dtd within the xml document itself.

So if I have to write DTD and xml document in a same document as per my previous example it will look like this


<?xml version = “1.0” ?>
<!DOCTYPE employees[
<!ELEMENT employees (employee)>
<!ELEMENT employee (name)>
<!ELEMENT name (#PCDATA)>]>
<employees>
<employee>
<name>Arpit rahi</name>
</employee>
</employees>




Keep in mind that you have to use the root element after the DOCTYPE

We will check it for both embedded dtd and dtd which is defined outside the xml file.

First of all with embedded DTD.

We will call our xml program that we have written here as employees.xml .The name of the xml file should be same as the root element of the xml document.

Now since it is an embedded DTD we will be using the following command to validate the xml document with the embedded DTD

Xml val –E employees.xml




Now we will check how to declare and validate when it is defined externally.


Now our employees.xml will look like

<?xml version ="1.0"?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees>
<employee>
<name>Arpit rahi</name>
</employee>
</employees>

and our employees.dtd will look like this

<!ELEMENT employees (employee)>
<!ELEMENT employee (name)>
<!ELEMENT name (#PCDATA)>

and we will be validating our xml document with reference to an externally defined dtd as follows

xml val –d employees.dtd employee.xml




=================================================================

Element declaration in DTD-An element declaration contains the ELEMENT keyword folllowed by the element name and the content model as we have seen earlier also.Now we will see different kind of content model.

<!ELEMENT element-name (content model)>

1>Empty

An empty content model means it can not contain any subelement or any text data.However it can contain the attributes.Its representation will be

<!ELEMENT element-name EMPTY>

Examples are-

<name/>

<name></name>

<name title=”rahi”/>

2>Child Element

These are the subelement enclosed which can be a single element or a sequence of element separated by comma.

Eg. <!ELEMENT name (first_name,last_name)>

IT can also have a choice element which is represnted by a | sign.

Eg-<!ELEMENT name (full_name |(first_name,last_name))>

IT says either to give the full name or first and last name.

3>Mixed type

The #PCDATA is the best example for mixed type.

Eg-

<join-date>10th April,2008</join-date>

This is as good as

<join-date>
<date>10</date>
<month>April</month>
<year>2008</year>
</join-date>



4>Any type

As the name suggest it can have any type of data or elements .IT must be avoided because it hinders the whole purpose of validation.

=========================================================================

Cardinality of Elements

It indicates the number of childrens permitted.

The different cardinality symbols are

1>No symbol-It means it is mandatory and only one value has to be provided.

Eg. <!ELEMENT employees (employee)>

2>Question mark(?) it means it is optional.You can either have a value or not i.e in mathematical term it is zero or one.They are always defined a suffix.


<!ELEMENT employees (employee?)>

It means employees element can have either one or none employee element.


3>Asterik(*)-It means their can be either zero or many values .

<!ELEMENT employees (employee*)>

This statement means employees element can have either zero or many employee element.

4>Plus sign(+)-It means one or more.
<!ELEMENT employees (employee+)>
This statement says that their should be atleast one employee element within employees element and you can have more than one employee also.

========================================================================
Attribute Declaration
The syntax for declaring an attribute is as follows
<!ATTLIST element-name attribute-name type default>
element-name and attribute name are self understandable.
Attribute-type can be specified as
CDATA,enumerated,ENTITY,ENTITIES,ID,IDREF,IDREFS,NMTOKEN,
NMTOKENS,NOTATION

An attribute default can be specified as
#IMPLIED,#REQUIRED,#FIXED or any literal value.
#REQUIRED-It indicates that the attribute must be specified.
#IMPLIED-It indicates that the attibute is optional.
#FIXED-IT implies that the attribute is a constanst a single value supplied.

Attribute Type
CDATA type we will check examples of an xml document and its DTD

<!ELEMENT employees (first_name,last_name)>
<!ATTLIST employees emp_id CDATA #IMPLIED>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>
So this DTD says that employee is a element which has two subelement first_name and last_name.Again it says employee has a attribute name emp_id which is of character data type and its value is optional.So if we want to write a valid xml document for this dtd it will be
<?xml version ="1.0"?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees emp_id = "20" >
<first_name>Arpit</first_name>
<last_name>Rahi</last_name>
</employees>



Enumerated data type-
It is used for a choice from a list of vlaues.We will check by writing a dtd and then the xml document to understand this
<!ELEMENT employees (first_name,last_name)>
<!ATTLIST employees gender (male|female) #IMPLIED>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>
this dtd says that their can be only two type in gender it can be either male or it can be either female.so the corresponding xml document will be
<?xml version ="1.0"?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees gender="male">
<first_name>Arpit</first_name>
<last_name>Rahi</last_name>
</employees>
Default attribute values
A quoted default attribute value can be specified in the DTD after the attribute value.
<!ELEMENT employee (first_name,last_name)>
<!ATTLIST employee emp_id (10|20|30) ‘20’>
employees.xml
<?xml version ="1.0"?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees emp_id = "20" >
<first_name>Arpit</first_name>
<last_name>Rahi</last_name>
</employees>

employees.xsd
<!ELEMENT employees (first_name,last_name)>
<!ATTLIST employees emp_id (10|20|30) '20'>
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>

Default value is required when you are using a #FIXED keyword.



Entities in XML
XML as per w3c standard provide some built in entities.eg-
Less than < is represented as &lt;
We also have character entities.It represents the unicode character defined in XML.
Eg-© represents s unicode character@ ,Character entity should start with @and must be followed by # sign.

Entity declaration
The general syntax for entity declaraion is
<!ENTITY entity_name “replacement text”>
eg-<!ENTITY dude “Arpit Rahi”>
So the next time we can use the entity dude for Arpit Rahi.
i.e if we have a statement
<employee>
&dude; is great
</employee>
Then XML parser will interpret is as
<employee>
Arpit Rahi is great
</employee>
The entity reference must be started with & and should end with ;
Eg
Our employees.xml will look like this
<?xml version ="1.0"?>
<!DOCTYPE employees SYSTEM "employees.dtd">
<employees emp_id = "20" >
<first_name>&dude;</first_name>
<last_name>Rahi</last_name>
</employees>

and employees.xsd
<!ELEMENT employees (first_name,last_name)>
<!ATTLIST employees emp_id (10|20|30) '20'>
<!ENTITY dude "Arpit">
<!ELEMENT first_name (#PCDATA)>
<!ELEMENT last_name (#PCDATA)>

Now if you will open the xml document in internet explorer or any browser u will get Arpit instead of &dude;

======================================================

Entities can be declared externally also.
1>Parsed
<!ENTITY entity-name SYSTEM “file.txt|URL”>
so we can have our employees.xml as
<?xml version ="1.0"?>
<!DOCTYPE employees [<!ENTITY employeeInfo SYSTEM "employee.xml">]>
<employees>
&employeeInfo;
</employees>

as you can see here it is referecing to another xml employee.xml and using it as a entity so our employee.xml should look like this
<?xml version ="1.0" ?>
<employee>
Arpit
</employee>

2>Unparsed which requires a notation declaration.
Unparsed exteranl entity is one way to include data that will be processed and will not be validate by XML parser.It is used because sometimes we need to add data which is not well format in the xml document.To convert a parsed exteranl entity declaration in to an unparsed external entity declaration add
1>NDATA keyword after the file name.
2>A NOTATION name after the NDATA keyword.

IT is declared as
<!ENTITY % entity-name “replacement text”>
It is then referenced in the DTD as
%entity-name;
<!ELEMENT employee (%employee_element;)>


These days DTD are not being used and has been replaced by another concept called XSD so we are not going deep in to DTD.
===============================================================================

Understanding XML with namespace
An XML namespace is identifed by a case-sensitive internationalized resource identifier IRI reference (URL or URN).It provides a unique name for a collection of elements and attributes.
The IRI is a string of character and it can be defined as
1>A uniform resource indicator.(URI) or Uniform resource locator(URL)
It can be some web address like http://abc.com/name.
Here there is one important point to consider that this web address need not be existing.IT is just a unique representation of string and is not checked to be valid web address.
2>A Uniform Resource Name(URN) such as urn:abc:name.A URN starts with the letter urn followed by a colon(:),a name space identifier(NID) in our case it is abc,again followed by a colon(:) and a namespave specific string (NSS),In our case it is name.

Now lets try to understand this how there can be ambiguity in xml and how it is resolved by a namespace.lets take two xxml document for example.
<?xml version =”1.0” ?>
<employee>
<name>Arpit</name>
</employee>

and other one
<?xml version =”1.0” ?>
<employee>
<name>
<f_name>Arpit</f_name>
<l_name>Rahi</l_name>
</name>
</employee>
Here we have the name element which has different cotent model in the two xml document so if we are pointing to a name you can get what information you are actually seeking for but consider this example
<?xml version =”1.0” ?>
<employee>
<name>Arpit</name>
<name>
<f_name>Arpit</f_name>
<l_name>Rahi</l_name>
</name>
</employee>

Here if we say name it is difficult to find out which name which are pointing to because there is no clear indication that when we say name which name and its approprioate contecnt model should be used.The XML namespace provides a way to differntiate the elements and its attributes by defining namespace so now we will see how to declare XML namespace.
The default name space is written like this
<employee xmlns=”http://www.abc.com/name”>
here xmlns is the attribute of the XML namespace and employee is the root name of the xml document.
With a name space prefix it is defined as
<emp:employee xmlns:emp=”http://www.abc.com/name”>
or
<emp:employee xmlns:emp=”urn:abc:name”>
The prefix is optional and has to be defined after the default attribute xmlns.
Attributes which are specified without a prefix are not associated with an XML namespace.The one important point to keep in mind is that the XML namespace string does not have to reference to an actual document or page.Namespace are very similar to packages in java.Just like package in java can havemany reusable classes and interface in a similar way a namespace in xml can have many reusable elements and attributes.

Namespaces are declared as an attribute of the element.As we have seen we used <employee xmlns=”http://www.abc.com/name”> for namespace declaration.xmlns is a reserver word which is used only to declare a namespace or in other word we can say it is used for binding the namespaces.

Again the namespace http://www.w3.org/2001/XMLSchema is a reserved namespace as per w3c standard.
There are few important things that must be taken care of for prefix in a xml namespace.
1>It can contain any xml character but it should not be a colon.
2>It can declared multiple time each with different names.
3>It can be overrided in a child element.eg-
<?xml version=”1.0” ?>
<emp:employee xmlns:emp=”urn:abc:name”>
<emp:first_name>Arpit</emp:first_name>
<emp:last_name xmlns:emp =”urn:abc:last_name”>Rahi</emp:last_name>
</emp:employee>
A good programming skill is that we should not use same prefix names in the same xml document as it will be too confusing for us to undertand and differentiate.
So now we will check our previous example that how we will identify which name element belongs to which content model.

<?xml version ="1.0" ?>
<emp:employee xmlns:emp="urn:abc:employee">
<emp:name>Arpit</emp:name>
<id:name xmlns:id="urn:abc:name-ns">
<id:f_name>Arpit</id:f_name>
<id:l_name>Rahi</id:l_name>
</id:name>
</emp:employee>
Here as you can see the first name is identifed by a prefix emp and the second name is identified by the prefix id so this is how we differentiate betweee the elements in an xml document.As you can see the scope of the namespace lies between the start and end tag for the element from where it is defined.
If we are using some tools like jdeveloper the it provides a functionality to check the xml namespace.It is show xmlns .just select the xml file and right click on it you will get the option.You need to have xmlparserv2.jar file in your classpath to invoke this.

=====================================================================================

XSD SCHEMA-I will cover in a next post because the same post will become too long

1 comment:

Website Designing Company said...

Very informative blog. I think many could benefit from reading your blog therefore I am subscribing to it and telling all my friends.