[Thread Prev][Thread Next][Index]

Re: [las_users] how to frequently update thredds dataset in las



Roland Schweitzer wrote:
Hi Robert,

You are struggling with issues that have been our radar screen for a while. I appreciate having the details of your use case so we can make sure the solutions we're working on meet your needs as best we can. Below are some questions, some ideas and a few solutions.
Roland thank YOU and your team for your dedication to this project which has opened up new opportunities.

Robert Fuller wrote:
Hi,

We're setting up a configuration where a Thredds dataset is accessible through LAS. The dataset is a regularly updated Thredds aggregation of ncml files, an oceanographic forecast generated with ROMS. The Thredds dataset has the same URL before and after the update.

At the moment we are using addXML to regenerate the relevant section of las.xml, then using tomcat manager to reload the las servlet to pick up these changes (in normal operation only the time arange element will change in the las dataset).

This method of regenerating the las dataset is not ideal for a couple of reasons:

1. Ideally we would use a natural dataset id rather than the one generated from addXML (the id is based on a hash of the thredds url, so it is persistent which is good, but it is not humanly interesting)
Can you explain to me why the ID should be "humanly interesting". I've gotten some push back in the past from other LAS developers about using a hash for the ID but I haven't been convinced there's a better solution. Of course, internally LAS uses the ID to reference datasets and if you were to create, save or type a LAS URL by hand you'd have to know the ID, but the user interaction via the standard LAS user interface should not directly require a human to know the ID. There's probably a use case for having a human readable ID so I'd be interested to hear about how it would help you.

If you're talking about the data set name that shows up in the LAS UI, you can control that with a switch on the addXML command line to either pass in the name you want or to tell addXML which global attribute to use for the name.
Yes, in the LAS gui the id's are irrelevent.

One such case where it would be handy to use some humanly interesting id's is for a deployment where we are representing the same datasets on dev, test and production servers. I want to provide some simple script fragments (las xml and wml) to some users/administraters who are not familiar with this domain. It would be 'nice' if other than the host name, these scripts will work out-of-the-box without too much thinking on the part of these users.

2. Ideally we would use natural variable names rather than those generated by addXML (perstent but not humanly interesting)
AddXML does the best job it can extracting the variable name from the actual metadata in the file, but it could do a better job. As I type this I realize it probably doesn't look for the CF standard name. Right now it looks for a "long_name" attribute and uses that. If no long_name is available it uses the actual netCDF variable name. Would it help if it looked at the CF standard name (which are pretty ugly what with all those underscores and no capitalization) or if you could pass in the name of an attribute to use for the variable name.

The key to being successful with this issue is to work on addXML until it can produce LAS configuration you like, because addXML will be the basis of some automatic update facilities we're building right now.
Actually, what shows up in the gui is good. I'm happy with that.

My one line perl script fixes my addXML output to something that works for me. For the NE Atlantic output in foo.xml I use this line: perl -p -i -e 's/([\w\d-]*?)(-){0,1}([xyzt]-){0,1}(id-[a-z0-9]+\b)/ne_atl_rect$2$1/g' foo.xml

3. After a number of reloads of the las servlet, tomcat process runs short on PERM GEN space, which requires restarting the tomcat server to resolve.

I have looked into a couple of ways of improving the situation:

i. use script to generate the las dataset xml setting the arange to the new values, then reloading the LAS servlet. This would address most of the issues noted above, but not number 3.

ii. modify the las.xml dataset in such a way that it will still work after the thredds dataset has been updated. I've tried two options a. Remove the time arange from the las dataset. This breaks the las gui (date control vanishes) and also the wms service (including GetCapabilities). b. Set the time arange to an early start date with a large number of steps. This works in the las gui provided the user knows the correct date to pick, but still breaks wms.

My suspicion is that if I do not reload the LAS servlet we will also encounter problems with LAS caching older views.
Not sure what you mean by this. If the only thing that happens is that the time range of data set extends, then a cache hit on a plot for Monday should still be valid on Tuesday even though the data set only extended to Monday when the plot was made.
You never know, our forecast could change ;-). The image cached yesterday may have shown sunshine tomorrow, but today the model is forecasting snow tomorrow. When the dataset has been updated we need to clear any cached results. This is to say the dataset not only extends (rather than extends, it slides), but also some of the data within the set may be replaced.


I welcome suggestions on other options for updating the las dataset after the thredds dataset has been updated.
We have some help for this problem coming out in the next release (which will be sometime in a couple of weeks -- I promise :-}). The next release will include a beta version of LAS manager's interface. One feature of the interface is a reinit process. Using this interface you can go to a particular URL (which is access-controlled by the same mechanisms that control access to the THREDDS Data Server installed with LAS) and ask LAS to reload it's configuration. Since this is an internal process to the running LAS servlet, it should not have the same problems with the container reloading the servlet.
kewl.

You probably want to hit the reinit URL from a process instead of a browser and I'm not sure exactly how to accomplish that right now, but some other folks we're working with want to do this also so we'll figure out a way.
we use wget

The manager's interface also includes a cache manager's interface where you can empty the entire cache or clear the cache of only those files associated with a particular dataset.
perfect.

Finally, in a future release -- not the one coming up this month, but relatively soon after we will have some new configuration options in LAS. This will allow you to configure an LAS directly from the THREDDS URL and specify how often you want LAS to re-initialize its configuration from that catalog. The new config will look something like this:

<dataset src="http://pcmdi3.llnl.gov/thredds/esgcet/catalog.xml"; src_type="THREDDS" update_time="23:00" update_interval="24 hours">
            <properties>
                <addXML>
                    <esg>true</esg>
                    <units_format>yyyy-M-d</units_format>
                    <categories>true</categories>
<global_title_attribute>experiment_id</global_title_attribute>
                </addXML>
            </properties>
       </dataset>
ok. For us the on-demand feature mentioned will be more useful than the schedule, but I can appreciate this use case also.


In this case, LAS will add the data sets and variables it finds in the THREDDS catalog using the addXML parameters in the properties section. This means the dates will be parsed using "yyyy-M-d", the categories matching the THREDDS hierarchy will be included and the data sets will be named using the value of the "experiment_id" global attribute in the file. (The "esg" parameter means the catalog includes ESG metadata and LAS will use that metadata when building it's configuration).

Using the update_time and update_interval, LAS will mark each dataset with an "expires" attribute. After the update_interval has passed (24 hours in this case) the next time after that when 11pm rolls around LAS will re-initialize itself. During that process, the configuration for this THREDDS catalog will get regenerated because its expires date will have passed.

If a catalog is marked with an expires date, but that date has still not passed then the configuration for that catalog will be read from the cache. If the original configuration did not include an update_time and update_interval, the configuration will always be read from the cache and the catalog will never be re-read unless the configuration for that catalog is not found in the cache for some reason.

Finally, LAS will compute the next time it should reinitialize based on the minimum time to the next "expires" time.
Yes that sounds interesting and useful alright.

This means that addXML logic is now internal to LAS so this is why it's critical for us to figure out how to get a configuration that meets your needs directly from addXML. Help me out with suggestions for how addXML could do a better job reading the catalogs you're interested in using.
My idea is add a "-idbase" command line option which will be used rather than the sha1 in generating the ids. This way I cand call addXML -idbase ne_atl_rect which will have the same effect as my perl 1 liner!

Thanks again, Roland, yer doing great work!

Roland


Thanks,
Robert.




--
Robert Fuller, Applepie Solutions,
5 Woodlands Avenue, Renmore, Galway, Ireland.
+353.86.0507760  http://www.aplpi.com
Registered in Ireland, no. 289353


[Thread Prev][Thread Next][Index]


Contact Us
Dept of Commerce / NOAA / OAR / PMEL / TMAP

Privacy Policy | Disclaimer | Accessibility Statement